Learn by doing LLM projects, evaluating RAG pipelines, and Self-RAG

Plus industry headlines, upcoming community events, and whether LLMs can critique and improve their own outputs

Nov 05, 2023

What’s up, everyone!

Thank you to everyone who joined the LangChain ZoomCamp on Friday.

Don't worry if you missed it; I’ve got the recording for you. I’ll stick to sending the links directly to the Zoom recordings; here’s the link to the Zoom recording where you can watch or download the videos.

It was a short, 30-minute session. Though I was struggling, losing my voice, and almost hacked up a lung, I managed to talk about what I did at work this week and showed this project I’m working on instruction fine-tuning an LLM.

Here are some things I covered:

🛠 Discussed low-rank adaptation and quantization in fine-tuning language models.
🔄 Explored hyperparameter tuning, particularly adjusting parameters R and Alpha in QLoRA.
🐞 Talked about my plans for evaluation.
💾 Shared the decision to use an open-source embedding model and plans for chat GPT-4 preference analysis.
🔄 Reiterated engagement with the Zoomcamp on prompt engineering techniques and prompt management, inviting your suggestions for topics to cover

I was talking about a Gradio app that

📚 I created a course on deep learning for image classification for the LinkedIn Learning platform.

Over 1100 people have taken the course in the two weeks since its launch, which is insane to me! It’s free to take if you have a LinkedIn premium account or a subscription to LinkedIn Learning. Or, you can purchase the course outright for $45.

🗓️ Events from around the community

Nov 6th: How to Build a Product Reco App Using JSON + Unstructured Data. Many people have been asking about how to build recommendation engines using LLMs. This is the event for you!

My friend (and mentor) Chris Alexiuk, has two events he’d like me to invite you to, and I highly recommend checking them out.

This is part of the AI Makerspace community, one of the best learning platforms for LLMs. I was a student in their cohort on LLMOps in September, and it was so good that I signed up for the current cohort on LLM Engineering.

They're sessions are value-packed and are always hands-on. Plus, Chris is an absolute legend:

✨ Blog of the Week

🤔 Can large language models (LLMs) really critique and improve their own outputs?

This post by

Deep (Learning) Focus

discusses several papers that propose improvements in LLMs, including their ability to self-critique, the use of noisy embeddings for fine-tuning, safe reinforcement learning from human feedback, and the enhancement of retrieval augmented generation.

Here are my key takeaways from the article:

Self-Critique and Self-Reflection in LLMs:
- LLMs often struggle to evaluate their outputs accurately, challenging the assumption they can self-critique and iteratively improve solutions.
- Two papers highlight that LLMs can't reliably verify correct solutions, which is crucial for self-critique, particularly in graph colouring and classical planning problems.
NEFTune - Enhancing Fine-tuning:
- NEFTune introduces a novel method by adding noise to word embeddings, which has been shown to improve LLM performance across different models and datasets consistently.
- This technique is simple to implement and has been integrated into HuggingFace's TRL package for easy adoption in fine-tuning pipelines.
Tackling Harmfulness in LLMs with RLHF:
- Reinforcement learning from human feedback (RLHF) is crucial for aligning LLMs with human-defined criteria balancing helpfulness against harmlessness.
- The Safe RLHF approach introduces a more rigorous method for alignment, maximizing helpfulness while constraining harmfulness, offering a potential solution to this tension.
Improving RAG through Self-Reflection (Self-RAG):
- Self-RAG improves upon the standard Retrieval-Augmented Generation (RAG) by incorporating a self-reflection process, ensuring the retrieved information is necessary and relevant.
- Although promising, the complexity of Self-RAG may hinder its practicality, suggesting a need for simpler, more adaptive retrieval processes for LLMs.

Cameron Wolfe is one of my absolute favourite writers for all things language models. I highly recommend subscribing to his newsletter for deep dives into LLMs or at least checking out his archives and bookmarking some for future reading.

Read the full post here

🛠️ GitHub Gems

Pere Marta has created, hands down, the most value and resource-rich repo I've come across for learning everything LLMs!

I went through every module in the course and looked at the accompanying resources and notebooks, and the coverage and quality blew me away.

This is the perfect resource to start your GenAI journey, brush up on some fundamentals, or add a new skill to your tool belt.

Stars (or hearts on a newsletter) may be a vanity metric, but when you’re a creator who puts in time and effort to share their knowledge…this small gesture means a lot. It’s the fuel that keeps us going.

Let’s help Pere out - he’s clearly put a tonne of work into this course, and it’s out there absolutely free.

Go and give him a ⭐️ for his hard work to show your appreciation and to motivate him to keep going.

Let’s get this repo up to 1k stars by the end of the weekend, I know we can make it happen!

⭐️ the repo on GitHub

Get a shirt, and support the newsletter.

I keep all my content freely available by partnering with brands for sponsorships. Lately, the pipeline for sponsorships has been a bit dry…so I launched a t-shirt line to gain community support.

Don’t want a shirt for yourself? The holidays are coming, so get one for your favourite data nerd!

You can check out my designs here and explore the excellent product descriptions I generated using GPT-4!

Checkout Harpreet's T-Shirt Line!

📰 Industry Pulse

🎵 Ever wondered how technology could breathe new life into old music? The Beatles' final song, "Now and Then," is now streaming, thanks to the power of AI and machine learning. The iconic rock band's last collaborative effort was made possible using breakthrough technology to piece together a finished track from an old lo-fi John Lennon recording. The song, which was initially attempted in the mid-'90s, was never completed due to technical issues. However, a new technology developed by Peter Jackson's team for the Get Back documentary allowed the remaining Beatles, Paul McCartney and Ringo Starr, to separate the different components of the song and give it the ending it deserved.

🛡️ Ever wondered which sectors are thriving amidst a tougher fundraising environment? TechCrunch reports that while AI startups have been a hotbed for dealmaking this year, another sector that has managed to maintain investor interest is defence tech. A prime example of this trend is Shield AI, a San Diego-based autonomous drone and aircraft startup, which recently raised a $200 million Series F round, valuing the company at $2.7 billion.

🤖 How close are we to having general-purpose robots in our everyday lives? Google DeepMind's head of robotics, Vincent Vanhoucke, shares some insights in a recent TechCrunch interview. In the interview, Vanhoucke discusses the development of Open X-Embodiment, a database of robotics functionality created in collaboration with 33 research institutes. The database, compared to the landmark ImageNet, is seen as a key step towards training a generalist model that can control many different types of robots. Vanhoucke also touches on the role of generative AI and the importance of office WiFi.

🤖 Could Elon Musk's new AI chatbot, Grok, be the next big thing in conversational AI? Elon Musk's AI startup, xAI, is developing a new AI model named Grok. This chatbot, which Musk has been teasing on Twitter, is designed to answer questions conversationally, possibly using a knowledge base similar to that of ChatGPT and other text-generating models. Grok also has internet browsing capabilities, allowing it to search the web for up-to-date information on specific topics.

🤖 Is Apple catching up in the AI race? According to a recent TechCrunch article, Apple CEO Tim Cook has confirmed that the tech giant is investing heavily in AI, particularly in generative AI technologies. In a Q4 earnings call, Cook highlighted the role of AI in Apple's recent technological developments, including new iOS 17 features like Personal Voice and Live Voicemail. He also confirmed that Apple is working on generative AI technologies, although he declined to share specific details. Despite not being labelled as "AI" by the company, Cook emphasized that AI and machine learning fundamentally power these features.

🎁 Wondering what to gift your tech-savvy loved ones this holiday season? How about some AI-powered gadgets that are not just fun but also productive? TechCrunch's AI reporter, Kyle Wiggers, has curated a list of AI-powered tech gifts that are worth considering for the holiday season of 2023. The list includes a variety of products, from toys to skincare, that showcase the potential fun and usefulness of AI, without focusing on its surveillance applications.

💡 Your Questions, Answered

Evaluating Retrieval-Augmented Generation (RAG) systems is done through nuanced metrics that capture various aspects of performance.

Some common metrics have been established as the benchmarks for evaluation are:

Precision and Recall: These metrics are staples in information retrieval evaluation. Precision quantifies the proportion of relevant items within the retrieved documents. At the same time, recall measures the completeness of the retrieval process by assessing the proportion of relevant items successfully extracted from the data source.

BLEU (Bilingual Evaluation Understudy): BLEU measures how many words and phrases in the LLM's output match a reference translation, adjusted for the length to avoid favouring overly verbose responses. This technique is primarily used for evaluating machine translation.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Besides being a hell of a name, this set of metrics evaluates automatic summarization and machine translation systems by counting the overlap of n-grams between the system's output and a set of reference summaries.
BERTScore: This measures the semantic similarity between the generated text and reference text. Under the hood, it uses BERT embeddings to calculate the cosine similarity scores, thus capturing the meaning beyond the exact word matches.
BLEURT (BLEU Regression with Transformers): This metric uses BERT's contextual understanding to predict human-like ratings of text quality. It's trained on BLEU and ROUGE metrics and fine-tuned with human ratings, offering a nuanced evaluation that aligns closely with human judgment.
METEOR (Metric for Evaluation of Translation with Explicit Ordering): METEOR goes beyond basic word matches by including synonymy matching, stemming, and paraphrase recognition, intending to align more closely with human evaluators' scores.

Precision and recall offer a straightforward assessment of retrieval accuracy, BLEU and ROUGE provide foundational measures of language generation quality. BERTScore and BLEURT introduce a deeper semantic understanding into the evaluation process. METEOR offers a more human-like assessment by considering linguistic nuances that other metrics might overlook.

In addition to what we just discussed, the Ragas framework introduces metrics that provide a more nuanced analysis of RAG pipelines:

Context Relevancy: This metric measures the signal-to-noise ratio in the retrieved contexts. It assesses how much of the retrieved information is pertinent to the query, giving a clear picture of the retrieval system's precision.
Context Recall: This metric evaluates the retriever's ability to fetch all necessary information to answer the query. It's crucial to determine whether the retrieval component is comprehensive enough to support the generation of accurate answers.
Faithfulness: Faithfulness measures the factual accuracy of the generated answer with the provided context. This is particularly important for ensuring that the generated responses are not only relevant but also accurate.
Answer Relevancy: Answer relevancy assesses how on-point the generated answers are to the questions. It ensures that the responses provided by the generation component of the RAG system are directly answering the queries posed.

References

Leng, Q. Best Practices for LLM Evaluation of RAG Applications. Databricks Blog.
Fang, F. Evaluating the Performance of Retrieval-Augmented LLM Systems. LastMile AI Blog.
Lawson, N. How to Measure the Success of Your RAG-based LLM System. Towards Data Science.
Langchain AI. Using Fixed Sources [Jupyter notebook]. GitHub.
Langchain. Evaluating RAG Pipelines with Ragas + LangSmith. LangChain Blog.
Exploding Gradients. Ragas. GitHub.

🔍 Research Refined

LLMs often falter on factuality, producing errors by relying solely on internal knowledge.

Self-RAG introduces “on-demand retrieval and self-reflection”, which improves response quality without compromising the model's versatility. It differs from typical RAG models by not just indiscriminately fetching passages; instead, it uses reflection tokens—special markers signaling the need for information retrieval or critiquing the output. These tokens let Self-RAG critique its own responses, improving relevance and factual accuracy.

🤔 Factuality Challenges: Current LLMs often struggle with factual inaccuracies. Self-RAG addresses this by enabling on-demand retrieval of information and self-reflection to assess and improve the generated content's accuracy.
🔄 Self-RAG vs. Conventional RAG: Self-RAG breaks the mold of traditional RAG by not defaulting to a set number of passages for context. It dynamically decides the necessity and relevance of retrieval, leading to more accurate and high-quality LLM responses.
🪙 Reflection Tokens Explained: In the Self-RAG toolkit, retrieval and critique tokens let the LLM question if retrieval is required and critique its own responses. This introspection leads to outputs that are both factual and of higher quality.
✅ Human Evaluation Reliability: Human evaluators have confirmed the reliability of Self-RAG’s reflection tokens, noting that the model's answers are generally plausible and well-supported by the evidence.
📊 Evaluation Metrics and Tasks: Self-RAG's performance was evaluated using a variety of metrics, including accuracy, FactScore, and citation precision, across tasks like fact verification, open-domain QA, biography generation, and long-form QA tasks.

Self-RAG presents a significant improvement over traditional RAG systems by incorporating a self-reflective step, enhancing factuality and contextual relevance. However, its complexity may pose practical challenges for implementation, suggesting a need for further refinement for real-world applications.

You can check out the blog here. The GitHub repo with model weights can be found here.

The Generative Generation

Discussion about this post