What's the difference between RAG and Fine-Tuning?
An overview of some key terminology you should know when working with LLMs
What’s up, everyone!
Thank you to everyone who joined the LangChain ZoomCamp on Friday!
Don't worry if you missed it; you can find the recording in the podcast section of the Substack. The slides are available on GitHub.
This session was heavy on concepts (and slides), and I covered all the core modules in LangChain.
In the next session, we’ll see these concepts in code and combine them in a single mini-project.
🗓️ An upcoming series of panel discussions
I’ve partnered with the Generative AI World Summit to bring you a series of four-panel discussions. These are interactive sessions, and I welcome your participation. I’ll have some seed questions prepared, but your participation will guide the direction of the conversation.
One registration link will get all the sessions on your calendar, and I look forward to having you there. You can register here.
Below is the schedule and the tentative speakers for each session.
🗓️ Thursday, September 28th at 4pm CST: The Future of Generative AI: Vision and Challenges
This session will explore the potential of Generative AI in reshaping industries, its future trajectory, and the challenges. Panelists will discuss the advancements they foresee in the next decade and the hurdles that must be overcome.
🗓️ Thursday, October 5th at 4pm CST: From Academia to Industry: Bridging the Gap in Generative AI
This panel will discuss translating academic research in Generative AI into real-world applications. Experts will discuss the challenges of this transition and share success stories.
🗓️ Thursday, October 12th at 4pm CST: Generative AI in Production: Best Practices and Lessons Learned
This will discuss the practical aspects of deploying Generative AI solutions, the challenges, and the lessons learned.
🗓️ Thursday, October 19th at 4pm CST: Ethics and Responsibility in Generative AI
Given the power of Generative AI, this session can address the ethical implications, potential misuse, and the responsibility of researchers and practitioners in ensuring its safe and beneficial use.
Let’s discuss the difference between RAG and fine-tuning a model
Fine-tuning simply means updating the parameters of a pretrained LLM via backpropagation on a smaller, task-specific dataset. You can “unfreeze” all or a subset of the model's parameters, allowing them to be updated to align the model for task-specific behaviours. It's computationally intensive and requires expertise.
RAG (Retrieval Augmented Generation) merges prompt engineering with context retrieval from external data sources. It couples information retrieval mechanisms with text generation models. Instead of solely relying on the model's knowledge, RAG fetches context from external data to provide richer answers. You won’t be updating any weights using this method. Simply augmenting the prompt with your specific data.
In-context learning (aka “prompting”) doesn't involve training network weights, either. Instead, it refines the input to guide the model's output. The model is guided to produce desired outputs without any parameter adjustments by providing specific instructions or examples in the input. This is using techniques like few-shot prompting, chain-of-thought, ReAct, etc.
You need to know the main takeaway: Fine-tuning is a technique to teach an LLM tasks or behaviours.
It is not meant for teaching new knowledge.
To teach new knowledge, RAG is a better method. You can store your data in an external database and retrieve the relevant information to give the LLM context for your query.
However, even to teach specific tasks or behaviours, LLMs can learn from a good prompt.
So, now let’s talk a little bit about prompt tuning…
What’s prompt-tuning?
Prompt-tuning: Also known as soft prompt tuning, involves concatenating the embeddings of the input tokens with a trainable tensor. This tensor can be optimized via backpropagation to improve performance on a target task. Unlike hard prompt tuning, where input tokens are directly changed, soft prompts are adjusted based on loss feedback from a labelled dataset.
Prompting (In-context learning): This process (aka hard prompt tuning) involves iteratively refining the model's input (the string of text you give it) to guide its output without adjusting any of the model's parameters. Techniques include zero-shot prompting (providing specific instructions) and few-shot prompting (prepending a few examples to guide the model).
While both methods involve manipulating the input to achieve desired outputs, prompt-tuning introduces trainable tensors to the input embeddings, allowing for optimization based on labelled data.
In contrast, prompting solely relies on refining the input without any parameter adjustments.
A few ways you can fine-tune an LLM
Extract features and train a new classification head.
A pretrained LLM is loaded and applied to the target dataset in this method.
The primary goal is to generate output embeddings for the training set, which can then be used as input features to train a classification model.
The classification model can be a logistic regression, a random forest, or XGBoost. However, linear classifiers like logistic regression often perform best in this context.
Freeze all the early layers, train on the last few layers and update weights.
In this approach, the parameters of the pretrained LLM are kept frozen, and only the newly added output layers are trained.
This is analogous to training a logistic regression classifier or a small multilayer perceptron on the embedded features.
It's a more resource-efficient method, but its performance might not be as high as finetuning all layers.
Update the weights in all layers.
This is the gold standard for adapting LLMs to new tasks.
Unlike Finetuning I, in this method, all the parameters of the pretrained LLM are updated.
It often results in superior modelling performance but comes at a higher computational cost.
Of course, your chosen method depends on your task, how much compute you have available, and your judgement on the trade-off between performance and efficiency.
What are parameter-efficient fine-tuning methods?
Parameter-efficient fine-tuning allows for adapting pretrained models while minimizing computational and resource footprints. Techniques like prefix tuning, adapters, and low-rank adaptation modify multiple layers to achieve better predictive performance at a low cost.
LoRA (Low-Rank Adaptation): Uses reparameterization to reduce the number of trainable parameters, allowing for efficient fine-tuning.
Prefix Tuning: Adds trainable tensors to each transformer block, not just the input embeddings. It modifies more layers of the model by inserting a task-specific prefix to the input sequence.
Soft Prompt Tuning: Concatenates the embeddings of the input tokens with a trainable tensor optimized via backpropagation.
Transfer Learning: Preserves the useful representations the model has learned from past training when applying it to a new task. This often involves dropping the neural network's head and replacing it with a new one.
Fine-tuning language models is essential for tasks that are hard to describe or can't be captured with few examples.
Imagine you were training someone to do the task you’re trying to get an LLM to do. If you had to spend more than a few days training someone to do that task, rather than giving them a few examples and sending them on their way, fine-tuning might be the answer.
Like any ML project, it’s a complex process. You’ll deal with the same issues: dataset design, experiment management, battling overfitting and catastrophic forgetting (the phenomenon where a neural network forgets previously learned information upon learning new tasks or data), plus cost and compute considerations.
Benefits of fine-tuning include:
Reliability: Fine-tuned models are more focused.
Writing Style: Crucial for corporate communication.
Cultural Localization: Important for non-English audiences.
RAG’s effectiveness is tied to search accuracy.
Like vector search, current solutions have limitations, especially with domain-specific terms. RAG might not be sufficient if your domain's terminology isn't in generic embeddings. This has led some to consider fine-tuning, but it's not always the right solution. Fine-tuning is best for specialized languages, like medical or legal fields.
Fine-tuning is a good investment if you’re working in one of those fields. If not, it’s better to proceed cautiously and try RAG instead.
✨ Blog of the Week
My pick this week is Leveraging qLoRA for Fine-Tuning of Task-Fine-Tuned Models Without Catastrophic Forgetting: A Case Study with LLaMA2(-chat)
The blog post goes into infusing domain-specific knowledge into a LLM. It emphasizes the importance of addressing challenges related to helpfulness, honesty, and harmlessness when designing LLM-powered applications of enterprise-grade quality.
The main focus is on the parametric approach to fine-tuning, which efficiently injects niche expertise into foundation models without compromising their general linguistic capabilities.
What will you learn by reading it?
You'll gain insights into the challenges and solutions of fine-tuning large language models, specifically focusing on the qLoRA approach for parameter-efficient fine-tuning.
You'll also learn about the practical steps involved in fine-tuning LLaMA2 using Amazon SageMaker and how to test the performance of the fine-tuned models.
It’s a thorough guide for GenAI practitioners like yourself who want to learn how to adapt LLMs to specific domains without compromising their general capabilities.
If you’re uncomfortable training a model on AWS, this is another great hands-on blog, and you can run it right in Colab.
🛠️ GitHub Gems
I attended a live session hosted by AI Makerspace titled Smart RAG: Domain-specific fine-tuning for end-to-end Retrieval.
During the session, we were taught how to jointly train a retriever and a generator to create a Domain Adapted Language Model (DALM) using an end-to-end RAG model. You can check out the notebook they used here.
E2E RAG training enables the creation of high-performing contextualized retrievers for the RAG system. The model we trained in the notebook learned how to retrieve documents from our context and generate responses in a unified system.
And they did this using an open-source library called DALM, developed by Arcee.
It’s a library worth checking out and starring on GitHub, and I’ll for sure be doing some projects with it.
📰 Industry Pulse
GitHub has released a public beta of GitHub Copilot Chat for all individual users, which integrates with both Visual Studio and VS Code. This AI-powered chat aims to revolutionize software development by allowing developers to code using natural language, reducing boilerplate work, and positioning natural language as a universal programming language. GitHub Copilot Chat offers features like real-time guidance, code analysis, security issue fixes, and simple troubleshooting, all within the integrated development environment (IDE).
Amazon is investing up to $4 billion in Anthropic to develop industry-leading foundation models for AI. This collaboration will leverage Anthropic's AI safety research and Amazon Web Services (AWS) infrastructure expertise to make AI more accessible and reliable for AWS customers. As part of the agreement, AWS will become Anthropic's primary cloud provider, and both companies are committed to the safe and responsible development and deployment of AI technologies.
DALL·E 3 is here! It’s an advanced version of OpenAI's text-to-image system and offers enhanced nuance and detail, translating textual ideas into highly accurate images. This system will be integrated with ChatGPT, allowing users to generate detailed prompts for DALL·E 3 and refine image outputs. OpenAI has implemented safety measures to ensure responsible content generation, including declining requests related to living artists and public figures, and is researching ways to identify AI-generated images.
💡 My Two Cents
I didn’t expect that tweet to get the reaction it did, and I guess it means a lot of people are feeling the same way.
scikit-learn was a staple in my toolbelt during the early years of my data science career. I still use some handy utility functions in scikit-learn, but I haven’t trained a sckit-learn model in almost three years.
I've primarily worked with deep learning over the last two years. So the stack I’ve been using doesn’t include old favorites like scikit-learn, SciPy, or pandas that much.
These days, my stack consists of:
PyTorch for all things tensors and deep learning
SuperGradients for computer vision (go and star the repository!)
The HuggingFace stack (transformers, bitsandbytes, peft, diffusers)
LangChain
LlamaIndex
You don’t need to know these libraries' entire ins and outs if you’re just starting out.
For PyTorch I recommend learning how to work with and manipulate tensors, work with dataloaders, create a model, and the training loop. SuperGradients is an abstraction over PyTorch, so it’s quite easy to get up and running on that once you’re comfortable with PyTorch. Go through the basic “hello world” tutorial for the other libraries.
And then from there, start building! It’s the only way you’ll learn these libraries meaningfully.
🔍 Research Refined
When trained on statements in the format "A is B," these models consistently fail to deduce or generalize the reverse, "B is A."
For example, even if they are trained on the information that "George Washington was the first US president," they often cannot correctly answer the question, "Who was the first US president?"
The study tested this phenomenon using various real-world examples, including celebrity-parent relationships and found that GPT-4 could only answer 33% of reversed questions correctly, compared to 79% for forward ones.
Andrej Karpathy commented on this, suggesting that LLM knowledge is more "patchy" than anticipated.
The findings underscore that LLMs might be heavily relying on statistical patterns in their training data rather than achieving a genuine understanding of causal relationships.
This discovery emphasizes the importance of testing LLMs in diverse and comprehensive ways to truly gauge their capabilities and limitations.
If you're interested in testing the "Reversal Curse" phenomenon yourself, here's a step-by-step guide:
Choose a Model: You can use popular LLMs like ChatGPT or Claude. Many platforms offer interfaces to interact with these models directly.
Prepare Data: Create a list of statements in the "A is B" format. For a more challenging test, consider using lesser-known facts or C-list celebrities with different last names from their parents.
Test Forward Statements: Begin by verifying the model's knowledge. For instance, if you're using celebrity-parent relationships, ask the model, "Who is [celebrity name]'s mother?" to confirm it knows the answer.
Test Reversed Statements: In a separate dialogue, challenge the model with the reverse question without giving any hints from the previous question. For the earlier example, you'd ask, "Who is the child of [parent's name]?" Ensure you don't include the celebrity's name in the dialogue.
Analyze Results: Compare the accuracy of the model's answers for forward and reverse questions. If the model struggles more with the reversed questions, it's exhibiting the "Reversal Curse."
Document Findings: Record the model's responses for both types of questions. This will help you analyze patterns and understand where the model excels or falls short.
I recommend checking out this thread by one of the authors of the papers. It’s super informative.
That’s it for this one.
See you next week, and if there’s anything you want me to cover shoot me an email.
Cheers,
Harpreet
There is so much value in this article that it's hard to digest it all! Excellent work Harpreet