Evaluating LLM Apps- How and Why?
In-depth analysis of LLM applications evaluation, generative AI's role in drug discovery, the latest AI audio tool, and fresh industry insights.
What’s up, everyone?
This week’s newsletter is brought to you by Voxel 51!
✨ Blog of the Week
tl;dr
Classification: Recall, Precision, ROC-AUC, Separation of Distributions
Summarization: Factual consistency via NLI, Relevance via reward modeling
Translation: Measure quality via chrF, BLEURT, COMET, COMETKiwi
Toxicity: Test with adversarial prompts from RealToxicityPrompts and BOLD
Copyright: Test with text from popular books and code
This week, I enjoyed reading Eugene Yan's practical piece, "A Builder's Guide to Evals for LLM-based Applications."
Most off-the-shelf evaluation metrics like ROUGE, BERTScore, and G-Eval are unreliable and impractical for evaluating real-world LLM applications. They don't correlate well with application-specific performance.
The blog is a comprehensive guide for evaluating LLMs across various applications, from summarization and translation to copyright and toxicity assessments.
If you're working on creating products that use LM technology, I highly recommend bookmarking Eugene's post as a useful reference guide. It contains practical tips and considerations to help you build your evaluation suite. Here's what I learned:
Summarization
When it comes to summarization, factual consistency and relevance are key measurements.
Eugene discusses framing these as binary classification problems and using natural language inference (NLI) models fine-tuned on open datasets as learned metrics. With sufficient task-specific data, these can outperform n-gram overlap and embedding similarity metrics.
This method improves with more data and is a robust tool against factual inconsistencies or "hallucinations" in summaries.
Translation
Yan discusses both statistical and learned evaluation metrics.
He highlights chrF, a character n-gram F-score, for its language-independent capabilities and ease of use across various languages. He also talks about metrics like BLEURT and COMET, which use models like BERT to assess translation accuracy, and COMETKiwi, a reference-free variant.
Reference-free models like COMETKiwi show promise in assessing translations even without gold-standard human references.
Copyright
Eugene also discusses measuring undesirable behaviours like copyright regurgitation and toxicity.
He presents methods to evaluate the extent to which models reproduce copyrighted content, a concern for both legal risks and ethical considerations. He outlines how to quantify exact and near-exact reproductions through specific datasets and metrics, providing a framework for assessing and mitigating these risks.
Running LLMs on carefully curated prompt sets drawn from books, code, and web text can help quantify the extent to which copyrighted content sneaks into the outputs.
Toxicity
Yan doesn't shy away from the topic of toxicity in LLM outputs.
He discusses using datasets like RealToxicityPrompts and BOLD to generate prompts that could lead to harmful language, employing tools like the Perspective API to measure and manage toxicity levels.
This section underscores the ongoing need for vigilance and human evaluation to ensure LLMs produce safe and respectful content.
The Human Element
Despite the advancements in automated metrics, human judgment remains the gold standard for complex tasks and nuanced assessments.
Eugene emphasizes the enduring importance of human evaluation, especially for complex reasoning tasks. Strategically sampling instances to label can help boost our evals' precision, recall, and overall confidence.
He advocates a balanced approach, combining automated tools with human insights to refine and improve LLM evaluations continuously.
Eugene Yan's guide is a valuable source of knowledge for anyone involved in developing, assessing, or just interested in the capabilities and challenges of LLM-based applications.
Let me know if you found it helpful, too!
🛠️ GitHub Gems
This week's GitHub gem is a shameless promo from me!
I spent a few days reviewing the approved papers, workshops, challenges, and tutorials for the 2024 CVPR conference and compiled what stood out most in this GitHub repo. Some of the highlights include Pixel-level Video Understanding in the Wild, Vlogger: Make Your Dream A Vlog, and Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes, which I found particularly exciting due to their potential impact on vision-language and multimodal models.
Admittedly, the content I've included is skewed towards my interests. If you have a different focus and think I've missed something important, by all means, contribute to the repo and help make it a more comprehensive resource for the community.
Are you going to be at CVPR?
I'll be with the crew from Voxel 51 at Booth 1519. Come by to say hello and get some sweet swag. You can find us right next to the Meta and Amazon Science booths!
Check out the repo, ⭐smash a Star, share it with your network, add your CVPR highlights, and let me know your thoughts.
I’m looking forward to seeing you at the conference!
📰 Industry Pulse
🎧 Stability AI has introduced Stable Audio 2.0, a new AI model that generates high-quality, full-length audio tracks up to three minutes long with coherent musical structure from natural language prompts. Stable Audio 2.0 is a new audio generation tool that allows users to upload and transform audio samples. It has expanded sound effect generation and style transfer capabilities. The model uses a highly compressed autoencoder and a diffusion transformer to generate tracks with large-scale coherent structures. The model is trained on a licensed dataset from AudioSparx and is available for free on the Stable Audio website. Stable Radio, a 24/7 live stream featuring Stable Audio's generated tracks, is now available on YouTube.
🔬 McMaster and Stanford University researchers have created SyntheMol, a generative AI model that rapidly designs small-molecule antibiotics to fight antibiotic-resistant bacteria. SyntheMol uses Monte Carlo tree search to create compounds from 132,000 building blocks and 13 chemical reactions. The AI model was trained on a database containing over 9,000 bioactive compounds and 5,300 synthetic molecules screened for growth inhibition against Acinetobacter baumannii. After generating 150 high-scoring compounds and synthesizing 58, six non-toxic drug candidates with antibacterial properties were identified. This proof-of-concept demonstrates the potential of generative AI to speed up drug discovery, address antimicrobial resistance, and improve healthcare worldwide, with the entire pipeline completed in just three months.
📈 Time series forecasting uses historical data patterns to make informed decisions across various industries, but integrating advanced AI techniques has been slow. Recently, transformer-based models are developed with large datasets, offering sophisticated capabilities for accurate forecasting and analysis. Notable examples include TimesFM, Lag-Llama, Moirai, Chronos, and Moment. These models help businesses and researchers navigate complex time series data landscapes, offering a glimpse into the future of time series analysis.
🧠 While generative AI has generated a lot of hype and excitement, Gartner's recent survey of over 500 business leaders found that only 24% are currently using or piloting generative AI tools. The majority, 59%, are still exploring or evaluating the technology. The top use cases are software/application development, creative/design work, and research and development. Barriers to adoption include governance policies, ROI uncertainty, and a lack of skills. Gartner predicts 30% of organizations will use generative AI by 2025.
🔍 Research Refined
Anthropic released a paper titled "Many-shot Jailbreaking," which unveils a startling vulnerability for long context LLMs.
This paper highlights the problem of adversarial attacks in LLMs, the methodology of exploiting long-context capabilities, the significant findings about the scalability and effectiveness of such attacks, and the challenges faced in mitigating these threats effectively.
The research shows how even the most advanced large language models can be manipulated to exhibit undesirable behaviours by abusing their extended context windows. This is done by applying hundreds of carefully crafted prompts, specifically many-shot jailbreak prompts (MSJ). MSJ extends the concept of few-shot jailbreaking, where the attacker prompts the model with a fictitious dialogue containing a series of queries that the model would typically refuse to answer.
These queries could include instructions for picking locks, tips for home invasions, or other illicit activities.
Jason Corso published a blog post about how to read research papers a while back. His PACES (problem, approach, claim, evaluation, substantiation) method has been my go-to for understanding papers. Using the PACES method, here's a summary of the key aspects of the paper titled "Many-shot Jailbreaking".
Problem
The paper addresses LLMs' susceptibilities to adversarial attacks using long-context prompts demonstrating undesirable behaviour. This vulnerability emerges due to expanded context windows in models by Anthropic, OpenAI, and Google DeepMind, posing a new security risk.
Approach
The authors investigate "Many-shot Jailbreaking (MSJ)" attacks, where the model is prompted with hundreds of examples of harmful behaviour, exploiting the large context windows of recent LLMs. This method scales attacks from few-shot to many-shot, testing across various models, including Claude 2.0, GPT-3.5, GPT-4, Llama 2, and Mistral 7B. The study covers the effectiveness of these attacks across different models, tasks, and context formatting to understand and characterize scaling trends and evaluate potential mitigation strategies.
Claim
The paper claims that the effectiveness of Many-shot Jailbreaking attacks on LLMs follows a power law with increasing demonstrations, significantly in longer contexts. This effectiveness challenges current mitigation strategies and suggests that long contexts offer a new attack surface for LLMs.
Evaluation
The authors evaluate the approach using datasets focused on malicious use cases, personality assessments, and opportunities for the model to insult. They measure success by the frequency of model compliance with harmful prompts and observe the scaling effectiveness of the attack across model sizes and formatting variations. Notably, the paper finds that current alignment techniques like supervised learning (SL) and reinforcement learning (RL) increase the difficulty of successful attacks but don't eliminate the threat.
Substantiation
The evaluation convincingly supports the paper's claim that MSJ is a potent attack vector against LLMs, with effectiveness that scales predictably with the number of shots. Despite efforts at mitigation, the findings suggest that long contexts allow for an attack that can circumvent current defences, posing risks for model security and future mitigation strategies.
Thanks for reading!
If you found the newsletter helpful, please do me a favour and smash that like button!
You can also help by sharing this with your network (whether via re-stack, tweet, or LinkedIn post) so they can benefit from it, too.
🤗 This small gesture and your support mean far more to me than you can ever imagine!
Cheers,
Harpreet
You just keep crashing it, Harpreet!