Talking Generative Faces, Open-Source AI, and Events Galore!

Plus AI Pilots Fighter Jets and Tiny LLMs Pack a Punch

Harpreet Sahota

Apr 28, 2024

What’s up, everyone?

This edition of the newsletter is brought to you by Voxel 51!

But first, thanks for reading and your support!

If you found the newsletter helpful, please do me a favour and smash that like button!

You can also help by sharing this with your network (whether via re-stack, tweet, or LinkedIn post) so they can benefit from it, too.

🤗 This small gesture and your support mean far more to me than you can ever imagine!

Let’s get to it!

🗓️ Community Events

Monday, April 29: Virtual Open Office Hours with Professor Jason Corso

Wednesday, May 1st: AI Makerspace: End-to-end Prototyping with Llama 3

Thursday, May 2nd: AI, Machine Learning and Data Science Meetup. This virtual meetup will have three talks:

Who needs RLHF When You Have SFT? by Srishti Gureja
Making LLMs Safe & Reliable by Shiv Sakhuja
Develop a Legal Search Application from Scratch using The Milvus Project and DSPy by Mert Bozkir

✨ Blog of the Week

This week's good read is Nathan Lambrt's slides from his guest lecture session for Stanford CS25, titled Aligning Open Language Models.

Honestly, I wouldn’t normally recommend slides as a good read (because that’s weird)…but this is an exception for three reasons:

I’m a huge fan of Nathan Lambert.
The slides break all the design principles people say slides should have, so they’re full of text.
He links out to a lot of really good material, so lots of alpha from that perspective.

In the slides, he showcases the rapid progress in aligning open language models, driven by innovative techniques, community efforts, and the availability of open-source resources.

• Nathan briefly traces the evolution of LMs from Claude Shannon's early work in 1948 to the emergence of transformers in 2017 and the subsequent release of influential models like GPT-1, BERT, and GPT-3.

• GPT-3 -3's rise in 2020, with its surprising capabilities and potential harms, highlighted the need for aligning LMs with human values and intentions.

Nathan also provides a brief history of the following:

• Instruction Fine-tuning (IFT): This technique involves training base LMs on specific instructions to improve their ability to follow user commands and engage in dialogue.

• Self-Instruct/Synthetic Data: This approach utilizes existing high-quality prompts to generate additional training data, significantly expanding the IFT dataset.

• Early Open-Source Models: He highlights several early open-source instruct models, such as Alpaca, Vicuna, Koala, and Dolly, which were based on LLaMA and fine-tuned using various datasets like ShareGPT and OpenAssistant.

• Evaluation Challenges: The slides discuss the emergence of evaluation methods like ChatBotArena, AlpacaEval, MT Bench, and Open LLM Leaderboard, each with strengths and limitations.

• RLHF and LoRA: He explores the use of Reinforcement Learning from Human Feedback (RLHF) and Low-Rank Adaptation (LoRA) techniques further to improve the alignment and efficiency of open-source models.

• OpenAssistant: The release of OpenAssistant, a large human-generated instruction dataset, was crucial in advancing open-source aligned models.

• StableVicuna: This model marked a significant step by being the first open-source model trained with RLHF using PPO.

• QLoRA & Guanaco: The development of QLoRA, a memory-efficient fine-tuning method, enabled the training of larger models like Guanaco, achieving state-of-the-art performance at the time.

🛠️ GitHub Gems

COCONut is a modernized large-scale segmentation dataset that improves upon COCO in terms of annotation quality, consistency across segmentation tasks, and scale and introduces a challenging new validation set.

It contains approximately 383K images with over 5.18M human-verified panoptic segmentation masks.
COCONut harmonizes annotations across semantic, instance, and panoptic segmentation tasks to address inconsistencies in the original COCO dataset.

Dataset Highlights

High-quality, human-verified panoptic segmentation masks
Consistent annotations across semantic, instance, and panoptic segmentation
Meticulously curated validation set (COCONut-val) with 25K images and 437K masks
Refined class definitions compared to COCO to improve annotation consistency

Construction

Uses an assisted-manual annotation pipeline leveraging modern neural networks to generate high-quality mask proposals efficiently
Human raters inspect and refine the machine-generated proposals
Achieves significant increase in both scale and annotation quality compared to existing datasets

Check out the GitHub here. The dataset is on Kaggle, which you can find here.

📰 Industry Pulse

✈️ AI WINGS OF FURY: AI Agent-Powered Fighter Jet Takes to the Skies

Ok, that was kind of a clickbaity headline, but it's pretty much what happened!

The US Air Force Test Pilot School and the Defense Advanced Research Projects Agency (DARPA) have successfully installed AI agents in the X-62A VISTA aircraft as part of the Air Combat Evolution (ACE) program.

The teams conducted over 100,000 lines of flight-critical software changes across 21 test flights, culminating in the first-ever AI vs human within-visual-range dogfights. The breakthrough demonstrates that AI can be used safely in aerospace applications, paving the way for future advances. The X-62A VISTA will continue to serve as a research platform to advance autonomous AI systems in aerospace.

Looks like AI has finally 'flying' colors in aerospace! I highly recommend checking out this YouTube video that DARPA put out to learn more.

Phi-3: where less is more, and 'mini' means maximum impact!

Microsoft has overshadowed the Llama-3 launch with their latest line of small language models (SLMs) - Phi-3!

The Phi-3 family includes three models: phi-3-mini with 3.8 billion parameters, phi-3-small with 7 billion, and phi-3-medium with 14 billion. The phi-3-small and medium compete with or outperform GPT 3.5 across all benchmarks, including the multi-turn bench (and by a decent amount). It’s not so good on TriviaQA due to its limited capacity to store "factual knowledge," but honestly, that’s not even an interesting benchmark to care about.

What is interesting, though, is how they curated their dataset. They created a dataset that used simple, easy-to-understand words like those a 4-year-old could understand. They also created synthetic datasets called "TinyStories" and "CodeTextbook" using high-quality data from larger language models. This supposedly makes the models less likely to give wrong or inappropriate answers.

Microsoft's Phi-3 SLMs are the proof that sometimes, smaller is smarter.

Another one from Microsoft: VASA-1!

VASA-1 is a Microsoft Research project that generates realistic talking faces in real-time based on audio input.

VASA-1 brings talking faces to life with its cutting-edge technology. The system generates facial movements and expressions that perfectly sync with audio input, creating a seamless and realistic experience. Moreover, it does this in real-time, crafting animations on the fly as the audio is spoken. The result is a lifelike appearance that's uncannily similar to real human faces, complete with intricate skin texture, facial features, and nuanced expressions.

Seriously, this thing is a trip. Go check out the website. None of the images are of real people, but the lip-audio synchronization, expressive facial movements, and natural head motions fooled me.

🔍 Research Refined

Evaluating retrieval-augmented generation (RAG) systems, which combine information retrieval and language generation, has been a challenging task due to the reliance on extensive human annotations.

A recent research paper introduces ARES, a novel framework that addresses this issue by providing an automated, data-efficient, and robust evaluation approach.

We’ll summarize the paper using the PACES method.

Problem

Traditional methods for evaluating the quality of a RAG system's generated responses rely heavily on expensive and time-consuming human annotations, which can introduce subjectivity and inconsistency into the evaluation process. An automated evaluation framework called ARES has been proposed to address this issue. It leverages synthetic data generation and machine learning techniques to provide reliable and data-efficient assessments of RAG system performance.

Approach

ARES approach has four components:

1. Synthetic Data Generation: ARES uses large language models to generate a synthetic dataset of query-passage-answer triples.

2. LLM Judges: ARES trains lightweight "judge" models to predict the quality of RAG system outputs.

3. Prediction-Powered Inference (PPI): ARES employs PPI to estimate confidence intervals for quality scores.

4. RAG System Ranking: ARES applies trained judges to evaluate the outputs of various RAG systems and rank them based on scores and confidence intervals.

Claim

The paper's main claim is that ARES provides an effective and efficient framework for evaluating RAG systems without relying heavily on human annotations.

The authors argue that ARES can accurately assess the performance of RAG systems in terms of context relevance, answer faithfulness, and answer relevance while significantly reducing the need for time-consuming and expensive human evaluations.

They claim that ARES has the potential to become a standard evaluation framework for RAG systems, enabling researchers and practitioners to assess and compare the performance of different RAG architectures more effectively.

Evaluation

ARES is evaluated on various datasets, and its effectiveness in accurately ranking RAG systems is shown with limited human annotations. However, testing ARES on broader datasets and RAG system architectures would be valuable.

Substantiation

The evaluation results substantiate the paper's main claim that ARES is an effective and efficient framework for evaluating RAG systems.

The high correlation between ARES rankings and human judgments across different datasets and evaluation dimensions supports the claim that ARES can provide reliable assessments of RAG system performance while requiring significantly fewer human annotations than traditional approaches.

The ablation studies and robustness experiments further strengthen the validity of the proposed framework.

Thanks for reading!