Get up to speed on Large Multimodal Models and hack around with on in a notebook I made for you!

Plus news from the industry, good learnings, and how to curate multimodal datasets

Harpreet Sahota

Mar 30, 2024

What’s up everyone!

This week’s newsletter is brought to you by Voxel 51!

🗓️ Events from the Community

Wednesday, April 3 - Exploring and Understanding Multimodal Embeddings

Wednesday, April 3 - The State of AI Engineering Jobs in 2024

✨ Blog of the Week

Think about how you experience the world: you see, hear, touch, and talk.

As a human, you have the uncanny ability to process and interact with the world using multiple modes of data simultaneously. You can output data in various ways, whether speaking, writing, typing, drawing, singing, or more. Developing AI systems that can operate in the "real world" means building models that understand the world as you do. It requires models that can take in multiple input types, reason over that input, and generate output across different modalities.

This week's blog pick is one from Chip Huyen, who examined large multimodal models (LMMs) in depth.

Chip's blog post discusses the latest advancements in training multimodal models. It’s split into three parts:

Understanding Multimodal: This section discusses the context of multimodality, including its importance, different data modalities, and types of multimodal tasks. It emphasizes the significance of working with images and text to generate content and a more comprehensive understanding.

Multimodal Training: This part discusses training a multimodal system, using CLIP and Flamingo as examples. It covers the components of a multimodal system, including encoders for each data modality, alignment of embeddings, and, for generative models, a language model to generate text responses.

Research Directions for LMMs: The final section discusses active research areas for LMMs, such as generating multimodal outputs, incorporating more data modalities, and creating adapters for more efficient multimodal training.

It's a long read, but it's time well spent. I highly recommend checking it out. My main takeaways are summarized below.

The Essence of Multimodality

Chip outlines that multimodality involves interactions between different data types, including text, images, audio, etc. She mentions that it can mean one or more of the following:

Multiple Input and Output Modalities: A model could input text and output an image, or vice versa. This versatility allows AI systems to engage in more complex tasks, such as generating descriptive text from an image (image captioning) or creating a visual representation based on a textual description.
Multimodal Inputs: An example would be an AI system that can analyze and understand content that includes images, text, or audio to make decisions or provide insights. This approach requires a comprehensive understanding of different data types to accurately interpret the context or content, such as sentiment analysis across text and image data.

Multimodal Outputs: This could involve an AI system that, given a specific input, can produce a textual summary and a relevant image or graphical representation. This capability enhances user interactions and provides more dynamic and informative responses in applications like educational tools or content creation platforms.

Data Modalities

Data modalities refer to the different forms of data, such as text, images, audio, tabular data, etc. Understanding and working with multiple data modalities is crucial for AI systems to operate effectively in the real world, mirroring human intelligence's multimodal nature.

One highlighted aspect is the ability to represent or approximate one data modality in terms of another. For example:
1. Audio can be visualized as images through mel spectrograms.
2. Speech can be transcribed into text, albeit losing some nuances like intonation and volume.
3. Images can be converted into vectors and then represented as sequences of text tokens.
4. Videos are treated as sequences of images, sometimes along with the audio.
5. Text can be visualized by taking a picture of it.
6. Data tables can be transformed into charts or images.

Images can represent other data types; visual data are abundant from sources like phones and webcams.

Conversely, the text is highlighted as a powerful output modality because of its applicability across various tasks, including summarization, translation, reasoning, and question-answering.

While there is a breadth of data modalities, images and text have widespread applicability, and the existing research and development efforts are centred around these modalities.

Chip Huyen's exploration into multimodality and LMMs offers an excellent overview of the topic, a thorough explanation of groundbreaking models like CLIP and Flamingo, plus a compelling glimpse into the future of AI, where the integration of diverse data types promises to unlock new levels of intelligence and utility.

🛠️ GitHub Gems

LLaVA-NeXT (aka LLaVA-1.6) is here!

The improved version of the LLaVA (Large Language and Vision Assistant) model, which is an open-source multimodal AI assistant capable of processing both text and images, boasts enhanced reasoning, optical character recognition (OCR), and world knowledge capabilities.

Here are the key improvements in LLaVA-NeXT compared to the previous LLaVA-1.5 version:

According to the authors, LLaVA-NeXT-34B outperforms Gemini Pro, a state-of-the-art multimodal model on several benchmarks. It achieves state-of-the-art performance across 11 benchmarks with simple modifications to the original LLaVA model.

The authors curated high-quality user instruction data that meets two criteria: diverse task instructions and superior responses. They combined two data sources: (1) Existing GPT-V data in LAION-GPT-V and ShareGPT-4V. (2) A 15K visual instruction tuning dataset from the LLaVA demo covering various applications. They also filtered potentially harmful or privacy-sensitive samples and generated responses with GPT-4V.

Major improvements include enhanced reasoning capabilities, optical character recognition (OCR), and world knowledge compared to the previous LLaVA-1.5 model.

LLaVA-NeXT has an emerging zero-shot capability in Chinese despite only being trained on English multimodal data. Its Chinese multimodal performance is surprisingly good.

The model is compute and data efficient. LLaVA-NeXT-34B was trained with 32 GPUs for about 1 day using only 1.3 million data samples. This compute and data cost is 100-1000x lower than other models.

In qualitative tests, LLaVA-NeXT-34B demonstrated strong visual comprehension and question-answering abilities on images, such as identifying people and understanding context from social media screenshots.

LLaVA-NeXT has been open-sourced and is available in the Hugging Face Transformers library, making it one of the best open-source vision-language models currently available.

LLaVA-NeXT is a massive advancement in open-source multimodal AI, making powerful visual-language capabilities more widely accessible to researchers and developers.

Get hands-on with the model and see it in action yourself with this notebook I created for you!

📰 Industry Pulse

🤖 Databricks has entered the generative AI arena by announcing DBRX, a model that rivals OpenAI's GPT series and Google's Gemini. With a hefty $10 million investment and two months of training, DBRX is positioned as a language processing powerhouse available for research and commercial use. However, its high hardware requirements and potential limitations raise questions about its accessibility and overall impact.

🌍 Google is stepping up its game in the travel sector with updates to make travel planning more integrated and insightful. At the heart of these updates is an enhanced Search Generative Experience (SGE) roll-out that leverages AI to help users craft personalized travel itineraries. This feature, currently in its experimental phase and available only in English in the U.S., promises to revolutionize how we think about planning our trips. Drawing from a vast pool of web resources, reviews, and user-submitted content, Google's AI aims to deliver comprehensive trip ideas that cater to specific requests, such as a history-focused three-day trip to Philadelphia.

🚚 Tesla's FSD technology is making rapid progress, with implications that could dramatically alter the automotive and transportation industries. While many currently focus on the adoption rates among Tesla buyers and the future of Robotaxis, this article argues that Robotrucking represents an even more valuable and swiftly approaching reality. The transition to Robotrucking involves overcoming regulatory hurdles and retrofitting existing truck fleets with autonomous technology, which could dramatically increase Tesla's profits and influence in the transportation sector. With detailed analysis and projections, the piece explores the financial, logistical, and societal impacts of the widespread adoption of autonomous trucking technology.

🎨 Adobe has unveiled Firefly Services, a suite of over 20 generative and creative APIs, tools, and services. This new offering aims to democratize Adobe's AI-powered features from its renowned Creative Cloud tools, such as Photoshop, making them accessible to enterprise developers. The goal? To streamline content creation within custom workflows and inspire the development of novel solutions. Moreover, Adobe is introducing Custom Models, enabling businesses to refine Firefly models with their own assets, which are integrated seamlessly into Adobe’s GenStudio.

👨🏽‍💻 Good Tutorials and Learning

• The repo for Qwen1.5-MOE is clear and shows you how to get started with this new model easily. It’s a small MoE model with only 2.7 billion activated parameters, yet it matches the performance of state-of-the-art 7B models like Mistral 7B and Qwen1.5-7B.

• HuggingFace CTO Thomas Wolf released a great video tutorial titled “A little guide to building LLMs in 2024”.

• Maxime Labonne shared a Colab notebook comparing FP16 vs. 1-bit LLama 2-7B models quantized using HQQ + LoRA, with SFT greatly improving the quantized models, enabling fitting larger models into smaller memory footprints in a shared Colab notebook.

• Omar from Hugging Face wrote a post on X, a mini-tutorial on three types of Mixture of Experts (MoE): Pre-trained MoE, upcycled MoEs, and FrankenMoEs.

🔍 Research Refined

Jason Corso published a blog post about how to read research papers a while back. His PACES (problem, approach, claim, evaluation, substantiation) method has been my go-to for understanding papers.

This week, I’ll apply his methodology to "The Role of Data Curation in Image Captioning" by Wenyan Li, Jonas F. Lotz, Chen Qiu, and Desmond Elliott.

Problem

Image captioning models are typically trained by treating all samples equally, ignoring variations in captions or the presence of mismatched or hard-to-caption data points. This negatively impacts a model's ability to generate captions accurately because it can "confuse" the model during training. This paper investigates whether actively curating difficult samples within datasets can enhance model performance without increasing the total number of samples.

Approach

The paper introduces three data curation methods to improve image captioning models by actively curating difficult samples within the dataset. These methods are:

REMOVAL: Completely removing high-loss samples from the training process.

REPLACECAP: Replacing the caption of a high-loss sample with another caption from the dataset or a caption generated by a language model.

REPLACEIMG: A text-to-image generation model replaces the image of a high-loss sample with a newly synthesized image based on its captions.

Claim

The authors claim that actively curating difficult samples in datasets, without increasing the total number of samples, enhances image captioning performance through three data curation methods: complete removal of a sample, caption replacement, or image replacement via a text-to-image generation model.

Their experiments show that the best strategy varies between datasets but is generalizable across different vision-language models.

Evaluation

The methods were evaluated with two state-of-the-art pretrained vision-language models (BLIP and BEiT-3) on widely used datasets (MS COCO and Flickr30K), focusing on how these curation methods impact the performance of image captioning models. They used metrics like CIDEr and BLEU scores to measure improvements.

The study finds that Flickr30K benefits more from removing high-loss training samples, suggesting it may be noisier than MS COCO.

Substantiation

The study discovered that:

A hybrid approach combining REMOVAL and REPLACEIMG methods yielded the best results, suggesting that a sophisticated strategy leveraging multiple curation methods could offer more flexibility and effectiveness in curating datasets.

The extent of curation and the choice of method are critical and dataset-dependent, with some datasets benefiting more from certain curation strategies than others.

The potential of using synthesized images for training was acknowledged, although the quality of the generated images limited its benefits.

The findings underscore the importance of data curation in training more effective image captioning models and open up avenues for applying similar frameworks to other multimodal tasks.

The Generative Generation