Llama 4 is Natively Multimodal. What does that mean? - Awesome MLSS Newsletter

3rd Edition

Earlier last month, Meta unveiled the Llama 4 herd of models. The largest models in this series are natively multimodal, with Llama 4 Behemoth’s performance surpassing competitors like GPT 4.5o, Claude Sonnet 3.7 and Gemini 2.0 pro on multiple STEM benchmarks, even though it is still in training.

Models accepting different modalities of inputs and producing different modalities are called multimodal. For example, generating and editing images with just text descriptions.

This week, we dive deep into natively multimodal models, right after a few updates

Upcoming Summer School Announcements

Applications for most of the following summer schools are closing in the next 20 days. Make sure to apply to them before the application deadline!

For the complete list, please visit our website

Some Research Highlights

Never say never, because AI models cannot understand it

MIT study reveals that vision-language models do not understand negation - i.e. words like ‘no’ or ‘doesn’t’

Beauty? Nope. Creativity lies in the eyes of the beholder

Study from Aalto University finds that people are far more likely to judge work done by an AI as ‘creative’ the more they are shown the process of creation

What’s happening in AI?

Most AI research so far has largely focused on language based models - LLMs. This isn’t a surprise, considering most data made publicly available is in text. Language is the easiest way for people to express themselves, but we do not reason with language alone.

Yann LeCun has also reiterated this point over and over again in several appearances: “In 4 years, a child has seen 50 times more data than the biggest LLMs.” 

So how do we move past pure text descriptions, and involve other modalities - speech, video, vision, and others? Enter Multimodal Models

So how do they work?

All models primarily work by compressing the existing data into a latent vector space, and then processing them numerically. These vectors are what we refer to as embeddings that form the basis for all transformer networks, whether in language, audio, or vision. 

At a bird’s eye view, any given multimodal network has three different components - the input module, fusion module, and the output module. 

The input module takes data from different modalities, and converts them into corresponding embedding features. The fusion module then takes these input features, and ‘fuses’ them, i.e. combines their information into a single latent vector space. These fused features are then passed to the output module, tasked with using the encoded information to generate outputs of different modalities. If you’d like a more technical explanation, here is a great paper to read. Below, is a very simplified view of the architecture. For a more detailed overview, we recommend reading Sebastian Raschka’s blog article.

A simplified overview of multimodal network architecture

Now, the question is more about when the fusion occurs.

Late Fusion combines outputs from separately trained models (like a language model and an image model) using a smaller bridging fusion model. It's simpler but less tightly integrated. 

Early Fusion trains everything jointly — all modalities processed together, enabling the model to learn richer correlations, like linking a caption to an image at a conceptual level.

Models that are jointly trained with early fusion on multiple different modalities with a single neural network are called Natively Multimodal Networks (NMM). These models were designed to be able to simultaneously reason over all data modalities, and come with a ton of benefits.

  1. Higher accuracy on cross modal tasks: Natively multimodal models show higher ability for visual question answers, sentiment analysis, etc. 

  2. Cross-modal transfer learning: The model learns from all modalities. This will lead to more emergent capabilities in terms of reasoning. We are already seeing this with how we can now carry out complex design tasks with just text descriptions. 

  3. Flexible user experiences: Voice interactions with AI assistants previously needed a transcription stage, then an LLM, then a text to speech model. NMMs directly process speech to speech, while also being able to reason over it without needing a separate pass. This opens up many new possibilities, and reduces latency.

Of course, the above is not an exhaustive list. There are many other benefits as well, but we wanted to highlight the ones that have the most direct impact. There still remain several open research questions too, such as 

  1. Data efficiency and alignment without pairing - how do we train multimodal models when data in one modality is sparse or noisy, or if we just don’t have data pairs? 

  2. Adversarial and spurious correlations - how can we be certain that adversarial attacks from one modality can be avoided? How can we be certain that all cross connections created are valid? 

  3. Scalability and Continual Learning - how do we make sure scaling laws hold for multimodal models as well? How can we fine tune one modality without losing stability? 

  4. Explainability and controllability - how can we explain outputs? How can we ensure that a model can focus more on analysis from one modality and not the other? 

Curious about one of these challenges? Reply to this email with the one that fascinates you most — we’ll send you curated papers, links, or even set up a discussion thread.

Awesome Machine Learning Summer Schools is a non-profit organisation that keeps you updated on ML Summer Schools and their deadlines. Simple as that.

Have any questions or doubts? Drop us an email! We would be more than happy to talk to you.

With love, Awesome MLSS

Reply

or to participate.