Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts | by Benjamin Marie | Dec, 2023


How to efficiently outperform GPT-3.5 and Llama 2 70B

Benjamin Marie
Towards Data Science
Image by 8385 from Pixabay

Most of the recent large language models (LLMs) use very similar neural architectures. For instance, the Falcon, Mistral, and Llama 2 models use a similar combination of self-attention and MLP modules.

In contrast, Mistral AI, which also created Mistral 7B, just released a new LLM with a significantly different architecture: Mixtral-8x7B, a sparse mixture of 8 expert models.

In total, Mixtral contains 46.7B parameters. Yet, thanks to its architecture, Mixtral-8x7B can efficiently run on consumer hardware. Inference with Mixtral-8x7B is indeed significantly faster than other models of similar size while outperforming them in most tasks.

In this article, I explain what a sparse mixture of experts is and why it is faster for inference than a standard model. Then, we will see how to use and fine-tune Mixtral-8x7B on consumer hardware.

I have implemented a notebook demonstrating QLoRA fine-tuning and inference with Mixtral-8x7B here:

Get the notebook (#32)

Image by the author

A sparse mixture of experts (SMoE) is a type of neural network architecture designed to improve the efficiency and scalability of traditional models. The concept of a mixture of experts was introduced to allow a model to learn different parts of the input space using specialized “expert” sub-networks. In Mixtral, there are 8 expert sub-networks.

Note that the “8x7B” in the name of the model is slightly misleading. The model has a total of 46.7B parameters which is almost 10B parameters less than what 8x7B parameters would yield. Indeed, Mixtral-8x7b is not a 56B parameter model since several modules, such as the ones for self-attention, are shared with the 8 expert sub-networks.

If you load and print the model with Transformers, the structure of the model is easier to understand:

MixtralForCausalLM(…



Source link

This post originally appeared on TechToday.