Google’s Latest Approaches to Multimodal Foundational Model | by Eileen Pangu | Aug, 2023

Multimodal foundational models are even more exciting than large language models. Let’s review Google research’s recent progress to have a glimpse of the bleeding edge.

Eileen Pangu
Towards Data Science
Image source:


While the hype on large language model (LLM) is still iron hot in the industry, the leading research organizations have turned their eyes to multimodal foundational models — models that have the same scale and versatility characteristics as LLM but can handle data beyond just text, such as images, audio, sensor signals, and so on. Multimodal foundational models are believed by many to be the key to unlock the next phase of Artificial Intelligence (AI) advance.

In this blog post, we take a closer look at how Google approaches multimodal foundational models. The content covered in this blog post is drawn from the key methods and insights of Google’s recent papers, for which we provide references at the end of this article.

Why Should You Care

Multimodal foundational models are exciting, but why should you care? You may be:

  • an AI/ML practitioner who wants to catch up with the latest research development of the field, but you don’t have the patience to go through dozens of new papers and hundreds of pages of surveys.
  • a current or emerging industry leader who is wondering what’s next after large language models, and is thinking about how to align your business with the new trends in the tech world.
  • a curious reader who may end up being the consumer of current or future multimodal AI products, and wants to get a visual and intuitive understanding of how things work behind the scenes.

For all the above audiences, this article will provide a good overview to jump-start your understanding of multimodal foundational models, which is a corner stone for future more accessible and helpful AI.

One more thing to note before we dive in: when people talk about multimodal foundational models, they often mean the input is multimodal, consisting of text, images, videos, signals, etc. The output, however, is always just text. The…

Source link

This post originally appeared on TechToday.