Gemini 2.0 is here and promises to be able to do (almost) anything
Google has used the last days of the year to launch its most anticipated artificial intelligence model: Gemini 2.0. This is its next-generation AI model, which promises to be a big step forward in terms of intelligence and capabilities.
If the previous model focused on multimodality, version 2.0 is based on AI agents, which are able to act more autonomously and solve complex problems with less human intervention. With this, Google is at the forefront of the race for the most advanced AI models on the market. Here are all the details!
Gemini 2.0 Introduction
On the occasion of the launch of Gemini 2.0, Sundar Pichai, CEO of Google and Alphabet, shared the following: “Information is at the heart of human progress. That’s why for more than 26 years we’ve been focused on our mission to organize the world’s information and make it accessible and useful. And that’s why we continue to push the boundaries of AI to organize that information at every input and make it accessible through any output so that it can be truly useful to you. (…) Today, millions of developers are developing with Gemini, which helps us reinvent all of our products (including the 7 that have 2 billion users) and create new ones. In the last year, we have invested in developing more agile models, i.e. able to better understand the world around you, anticipate and act on your behalf, under your supervision. Today we are excited to launch our next era of models designed for this new era of agents: we are introducing Gemini 2.0, our most capable model yet. With new advances in multimodality (such as native audio and image output) and use of native tools, it will allow us to create new AI agents that bring us closer to our vision of a universal assistant (…)”.
In this video you can see a summary of the new capabilities of the model:
Gemini 2.0 Flash
The first model released by the company is Gemini 2.0 Flash, the smaller and less powerful model, although even better than the current Pro model. According to Demis Hassabis, CEO of Google DeepMind, this model is more versatile and capable than previous models and can generate multilingual images and audio natively: “Flash even outperforms 1.5 Pro in key benchmarks, with twice the speed and also comes with new capabilities. In addition to supporting multimodal inputs such as images, video, and audio, Flash 2.0 now supports multimodal outputs, such as natively generated images mixed with text and multilingual audio synthesized from text (TTS). It is also natively integrated with tools such as Google Search or code execution, as well as third-party user-defined functions.”
This model is already available as an experimental model via the Gemini API, with multimodal input and text output, native text-to-speech conversion, and image generation.
It will be widely available in January, along with more model sizes.
AI agents for Gemini 2.0
The biggest new feature of Gemini 2.0 lies in the AI agents. It now includes native UI action capabilities, along with other enhancements such as multimodal reasoning, understanding long contexts, tracking and planning complex instructions, calling composite functions, using native tools, and improving latency.
These AI agents will have a major influence over the next few years, and Google is exploring this field with several prototypes that can help people perform tasks like never before.
It is still in its early stages of development, but one example is the updated Project Astra, a prototype that explores the future capabilities of a universal AI assistant.
We also find Project Mariner, which explores the future of human-agent interaction, starting with the browser. Or Jules, an AI-powered code agent that helps developers in their tasks, integrated directly into a GitHub workflow.
Astra Project
A few months ago, Google launched this project, which they presented as an evolution of virtual assistants, and which can analyze our environment for numerous actions, such as finding lost objects or describing situations.
With the arrival of Gemini 2.0, Project Astra has also been improved:
- Improved dialogues: you now have the ability to converse in several languages, as well as a better understanding of accents or less common words.
- New use of tools: you can now use the search engine, Google Lens or Maps.
- Improved memory: you now have up to 10 minutes of memory during the session and can remember other conversations you have had with him in the past thanks to his personalization.
- Improved latency: thanks to new streaming features and native audio understanding, the AI agent can understand language with a latency similar to that of a human conversation.
Mariner Project
As briefly mentioned above, Project Mariner is a research prototype built with Gemini 2.0 that explores the future of human-agent interaction.
It is able to understand and reason, through browser screen information, about pixels, text, code, images, or forms, and then use this information through a Chrome extension that completes the tasks for you.
It’s still at an early stage, but the results are looking very promising.
Here comes the challenge of building it securely and responsibly, so it can only type, scroll, or click on the active browser tab and ask the user for a final confirmation before performing certain sensitive actions.
With all these advances, Google and DeepMind have also emphasized their commitment to security and accountability when developing AI agents. As such, they are taking an explorative and incremental approach to product development, testing multiple prototypes, insisting on security integration and training, working with trusted testers and external experts, and conducting thorough risk and security and assurance assessments.
Without a doubt, Gemini 2.0 and the new prototypes open a great door to a new generation of smarter and more autonomous AI models, and one that we look forward to exploring and discovering. We will be sharing demos using this new version very soon.