In a groundbreaking move, Google has infused its Bard AI chatbot with Gemini, a novel model that introduces a native understanding of video, audio, and photos. This transformative technology is already making waves in dozens of countries, initially offering text-based chat capabilities in English. Google promises imminent multimedia advancements, such as interpreting graphical data and deciphering children’s drawings, in the near future.
Unlike its predecessors, Gemini is designed to transcend the limitations of text-based communication. While text is vital, human understanding involves processing richer information from our dynamic three-dimensional world. Gemini aims to bridge this gap by incorporating complex communication abilities like speech and imagery, aligning more closely with our holistic perception of the world.
Gemini comes in three variants tailored for diverse computing power requirements:
- Gemini Nano: Designed for mobile phones, with variations catering to different memory levels. It will power new features on Google’s Pixel 8 phones, enhancing capabilities like conversation summarization in the Recorder app and suggesting message replies in WhatsApp through Google’s Gboard.
- Gemini Pro: Optimized for rapid responses, this version operates in Google’s data centers and will drive the latest iteration of Bard, available starting today.
- Gemini Ultra: Currently limited to a select test group, this high-end version will debut in the upcoming Bard Advanced chatbot, scheduled for an early 2024 release. While pricing details remain undisclosed, a premium cost is anticipated for this cutting-edge capability.
This release underscores the swift progress in the generative AI domain, where chatbots autonomously respond to prompts in plain language, eliminating the need for complex programming instructions. Google’s Gemini follows in the footsteps of its competitors, such as OpenAI, yet aims to surpass them with its third major AI model revision. The company envisions incorporating this technology into widely-used products like search, Chrome, Google Docs, and Gmail.
Eli Collins, Product Vice President at Google’s DeepMind division, remarked, “For a long time, we wanted to build a new generation of AI models inspired by the way people understand and interact with the world—an AI that feels more like a helpful collaborator and less like a smart piece of software. Gemini brings us a step closer to that vision.”
AI is getting smarter, but it’s not perfect
Multimedia likely will be a big change compared to text when it arrives. But what hasn’t changed is the fundamental problems of AI models trained by recognizing patterns in vast quantities of real-world data. They can turn increasingly complex prompts into increasingly sophisticated responses, but you still can’t trust that they didn’t just provide an answer that was plausible instead of actually correct. As Google’s chatbot warns when you use it, “Bard may display inaccurate info, including about people, so double-check its responses.”
Gemini is the next generation of Google’s large language model, a sequel to the PaLM and PaLM 2 that have been the foundation of Bard so far. But by training Gemini simultaneously on text, programming code, images, audio and video, it’s able to more efficiently cope with multimedia input than with separate but interlinked AI models for each mode of input.
Examples of Gemini’s abilities, according to a Google research paper, are diverse.
Looking at a series of shapes consisting of a triangle, square and pentagon, it can correctly guess the next shape in the series is a hexagon. Presented with photos of the moon and a hand holding a golf ball and asked to find the link, it correctly points out that Apollo astronauts hit two golf balls on the moon in 1971. It converted four bar charts showing country-by-country waste disposal techniques into a labeled table and spotted an outlying data point, namely that the US throws a lot more plastic in the dump than other regions.
The company also showed Gemini processing a handwritten physics problem involving a simple sketch, figuring out where a student’s error lay, and explaining a correction. A more involved demo video showed Gemini recognizing a blue duck, hand puppets, sleight-of-hand tricks and other videos. None of the demos were live, however, and it’s not clear how often Gemini fumbles such challenges.