- Google’s Gemini and OpenAI’s GPT-4 are cutting-edge AI models.
- Gemini, a multimodal model, emphasises collaboration and on-device processing, while GPT-4 focuses on creative responses, an expanded attention window, and diverse features.
- Both models are continuously refined for safety, marking remarkable progress in AI.
Artificial intelligence (AI) has been making significant strides in recent years with tech giants like Google and Microsoft-backed OpenAI leading the charge. Google’s Gemini and OpenAI’s GPT-4x are two of the most advanced AI models currently in existence with a handful of other generative AI models co-existing with them. They are the result of years of research and development, and both models have their unique features and capabilities.
Google took its next leap in artificial intelligence on December 6, 2023, with the launch of Project Gemini, a next-gen generative AI model trained to behave in human-like ways that is likely to intensify the debate about the technology’s potential promise and perils. Members of the Google DeepMind team, the driving force behind Gemini, alongside Google Research — gave a high-level overview of Gemini (technically “Gemini 1.0”) and its capabilities in a virtual press briefing.
Developed from scratch, Google Gemini is a multimodal model (i.e., it mixes language and visual understanding) designed to seamlessly comprehend, interact with, and integrate various forms of information, including text, code, audio, images, and videos. Gemini is expected to be the most powerful AI ever built. It can master human-style conversations, language, and content, understand and interpret images, code prolifically, drive data and analytics, and be used by developers to create new AI apps and APIs.
The Model Breakdown
Gemini is essentially a family of AI models. It comes in three flavors:
- Gemini Ultra, the flagship model
- Gemini Pro, a “lite” model
- Gemini Nano, which is distilled to run on mobile devices like the Pixel 8 Pro
Further, Gemini Nano comes in two model sizes, Nano-1 (1.8 billion parameters) and Nano-2 (3.25 billion parameters) — targeting low- and high-memory devices, respectively. The rollout will unfold in phases, with less sophisticated versions of Gemini called “Nano” and “Pro” being immediately incorporated into Google’s AI-powered chatbot Bard and its Pixel 8 Pro smartphone.
Also Read: The New Google AI Experiment Can Create Music Inspired By 100+ Instruments
What To Expect
Gemini’s biggest advances will come to light in early 2024 when its Ultra model will be used to launch “Bard Advanced”, a juiced-up version of the chatbot that initially will only be offered to a test audience. The AI, at first, will only work in English throughout the world, although Google executives assured reporters during a briefing that the technology will eventually diversify into other languages.
Based on a demonstration of Gemini for a group of reporters, Google’s “Bard Advanced” might be capable of unprecedented AI multitasking by simultaneously recognising and understanding presentations involving text, photos, and video. Gemini will also eventually be infused into Google’s dominant search engine.
With Gemini providing a helping hand, Google promises Bard will become more intuitive and better at tasks that involve planning. On the Pixel 8 Pro, Gemini will be able to quickly summarise recordings made on the device and provide automatic replies on messaging services, starting with WhatsApp, according to Google.
“This is a significant milestone in the development of AI, and the start of a new era for us at Google,” declared Demis Hassabis, CEO of Google DeepMind, the AI division behind Gemini. Sissie Hsiao, GM of Google Assistant and Bard, said during the briefing that the fine-tuned Gemini Pro delivers improved reasoning, planning and understanding capabilities over the previous model driving Bard.
Topic For Debate
Gemini’s problem-solving skills are being touted by Google as being well adept in math and physics, fuelling hopes among AI optimists that it may lead to scientific breakthroughs that improve life for humans. But on the other side of the AI debate, there is conjecture about the technology eventually eclipsing human intelligence, resulting in the loss of millions of jobs. It could also result in more destructive behaviour, such as amplifying misinformation or triggering the deployment of nuclear weapons.
“We’re approaching this work boldly and responsibly,” Google CEO Sundar Pichai wrote in a blog post. “That means being ambitious in our research and pursuing the capabilities that will bring enormous benefits to people and society, while building in safeguards and working collaboratively with governments and experts to address risks as AI becomes more capable.”
Gemini’s arrival is likely to stand tall in the AI competition that has been escalating for the past year, with San Francisco startup OpenAI and long-time industry rival Microsoft. Backed by Microsoft’s financial muscle and computing power, OpenAI was already deep into developing its most advanced AI model, GPT-4, when it released the free ChatGPT tool last year. The chatbot rocketed to global fame, putting generative AI on the map. This in turn pressured Google to push out its AI model Bard in response.
Gemini Vs. OpenAI’s GPT-4
With Gemini coming out, OpenAI may find itself trying to prove its technology remains smarter than Google’s. “I am in awe of what it’s capable of,” said Eli Collins, Google DeepMind vice president of product, in regard to Gemini. A white paper released on December 6, 2023, outlined the most capable version of Gemini outperforming GPT-4 on multiple-choice exams, grade-school math, and other benchmarks.
Gemini Ultra does several things better than rival OpenAI’s multimodal model, GPT-4 with Vision, which can only understand the context of two modalities: words and images. Gemini Ultra can transcribe speech and answer questions about audio and videos (e.g. “What’s happening in this clip?”) in addition to art and photos.
While both models are highly advanced, they have different focuses. Gemini’s strength lies in its multimodal capabilities and its ability to run on-device for instantaneous processing. On the other hand, GPT-4 excels in its creative and collaborative capabilities, its expanded attention window, and its integration of various features.
Both models are still being fine-tuned and tested for safety. They are expected to continue to build on workplace, security, productivity features, and more. At the same time, researchers also acknowledged ongoing struggles in getting AI models to achieve higher-level reasoning skills.
The Demo And The Controversy
Google released a demo video showcasing Gemini’s cool capabilities. Called “Hands-on with Gemini: Interacting with multimodal AI”, the video already hit over a million views. The impressive demo “highlights some of our favorite interactions with Gemini,” showing how the multimodal model can be flexible and responsive to a variety of inputs.
To begin with, it narrates an evolving sketch of a duck from a squiggle to a completed drawing, it then responds to various voice queries about that toy. Then, the demo moves on to other show-off moves like tracking a ball in a cup-switching game, recognizing shadow puppet gestures, reordering sketches of planets, and so on. It’s all very responsive.
The video does caution that “latency has been reduced and Gemini outputs have been shortened.” All in all, it was a pretty mind-blowing show of force in the domain of multimodal understanding until observers pointed out that it was not as real. “We created the demo by capturing footage in order to test Gemini’s capabilities on a wide range of challenges. Then we prompted Gemini using still image frames from the footage, and prompting via text.”
Although it might kind of do the things Google shows in the video, it didn’t, and maybe couldn’t, do them live and in the way they implied. In actuality, it was a series of carefully tuned text prompts with still images, clearly selected and shortened to misrepresent what the interaction is actually like. Viewers are misled about the speed, accuracy, and fundamental mode of interaction with the model, thus questioning its Google’s integrity.
Some computer scientists see limits in how much can be done with large language models, which work by repeatedly predicting the next word in a sentence and are prone to making up errors known as hallucinations. “We made a ton of progress in what’s called factuality with Gemini. So Gemini is our best model in that regard. But it’s still, I would say, an unsolved research problem,” Collins said.
Both Google Gemini and OpenAI GPT-4 represent significant advancements in the field of AI. They each have their unique strengths and capabilities, and it will be interesting to see how these models evolve and what new features and capabilities they will bring in the future.