General
MMLU
Representation of questions in 57 subjects (incl. STEM, humanities, and others)
General
MMLU
Representation of questions in 57 subjects (incl. STEM, humanities, and others)
71.8%
83.7%
Gemini 1.5 Pro
(Feb 2024)
81.9%
Gemini 1.5 Flash
78.9%
Gemini 1.5 Pro
(May 2024)
85.9%
Code
Natural2Code
Python code generation. Held out dataset HumanEval-like, not leaked on the web
Code
Natural2Code
Python code generation. Held out dataset HumanEval-like, not leaked on the web
69.6%
74.9%
Gemini 1.5 Pro
(Feb 2024)
77.7%
77.2%
Gemini 1.5 Pro
(May 2024)
82.6%
Math
MATH
Challenging math problems (incl. algebra, geometry, pre-calculus, and others)
Math
MATH
Challenging math problems (incl. algebra, geometry, pre-calculus, and others)
32.6%
Gemini 1.0 Ultra
53.2%
Gemini 1.5 Pro
(Feb 2024)
58.5%
54.9%
Gemini 1.5 Pro
(May 2024)
67.7%
Reasoning
GPQA (main)
Challenging dataset of questions written by domain experts in biology, physics, and chemistry
Reasoning
GPQA (main)
Challenging dataset of questions written by domain experts in biology, physics, and chemistry
27.9%
35.7%
Gemini 1.5 Pro
(Feb 2024)
41.5%
39.5%
Gemini 1.5 Pro
(May 2024)
46.2%
Reasoning
Big-Bench Hard
Diverse set of challenging tasks requiring multi-step reasoning
Big-Bench Hard
Diverse set of challenging tasks requiring multi-step reasoning
75.0%
83.6%
Gemini 1.5 Pro
(Feb 2024)
84.0%
85.5%
Gemini 1.5 Pro
(May 2024)
89.2%
Multilingual
WMT23
Language translation
Multilingual
WMT23
Language translation
71.7
74.4
Gemini 1.5 Pro
(Feb 2024)
75.2
74.1
Gemini 1.5 Pro
(May 2024)
75.3
Image
MMMU
Multi-discipline college-level reasoning problems
Image
MMMU
Multi-discipline college-level reasoning problems
47.9%
59.4%
Gemini 1.5 Pro
(Feb 2024)
58.5%
56.1%
Gemini 1.5 Pro
(May 2024)
62.2%
Image
MathVista
Multi-discipline college-level reasoning problems
MathVista
Mathematical reasoning in visual contexts
46.6%
53.0%
Gemini 1.5 Pro
(Feb 2024)
54.7%
58.4%
Gemini 1.5 Pro
(May 2024)
63.9%
Audio
FLEURS (55 languages)
Automatic speech recognition (based on word error rate, lower is better)
Audio
FLEURS (55 languages)
Automatic speech recognition (based on word error rate, lower is better)
6.4%
6.0%
Gemini 1.5 Pro
(Feb 2024)
6.6%
9.8%
Gemini 1.5 Pro
(May 2024)
6.5%
Video
EgoSchema
Video question answering
Video
EgoSchema
Video question answering
55.7%
61.5%
Gemini 1.5 Pro
(Feb 2024)
65.1%
65.7%
Gemini 1.5 Pro
(May 2024)
72.2%
Leave a Reply