Gemini models

General

MMLU

Representation of questions in 57 subjects (incl. STEM, humanities, and others)

General

MMLU

Representation of questions in 57 subjects (incl. STEM, humanities, and others)

71.8%

83.7%

Gemini 1.5 Pro

(Feb 2024)

81.9%

Gemini 1.5 Flash

78.9%

Gemini 1.5 Pro

(May 2024)

85.9%

Code

Natural2Code

Python code generation. Held out dataset HumanEval-like, not leaked on the web

Code

Natural2Code

Python code generation. Held out dataset HumanEval-like, not leaked on the web

69.6%

74.9%

Gemini 1.5 Pro

(Feb 2024)

77.7%

77.2%

Gemini 1.5 Pro

(May 2024)

82.6%

Math

MATH

Challenging math problems (incl. algebra, geometry, pre-calculus, and others)

Math

MATH

Challenging math problems (incl. algebra, geometry, pre-calculus, and others)

32.6%

Gemini 1.0 Ultra

53.2%

Gemini 1.5 Pro

(Feb 2024)

58.5%

54.9%

Gemini 1.5 Pro

(May 2024)

67.7%

Reasoning

GPQA (main)

Challenging dataset of questions written by domain experts in biology, physics, and chemistry

Reasoning

GPQA (main)

Challenging dataset of questions written by domain experts in biology, physics, and chemistry

27.9%

35.7%

Gemini 1.5 Pro

(Feb 2024)

41.5%

39.5%

Gemini 1.5 Pro

(May 2024)

46.2%

Reasoning

Big-Bench Hard

Diverse set of challenging tasks requiring multi-step reasoning

Big-Bench Hard

Diverse set of challenging tasks requiring multi-step reasoning

75.0%

83.6%

Gemini 1.5 Pro

(Feb 2024)

84.0%

85.5%

Gemini 1.5 Pro

(May 2024)

89.2%

Multilingual

WMT23

Language translation

Multilingual

WMT23

Language translation

71.7

74.4

Gemini 1.5 Pro

(Feb 2024)

75.2

74.1

Gemini 1.5 Pro

(May 2024)

75.3

Image

MMMU

Multi-discipline college-level reasoning problems

Image

MMMU

Multi-discipline college-level reasoning problems

47.9%

59.4%

Gemini 1.5 Pro

(Feb 2024)

58.5%

56.1%

Gemini 1.5 Pro

(May 2024)

62.2%

Image

MathVista

Multi-discipline college-level reasoning problems

MathVista

Mathematical reasoning in visual contexts

46.6%

53.0%

Gemini 1.5 Pro

(Feb 2024)

54.7%

58.4%

Gemini 1.5 Pro

(May 2024)

63.9%

Audio

FLEURS (55 languages)

Automatic speech recognition (based on word error rate, lower is better)

Audio

FLEURS (55 languages)

Automatic speech recognition (based on word error rate, lower is better)

6.4%

6.0%

Gemini 1.5 Pro

(Feb 2024)

6.6%

9.8%

Gemini 1.5 Pro

(May 2024)

6.5%

Video

EgoSchema

Video question answering

Video

EgoSchema

Video question answering

55.7%

61.5%

Gemini 1.5 Pro

(Feb 2024)

65.1%

65.7%

Gemini 1.5 Pro

(May 2024)

72.2%

Comments

Leave a Reply Cancel reply