Leaderboard - TabMWP

Evaluation of different methods on the test split (whole: 7,686 examples; mini: 1,000 examples). The accuracies across various categories and the overall average are reported below.

😀 You are invited to contribute your results to the TabMWP test split! Please send your result scores to this email or open a new issue at the github repository.

# Model Table Method Type Source Date Avg FREE MC INT DEC EXTR BOOL OTH
* Human Performance Image - - Link 22-09-29 90.22 84.61 93.32 84.95 83.29 97.18 88.69 96.20
1 Chameleon (GPT-4) 🥇 Text-GT Few-shot Tool Link 23-04-19 98.78 98.95 98.29 99.34 97.42 98.58 98.56 93.33
2 Docugami-MATATA-8B 🥈 Text-GT Fine-tuned Tool Link 24-12-02 98.13 98.35 97.49 98.41 98.11 97.26 99.56 81.90
3 PoT GPT-4 🥉 Text-GT Few-shot (4) Code Link 23-04-19 96.93 97.40 95.58 98.48 93.22 96.25 98.00 68.57
4 CREATOR (ChatGPT) Text-GT Few-shot Tool Link 23-05-23 94.7 - - - - - - -
5 Chameleon (ChatGPT) Text-GT Few-shot Tool Link 23-04-19 93.28 93.13 93.72 92.71 94.76 91.29 98.11 78.85
6 TaCo (TAPEX-large) Text-GT Fine-tuned CoT Link 23-12-06 92.91 91.69 93.47 92.54 88.41 96.05 91.44 86.67
7 PoT ChatGPT + Doc Text-GT Zero-shot Tool Link 23-08-01 92.69 - - - - - - -
8 CoT GPT-4 Text-GT Few-shot (8) CoT Link 23-04-19 90.81 88.48 97.49 86.16 97.51 96.86 99.11 89.52
9 CoS-Planning (ChatGPT) Text-GT Few-shot Tool Link 23-10-08 90.00 - - - - - - -
10 PoT ChatGPT Text-GT Few-shot (4) Code Link 23-04-19 89.49 90.24 87.35 89.31 93.82 92.10 85.89 55.24
11 BM25 (ChatGPT) Text-GT Few-shot Tool Link 23-09-29 89.2 - - - - - - -
12 CRITIC (ChatGPT) Text-GT Few-shot Tool Link 23-09-30 89.0 - - - - - - -
13 RetICL (Codex) Text-GT Few-shot CoT Link 23-05-23 88.51 - - - - - - -
14 CRAFT (ChatGPT) Text-GT Few-shot Tool Link 23-09-29 88.4 - - - - - - -
15 CRITIC (GPT-3) Text-GT Few-shot Tool Link 23-09-30 87.6 - - - - - - -
16 TaCo (TAPEX-base) Text-GT Fine-tuned CoT Link 23-12-06 86.12 85.53 85.74 85.29 86.44 93.31 77.89 81.90
17 SimCSE (ChatGPT) Text-GT Few-shot Tool Link 23-09-29 83.8 - - - - - - -
18 CoT ChatGPT Text-GT Few-shot (8) CoT Link 23-04-19 82.03 78.43 92.32 75.38 90.30 92.30 92.89 87.62
19 PoT-SC Codex Text-GT Few-shot (4) Code Link 22-11-22 81.8 79.5 88.4 77.1 88.9 88.7 92.7 48.6
20 SEGSBS-PAL (Codex) Text-GT Few-shot Code Link 23-05-01 80.9 - - - - - - -
21 CRITIC (LLaMA-2-70B) Text-GT Few-shot Tool Link 23-09-30 75.0 - - - - - - -
22 ToRA (70B) Text-GT - Tool Link 23-09-29 74.0 - - - - - - -
23 ToRA-Code (34B) Text-GT - Code Link 23-09-29 70.5 - - - - - - -
24 CoT GPT-3 + PromptPG Text-GT Few-shot (2) CoT Link 22-09-29 68.23 66.17 74.11 64.12 74.16 76.19 72.81 65.71
25 ToRA-Code (13B) Text-GT - Code Link 23-09-29 65.4 - - - - - - -
26 CodeLLaMA (PAL) (34B) Text-GT - Code Link 23-09-29 63.1 - - - - - - -
27 CoT GPT-3 Text-GT Few-shot (2) CoT Link 22-09-29 60.76 69.09 60.04 63.58 76.49 61.19 67.30 62.92
28 CodeLLaMA (PAL) (13B) Text-GT - Code Link 23-09-29 59.5 - - - - - - -
29 LLaMA-2 (PAL) (70B) Text-GT - Code Link 23-09-29 59.5 - - - - - - -
30 TAPEX_Large Text-GT Fine-tuned PLM Link 22-09-29 58.52 51.00 80.02 59.92 16.31 95.34 64.00 73.33
31 CoT GPT-3 Text-GT Zero-shot CoT Link 22-09-29 57.61 54.36 66.92 55.82 48.67 78.82 55.67 51.43
32 LLaMA-2 (70B) Text-GT - - Link 23-09-29 57.5 - - - - - - -
33 UnifiedQA_Large Text-GT Fine-tuned PLM Link 22-09-29 57.35 48.67 82.18 55.97 20.26 94.63 68.89 79.05
34 GPT-3 Text-GT Few-shot (2) CoT Link 22-09-29 57.13 54.69 64.11 58.36 40.40 75.95 52.41 53.02
35 GPT-3 Text-GT Zero-shot CoT Link 22-09-29 56.96 53.57 66.67 55.55 45.84 78.22 55.44 54.29
36 ToRA-Code (7B) Text-GT - Code Link 23-09-29 51.6 - - - - - - -
37 TAPEX_Base Text-GT Fine-tuned PLM Link 22-09-29 48.27 39.59 73.09 46.85 11.33 84.19 61.33 69.52
38 CodeLLaMA (PAL) (7B) Text-GT - Code Link 23-09-29 47.3 - - - - - - -
39 ToRA (13B) Text-GT - Tool Link 23-09-29 47.2 - - - - - - -
40 UnifiedQA_Base Text-GT Fine-tuned PLM Link 22-09-29 43.52 34.02 70.68 40.74 7.90 84.09 55.67 73.33
41 ToRA (7B) Text-GT - Tool Link 23-09-29 42.4 - - - - - - -
42 LLaMA-2 (13B) Text-GT - - Link 23-09-29 39.5 - - - - - - -
43 LLaMA-2 (7B) Text-GT - - Link 23-09-29 31.1 - - - - - - -
44 UnifiedQA_Small Text-GT Fine-tuned PLM Link 22-09-29 22.27 51.31 27.27 2.83 52.28 48.11 69.52 29.79
45 TAPEX_Large Text-GT Pre-trained PLM Link 22-09-29 18.59 8.80 46.59 10.62 1.72 46.91 48.11 30.48
46 UnifiedQA_Large Text-GT Pre-trained PLM Link 22-09-29 15.96 4.48 48.80 5.19 1.72 48.33 50.33 40.00
47 TAPEX_Base Text-GT Pre-trained PLM Link 22-09-29 15.73 7.32 39.76 8.68 2.06 35.06 47.11 20.95
48 UnifiedQA_Base Text-GT Pre-trained PLM Link 22-09-29 14.56 4.60 43.02 5.28 1.97 37.08 50.11 38.10
49 UnifiedQA_Small Text-GT Pre-trained PLM Link 22-09-29 12.18 1.18 43.62 1.37 0.43 38.70 49.78 37.14
* Heuristic Guess - - - Link 22-09-29 15.29 6.71 39.81 8.37 0.26 30.80 51.22 26.67

Table formats

  • Image: taking the image format of the table as the input
  • Text-GT: taking the textual ground truth parsed format of the table as the input
  • Model types

  • PLM: pre-trained language model
  • CoT: chain-of-thought prompting large language mode
  • Code: code-augmented large language model
  • Tool: tool-augmented large langauge model
  • Accuracies for different question types:

  • Avg: all problems (reporting the average accuracy)
  • FREE: free-text questions
  • MC: multi-choice questions
  • INT: questions with integer answers
  • DEC: questions with decimal answers
  • EXTR: questions with extractive text answers
  • BOOL: questions with Boolean text answers
  • OTH: questions with other text answers