Evaluation of different methods on the test split (whole: 7,686 examples; mini: 1,000 examples). The accuracies across various categories and the overall average are reported below.
😀 You are invited to contribute your results to the TabMWP test split! Please send your result scores to this email or open a new issue at the github repository.
# | Model | Table | Method | Type | Source | Date | Avg | FREE | MC | INT | DEC | EXTR | BOOL | OTH |
* | Human Performance | Image | - | - | Link | 22-09-29 | 90.22 | 84.61 | 93.32 | 84.95 | 83.29 | 97.18 | 88.69 | 96.20 |
1 | Chameleon (GPT-4) 🥇 | Text-GT | Few-shot | Tool | Link | 23-04-19 | 98.78 | 98.95 | 98.29 | 99.34 | 97.42 | 98.58 | 98.56 | 93.33 |
2 | Docugami-MATATA-8B 🥈 | Text-GT | Fine-tuned | Tool | Link | 24-12-02 | 98.13 | 98.35 | 97.49 | 98.41 | 98.11 | 97.26 | 99.56 | 81.90 |
3 | PoT GPT-4 🥉 | Text-GT | Few-shot (4) | Code | Link | 23-04-19 | 96.93 | 97.40 | 95.58 | 98.48 | 93.22 | 96.25 | 98.00 | 68.57 |
4 | CREATOR (ChatGPT) | Text-GT | Few-shot | Tool | Link | 23-05-23 | 94.7 | - | - | - | - | - | - | - |
5 | Chameleon (ChatGPT) | Text-GT | Few-shot | Tool | Link | 23-04-19 | 93.28 | 93.13 | 93.72 | 92.71 | 94.76 | 91.29 | 98.11 | 78.85 |
6 | TaCo (TAPEX-large) | Text-GT | Fine-tuned | CoT | Link | 23-12-06 | 92.91 | 91.69 | 93.47 | 92.54 | 88.41 | 96.05 | 91.44 | 86.67 |
7 | PoT ChatGPT + Doc | Text-GT | Zero-shot | Tool | Link | 23-08-01 | 92.69 | - | - | - | - | - | - | - |
8 | CoT GPT-4 | Text-GT | Few-shot (8) | CoT | Link | 23-04-19 | 90.81 | 88.48 | 97.49 | 86.16 | 97.51 | 96.86 | 99.11 | 89.52 |
9 | CoS-Planning (ChatGPT) | Text-GT | Few-shot | Tool | Link | 23-10-08 | 90.00 | - | - | - | - | - | - | - |
10 | PoT ChatGPT | Text-GT | Few-shot (4) | Code | Link | 23-04-19 | 89.49 | 90.24 | 87.35 | 89.31 | 93.82 | 92.10 | 85.89 | 55.24 |
11 | BM25 (ChatGPT) | Text-GT | Few-shot | Tool | Link | 23-09-29 | 89.2 | - | - | - | - | - | - | - |
12 | CRITIC (ChatGPT) | Text-GT | Few-shot | Tool | Link | 23-09-30 | 89.0 | - | - | - | - | - | - | - |
13 | RetICL (Codex) | Text-GT | Few-shot | CoT | Link | 23-05-23 | 88.51 | - | - | - | - | - | - | - |
14 | CRAFT (ChatGPT) | Text-GT | Few-shot | Tool | Link | 23-09-29 | 88.4 | - | - | - | - | - | - | - |
15 | CRITIC (GPT-3) | Text-GT | Few-shot | Tool | Link | 23-09-30 | 87.6 | - | - | - | - | - | - | - |
16 | TaCo (TAPEX-base) | Text-GT | Fine-tuned | CoT | Link | 23-12-06 | 86.12 | 85.53 | 85.74 | 85.29 | 86.44 | 93.31 | 77.89 | 81.90 |
17 | SimCSE (ChatGPT) | Text-GT | Few-shot | Tool | Link | 23-09-29 | 83.8 | - | - | - | - | - | - | - |
18 | CoT ChatGPT | Text-GT | Few-shot (8) | CoT | Link | 23-04-19 | 82.03 | 78.43 | 92.32 | 75.38 | 90.30 | 92.30 | 92.89 | 87.62 |
19 | PoT-SC Codex | Text-GT | Few-shot (4) | Code | Link | 22-11-22 | 81.8 | 79.5 | 88.4 | 77.1 | 88.9 | 88.7 | 92.7 | 48.6 |
20 | SEGSBS-PAL (Codex) | Text-GT | Few-shot | Code | Link | 23-05-01 | 80.9 | - | - | - | - | - | - | - |
21 | CRITIC (LLaMA-2-70B) | Text-GT | Few-shot | Tool | Link | 23-09-30 | 75.0 | - | - | - | - | - | - | - |
22 | ToRA (70B) | Text-GT | - | Tool | Link | 23-09-29 | 74.0 | - | - | - | - | - | - | - |
23 | ToRA-Code (34B) | Text-GT | - | Code | Link | 23-09-29 | 70.5 | - | - | - | - | - | - | - |
24 | CoT GPT-3 + PromptPG | Text-GT | Few-shot (2) | CoT | Link | 22-09-29 | 68.23 | 66.17 | 74.11 | 64.12 | 74.16 | 76.19 | 72.81 | 65.71 |
25 | ToRA-Code (13B) | Text-GT | - | Code | Link | 23-09-29 | 65.4 | - | - | - | - | - | - | - |
26 | CodeLLaMA (PAL) (34B) | Text-GT | - | Code | Link | 23-09-29 | 63.1 | - | - | - | - | - | - | - |
27 | CoT GPT-3 | Text-GT | Few-shot (2) | CoT | Link | 22-09-29 | 60.76 | 69.09 | 60.04 | 63.58 | 76.49 | 61.19 | 67.30 | 62.92 |
28 | CodeLLaMA (PAL) (13B) | Text-GT | - | Code | Link | 23-09-29 | 59.5 | - | - | - | - | - | - | - |
29 | LLaMA-2 (PAL) (70B) | Text-GT | - | Code | Link | 23-09-29 | 59.5 | - | - | - | - | - | - | - |
30 | TAPEX_Large | Text-GT | Fine-tuned | PLM | Link | 22-09-29 | 58.52 | 51.00 | 80.02 | 59.92 | 16.31 | 95.34 | 64.00 | 73.33 |
31 | CoT GPT-3 | Text-GT | Zero-shot | CoT | Link | 22-09-29 | 57.61 | 54.36 | 66.92 | 55.82 | 48.67 | 78.82 | 55.67 | 51.43 |
32 | LLaMA-2 (70B) | Text-GT | - | - | Link | 23-09-29 | 57.5 | - | - | - | - | - | - | - |
33 | UnifiedQA_Large | Text-GT | Fine-tuned | PLM | Link | 22-09-29 | 57.35 | 48.67 | 82.18 | 55.97 | 20.26 | 94.63 | 68.89 | 79.05 |
34 | GPT-3 | Text-GT | Few-shot (2) | CoT | Link | 22-09-29 | 57.13 | 54.69 | 64.11 | 58.36 | 40.40 | 75.95 | 52.41 | 53.02 |
35 | GPT-3 | Text-GT | Zero-shot | CoT | Link | 22-09-29 | 56.96 | 53.57 | 66.67 | 55.55 | 45.84 | 78.22 | 55.44 | 54.29 |
36 | ToRA-Code (7B) | Text-GT | - | Code | Link | 23-09-29 | 51.6 | - | - | - | - | - | - | - |
37 | TAPEX_Base | Text-GT | Fine-tuned | PLM | Link | 22-09-29 | 48.27 | 39.59 | 73.09 | 46.85 | 11.33 | 84.19 | 61.33 | 69.52 |
38 | CodeLLaMA (PAL) (7B) | Text-GT | - | Code | Link | 23-09-29 | 47.3 | - | - | - | - | - | - | - |
39 | ToRA (13B) | Text-GT | - | Tool | Link | 23-09-29 | 47.2 | - | - | - | - | - | - | - |
40 | UnifiedQA_Base | Text-GT | Fine-tuned | PLM | Link | 22-09-29 | 43.52 | 34.02 | 70.68 | 40.74 | 7.90 | 84.09 | 55.67 | 73.33 |
41 | ToRA (7B) | Text-GT | - | Tool | Link | 23-09-29 | 42.4 | - | - | - | - | - | - | - |
42 | LLaMA-2 (13B) | Text-GT | - | - | Link | 23-09-29 | 39.5 | - | - | - | - | - | - | - |
43 | LLaMA-2 (7B) | Text-GT | - | - | Link | 23-09-29 | 31.1 | - | - | - | - | - | - | - |
44 | UnifiedQA_Small | Text-GT | Fine-tuned | PLM | Link | 22-09-29 | 22.27 | 51.31 | 27.27 | 2.83 | 52.28 | 48.11 | 69.52 | 29.79 |
45 | TAPEX_Large | Text-GT | Pre-trained | PLM | Link | 22-09-29 | 18.59 | 8.80 | 46.59 | 10.62 | 1.72 | 46.91 | 48.11 | 30.48 |
46 | UnifiedQA_Large | Text-GT | Pre-trained | PLM | Link | 22-09-29 | 15.96 | 4.48 | 48.80 | 5.19 | 1.72 | 48.33 | 50.33 | 40.00 |
47 | TAPEX_Base | Text-GT | Pre-trained | PLM | Link | 22-09-29 | 15.73 | 7.32 | 39.76 | 8.68 | 2.06 | 35.06 | 47.11 | 20.95 |
48 | UnifiedQA_Base | Text-GT | Pre-trained | PLM | Link | 22-09-29 | 14.56 | 4.60 | 43.02 | 5.28 | 1.97 | 37.08 | 50.11 | 38.10 |
49 | UnifiedQA_Small | Text-GT | Pre-trained | PLM | Link | 22-09-29 | 12.18 | 1.18 | 43.62 | 1.37 | 0.43 | 38.70 | 49.78 | 37.14 |
* | Heuristic Guess | - | - | - | Link | 22-09-29 | 15.29 | 6.71 | 39.81 | 8.37 | 0.26 | 30.80 | 51.22 | 26.67 |
Table formats
Model types
Accuracies for different question types: