Leaderboard - TabMWP

Evaluation of different methods on the test split (whole: 7,686 examples; mini: 1,000 examples). The accuracies across various categories and the overall average are reported below.

😀 You are invited to contribute your results to the TabMWP test split! Please send your result scores to this email or open a new issue at the github repository.

#	Model	Table	Method	Type	Source	Date	Avg	FREE	MC	INT	DEC	EXTR	BOOL	OTH
*	Human Performance	Image	-	-	Link	22-09-29	90.22	84.61	93.32	84.95	83.29	97.18	88.69	96.20
1	Chameleon (GPT-4) 🥇	Text-GT	Few-shot	Tool	Link	23-04-19	98.78	98.95	98.29	99.34	97.42	98.58	98.56	93.33
2	Docugami-MATATA-8B 🥈	Text-GT	Fine-tuned	Tool	Link	24-12-02	98.13	98.35	97.49	98.41	98.11	97.26	99.56	81.90
3	PoT GPT-4 🥉	Text-GT	Few-shot (4)	Code	Link	23-04-19	96.93	97.40	95.58	98.48	93.22	96.25	98.00	68.57
4	CREATOR (ChatGPT)	Text-GT	Few-shot	Tool	Link	23-05-23	94.7	-	-	-	-	-	-	-
5	Chameleon (ChatGPT)	Text-GT	Few-shot	Tool	Link	23-04-19	93.28	93.13	93.72	92.71	94.76	91.29	98.11	78.85
6	TaCo (TAPEX-large)	Text-GT	Fine-tuned	CoT	Link	23-12-06	92.91	91.69	93.47	92.54	88.41	96.05	91.44	86.67
7	PoT ChatGPT + Doc	Text-GT	Zero-shot	Tool	Link	23-08-01	92.69	-	-	-	-	-	-	-
8	CoT GPT-4	Text-GT	Few-shot (8)	CoT	Link	23-04-19	90.81	88.48	97.49	86.16	97.51	96.86	99.11	89.52
9	CoS-Planning (ChatGPT)	Text-GT	Few-shot	Tool	Link	23-10-08	90.00	-	-	-	-	-	-	-
10	PoT ChatGPT	Text-GT	Few-shot (4)	Code	Link	23-04-19	89.49	90.24	87.35	89.31	93.82	92.10	85.89	55.24
11	BM25 (ChatGPT)	Text-GT	Few-shot	Tool	Link	23-09-29	89.2	-	-	-	-	-	-	-
12	CRITIC (ChatGPT)	Text-GT	Few-shot	Tool	Link	23-09-30	89.0	-	-	-	-	-	-	-
13	RetICL (Codex)	Text-GT	Few-shot	CoT	Link	23-05-23	88.51	-	-	-	-	-	-	-
14	CRAFT (ChatGPT)	Text-GT	Few-shot	Tool	Link	23-09-29	88.4	-	-	-	-	-	-	-
15	CRITIC (GPT-3)	Text-GT	Few-shot	Tool	Link	23-09-30	87.6	-	-	-	-	-	-	-
16	TaCo (TAPEX-base)	Text-GT	Fine-tuned	CoT	Link	23-12-06	86.12	85.53	85.74	85.29	86.44	93.31	77.89	81.90
17	SimCSE (ChatGPT)	Text-GT	Few-shot	Tool	Link	23-09-29	83.8	-	-	-	-	-	-	-
18	CoT ChatGPT	Text-GT	Few-shot (8)	CoT	Link	23-04-19	82.03	78.43	92.32	75.38	90.30	92.30	92.89	87.62
19	PoT-SC Codex	Text-GT	Few-shot (4)	Code	Link	22-11-22	81.8	79.5	88.4	77.1	88.9	88.7	92.7	48.6
20	SEGSBS-PAL (Codex)	Text-GT	Few-shot	Code	Link	23-05-01	80.9	-	-	-	-	-	-	-
21	CRITIC (LLaMA-2-70B)	Text-GT	Few-shot	Tool	Link	23-09-30	75.0	-	-	-	-	-	-	-
22	ToRA (70B)	Text-GT	-	Tool	Link	23-09-29	74.0	-	-	-	-	-	-	-
23	ToRA-Code (34B)	Text-GT	-	Code	Link	23-09-29	70.5	-	-	-	-	-	-	-
24	CoT GPT-3 + PromptPG	Text-GT	Few-shot (2)	CoT	Link	22-09-29	68.23	66.17	74.11	64.12	74.16	76.19	72.81	65.71
25	ToRA-Code (13B)	Text-GT	-	Code	Link	23-09-29	65.4	-	-	-	-	-	-	-
26	CodeLLaMA (PAL) (34B)	Text-GT	-	Code	Link	23-09-29	63.1	-	-	-	-	-	-	-
27	CoT GPT-3	Text-GT	Few-shot (2)	CoT	Link	22-09-29	60.76	69.09	60.04	63.58	76.49	61.19	67.30	62.92
28	CodeLLaMA (PAL) (13B)	Text-GT	-	Code	Link	23-09-29	59.5	-	-	-	-	-	-	-
29	LLaMA-2 (PAL) (70B)	Text-GT	-	Code	Link	23-09-29	59.5	-	-	-	-	-	-	-
30	TAPEX_Large	Text-GT	Fine-tuned	PLM	Link	22-09-29	58.52	51.00	80.02	59.92	16.31	95.34	64.00	73.33
31	CoT GPT-3	Text-GT	Zero-shot	CoT	Link	22-09-29	57.61	54.36	66.92	55.82	48.67	78.82	55.67	51.43
32	LLaMA-2 (70B)	Text-GT	-	-	Link	23-09-29	57.5	-	-	-	-	-	-	-
33	UnifiedQA_Large	Text-GT	Fine-tuned	PLM	Link	22-09-29	57.35	48.67	82.18	55.97	20.26	94.63	68.89	79.05
34	GPT-3	Text-GT	Few-shot (2)	CoT	Link	22-09-29	57.13	54.69	64.11	58.36	40.40	75.95	52.41	53.02
35	GPT-3	Text-GT	Zero-shot	CoT	Link	22-09-29	56.96	53.57	66.67	55.55	45.84	78.22	55.44	54.29
36	ToRA-Code (7B)	Text-GT	-	Code	Link	23-09-29	51.6	-	-	-	-	-	-	-
37	TAPEX_Base	Text-GT	Fine-tuned	PLM	Link	22-09-29	48.27	39.59	73.09	46.85	11.33	84.19	61.33	69.52
38	CodeLLaMA (PAL) (7B)	Text-GT	-	Code	Link	23-09-29	47.3	-	-	-	-	-	-	-
39	ToRA (13B)	Text-GT	-	Tool	Link	23-09-29	47.2	-	-	-	-	-	-	-
40	UnifiedQA_Base	Text-GT	Fine-tuned	PLM	Link	22-09-29	43.52	34.02	70.68	40.74	7.90	84.09	55.67	73.33
41	ToRA (7B)	Text-GT	-	Tool	Link	23-09-29	42.4	-	-	-	-	-	-	-
42	LLaMA-2 (13B)	Text-GT	-	-	Link	23-09-29	39.5	-	-	-	-	-	-	-
43	LLaMA-2 (7B)	Text-GT	-	-	Link	23-09-29	31.1	-	-	-	-	-	-	-
44	UnifiedQA_Small	Text-GT	Fine-tuned	PLM	Link	22-09-29	22.27	51.31	27.27	2.83	52.28	48.11	69.52	29.79
45	TAPEX_Large	Text-GT	Pre-trained	PLM	Link	22-09-29	18.59	8.80	46.59	10.62	1.72	46.91	48.11	30.48
46	UnifiedQA_Large	Text-GT	Pre-trained	PLM	Link	22-09-29	15.96	4.48	48.80	5.19	1.72	48.33	50.33	40.00
47	TAPEX_Base	Text-GT	Pre-trained	PLM	Link	22-09-29	15.73	7.32	39.76	8.68	2.06	35.06	47.11	20.95
48	UnifiedQA_Base	Text-GT	Pre-trained	PLM	Link	22-09-29	14.56	4.60	43.02	5.28	1.97	37.08	50.11	38.10
49	UnifiedQA_Small	Text-GT	Pre-trained	PLM	Link	22-09-29	12.18	1.18	43.62	1.37	0.43	38.70	49.78	37.14
*	Heuristic Guess	-	-	-	Link	22-09-29	15.29	6.71	39.81	8.37	0.26	30.80	51.22	26.67

Table formats

Image: taking the image format of the table as the input

Text-GT: taking the textual ground truth parsed format of the table as the input

Model types

PLM: pre-trained language model

CoT: chain-of-thought prompting large language mode

Code: code-augmented large language model

Tool: tool-augmented large langauge model

Accuracies for different question types:

Avg: all problems (reporting the average accuracy)

FREE: free-text questions

MC: multi-choice questions

INT: questions with integer answers

DEC: questions with decimal answers

EXTR: questions with extractive text answers

BOOL: questions with Boolean text answers

OTH: questions with other text answers