ZhiJian Yi-Kai Zhang1,2 Xu-Xiang Zhong1,2 Shiyin Lu3 Qing-Guo Chen3
De-Chuan Zhan1,2 Han-Jia Ye1,2
\addr1School of Artificial Intelligence, Nanjing University
2National Key Laboratory for Novel Software Technology, Nanjing University
3AI Business, Alibaba Group
Corresponding author, email: yehj@lamda.nju.edu.cn.
Abstract
The rapid advancements in Large Language Models (LLMs) have significantly expanded their applications, ranging from multilingual support to domain-specific tasks and multimodal integration. In this paper, we present OmniEvalKit, a novel benchmarking toolbox designed to evaluate LLMs and their omni-extensions across multilingual, multidomain, and multimodal capabilities. Unlike existing benchmarks that often focus on a single aspect, OmniEvalKit provides a modular, lightweight, and automated evaluation system. It is structured with a modular architecture comprising a Static Builder and Dynamic Data Flow, promoting the seamless integration of new models and datasets.OmniEvalKit supports over 100 LLMs and 50 evaluation datasets, covering comprehensive evaluations across thousands of model-dataset combinations.OmniEvalKit is dedicated to creating an ultra-lightweight and fast-deployable evaluation framework, making downstream applications more convenient and versatile for the AI community.
Keywords: Large Language Models (LLMs), Extensions of LLMs, Evaluation Toolbox
1 Introduction
The rapid development of Large Language Models (LLMs)(Du etal., 2022; Jiang etal., 2023; OpenAI, 2022; Touvron etal., 2023a; Bai etal., 2023a) has made their question-answering capabilities crucial in many applications. Recently, the inputs for LLMs have been continually expanded to include multiple languages and various specialized domains, including applications worldwide, code generation(Chen etal., 2021; Rozière etal., 2023), mathematical problem-solving(Romera-Paredes etal., 2024; Yang etal., 2024b), legal inference(Fei etal., 2024), economic decision-making(Xie etal., 2023), and medical diagnosis(Wang etal., 2023). Furthermore, Multimodal LLMs(Chen etal., 2023; Zhu etal., 2024; Liu etal., 2023a, 2024; Yao etal., 2024; OpenAI, 2023, 2024a, 2024b) (MLLMs) can integrate diverse forms of information, such as image(Bai etal., 2023b; Wang etal., 2024), video(Zhang etal., 2023a; Chi etal., 2021), or tabular(Hegselmann etal., 2023) inputs.These advances – across multilingual, multidomain, and multimodal as M3 omni-applications – are steering us toward Artificial General Intelligence (AGI) systems(Goertzel, 2014).
Unlike previous single-task models, LLMs and their extensions are expected to excel in comprehensive zero-shot capabilities, remaining effective in traditional text-only tasks(Zhang etal., 2024b).For instance, MLLMs enable real-time translation and enhance cross-cultural communication in international business, while their foundational language also plays a vital role. Domain-specific LLMs can serve as professionals in coding, mathematics, law, finance, or healthcare, requiring a solid foundation in general conversation. Moreover, MLLMs are particularly significant when dealing with interleaved information, especially when processing text-only parts within multimodal content.Driven by industry demands, LLMs and their omni-extensions have evolved through nearly ten generations, creating hundreds of models across different languages, domains, and modalities. Consequently, fair and comprehensive evaluations are crucial for recognizing model performance in various aspects, making informed subsequent model selections regarding deployment costs and inference overheads. The evaluation process helps identify models with the most robust capabilities and the broadest application prospects.
While many LLM benchmarks are available, most tend to concentrate on a specific type, such as a single language(Singh etal., 2024; Li etal., 2023), domain(Fei etal., 2023; Islam etal., 2023), or modality(Duan etal., 2024). Additionally, the underlying codebases often lack compatibility, and variations in preprocessing configurations and interfaces can be substantial.For instance, the popular open LLM Leaderboard(Beeching etal., 2023; Fourrier etal., 2024) evaluates text-only question answering but lacks multimodal support due to the fixed default generation method. Similarly, platforms like OpenCompass(Contributors, 2023) implement different codes for text-only and multimodal evaluations; however, using multiple codebases for comprehensive evaluation is labor-intensive and may result in biased outcomes.The phenomenon prompts us to design a modular and lightweight benchmarking toolbox to comprehensively evaluate the diverse and rapidly evolving LLMs and their omni-extensions.
We introduce OmniEvalKit, a unified, comprehensive, and automated framework for evaluating the capability of LLMs and their omni-extensions.OmniEvalKit expands to include M3 inputs, facilitating a comprehensive evaluation, particularly in text-only v.s. visual question-answering tasks, specific languages, and domains of the MLLMs.The OmniEvalKit framework, as illustrated inFigure1, separates the evaluation process into two primary components: the Static Builder and the Dynamic Data Flow.The static builder is responsible for constructing the candidate model and establishing evaluation facilities that operate on the dynamic data flow.For questions in different languages, fields, and modalities, inputs are forwarded into the model, which produces outputs that the evaluation facilities subsequently process. After extracting intent and answers, these outputs are passed to the metrics calculator to obtain results.The static builder and dynamic data flow decouple the model from the data. The evaluation facilities function like an industrial assembly line, with each component fulfilling a specific role.This modular architecture facilitates the seamless integration of new components as building blocks in the data flow. It promotes quick extensions and migrations of new models or evaluation datasets via simple interface alignment.OmniEvalKit supports over 100 LLMs with their omni-extensions and approximately 50 evaluation datasets, enabling evaluations across thousands of model-dataset combinations, in addition to general text-only types with broad natural language tasks, various exams, and specific knowledge items, also includes over 12 general language evaluations, domain-specific evaluations in such as coding, mathematics, law, finance, healthcare, etc., as well as multimodal evaluations with image, video, or tabular inputs.Due to the highly flexible modularity of OmniEvalKit, it is also compatible with some LLM-related service models, such as vector extraction in retrieval embedding models, model selection, and auxiliary techniques for downstream reuse.The vision of OmniEvalKit is to build an open evaluation community that offers high availability for users, high accuracy in result efficacy, and comprehensive implementation on deployment.
2 OmniEvalKit for Comprehensive Evaluation
In this section, we first introduce the key components of the OmniEvalKit framework. InFigure1, we illustrate how these components are connected through the entire evaluation pipeline using a data flow via the static builder.
2.1 Key Features
- •
Model Types & Series: The LLMs and their M3 omni-extensions.InFigure1, we present a list of models that OmniEvalKit considers, supporting over 100 LLM, as well as LLM’s extensions, including the Llama(Touvron etal., 2023a, b), Qwen(Bai etal., 2023a; Yang etal., 2024a, b; Hui etal., 2024; Bai etal., 2023b; Wang etal., 2024), Mistral(Jiang etal., 2024), and Phi(Abdin etal., 2024) series, alongside a variety of proprietary series.The extended LLMs are proficient at following specific instructions across various language, domain, and modality-based dimensions, while also comprehending general text-only instructions in the original LLM.OmniEvalKit also accommodates diverse structural inputs, such as image, video, tabular, and other non-text modes.
- •
Question Types & Evaluation Benchmarks: Multiple tasks for general question-answering capabilities, covering multidomain knowledge, multilingual migration, and multimodal information fusion tasks.We demonstrate that OmniEvalKit is compatible with mainstream evaluation benchmarks, including various examination assessment datasets such as MMLU/CMMLU(Hendrycks etal., 2021), BBH(Suzgun etal., 2023), ARC-Easy/Challenge(Clark etal., 2018), and OpenbookQA(Mihaylov etal., 2018); typical language datasets like GLUE(Wang etal., 2019), HellaSwag(Zellers etal., 2019), and ANLI(Nie etal., 2020); and specialized knowledge datasets such as ACLUE(Zhang and Li, 2023), which focuses on ancient Chinese content, EQ-Bench(Paech, 2023) for emotional intelligence assessment, and PIQA(Bisk etal., 2020), which targets commonsense reasoning, among others.Additionally, OmniEvalKit includes multilingual extensions as MMMLU(Hendrycks etal., 2024), XStoryCloze(Lin etal., 2022), OALL(Elfilali etal., 2024) and multilingual translations of ARC or HellaSwag(Zellers etal., 2019).Further, OmniEvalKit covers the five major domains of coding, mathematics, law, finance, and healthcare.It also covers a comprehensive range of general multimodal assessments, including MMMU(Yue etal., 2023), MME(Fu etal., 2023), MMBench(Liu etal., 2023b), MMStar(Chen etal., 2024b), and MMVet(Yu etal., 2023), as well as multimodal-specific scientific question datasets like AI2D(Kembhavi etal., 2016), ScienceQA(Lu etal., 2022) and HallusionBench(Guan etal., 2023) for capabilities related to specific tasks.
- •
Answer Extraction Facility: The omni-extensions of LLMs generate fluent, natural, and human-like responses during deployment, often including additional analysis and connecting words.For example, some models tend to combine longer thoughts in their responses.In OmniEvalKit, it supports not only pre-defined rich regular expressions to extract key answers but also the additional LLM to summarize answers from the response.It features a variety of regularization templates, and the answer extraction model interface aligns with the general one, enabling flexible customization options.
- •
Model Generation Options: OmniEvalKit employs a traditional generation approach, incorporating perplexity (PPL) measurements, as MLLM also supports this feature. It provides a flexible assessment interface and offers multiple decoding modes to improve the generation process.
- •
Accuracy Calculation Center:In addition to the default predefined evaluation metrics such as accuracy, BLEU(Papineni etal., 2002), ROUGE(Lin, 2004), and others available in the source benchmarks, OmniEvalKit also offers customizable metrics.It supports various question formats, including single-choice, multiple-choice, yes-or-no, fill-in-the-blank, and free-open ones.
2.2 Evaluation Pipeline
- •
Static Builder: Constructs the LLM or its omni-extension and evaluation facilities.
- –
Model Constructor: Configuration component for the evaluated LLM or its extension. The construction and initialization of the model are organized through independent modular files. Customizable interface parameters can be effortlessly passed, enhancing user adaptability. This module is currently responsible for updating the model structure, initializing pretrained parameters, performing GPU mapping, and transforming tokens. Implementing new models is streamlined to focus solely on the relevant model construction class while ensuring alignment with interfaces for token preprocessing, prompt concatenation, and response generation.Also, the constructor component supports customization for all generation choices, including different data settings and inference methods. When necessary, default settings are employed to simplify integration and maintain robustness throughout the evaluation.
- –
Evaluation Facilities:Create components dedicated to extracting answers from responses and assessing metrics. This module involves distinct member classes for filters and estimators. The filters remove excessive and redundant information while extracting the critical answers embedded in the model’s response. Subsequently, each filtered answer is evaluated against the ground truth by the estimators, and the results are summarized across the entire dataset.
- –
Other Components:Additional parts that assist in the evaluation process. The flexibility of OmniEvalKit allows for the seamless integration of extra processing functions into the overall system. For example, a prompt handling module can be equipped to dynamically concatenate instructions for corresponding additional keys in JSON data files and a token preprocessing unit for effectively reading visual input, such as images, video, or tabular information. These components serve as crucial supportive elements within the overall data flow.
- –
- •
Data Flow: All datasets are stored in a unified JSON format as a list of dicts. Each dimension, such as domain, language, modality, instruction, and ground truth answer, is recorded in key-value pairs. The file includes different settings for Chain-of-Thought (CoT)(Wei etal., 2022) and Few-Shot In-Context Learning (FSL, ICL)(Brown etal., 2020; Dong etal., 2024). This standardized and compact format facilitates the expansion of new tasks within the OmniEvalKit framework, enhancing the overall versatility and scalability.
Data flow-driven interaction with models and evaluation facilities. When instructions from the data flow are input into the model, the corresponding responses are processed through the answer extraction module and, subsequently, the evaluation module.Specifically, relevant keys in the data JSON record the prompts required for the instruction, such as in-context few-shot examples or relevant thoughts. As the data flows through the evaluated model, it is concatenated with highly available custom prompts. After inference, the outputs are filtered through a key answer extraction module, where the core content of the responses is extracted using regular expressions or additional models. Subsequently, the estimator module evaluates metrics for each or the entire dataset, aggregating the results obtained.This systematic approach guarantees coherent and reliable results, preserving both integrity and utility.
3 Conclusion & Derivative Fields
OmniEvalKit is a highly flexible and modular evaluation framework designed for assessing M3 types of LLMs and their omni-extensions. It enables convenient combinations that allow the addition of new models and datasets with just a single-file modification.To date, OmniEvalKit has integrated over 50 different evaluation datasets, generating more than 5,000 sets of results for various LLMs and their omni-extensions. This adaptability opens up numerous possibilities for related application areas:
- •
New Pattern and Law Exploration: Investigating the fundamental patterns and laws behind extensive evaluation results.For example, scaling laws are crucial for understanding the trends of LLMs and their omni-extensions concerning relevant variables(Kaplan etal., 2020; Zhang etal., 2024a), such as model performance relative to the dataset scale and training FLOPs. Scaling laws can guide the selection of hyperparameters and architectures, as well as the prediction of model capabilities for downstream tasks. Leveraging the extensive evaluation results from OmniEvalKit, custom metrics, and training-related performance can be utilized for scaling law research(Kaplan etal., 2020). In particular, the set of evaluation models quickly deployed within OmniEvalKit and the flexible, adjustable metric assessment module support researchers in exploring the relationships of model capabilities across different dimensions and hierarchical layers for extremely large-scale and diverse model types.
- •
Evaluation and Selection of Special Models, Metrics, or Vertical Domain: When LLMs and their omni-extensions face deployment demands in vertical domains that involve new modalities, domains, and tasks, how to rapidly evaluate existing solutions becomes crucial for guiding pre-selection of optimal models. Some methods(You etal., 2021) consider relying on proxy metrics for generalization, measuring them against the evaluation results. Additionally, some learnable strategies(Zhang etal., 2023b) investigate how to learn generalized criteria for mapping model performance from existing data. Moreover, OmniEvalKit can also be extended to specific model evaluations and data filtering. For example, guiding re-rankers to act as reward models in reinforcement learning from human feedback(Ouyang etal., 2022), batch-generating data quality paradigms(Biderman etal., 2023), or initiating embedding models(Chen etal., 2024a) for a comprehensive assessment of data components.
- •
Exploration of New embeddings and Key Outputs: OmniEvalKit’s modular components continuously evolve, embedding themselves in both the input and output data streams. This framework effectively captures the dynamics of each stage in the model inference process. Not only do these components log the relevant representations produced at any layer, but they also track information regarding future predictions, enabling researchers to gain deeper insights into the model’s behavior. Additionally, OmniEvalKit offers customizable decoding methods for generated sequences, creating a flexible environment for the unified evaluation of various search strategies. This comprehensive approach allows researchers to analyze and compare different generation strategies more effectively.
OmniEvalKit has shown stable performance across diverse devices and will continue to evolve to support additional GPU architectures and deep learning deployment frameworks.
References
- Abdin etal. (2024)M.I. Abdin, S.A. Jacobs, A.A. Awan, J.Aneja, A.Awadallah, etal.Phi-3 technical report: A highly capable language model locally on your phone.CoRR, abs/2404.14219, 2024.
- Bai etal. (2023a)J.Bai, S.Bai, Y.Chu, Z.Cui, K.Dang, etal.Qwen technical report.CoRR, abs/2309.16609, 2023a.
- Bai etal. (2023b)J.Bai, S.Bai, S.Yang, S.Wang, S.Tan, P.Wang, J.Lin, C.Zhou, and J.Zhou.Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023b.
- Beeching etal. (2023)E.Beeching, C.Fourrier, N.Habib, S.Han, N.Lambert, N.Rajani, O.Sanseviero, L.Tunstall, and T.Wolf.Open llm leaderboard.https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, 2023.
- Biderman etal. (2023)S.Biderman, H.Schoelkopf, Q.G. Anthony, H.Bradley, K.O’Brien, E.Hallahan, M.A. Khan, S.Purohit, U.S. Prashanth, etal.Pythia: A suite for analyzing large language models across training and scaling.In ICML, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430, 23–29 Jul 2023.
- Bisk etal. (2020)Y.Bisk, R.Zellers, R.L. Bras, J.Gao, and Y.Choi.PIQA: reasoning about physical commonsense in natural language.In AAAI, pages 7432–7439, 2020.
- Brown etal. (2020)T.Brown, B.Mann, N.Ryder, M.Subbiah, J.D. Kaplan, P.Dhariwal, A.Neelakantan, P.Shyam, G.Sastry, etal.Language models are few-shot learners.In NeurIPS, volume33, pages 1877–1901. Curran Associates, Inc., 2020.
- Chen etal. (2024a)J.Chen, S.Xiao, P.Zhang, K.Luo, D.Lian, and Z.Liu.Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024a.
- Chen etal. (2023)L.Chen, J.Li, X.Dong, P.Zhang, C.He, J.Wang, F.Zhao, and D.Lin.Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023.
- Chen etal. (2024b)L.Chen, J.Li, X.Dong, P.Zhang, Y.Zang, Z.Chen, H.Duan, J.Wang, Y.Qiao, D.Lin, and F.Zhao.Are we on the right way for evaluating large vision-language models?CoRR, abs/2403.20330, 2024b.
- Chen etal. (2021)M.Chen, J.Tworek, H.Jun, Q.Yuan, H.P. deOliveiraPinto, etal.Evaluating large language models trained on code.CoRR, abs/2107.03374, 2021.
- Chi etal. (2021)P.Chi, P.Chung, T.Wu, C.Hsieh, Y.Chen, S.Li, and H.Lee.Audio albert: A lite bert for self-supervised learning of audio representation.In IEEE SLT, pages 344–350, 2021.
- Clark etal. (2018)P.Clark, I.Cowhey, O.Etzioni, T.Khot, A.Sabharwal, C.Schoenick, and O.Tafjord.Think you have solved question answering? try arc, the AI2 reasoning challenge.CoRR, abs/1803.05457, 2018.
- Contributors (2023)O.Contributors.Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023.
- Dong etal. (2024)Q.Dong, L.Li, D.Dai, C.Zheng, J.Ma, R.Li, H.Xia, J.Xu, Z.Wu, T.Liu, B.Chang, X.Sun, L.Li, and Z.Sui.A survey on in-context learning, 2024.URL https://arxiv.org/abs/2301.00234.
- Du etal. (2022)Z.Du, Y.Qian, X.Liu, M.Ding, J.Qiu, Z.Yang, and J.Tang.Glm: General language model pretraining with autoregressive blank infilling.In ACL, pages 320–335, 2022.
- Duan etal. (2024)H.Duan, J.Yang, Y.Qiao, X.Fang, L.Chen, etal.Vlmevalkit: An open-source toolkit for evaluating large multi-modality models.In ACM MM, pages 11198–11201. ACM, 2024.
- Elfilali etal. (2024)A.Elfilali, H.Alobeidli, C.Fourrier, B.E.A. Boussaha, R.Cojocaru, N.Habib, and H.Hacid.Open arabic llm leaderboard.https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard, 2024.
- Fei etal. (2023)Z.Fei, X.Shen, D.Zhu, F.Zhou, Z.Han, S.Zhang, K.Chen, Z.Shen, and J.Ge.Lawbench: Benchmarking legal knowledge of large language models.CoRR, abs/2309.16289, 2023.
- Fei etal. (2024)Z.Fei, X.Shen, D.Zhu, F.Zhou, Z.Han, etal.Lawbench: Benchmarking legal knowledge of large language models.In EMNLP, pages 7933–7962, 2024.
- Fourrier etal. (2024)C.Fourrier, N.Habib, A.Lozovskaya, K.Szafer, and T.Wolf.Open llm leaderboard v2.https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024.
- Fu etal. (2023)C.Fu, P.Chen, Y.Shen, Y.Qin, M.Zhang, X.Lin, Z.Qiu, W.Lin, J.Yang, X.Zheng, etal.Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023.
- Goertzel (2014)B.Goertzel.Artificial general intelligence: Concept, state of the art, and future prospects.J. Artif. Gen. Intell., 5(1):1–48, 2014.
- Guan etal. (2023)T.Guan, F.Liu, X.Wu, R.Xian, Z.Li, X.Liu, X.Wang, L.Chen, F.Huang, Y.Yacoob, etal.Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023.
- Hegselmann etal. (2023)S.Hegselmann, A.Buendia, H.Lang, M.Agrawal, X.Jiang, and D.A. Sontag.Tabllm: Few-shot classification of tabular data with large language models.In AISTATS, volume 206 of Proceedings of Machine Learning Research, pages 5549–5581, 2023.
- Hendrycks etal. (2021)D.Hendrycks, C.Burns, S.Basart, A.Zou, M.Mazeika, D.Song, and J.Steinhardt.Measuring massive multitask language understanding.In ICLR, 2021.
- Hendrycks etal. (2024)D.Hendrycks, C.Burns, S.Basart, A.Zou, M.Mazeika, D.Song, and J.Steinhardt.Measuring massive multitask language understanding.https://huggingface.co/datasets/openai/MMMLU, 2024.
- Hui etal. (2024)B.Hui, J.Yang, Z.Cui, J.Yang, D.Liu, L.Zhang, T.Liu, J.Zhang, B.Yu, K.Dang, A.Yang, R.Men, F.Huang, X.Ren, X.Ren, J.Zhou, and J.Lin.Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024.
- Islam etal. (2023)P.Islam, A.Kannappan, D.Kiela, R.Qian, N.Scherrer, and B.Vidgen.Financebench: A new benchmark for financial question answering.CoRR, abs/2311.11944, 2023.
- Jiang etal. (2023)A.Q. Jiang, A.Sablayrolles, A.Mensch, C.Bamford, D.S. Chaplot, D.delas Casas, F.Bressand, G.Lengyel, G.Lample, L.Saulnier, L.R. Lavaud, M.-A. Lachaux, P.Stock, T.L. Scao, T.Lavril, T.Wang, T.Lacroix, and W.E. Sayed.Mistral 7b, 2023.
- Jiang etal. (2024)A.Q. Jiang, A.Sablayrolles, A.Roux, A.Mensch, B.Savary, etal.Mixtral of experts.CoRR, abs/2401.04088, 2024.
- Kaplan etal. (2020)J.Kaplan, S.McCandlish, T.Henighan, T.B. Brown, B.Chess, R.Child, S.Gray, A.Radford, J.Wu, and D.Amodei.Scaling laws for neural language models.CoRR, abs/2001.08361, 2020.
- Kembhavi etal. (2016)A.Kembhavi, M.Salvato, E.Kolve, M.Seo, H.Hajishirzi, and A.Farhadi.A diagram is worth a dozen images.In ECCV, pages 235–251, 2016.
- Li etal. (2023)H.Li, Y.Zhang, F.Koto, Y.Yang, H.Zhao, Y.Gong, N.Duan, and T.Baldwin.CMMLU: Measuring massive multitask language understanding in Chinese.arXiv preprint arXiv:2306.09212, 2023.
- Lin (2004)C.-Y. Lin.ROUGE: A package for automatic evaluation of summaries.In Text Summarization Branches Out, pages 74–81. Association for Computational Linguistics, 2004.
- Lin etal. (2022)X.V. Lin, T.Mihaylov, M.Artetxe, T.Wang, S.Chen, etal.Few-shot learning with multilingual generative language models.In EMNLP, pages 9019–9052, 2022.
- Liu etal. (2023a)H.Liu, C.Li, Q.Wu, and Y.J. Lee.Visual instruction tuning.NeurIPS, 36, 2023a.
- Liu etal. (2024)H.Liu, C.Li, Y.Li, B.Li, Y.Zhang, S.Shen, and Y.J. Lee.Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Liu etal. (2023b)Y.Liu, H.Duan, Y.Zhang, B.Li, S.Zhang, W.Zhao, Y.Yuan, J.Wang, C.He, Z.Liu, etal.Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023b.
- Lu etal. (2022)P.Lu, S.Mishra, T.Xia, L.Qiu, K.-W. Chang, S.-C. Zhu, O.Tafjord, P.Clark, and A.Kalyan.Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 35:2507–2521, 2022.
- Mihaylov etal. (2018)T.Mihaylov, P.Clark, T.Khot, and A.Sabharwal.Can a suit of armor conduct electricity? A new dataset for open book question answering.In EMNLP, pages 2381–2391. Association for Computational Linguistics, 2018.
- Nie etal. (2020)Y.Nie, A.Williams, E.Dinan, M.Bansal, J.Weston, and D.Kiela.Adversarial NLI: A new benchmark for natural language understanding.In ACL, pages 4885–4901. Association for Computational Linguistics, 2020.
- OpenAI (2022)OpenAI.Chatgpt.https://openai.com/blog/chatgpt, 2022.
- OpenAI (2023)OpenAI.Gpt-4v(ision) system card.https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.Accessed: 2024-05-26.
- OpenAI (2024a)OpenAI.Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024a.Accessed: 2024-05-26.
- OpenAI (2024b)OpenAI.Introducing gpt-4o: our fastest and most affordable flagship model.https://platform.openai.com/docs/guides/vision, 2024b.Accessed: 2024-05-26.
- Ouyang etal. (2022)L.Ouyang, J.Wu, X.Jiang, D.Almeida, C.L. Wainwright, etal.Training language models to follow instructions with human feedback.In NeurIPS, 2022.
- Paech (2023)S.J. Paech.Eq-bench: An emotional intelligence benchmark for large language models.CoRR, abs/2312.06281, 2023.
- Papineni etal. (2002)K.Papineni, S.Roukos, T.Ward, and W.Zhu.Bleu: a method for automatic evaluation of machine translation.In ACL, pages 311–318. Association for Computational Linguistics, 2002.
- Romera-Paredes etal. (2024)B.Romera-Paredes, M.Barekatain, A.Novikov, M.Balog, M.P. Kumar, E.Dupont, F.J.R. Ruiz, J.S. Ellenberg, P.Wang, O.Fawzi, P.Kohli, and A.Fawzi.Mathematical discoveries from program search with large language models.Nat., 625(7995):468–475, 2024.
- Rozière etal. (2023)B.Rozière, J.Gehring, F.Gloeckle, S.Sootla, I.Gat, etal.Code llama: Open foundation models for code.CoRR, abs/2308.12950, 2023.
- Singh etal. (2024)A.K. Singh, R.Murthy, V.kumar, J.Sen, and G.Ramakrishnan.Indic qa benchmark: A multilingual benchmark to evaluate question answering capability of llms for indic languages, 2024.URL https://arxiv.org/abs/2407.13522.
- Suzgun etal. (2023)M.Suzgun, N.Scales, N.Schärli, S.Gehrmann, Y.Tay, H.W. Chung, A.Chowdhery, Q.V. Le, E.H. Chi, D.Zhou, and J.Wei.Challenging big-bench tasks and whether chain-of-thought can solve them.In ACL, pages 13003–13051. Association for Computational Linguistics, 2023.
- Touvron etal. (2023a)H.Touvron, T.Lavril, G.Izacard, X.Martinet, M.-A. Lachaux, T.Lacroix, B.Rozière, N.Goyal, E.Hambro, F.Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a.
- Touvron etal. (2023b)H.Touvron, L.Martin, K.Stone, P.Albert, A.Almahairi, Y.Babaei, N.Bashlykov, S.Batra, P.Bhargava, etal.Llama 2: Open foundation and fine-tuned chat models.CoRR, abs/2307.09288, 2023b.doi: 10.48550/ARXIV.2307.09288.URL https://doi.org/10.48550/arXiv.2307.09288.
- Wang etal. (2019)A.Wang, A.Singh, J.Michael, F.Hill, O.Levy, and S.R. Bowman.GLUE: A multi-task benchmark and analysis platform for natural language understanding.In ICLR. OpenReview.net, 2019.
- Wang etal. (2023)H.Wang, C.Liu, N.Xi, Z.Qiang, S.Zhao, B.Qin, and T.Liu.Huatuo: Tuning llama model with chinese medical knowledge.CoRR, abs/2304.06975, 2023.
- Wang etal. (2024)P.Wang, S.Bai, S.Tan, S.Wang, Z.Fan, J.Bai, K.Chen, X.Liu, J.Wang, W.Ge, Y.Fan, K.Dang, M.Du, X.Ren, R.Men, D.Liu, C.Zhou, J.Zhou, and J.Lin.Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024.
- Wei etal. (2022)J.Wei, X.Wang, D.Schuurmans, M.Bosma, B.Ichter, F.Xia, E.H. Chi, Q.V. Le, and D.Zhou.Chain-of-thought prompting elicits reasoning in large language models.In NeurIPS, 2022.
- Xie etal. (2023)Q.Xie, W.Han, X.Zhang, Y.Lai, M.Peng, A.Lopez-Lira, and J.Huang.Pixiu: A large language model, instruction data and evaluation benchmark for finance, 2023.
- Yang etal. (2024a)A.Yang, B.Yang, B.Hui, B.Zheng, B.Yu, etal.Qwen2 technical report.CoRR, abs/2407.10671, 2024a.
- Yang etal. (2024b)A.Yang, B.Zhang, B.Hui, B.Gao, B.Yu, C.Li, D.Liu, J.Tu, J.Zhou, J.Lin, K.Lu, M.Xue, R.Lin, T.Liu, X.Ren, and Z.Zhang.Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.CoRR, abs/2409.12122, 2024b.
- Yao etal. (2024)Y.Yao, T.Yu, A.Zhang, C.Wang, J.Cui, H.Zhu, T.Cai, H.Li, W.Zhao, Z.He, etal.Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024.
- You etal. (2021)K.You, Y.Liu, J.Wang, and M.Long.Logme: Practical assessment of pre-trained models for transfer learning.In ICML, volume 139, pages 12133–12143, 2021.
- Yu etal. (2023)W.Yu, Z.Yang, L.Li, J.Wang, K.Lin, Z.Liu, X.Wang, and L.Wang.Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023.
- Yue etal. (2023)X.Yue, Y.Ni, K.Zhang, T.Zheng, R.Liu, G.Zhang, S.Stevens, D.Jiang, W.Ren, Y.Sun, etal.Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.arXiv preprint arXiv:2311.16502, 2023.
- Zellers etal. (2019)R.Zellers, A.Holtzman, Y.Bisk, A.Farhadi, and Y.Choi.HellaSwag: Can a machine really finish your sentence?In ACL, pages 4791–4800, 2019.
- Zhang etal. (2024a)B.Zhang, Z.Liu, C.Cherry, and O.Firat.When scaling meets LLM finetuning: The effect of data, model and finetuning method.In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024a.
- Zhang etal. (2023a)H.Zhang, X.Li, and L.Bing.Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023a.
- Zhang and Li (2023)Y.Zhang and H.Li.Can large language model comprehend ancient chinese? A preliminary test on ACLUE.CoRR, abs/2310.09550, 2023.
- Zhang etal. (2023b)Y.Zhang, T.Huang, Y.Ding, D.Zhan, and H.Ye.Model spider: Learning to rank pre-trained models efficiently.In NeurIPS, 2023b.
- Zhang etal. (2024b)Y.Zhang, S.Lu, Y.Li, Y.Ma, Q.Chen, Z.Xu, W.Luo, K.Zhang, D.Zhan, and H.Ye.Wings: Learning multimodal llms without text-only forgetting.CoRR, abs/2406.03496, 2024b.
- Zhu etal. (2024)D.Zhu, J.Chen, X.Shen, X.Li, and M.Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advanced large language models.In ICLR, 2024.