LLM's Creativity Leaderboard

This leaderboard aims to showcase the creativity evaluation results of state-of-the-art Large Language Models (LLMs) in multimodal scenarios. It is based on the Oogiri game, a creativity-driven task requiring humor, associative thinking, and the ability to generate unexpected responses to text, images, or their combination. The evaluation is further supported by LoTbench, an interactive and causality-aware evaluation benchmark built upon the Leap-of-Thought (LoT), specifically tailored to assess the creativity of multimodal LLMs.

The overview of LoTbench
As illustrated in (a) and (b), LoTbench enhances standard evaluations, which utilize higher accuracy in selection and ranking tasks to assess LLM's creativity, by enabling LLMs to generate multi-round responses, evaluated by a causal evaluator to determine whether they approach high-quality human-level creative responses (HHCRs). If not, the model enters a rethinking phase for the next round, as shown in (c).

Compared with standard evaluations for assessing creativity in multimodal LLMs, LoTbench offers several key advantages as follows.

  • (1) To address information leakage and limited interpretability for standard evaluations, LoTbench trains LLMs to assist in generating specific high-quality human-level creative responses (HHCRs).
  • (2) With causal reasoning techniques, LoTbench measures creativity by analyzing the average number of rounds required for an LLM to reach HHCRs. Fewer required rounds indicate higher human-level creativity.
  • (3) It is demonstrated that LoTbench aligns with human cognitive theories and reveals that while the current LLMs’ creativity is not very high, it’s close to human levels and has the potential to surpass human creativity.
  • The leaderboard currently evaluates a variety of multimodal LLMs, encompassing both closed-source and open-source models. The evaluation is conducted in a zero-shot setting to determine whether these models can produce high-quality, human-level creative outputs without fine-tuning on the benchmark. Each model undergoes multiple rounds of generation, with the final creativity score derived from these iterations. Notably, LoTbench provides three reference levels of human creativity—“Human (high)”, “Human (medium)”, and “Human (low)”—which represent the creativity performance tiers of human participants and serve as comparative benchmarks.


    Human Expert Open-Source Proprietary

    Rank Model Name Size Creativity Score Date