QuaCer-C: Quantitative Certification of Knowledge Comprehension in LLMs

Feb 15, 2024ยท
Ishita Chaudhary
Vedaant Jain
Vedaant Jain
,
Gagandeep Singh
ยท 1 min read
Abstract
Large Language Models (LLMs) have demonstrated impressive performance on several benchmarks. However, traditional studies do not provide formal guarantees on the performance of LLMs. In this work, we propose a novel certification framework for LLM, QuaCer-C, wherein we formally certify the knowledge-comprehension capabilities of popular LLMs. Our certificates are quantitative โ€” they consist of high-confidence, tight bounds on the probability that the target LLM gives the correct answer on any relevant knowledge comprehension prompt. Our certificates for the Llama, Vicuna, and Mistral LLMs indicate that the knowledge comprehension capability improves with an increase in the number of parameters and that the Mistral model is less performant than the rest in this evaluation.
Type
Publication
In SeTLLM@ICLR 2024

QuaCer-C introduces a novel certification framework for Large Language Models (LLMs), addressing the lack of formal guarantees in traditional performance evaluations. This work provides quantitative certificates that offer high-confidence, tight bounds on the probability of correct answers for knowledge comprehension prompts.

Key contributions of this research include:

  1. Formal specification of the knowledge comprehension property using popular knowledge graphs like Wikidata5m.
  2. Modeling certification as a probability estimation problem, leveraging Clopper-Pearson confidence intervals for provable, high-confidence bounds.
  3. Generation of certificates for popular LLMs including Llama 7B and 13B, Vicuna 7B and 13B, and Mistral-7B.

Our findings indicate that knowledge comprehension capability improves with an increase in the number of model parameters. Comparisons between different model classes reveal that Mistral performs less effectively than Llama and Vicuna in this evaluation.

This work contributes to the development of more reliable and verifiable AI systems, crucial for applications requiring trusted knowledge processing and comprehension.