FLASK

screenshot of FLASK

[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Overview

FLASK (Fine-grained Language Model Evaluation Based on Alignment Skill Sets) is a task-agnostic evaluation protocol for language models, focusing on fine-grained instance-wise skill sets as metrics. This repository serves as the official platform for the FLASK project, providing model-based and human-based evaluation guidelines.

Features

  • Task-agnostic evaluation protocol for language models
  • Implementation of fine-grained instance-wise skill sets as metrics
  • Support for OpenAI API key integration for GPT-4 evaluation
  • Model inference and evaluation functionalities
  • Aggregation and analysis capabilities based on skills, domains, and difficulty levels
  • Metadata annotation implementation for domain and skillset categorization