GenAI: Code Challenge (Pilot)

Evaluating AI-generated test code.

Overview

The NIST GenAI Pilot Code Challenge will measure and evaluate unit tests generated by Artificial Intelligence (AI) for testing elementary-level python code. This pilot provides an evaluation framework that facilitates the development and improvement of AI Large Language Models (LLMs) technologies to write effective tests for software code. The Pilot Code Challenge is open to all who wish to participate and abide by the rules and procedures described throughout this website and evaluation plan.

What

The Code task is to automatically generate high-quality tests for python code given a textual specification of the task the python code is intended to carry out. There will be two prompt conditions: fixed prompt and custom prompt. Participants will be given a fixed prompt for each trial_id, this fixed prompt will already have the textual specification incorporated. The submission must contain an output that used the fixed prompt. Participants will have the opportunity to customize the prompts, and will be allowed to submit up to 9 alternative custom prompts in a submission. Both the fixed prompt and 1 custom prompt per trial_id are required in a submission. These text specifications may include a method header with all the parameter names and a brief summary of what the code inputs and outputs are and the function of the code. For more details, please see the GenAI Code Pilot Evaluation Plan.

Who

Teams from academia, industry, and other research labs are invited and encouraged to contribute to Generative AI research through the GenAI platform. This platform is designed to support various modalities, including both "Generators" and "Discriminators" technologies. The GenAI Code specifically focuses on Generators and measuring the quality of AI-generated test code.

How

To take part in the GenAI Code evaluation, participants must register on this website and agree to the license to download data and upload submissions. NIST will make all necessary data resources available. Please refer to the published schedule for data release dates. Participants will be able to upload their AI-generated test code to the challenge website and see their results displayed on the scoreboard.

Task Coordinator

If you have any questions, please email to the NIST GenAI team

Schedule

Date	Milestone
July 16, 2025	Evaluation Plan Posted
July 23, 2025	Registration and Submissions Open
September 12, 2025	Registration and Submissions Close
September 26, 2025	GenAI Code Preliminary Results Available (Leaderboard)
TBD	GenAI Code Pilot Workshop

GenAI Code Resources

Trustworthy & Responsible AI Resource Center

Code Generation Instructions

GenAI Code Participants are provided the input json file containing the code bank input specifications, separated and labelled by trial_id, and the corresponding fixed prompts. With this input file, GenAI Code Participants will submit a json file of test outputs containing pytest tests that can test arbitrary implementations of the methods specified by the input specifications. This json file will have outputs separated and labelled by trial_id and by prompt_number. All tests generated using the fixed prompts will have prompt_number 0. Please see the GenAI Code Pilot Evaluation Plan for detailed instructions.

Only GenAI Code participants who have completed and submitted all required forms will be allowed access. Please check the published schedule for testing data release dates.

Optional Dry-Run Track

In addition to the main pilot track, there is a dry-run track on a small development data set. For this development data set, participants will be given not just the input problem file but also the key file, one baseline submission, and correct and incorrect code implementations. Participants may submit to the dry-run track and those submissions will be scored; however, the dry-run submissions will neither be analyzed nor scored on a scoreboard. All submissions to the dry-run track are optional.

Example Fixed Prompt


"We have python code that implements the following specification.

Specification:

def add(x: int, y: int) -> int:
    """
    Given two integers x, and y, return the sum of x and y. If either x or y is not
    an integer, raise a TypeError Exception.
    """


Please write python pytest test code that comprehensively tests the code for method add to determine if the code correctly meets the specification or not. When writing tests:
* write a comprehensive test suite,
* test edge cases,
* only generate tests that you are confident are correct, and
* include tests for TypeError cases.

Please write '###|=-=-=beginning of tests=-=-=|' before the tests. Write '###|=-=-=end of tests=-=-=|' immediately after the tests. Import any needed packages, including pytest. Additionally import the code being tested by adding the line `from genai_code_file import add` the line after '###|=-=-=beginning of tests=-=-=|'. Do not provide an implementation of the method add with the tests."

Submission Guidelines

Please refer to the GenAI Code Pilot Evaluation Plan for specific instructions. Please see the dry-run submission provided with the data for an example submission format.
Code should be compilable so that it can be run with the pytest package.
Each submission should be submitted as a single .json file, and each file should contain both the fixed prompt and custom prompt submissions.
For all submissions and all problems, the prompt used must be provided as metadata within the submission.
Submission notes: according to the published schedule, the submission page (form) will be open and available (via the GenAI website) for teams to submit their data outputs. Please make sure to follow the schedule and submit on time as extending the submission dates may not be possible.
Upon submission, NIST will validate the data outputs uploaded and report any errors to the submitter.
Please take into consideration that submitting your data outputs indicates and assumes your agreement to the "Protocol and Rules" section of the Evaluation Plan.

Example Sample Output:


from genai_code_file import add
import pytest

class TestCode(object):
    def test_add(self):
        assert add(2, 3) == 5
        with pytest.raises(TypeError):
            add("abc", "def")

Submission Validation

NIST will provide the scoring software with the validator script to participants to validate their output json files format as well as content specific to the task guidelines (e.g. required attributes). The submission platform will run this validator and fail any submissions that do not pass this validator.
Provided with the validator scripts are the scoring scripts (to allow participants to locally score dry-run submissions) and optional utility scripts that may help with certain components.

For the best user experience with the scoreboard, please use Google Chrome.

For the columns: finds CI1 error (%), finds CIT error (%), finds CI1 & CIT errors (%), and 100% coverage & finds all errors (%), only programs with correct tests were counted. For mean coverage (%), only programs with correct test were measured for coverage.