NIST Sources Sought: Evaluations and Benchmarks for Assessing AI Research and Development Capabilities of AI Models

March 5, 2025

Notice ID: CAW-AIRD-25

To enable the U.S. economy to harness the full benefits of AI, the National Institute of Standards and Technology (NIST) focuses on fundamental research and improving AI measurement science, technology, standards and related tools.

The U.S. AI Safety Institute (AISI), housed within the National Institute of Standards and Technology, is developing testing, evaluations, and guidelines to help accelerate trustworthy AI innovation in the United States and around the world, with a focus on promoting measurement science for AI capabilities and helping to prevent flaws or misuses of AI technology that could undermine public safety or national security.

As part of this work, AISI is conducting testing, evaluation, validation, and verification (TEVV) on high-impact frontier models’ capabilities. In doing so, AISI seeks to ensure that its projects, evaluations, and tools reflect the best available science, and to coordinate closely with a diverse set of Al stakeholders who are developing and conducting evaluations to assess capabilities, functionality, and risks.

NIST is performing market research to identify potential sources for an anticipated contract to assist in developing evaluations and benchmarks of AI models’ relevant software engineering and AI research and development capabilities, functionality, and risks.

The Contractor must provide or develop resources for various aspects of assessing the capability of frontier Al models to assist in software engineering and AI research and development, including by assessing the quality or functionality of AI-generated outputs in these domains and any corresponding risks …

Contractors must provide or develop resources for one or more of the following:

Developing benchmarks and scoring mechanisms for automated evaluation of Al models’ relevant capabilities.
Developing tasks for automated evaluation of AI models’ relevant capabilities with accompanying data on human baseline performance (e.g., how long the tasks take human experts to complete);
Design and implementation of protocols or methods for evaluating AI models’ relevant capabilities.…

Relevant frontier model capabilities to elicit, evaluate, and benchmark include:

Capabilities that enable a model to assist with or automate software development activities such as designing and implementing projects based on specifications, identifying and correcting bugs, updating and refactoring code, or deploying code.
Capabilities that enable a model to assist with or automate research activities associated with frontier AI model development, such as the ability to generate and test hypotheses relating to the design of AI models or to perform iterative experimentation …

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

LEAVE A REPLY Cancel reply

User Agreement