Software Testing: Using Large Language Models to save effort for test case derivation from safety requirements

The verification and validation of software components are based on extensive testing. The required test cases to enable testing are derived from the specified requirements, which are then executed, and the results are compared with the acceptance criteria of the test cases. Even for relatively small systems, the derivation of test cases is a resource-intensive and therefore expensive endeavor. Assuming a conservative estimate of 5–10 minutes per test case, it may take more than twenty person-days of effort to write test cases for a system with around 500 requirements. By leveraging Large Language Models, (LLMs), we can increase the efficiency of test case generation.

The development of complex systems starts with their requirement specifications. For dependable and safety-critical systems, this also includes safety requirements based on the safety analysis of the system. Deriving the test cases manually for these requirements is a time-consuming process. However, by leveraging LLMs, we can improve this process. As input for the LLM, a textual representation of the requirements is used, which is then autonomously transformed into test cases and scenarios in either plain-text format or any formal specification, such as ASAM Open Test Specification. The current best practice of test case reviews by test engineers can ensure the integrity and correctness of these test cases. By using LLMs, the work of the test engineer can be reduced by focusing on formulating test cases for edge cases and reviewing and refining the automatically derived test cases and scenarios.

Using Large Language Models can significantly reduce the time and costs needed to generate test cases.

LLM-based test case generator

We developed an LLM-based test case generator and applied it to a “Lane Keep Assist” use case. Since LLMs may suffer from inherent uncertainties and quality deficits, our basic architecture includes quality and uncertainty evaluation. The table below shows an excerpt of the basic requirements for the chosen scenario, while Figure 1 illustrates the process of deriving test cases using our LLM-based approach.

Requirement ID	Requirement ID Reqif	Category	Requirement Description
1.1	R001	Lane Detection	The system shall detect lane markings on the road using cameras and/or sensors.
1.2	R002	Lane Detection	The system shall identify lane boundaries under various lighting and weather conditions.
2.1	R003	Lane Departure Warning	The system shall provide a warning to the driver if the vehicle is unintentionally drifting out of the lane.
2.2	R004	Lane Departure Warning	The warning shall be provided through visual, auditory, and/or haptic feedback.
3.1	R005	Steering Assistance	The system shall gently steer the vehicle back into the lane if it detects an unintentional departure.

Basic architecture for software testing: test case generator by Fraunhofer IESE – Architecture used for test case generator module — Figure 1: Basic architecture: test case generator for software testing

The requirements as input can be provided in either ReqIF, JSON, or CSV format. The LLM is used to generate test cases based on the given requirements. Since the data within the requirements may be confidential, we utilized our in-house deployed internal LLM tool, which does not expose information.

To maintain confidentiality of the requirements and the generated test cases, internally deployed LLM models are used.

Automated test case generation

Large language models generate their output based on prompts. For generating the test cases, one can start with a simple prompt, such as “Generate the test case for the following requirement.” However, this may not yield the desired result. Studies have shown that LLMs are easier to work with when provided with prompts that are as concise as possible. Using the guidelines of the standards ISO 26262, we settled on a prompt that specifies in detail the expected output characteristics and attributes of a test case specification.

Quality evaluation

Once we obtain the test cases using LLMs, it is essential to evaluate their quality automatically. Even though we plan to have the test cases evaluated by a test engineer, we can use this to judge the quality beforehand, thereby reducing the time needed by the test engineer for evaluation. Or triggering a new generation of the respective test case if quality defects are detected.

For the quality evaluation, we settled on an evaluation based on content availability and correctness. From the standards (ISO 26262, ISO 29119, etc.), we extracted the attributes required for test cases that must be present. We then evaluated each generated test case to determine if the required attributes are present or missing. Based on that, we assessed content completeness using simple and compound metrics, as outlined below in Figure 2 and Figure 3.

Quality of Conformance: Fraunhofer IESE – Simple Quality of Conformance metric — Figure 2: Simple Quality of Conformance metric

Quality of Conformance: Fraunhofer IESE – Compound Quality of Conformance metric — Figure 3: Compound Quality of Conformance metric

The correctness of the generated test cases can be evaluated against the required criteria. This can either be done manually or automated. For manual evaluation, the criteria defined in the standards, such as ISO 26262 and ISO 29119, are used. The table below shows some of these criteria.

Sl No	Criteria	Satisfied Yes/No	Comment
1	Language is simple and straightforward
2	Steps are specific and detailed
3	Steps are clear and unambiguous
4	Consistent terminology and format used
5	Inputs are clearly defined

Simple Quality of Content (QoC) and Compound QoC metrics, along with the correctness criteria, can be used to evaluate the quality of the generated test cases. This can even be automated in instances where human-written test cases (true test cases) are available. These test cases can be used to evaluate correctness using techniques such as fuzzy string matching. However, this can be replaced with more sophisticated techniques or even be based on LLMs.

Uncertainty evaluation

Although LLMs have significantly advanced the domain of natural language processing, they still face challenges related to uncertainty. We evaluated the uncertainty of five LLM models, focusing on those that are deployable in-house. The five models evaluated are: Pixtral-12B, LLaMA2, LLaMA3.1 (8B & 70B), and Gemma:27B. The uncertainty evaluation was conducted using existing datasets, such as GSM 8k (for mathematical reasoning, evaluating the ability to solve arithmetic and algebraic problems), Business Ethics (a subset of the MML dataset that measures the model’s understanding of ethical scenarios in business contexts), and Professional Law (a subset of the MML dataset that focuses on legal principles and professional reasoning). Figure 4 displays the results.

Software testing by using AI to generate test cases: Large Language Models Performance graph by Fraunhofer IESE — Figure 4: Performance comparison of LLMs

Out of all the evaluated models, LLaMA3.1 (70B) and Pixtral were found to be best performant.

Conclusion for software testing

In this work, we introduced a method to automatically generate test cases from requirements using LLMs. We further evaluated metrics to assess the quality of the generated test cases and evaluated the uncertainty of the LLMs. As the next step, we plan to automate the translation of test cases into ASAM Open Test Specification format and execute them.

Every company specifies requirements in different ways: We are happy to generate insights on the improvement potential of our approach for your specific safety requirements shapes in case studies. Contact us today to learn how a collaboration between Fraunhofer IESE and your company can be operationalized? Drop us a message and we’ll arrange an introductory meeting, where we are happy to discuss your project and priorities.

References

ISO 26262 Road vehicles – Functional safety
ISO 29119 Software and systems engineering — Software testing
Agrawal, Pravesh, et al. „Pixtral 12B.“ arXiv preprint arXiv:2410.07073 (2024).
Touvron, Hugo, et al. „Llama 2: Open foundation and fine-tuned chat models.“ arXiv preprint arXiv:2307.09288 (2023).
Dubey, Abhimanyu, et al. „The llama 3 herd of models.“ arXiv preprint arXiv:2407.21783 (2024).
Team, Gemma, et al. „Gemma 2: Improving open language models at a practical size.“ arXiv preprint arXiv:2408.00118 (2024).
Cobbe, Karl, et al. „Training verifiers to solve math word problems.“ arXiv preprint arXiv:2110.14168 (2021).
Hendrycks, Dan, et al. „Measuring massive multitask language understanding. (MMLU)“ arXiv preprint arXiv:2009.03300 (2020).