Every new vulnerability disclosure piles more pressure on already overwhelmed security teams. But what if AI could help shoulder some of that burden? A recent study (https://arxiv.org/pdf/2512.06781) explores whether Large Language Models (LLMs) can automate vulnerability scoring, a critical but time-consuming task. While the results are promising in certain areas, they also reveal persistent challenges that prevent full automation. And this is the part most people miss: context still reigns supreme, even for the most advanced AI.
Security teams are drowning in a sea of vulnerabilities. In 2024 alone, over 40,000 CVEs were published, straining programs designed to assess their severity. Without timely and accurate scoring, prioritizing which threats to address first becomes a guessing game. This is where LLMs come in – could they provide a much-needed helping hand?
Researchers put six leading LLMs to the test: GPT 4o, GPT 5, Llama 3.3, Gemini 2.5 Flash, DeepSeek R1, and Grok 3. Each model was tasked with scoring over 31,000 CVEs based solely on their brief descriptions, stripped of any identifying details like product names or CVE IDs. This forced the models to rely purely on the text, mimicking the real-world challenge of assessing vulnerabilities with limited information.
But here's where it gets controversial: While the models showed impressive accuracy in some areas, their performance was far from perfect. Two metrics stood out as particularly well-handled:
Attack Vector: How an attacker gains access (network, local, physical, etc.). Gemini led the pack with around 89% accuracy, closely followed by GPT 5. This success likely stems from the fact that descriptions often explicitly mention the attack method.
User Interaction: Whether the exploit requires user action (clicking a link, opening a file). GPT 5 excelled here, also achieving around 89% accuracy. This metric benefits from descriptions that frequently detail user involvement.
Other metrics, like Confidentiality Impact (data exposure) and Integrity Impact (data tampering), showed moderate success, with GPT 5 scoring in the mid-to-high 70% range. When descriptions hinted at data compromise, the models often picked up on these cues.
However, the study also exposed significant weaknesses. Availability Impact, which assesses service disruption, proved the most challenging. GPT 5 managed only around 68% accuracy, with other models lagging further behind. Vague descriptions that merely mention a potential crash without specifying the severity of the outage left the models struggling.
Similarly, Privileges Required (the level of access needed by an attacker) and Attack Complexity (conditions for successful exploitation) were problematic. Descriptions often lacked the detail needed to differentiate between low and none privilege levels or to accurately assess the complexity of an attack.
Interestingly, error analysis revealed a striking pattern: all six models made the same mistakes on a significant portion of CVEs, particularly for Availability Impact and Attack Complexity. This suggests that certain vulnerabilities consistently trip up even the most advanced LLMs, highlighting the limitations of relying solely on textual descriptions.
To squeeze out every last drop of performance, researchers combined the predictions of all six models using meta classifiers. This resulted in slight improvements across the board, with the biggest gain seen in Scope (whether an exploit spreads beyond the initial target). However, these improvements were modest, underscoring the fact that even combined AI models cannot overcome the lack of crucial context in many vulnerability descriptions.
So, can LLMs truly revolutionize vulnerability scoring? While they show promise in automating certain aspects, the study makes it clear that human expertise and contextual understanding remain irreplaceable. The question remains: how can we best leverage AI to augment human capabilities without sacrificing accuracy and reliability? The debate is far from over, and we invite you to share your thoughts in the comments below. Do you see LLMs as a game-changer for cybersecurity, or are we placing too much trust in their abilities?