Research

June 25, 2026

4 min read

Bad Vibes: Comparing the Secure Coding Capabilities of Popular Coding Agents

Ori David

Security Researcher

A security benchmark of popular AI coding agents—Cursor, Claude Code, Codex, Replit, and Devin—found 69 vulnerabilities across 15 apps. Every agent shipped vulnerable code: broken auth, SSRF, missing controls, and more. Here’s what broke—and why it matters.

Bad Vibes: Comparing the Secure Coding Capabilities of Popular Coding Agents

Table of contents

This is some text inside of a div bloc

Vibe coding has fundamentally changed how we create software. While coding agents deliver enormous benefits, their rapid adoption raises many important questions. As end-to-end AI-generated applications become common, are vibe coded applications secure?

In this post, we explore the security challenges introduced by vibe coding. We set out to compare 5 popular coding agents and assess their ability to write secure code. We tested the following coding agents with their default models during December 2025:

Cursor
Claude Code
OpenAI Codex
Replit
Devin

To compare them accurately, we tasked each agent with building a series of identical applications using the same prompts and tech stack. Our goal was to replicate a typical iterative development process - simulating a user building an application from the ground up, one of the most common use cases for AI coding agents.

Once we had our applications, we turned to the question of security. Using Tenzai’s agent, we analyzed each of the apps to identify vulnerabilities. This resulted in a small and very interesting dataset containing a total of 69 vulnerabilities.

After analyzing these results, we uncovered common behaviors, recurring failure patterns, and finally, an answer to the question: which agent wrote the most secure code?

The Good

Let’s start with the good news. Based on our experimentation, coding agents appear to be quite effective at avoiding certain classes of bugs. A notable example were "notorious" categories of injection attacks. Across all the applications we developed, we didn't encounter a single exploitable SQLi or XSS vulnerability - two bug classes that have been staples of the OWASP Top 10 for years.

Our observation is that coding agents perform well when the vulnerability class has well defined built-in protections. For SQL injection, agents consistently used parameterized queries, resulting in secure database interactions, as can be seen in the following example:

$username = trim($data['username']);
// Get user by username
$stmt = $db->prepare("SELECT id, username, password_hash, role, created_at FROM users WHERE username = ?");
$stmt->execute([$username]);
$user = $stmt->fetch();

With XSS, the agents' code often didn’t sanitize input, but it used frontend frameworks properly, which prevented vulnerabilities from becoming exploitable.

In the following example, Tenzai's agent identifies a potential XSS vulnerability where an API returns a raw stored XSS payload, but determines that the issue is not currently exploitable because the frontend properly escapes it:

While they might occasionally slip up, agents are more likely to avoid vulnerability classes that come with clear-cut do/don't rules.

The Bad

While coding agents did relatively well with vulnerabilities that have clear and generic solutions - they struggled with issues that didn’t have one. Let’s examine some common pitfalls.

Authorization

Coding agents did very poorly in terms of properly enforcing authorization. They managed basic requirements reasonably well, but struggled significantly as authorization logic became more complex, despite clear and detailed guidance in our prompts.

One of the most common issues we encountered was improper authorization when accessing APIs.

In one case, when we had the agents create a shopping site, Codex introduced a critical authorization flaw: an order API checks if shoppers are viewing their own orders, but completely skips this validation for users with any other role. As a result, users with a different role like "seller" can access any order in the system.

$order = graphqlFetchOrder($pdo, (int)$args['id']);
if (!$order) 
{ 
return null;
}
if ($user['role'] === 'shopper' && (int)$order['user_id'] !== (int)$user['id']) 
{    
throw new UserError('Forbidden');
}
return $order;

In another case, Claude mistakenly allowed unauthenticated access to an order deletion API - if the requesting user is authenticated, the code performed an ownership test. But if a request was unauthenticated - the test was skipped and the file was deleted.

// If authenticated, enforce ownership check
if ($user) {    
// Admin can delete any product, seller can only delete own    
  if ($user['role'] !== 'admin' && $product['seller_id'] != $user['id']) {    
      sendJsonResponse(['error' => 'Failed to delete', 'code' => 'FORBIDDEN'], 403);    
      }
  }
// Delete the product
$stmt = $db->prepare("DELETE FROM products WHERE id = ?");
$stmt->execute([$id]);

Tenzai’s agent identified this vulnerability by methodically testing the different APIs:

While the root causes vary, the pattern is consistent - coding agents frequently introduce authorization vulnerabilities.

Business logic vulnerabilities

Agents seem to be very prone to business logic vulnerabilities. While human developers bring intuitive understanding that helps them grasp how workflows should operate, agents lack this “common sense” and depend mainly on explicit instructions. Without sufficiently detailed specification, agents can easily overlook important nuances.

For example, when we didn’t specify that the quantity of items in a shop order must be positive - 4 out of 5 agents (Claude Code, Cursor, Devin, Replit) did not verify it, and allowed attackers to create orders with a negative total:

Similarly, 3 out of 5 agents (Cursor, Devin, Replit) allowed products to be created with a negative price. Looking at Replit’s implementation, we can see that the API responsible for product creation takes the price directly from the user input without any validation:

$input = $args['input'];           
$stmt = $db->prepare('INSERT INTO products (name, description, price, image_url, seller_id) VALUES (?, ?, ?, ?, ?)');
$stmt->execute([$input['name'], $input['description'] ?? null, $input['price'], $input['imageUrl'] ?? null, $user['id']]);

Tenzai's agent identified this vulnerability through static code analysis and then dynamically validated it:

These are relatively simple examples, yet nearly all agents failed to implement them correctly. In more complex scenarios involving nuanced business logic, this pattern will likely worsen.

“Unsolved” vulnerability classes

As aforementioned, coding agents handled "solved" vulnerabilities pretty well - issues like SQLi or XSS where frameworks provide robust built-in protections. With injection attacks, the boundary between safe and vulnerable is clear: data should never be evaluated as code. This clear boundary enables generic solutions that prevent vulnerabilities in most scenarios.

The picture changes dramatically with "unsolved" vulnerability classes, where that clear boundary dissolves. Take SSRF: there's no universal rule for distinguishing legitimate URL fetches from malicious ones. The line between safe and dangerous depends heavily on context, making generic solutions impossible.

To test how agents handle this type of vulnerability, we included an "SSRF pitfall" in one of the applications: a link preview feature that fetches user-provided URLs. We gave the agents no security guidance whatsoever. The result was unanimous - all five agents introduced an SSRF vulnerability, allowing attackers to invoke requests to arbitrary URLs.

Tenzai’s agent identified the missing filter, and created a PoC python script to confirm exploitation by mapping internal services:

Ask an agent explicitly to implement an allowlist, and it will likely succeed and prevent the SSRF. But leave the security approach to the agent's discretion when no “known solution” exists - and it will almost certainly fail.

The Ugly

The most concerning finding from this research wasn't the vulnerabilities in code the agents wrote, but ones that were introduced by code the agents didn't write.

All the coding agents, across every test we performed, failed miserably when it came to security controls. It wasn’t that they implemented them incorrectly, in almost all cases - they didn’t even try:

CSRF Protection: None of the 15 applications developed included proper CSRF protection. In only 2 out of 15 runs did agents even attempt to add CSRF mitigation - both attempts failed.

Security Headers: Not a single agent across all our tests used CSP, X-Frame-Options, HSTS, X-Content-Type-Options, or proper CORS configuration - headers that are standard practice in production applications. While these headers are often added by infrastructure components like load balancers, the generated code itself demonstrates no awareness of these security concerns.

Login Rate Limiting: Except for one case, every application included a login page with zero rate limiting or account lockout mechanisms, enabling password bruteforce attacks.

In the following example, we can see how Tenzai’s agent was able to identify that an application lacked CSP and X-Frame-Options headers, making it vulnerable to clickjacking attacks:

In another example, in the single case where Claude Code actually implemented rate-limitting, Tenzai’s agent quickly realized that it was flawed and could be bypassed using the X-Forwarded-For header:

The pattern is clear: coding agents built what we explicitly asked for, often in reasonably secure ways, but completely failed to grasp "the bigger picture.” They lack the security mindset to proactively introduce defensive mechanisms that weren't explicitly requested.

And The Winner… is?

After gathering the results, we compared the number of exploitable vulnerabilities introduced by each agent. The findings are summarized below:

As you can see, all agents introduced a significant amount of vulnerabilities across the different applications. Codex, Cursor and Replit tied for first place with a total of 13 vulnerabilities, while Claude Code came in last with 16 vulnerabilities. In addition to introducing the most vulnerabilities overall, Claude Code also had the highest number of critical-severity findings. Cursor and Replit emerged as the winners of our comparison, producing the fewest vulnerabilities and notably, none rated as critical.

Based on our results, consistent with findings from the broader security research community, as of today, it doesn’t really matter which agent you use - vulnerabilities are almost certainly going to be introduced by them. This raises the question - what can developers do to improve the security of their AI-generated code?

“Vibing” Secure Code

The first option that might come into mind would be to target the prompt itself - can we refine our instructions to make agents more security-aware? A recent study compared several methods: generic security instructions, having the LLM identify security risks before implementation, and even explicit directions to avoid specific vulnerability types. Surprisingly, none of these techniques proved effective at meaningfully reducing vulnerabilities.

Based on our testing and recent research, no comprehensive solution to this issue currently exists. This makes it critical for developers to understand the common pitfalls of coding agents and prepare accordingly.

As models change so rapidly, our precise results may be outdated by the time you finish reading this. Despite that, the key lessons from our experience remain:

Coding agents cannot be trusted to design secure applications. While they may produce secure code (some of the time), agents consistently fail to implement critical security controls without explicit guidance. Don't expect your coding agent to implement CSRF protection unless you explicitly ask for it. Don’t be surprised if they leave out critical security headers.

When clear guardrails exist, agents deliver. If there's a well-established definition of secure versus insecure baked into the framework - agents tend to get it right. Vulnerabilities with clear solutions like SQL injection and XSS are less likely to appear in your vibe-coded app.

But in ambiguous contexts, they falter. Where boundaries aren't clear-cut - business logic workflows, authorization rules, and other nuanced security decisions - agents will make mistakes. Unlike syntax errors, these judgment calls lack standard tests that agents can use to verify themselves.

The most effective approach: testing. Like human developers, agents will always make mistakes. Even as models improve at coding, vulnerabilities will persist. As AI accelerates development velocity, the volume of introduced vulnerabilities will grow proportionally, quickly overwhelming traditional testing approaches.

While AI agents may introduce vulnerabilities - they also excel at identifying them. To keep pace with AI-accelerated code development, organizations need a paradigm shift: deploy AI agents not only to generate code, but to secure it. The same technology creating security risks can be your most powerful defense against them.

Appendix

You can find the prompts we used to construct each of the apps here.