A recent experiment conducted by security researcher Kasra Rahjerdi highlights significant disparities in the capabilities of various AI models in tackling real-world security challenges. The test involved a vulnerable book review application, which contained exposed Firebase credentials allowing direct database access.
Over a dozen AI models were evaluated, each with a budget of $10 and a two-hour time limit, amounting to a total expenditure of $1,500. The standout performer was GPT-5.5, successfully solving the challenge in 7 out of 10 trials, with an average cost of $9.46 per solution. In contrast, DeepSeek V4 Pro was noted for its cost efficiency, solving 3 out of 10 runs at just $0.62 per success.
Other models, such as Claude Sonnet 4.6 and Claude Opus 4.8, each managed to solve the challenge in 2 out of 10 attempts. The least successful was Gemini 3.1 Pro Preview, which largely refused to engage, performing poorly with a median token count of only 9k compared to over 100k for other models. Rahjerdi noted that Chinese models exhibited greater willingness to interact with live databases than their Western counterparts.