Isn't SWE-bench based on public Github issues? Wouldn't the increase in performance also be explained by continuing to train on newer scraped Github data, aka training on the test set?
The pressure for AI companies to release a new SOTA model is real, as the technology rapidly become commoditised. I think people have good reason to be skeptical of these benchmark results.
The pressure for AI companies to release a new SOTA model is real, as the technology rapidly become commoditised. I think people have good reason to be skeptical of these benchmark results.