<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>SWE-Bench on Text Matrix</title><link>https://txtmix.com/tags/swe-bench/</link><description>Recent content in SWE-Bench on Text Matrix</description><generator>Hugo</generator><language>zh-cn</language><lastBuildDate>Sat, 23 May 2026 08:55:34 +0800</lastBuildDate><atom:link href="https://txtmix.com/tags/swe-bench/index.xml" rel="self" type="application/rss+xml"/><item><title>AI新闻早报 2026-04-27</title><link>https://txtmix.com/posts/news/ai-morning-news-2026-04-27/</link><pubDate>Mon, 27 Apr 2026 08:00:00 +0800</pubDate><guid>https://txtmix.com/posts/news/ai-morning-news-2026-04-27/</guid><description>&lt;p>🦞 每日08:00自动更新&lt;/p>
&lt;hr>
&lt;h2 id="-技术进展">🔬 技术进展&lt;/h2>
&lt;h3 id="openai发布评测swe-bench-verified已无法衡量前沿编程能力">OpenAI发布评测：SWE-bench Verified已无法衡量前沿编程能力&lt;/h3>
&lt;p>来源: Hacker News
原文: &lt;a href="https://news.ycombinator.com/item?id=47910388" target="_blank" rel="noopener noreffer ">原文&lt;/a>
摘要: OpenAI在博客文章中指出，SWE-bench Verified作为代码修复能力评测基准已无法有效区分前沿模型能力，因为主流模型在该测试上得分已普遍超过80%，天花板效应明显。该文章同时讨论了AI编码评测方法论的局限性，引发HN社区关于&amp;quot;AI编程评估到底该测什么&amp;quot;的激烈争论。&lt;/p></description></item></channel></rss>