Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM
On this page
LLMs have become the go-to choice for code generation tasks, with anexponential increase in the training, development, and usage of LLMsspecifically for code generation. To evaluate the ability of LLMs on code, bothacademic and industry practitioners rely on popular handcrafted benchmarks.However, prior benchmarks contain only a very limited set of problems, both inquantity and variety. Further, due to popularity and age, many benchmarks areprone to data leakage where example solutions can be readily found on the weband thus potentially in training data. Such limitations inevitably lead us toinquire: Is the leaderboard performance on existing benchmarks reliable andcomprehensive enough to measure the program synthesis ability of LLMs? Toaddress this, we introduce EvoEval – a program synthesis benchmark suitecreated by evolving existing benchmarks into different targeted domains for acomprehensive evaluation of LLM coding abilities. Our study on 51 LLMs showsthat compared to the high performance obtained on standard benchmarks likeHumanEval, there is a significant drop in performance (on average 39.4
Further reading
- Access Paper in arXiv.org