1. Background
As agentic LLMs become more and more important in the software engineering industry, software engineers will want clearer and more reliable benchmarks for deciding which coding LLM to use and when it’s time to upgrade.
One particularly important case is the first shot of an LLM, i.e. the output it produces after having entered just a single prompt.
This is a benchmark which tests how well contemporary LLMs used for coding can implement a certain complex task, based on modern standards and intelligent configuration.
In the following, the way how this benchmark works is explained using the concrete example of “Matrix-based internet”. In other studies, all architectural decisions made below can of course be adjusted.
2. Procedure
At first, describe how this benchmark works.
An example of this type of benchmark would be this:
Have a good idea for a software project idea you want to release into the public domain. For this example: Creating an in-browser web browser, which can display files served via Matrix.
Run a setup command for your environment. In this example, run the command “pnpm dlx shadcn@latest init” and create a new project. This command should be, given the correct setup of the executable (here “pnpm”), be fully cross-compatible across all relevant operating systems.
Optionally, state if you did any further manual setup steps. For example, in this case, I added the “.cursorrules” file here: https://github.com/hermesloom/matrix-internet/blob/main/.cursorrules
Open the project in the benchmarked LLM editor and enter the used prompt. For example, the prompt could be this:
Replace the entire page by a Shadcn-based UI that looks like a web browser (within the user's actual web browser) and which has the same very basic functionality of a regular web browser, including using a Matrix client library to fetch data from Matrix users that serve files in a room with a predefined identifier/name, and being able to enter these usernames like URLs. The returned data is then in MDX, which should also be rendered.
In the "UI" of that browser, also include a button which explains you how to host your own website, with clear instructions how to do that for the "Element" Matrix client.
Run the code in the intended environment (in this case in the web browser, accessible via a human-readable URL), for the intended users (in this case any human in the world) and test it thoroughly and record findings.
To evaluate this across multiple runs from different people, for each run, store
the used operating system version and runtime environment versions (e.g. “MacOS 15.5, Node.js 20.17.0, pnpm 10.9.0”)
the used versions in the setup of the coding LLM (e.g. “Cursor 1.0.0, claude-4-sonnett”)
the links to any files used for additional configuration of the coding LLM (e.g. https://github.com/hermesloom/matrix-internet/blob/main/.cursorrules)
whether any other commands were necessary to start the local development environment after the code was generated (can also be empty, for example in this case, where the coding agent executed “npm run dev” automatically, which was satisfactory)
whether this local development server could be started flawlessly, i.e. without errors or warnings (in this case, yes)
the identifier which reviewers need to try out the generated application. In this case, the URL where the generated web application’s production server is hosted via a PaaS like Vercel or whatever else you prefer.
the parameter to pass to “git clone” to obtain the source code of the repository in your environment (e.g. “git@github.com:hermesloom/matrix-internet.git”)
an overall rating of this output from 0 to 10 (fully subjective)
a detailed justification of this rating with all observed findings (can range from “development server crashed immediately” to “instructions slightly unclear in sub-menu” to “footer really pretty” to anything else you find noteworthy)
If you publish your personal best shot at this, make sure to include the prompt in the “.well-known/prompt” file of the repository, to indicate that this code was neither written nor edited by a human. For example see https://github.com/hermesloom/matrix-internet/blob/main/app/.well-known/prompt
To make the outputs of the runs comparable, the prompt should be kept static within a single study.
3. Goals
In this section, describe the types of functional requirements which shall be tested here. In this case, this prompt should explicitly test how well the created systems perform in:
interoperability (i.e. do all these “browsers” explain the creation of the rooms in Element in the same way, so that the created clients can “talk” to each other)?
correct use of libraries, in this case in the Node.js/Next.js/shadcn ecosystem, especially the more niche ones needed here
4. Demo
If you think that the idea of the prompt above is cool, try it out in your favorite AI code editor and submit your result at https://docs.google.com/forms/d/e/1FAIpQLSfGtUmITNMTm86tBHylOs6DfuUhQXXJ5YWoFxu4XYFGN1MG_Q/viewform.
Then you can always observe the statistics in real time at https://docs.google.com/spreadsheets/d/1XSv7EBNsac9xH7LZ81jsPPNciODTzp7F-7DUd6404U8/edit.
Have fun!
5. Contact
@henophilia:matrix.org