Recently, I've been using AI to recommend arXiv papers relevant to my work finetuning VLMs, which has enabled me to discover several useful benchmarks only hours after publication.
Since my last Substack post, where we discussed making sense of 3 different evals, I've found yet another benchmark for assessing visual-spatial reasoning: SIRI-Bench.
I was excited to find the researcher's remark that SpaceQwen:
"is particularly noteworthy. Despite its small size, it shows competitive performance, with 25% of its predictions having errors below 80%."

That's why I'm experimenting with extensions to my AI research discovery engine to go a step beyond ranking relevant results for review and extracting links to project pages and HuggingFace or GitHub repositories.
It's the kind of info you'd find associated with new whitepapers featured in Papers With Code or Huggingface's Daily Papers. However, I want to go further still to automatically smoke-test the code and model artifacts.
My goal was to successfully build a Dockerfile that includes all the required project dependencies for my top recommended paper.
After searching for similar efforts, I found the AI Docker Generator, which prompts an LLM to generate a Dockerfile based on the context of the GitHub repo packaged with repomix.
Fortunately, tools like Repomix or Gitingest allow you to set filters to exclude certain parts of a repo. These filters are crucial because many repositories will include large data files or clone third-party dependencies, either of which can easily exceed the context window limits for any LLM provider.
Helping the LLM focus on the most relevant context and limit hallucination, you can set simple rules to ignore specific file extensions or directories like 'third_party.'
However, Dockerfiles generated directly from the LLM can still include bugs and require testing. The team at Docker views agents as a technology worth experimenting with to robustly build Docker images automatically.
A framework like Autogen makes it easy to pass a Dockerfile to an executor to test the build. When we encounter an error, the LLM can be re-prompted with the traceback to iterate over the Dockerfile in a loop.
A smooth build process is a strong signal of the quality and utility of new code releases, and we've seen how an agent can streamline your experimentation and filter for more actionable recommendations.
With a well-built Dockerfile, we're getting closer to running the code in the context of our application and determining where ideas like it fit into what we're building.
Can tools for testing ideas sooner help give you an edge in your AI development?
The competition for AI talent has reached new heights, but we're finding new ways to connect with and learn from fellow travelers in our quest for a deeper understanding.