Making Your Docs Rock: Lessons from 1,000 Hugging Face Models
You've found an exciting new colab notebook covering the exact model training method you sought. After updating the dependencies, you've gotten past a memory issue and even updated the script with your dataset. Finally, training has finished, and you can push to the hub. All set, right?
Unfortunately, many developers will stop short of documenting their model using its card, making it hard to reproduce or use the artifacts. Poor model documentation may limit the discovery of your work or even undermine trust in your brand.
Hugging Face shares some of their findings in UX research focused on the usability of ModelCards for different audiences and even created a tool to help generate this documentation.
This post will emphasize learning from larger-scale data by reviewing 1,000 models from Hugging Face's Hub using their APIs to get more data relating documentation quality to model popularity.
To measure popularity, we'll use the statistics directly available through the Hub APIs to retrieve 'download counts' and 'user likes.' Check out this notebook for HF Hub API examples for getting this information.
To assess documentation quality, we'll apply LLM-as-a-Judge by prompting frontier models to score each model card's content on the 1-5 Likert scale, using a rubric which rewards well-structured and comprehensive documentation while penalizing generically templated or incomplete cards.
Here is a new HF dataset, including the models and metrics relevant to this investigation at the time of writing.
From our table, here's a sample model with low-information documentation and low likes, despite hundreds of monthly downloads.
For comparison, this model includes comprehensive documentation and a couple more likes with similar download counts.
The following histogram shows the distribution of quality scores as determined by the LLM Judge for our collection of models.
After excising outliers, we can apply robust regression and Spearman rank correlation to identify modest positive correlations between scores of model card content and 'download counts' and 'user likes,' respectively.
Perhaps not surprisingly, the relationship between high-scoring model card content and 'user likes' is stronger. Some models seem made for machine consumption; GGUF quantized models optimized for inference often end up with high download counts.
Using text embeddings to represent each model's documentation, we can use projection and clustering to identify common patterns. You'll recognize some models that belong to the big labs that originated the foundation models. Still, other popular segments include inference-optimized artifacts or models made by indie practitioners, who have specialized their models to a particular domain or multilingual capabilities.
Using this analysis, I wanted to improve the documentation of some of the models I've shared, like this thinking VLM that uses test-time compute for enhanced quantitative spatial reasoning.
Popular models use tags and comprehensive metadata and offer inference optimized gguf quants. People want to see the main usage and limitations called out clearly as well as code snippets or link spaces to quickly get the model running.
After generating some recommended changes based on our scoring rubric, I added more information about the model's limitations and info to help others reproduce training or better understand the dataset. I'll wait to see how these changes affect community engagement and traction for these latest models 🤞
Ultimately, our analysis supports what we knew must be true: high-quality documentation improves discoverability and trust. If it was worth pushing to the hub, why kneecap its distribution by treating documentation as an afterthought?
The Hub is not just a place to back up those weights. By providing more of the context behind that model, yours could make worldwide impact.
Want your weights heard 'round the hub? You gotta rock the docs!


