Your choice of the base model architecture and size, fine-tuning regimen, data mixture, and many other engineering decisions influence the suitability of the foundation model powering your AI application. In the AI research and engineering community, we continually learn about the factors that interact to contribute to or degrade AI performance, measured in various ways.
That's why in last week's post, "Building a Knowledge Flywheel," we discussed ways to filter the firehose of new arXiv paper to identify the most promising hypotheses for improving our AI application. After all, the most popular paper making it into your social media feed may have little relevance to the challenges you're facing with what you're building.
By grounding our understanding of the experiment design space in a causal model built using real-world data, we can establish a more robust, scientific explanation of the factors that drive performance. That's what it means to go beyond surface-level correlations in the data to work with a causal understanding of the system.
Discovering these relationships starts by associating the treatments of our AI experiments with nodes in a structural causal model. Framed this way, we can test the significance of each decision to reach a better understanding of what truly matters.
The core idea here is to determine the average treatment effect from the data, controlling for biases that would invalidate the comparison. Using classic techniques such as A/B testing, we apply randomization to eliminate bias in treatment selection, allowing us to infer the effect robustly.
However, A/B testing can be costly and sometimes not feasible. And so, we're motivated to do more with our offline evaluations, enabling us to deploy changes to our production AI systems with greater confidence.
Interpreting our VLM experiment data using the PC-SubQ algorithm, we found evidence supporting the view that reasoning capabilities drive quantitative spatial reasoning performance, prompting further investigation into ways to achieve additional improvements.
This week, I've been reviewing "Causal Discovery and Inference in Python" to learn about various algorithms that complement the strengths and weaknesses of AI in causal discovery, such as PC or NOTEARS. I've been incorporating tools based on gcastle and econML into my knowledge flywheel.
At the same time, researchers have released another comprehensive spatial reasoning benchmark, offering yet another comparison for our models SpaceQwen and SpaceThinker.
It should come as no surprise that another benchmark provides another model ranking. How can we reconcile the differences and utilize this view to enhance our understanding of the latent spatial reasoning capabilities of our models?
Causal data fusion provides a methodology for integrating information from heterogeneous sources, which have unique biases, thereby enabling us to relate these findings to our application context and scale our capacity for scientific discovery.
Once you've adopted the causal framing, you can leverage expert knowledge as priors or assume other models to make efficient use of your data. You can apply interventions designed to improve and reduce uncertainty about the system. You can expect your offline evaluations to be more transportable to your application.
Causal AI engineering enables systematic and efficient experimentation, which is also capable of the flexibility needed to adapt to a quickly shifting landscape. How do new results influence your belief of what is essential for your AI application?
It's time we moved on from AI evaluation based on static benchmarks and regarded these snapshots as evidence of a latent capability that we can describe using a causal model.
Causal AI engineering enables us to go beyond model fitting, providing an efficient, empirically driven, and explainable understanding of what matters for your AI.