When "Just Claude It" Isn't Enough: Hyperparam Beats Bespoke Scripts for Data Exploration
Hyperparam is a great way to rapidly clean and transform a large dataset. I’ve been working with them on UX, When I demo Hyperparam to ML practitioners, I’ve sometimes heard a version of the same question: "This looks cool, but couldn't I just ask Claude to whip up a quick script for this?" It's a fair objection. LLM-based coding is incredibly powerful, and the promise of going from question to custom analysis in seconds is compelling.
(I’m sharing this post today because Hyperparam is now in public release — which means you can follow these steps along if you want!)
So I decided to test it. I took a real dataset—one that needed actual cleaning and exploration—and worked through the same analysis twice: once with Hyperparam, once with Claude Code generating bespoke Python. The dataset is MathX, a collection of math problems with AI-generated solutions and reasoning chains. My goal was straightforward: figure out how often the dataset’s AI reasoning chain actually got the right answer. That would be a key step if I wanted to use this dataset for any training scenario — and it seemed like it would be pretty easy to check. I figured that was a reasonable approximation toward a data cleaning I might do in the real world.
The dataset is a Parquet file with three key columns: the problem, the expected answer, and the AI's generated solution (including all its reasoning). The question, answer, and logic chain used LaTeX-like math notation liberally.
Here's what I learned.
The Claude Code Approach: Powerful but Friction-Full
I started with Claude Code, working in a Jupyter notebook.
My first prompt was simple: "Extract the answer from the generated solution and compare it to the expected answer."
I had to iterate a few different times to get the code to be useful:
Pass 1: The script looked for non-existent columns, and came out with null results. That might be on me: I named the columns the wrong things in my prompt, and Claude didn’t correct me.
Pass 2: After we came to an understanding about column names, Claude wrote a regex to extract answers. The regex was... optimistic. It didn't account for the fact that many answers were wrapped in LaTeX
\boxed{}commands,, or that the "answer" might not be the last thing in the reasoning chain.Pass 3: I passed some failed examples into Claude, and it rewrote the regex in a much more intelligent way. This time it successfully extracted the boxed content.
Now it was time to move onto comparison. The naive comparison logic was broken—it was doing exact string matches on mathematical expressions that could be written in multiple equivalent forms. While some answers were clean — the number 28 is unambiguous — other parts of the dataset had an astonishing range of difficult examples.
I hopped into an iterative loop. I had the notebook show me ten examples that failed to match. I would:
Run the notebook cell
Look at the mismatches
Copy representative examples back into Claude's context
Ask it to improve the extraction or comparison logic
Rerun the cell
Claude and I made progress. We handled the \boxed{} stripping and answers that were wrapped in LaTeX like \text{}. We caught some edge cases. But the dataset really called for fuzzy matching — 0.5 compared of \frac{1}{2}, or “a=23” vs "the tower is 23 meters high”. The AI kept wanting to add increasingly complex parsing logic, and I kept feeling like I was fighting the tooling.
The fundamental issue was that each iteration required explicit decision-making. Should we use a regex or parse the LaTeX? How should we handle fractions versus decimals? Do we need an LLM call per row, or can we script this? Every time I wanted to make progress, I had to think through the implementation strategy first.
The Hyperparam Approach: Conversational Data Work
I opened the same MathX dataset in Hyperparam and typed:
"I'd like to find out how often the answer in 'Generated solution' matches the answer in 'expected_answer.' Can you give me a column that shows the answer in generated_solution, and another that tells me if it matches?"
Five minutes later, I had results on the first thousand rows.
The Hyerparam UI, and two columns.
Hyperparam extracted the answers and created a match column. As it ticked along, I could see the first few answers instantly, even as other rows were still working. When I looked at the results, I saw plenty of "NO" cases that I wanted to examine further—but the formatting was a mess. All those \boxed{} answers were back!
I then asked:
can you pull out the "\boxed" crud from the extracted_answer column?
Here's where something interesting happened. Instead of making an LLM call to interpret each solution, Hyperparam generated a JavaScript snippet:
let cleaned = answer.replace(/\\boxed\{/, '').replace(/\}$/, '');
It recognized that stripping the LaTeX box wrapper was a simple string operation—no LLM needed. Fast, cheap, and exactly right for the job.
This let me check some examples easily.
I was pleasantly surprised: the matcher LLM had figured out that 165/9 was the same as 55/3 , and that (Median = 91.5, Mode = 92) is the same as (Median 91.5, Mode 92).
This is one of Hyperparam's most exciting design choices: it can route work to either JavaScript for simple transformations or LLM calls for semantic tasks. I didn't have to think about that tradeoff—it just made the right call.
I was through the task in three or four minutes, and was easily able to move to next steps: counting the number of times that the chain-of-reasoning AI reset itself; looking at the string length of the AI’s reasoning process; and characterizing the problems by difficulty and type of problem.
Hyperparam Limitations
To be sure, Hyperparam wasn’t seamless. When I first loaded my file, the viewer didn’t want to display one of the columns. It turns out that the data compression used in the original Parquert file was a very inefficient one – and Hyperpram, trying to keep the data in browser memory, declined to show me some of the longest columns. Fortunately, hitting the “export” button in Hyperparam and then re-opening the exported file made that problem go away.
Hyperparam also doesn’t currently do cross-row operations well. In the notebook, I set a counter to see what percentage of rows were matches, so that I could see how much lift I’d gotten between rounds of improving my regexes. Hyperparam doesn’t show me those distributions.
Last, I wanted to explore the data numerically – to test hypotheses about whether the difficulty of the problem presented would have a correlation with the accuracy of the AI. I could spot-check individual rows in Hyperparam, but couldn’t do bulk operations. However, after exporting the data file, I was able to pull it into a python notebook and trivially threw together a few Seaborn scatterplots.
In Conclusion
Hyperparam made it easy to transform my data, either with an LLM call per line, or by generating a snippet of Javascript. The system got me answers in seconds, letting me move onward with my data cleaning and analysis. Yes, I could have gotten much of the way there with some time and a Claude prompt – but the efficiency of just dropping my file into Hyperparam and asking for my column transformations couldn’t be beat.
If you decide to check out Hyperparam, I’d love to know about your experience! Drop my a line — lets talk about how it can support data analysis for you!