Shubham Tomar

RAG for Miners

2025-02-18T00:00:00+00:00

I recently worked on building and evaluating a RAG-based Q&A system for a heavy equipment manufacturer. The bot answers questions from technical documentation – manuals, parts catalogs, specs, service bulletins – for operators running underground drilling rigs.

The tooling is mature. We used Databricks and Azure. Technically it was a plumbing job.

The real work was understanding how miners actually use the thing, where it fails them, and whether it moves the needle for the business.

Miners do not like to type

We built and tested the system with well-formed technical questions. Tested with internal technical stakeholders as well.

When we went for customer trials, the average query was 30 characters. Some were a single word.

“fuses”

“oil”

“brakes are not working”

These are people underground, wearing gloves, on a tablet bolted to a drilling rig. They type the minimum they think will get them an answer.

Every LLM app has this gap. The team tests with careful prompts. Real users show up with fragments, typos, and single keywords. The golden dataset I ended up with – built from actual thumbs-up production feedback – had almost no overlap with test queries provided by internal technical stakeholders. Not just different questions. Different grammar, different length, different assumptions about what the system already knows.

The failure modes

We manually reviewed hundreds of responses and categorized the failures. Hallucination and retrieval misses.

Credits to sensei @sh_reya and @HamelHusain.

We also categorized our queries using topic modelling – something I picked from @jxnlco.

This pattern is everywhere in production RAG systems and almost never measured. Standard evals ask: did it retrieve relevant documents? Is the answer grounded? Is it correct? Nobody asks: when the system couldn’t answer, did it fail usefully or just shrug?

Measure the non-answer rate.

Build evaluations from production feedback

I split evaluation into two tiers: quality metrics and exploration signals. More teams should do this.

Quality metrics drive pass/fail decisions. They need to be deterministic where possible and calibrated against real data. We started with simple rule-based checks: is the response non-empty? Reasonable length? Includes source citations? Completed without errors? Latency acceptable?

Then we added checks for the specific failure modes we found by reading thumbs-down feedback: the graceful non-answer pattern, formatting problems, filler responses that bury the actual answer.

These rule-based checks caught the majority of real issues without a single LLM call.

Exploration signals are LLM-as-judge scorers for finding interesting traces to review – not for automated decisions. When we validated generic judges (relevance, correctness) against human-labeled data, agreement was near-perfect. But domain-specific judges – “does the answer reference the right product?” or “are safety warnings included?” – dropped below 90%. Useful for surfacing traces worth looking at. Not reliable enough for quality gates.

The practical takeaway: deterministic scorers first. They’re boring and they work. LLM judges second, and only after you’ve validated them against human judgment.

Would this thing actually make money?

Stakeholder asked this. It’s the right question.

The end users are mining companies running million-dollar drilling rigs underground. When a rig breaks, the operator either fixes it or waits for a technician. Every hour of downtime is lost production.

The bot collapses the time between “something’s wrong” and “I know what to do.” Finding a part number, looking up a procedure, identifying a fault code. Even small time savings per incident compound across hundreds of machines.

But a bot that shrugs at 1 in 6 queries trains users to stop trusting it. Users would leave silently. Once they stop asking, ROI drops to zero regardless of how good the technology is.

This is what most teams get wrong about LLM ROI. They benchmark accuracy and project value from success cases. But adoption is driven by failure cases.

Building Inquiro: When Code Becomes Cheap

2025-01-31T00:00:00+00:00

Kailash Nadh recently wrote that LLMs have inverted the old adage: “Talk is cheap, show me the code” has become “Code is cheap, show me the talk.”

He’s right. And I have proof.

I built Inquiro—a statistical analysis platform for researchers—in weeks, not months. Not because I cut corners. Because code is cheap now, and what I had was the talk.

The Inversion

For decades, typing was the bottleneck. You could imagine features faster than you could implement them. Ideas were cheap; execution was expensive.

That’s over.

“Programming is 90% thinking and 10% typing” was always true in spirit. Now it’s true literally. The thinking—what Nadh calls “the talk”—is the scarce resource. The ability to articulate what you want, to architect systems, to imagine solutions. The code follows.

Inquiro required design sensibility, statistical knowledge, and architectural judgment. It needed someone who understood what researchers actually need, why standard errors matter, how assumption checks should surface in a UI. The implementation? AI handled most of it.

What I Actually Did

I didn’t write Inquiro. I directed it.

I described what I wanted: a clean interface that signals rigor. A pipeline that generates reproducible Python code with proper diagnostics. Automatic checks for heteroskedasticity, influential observations, pre-trend validation. Results that researchers can trust and cite.

AI became a collaborator that never tired. For well-defined problems—parsing outputs, formatting tables, handling edge cases—it generated working code faster than I could type. For ambiguous problems—how should the UX respond when an assumption fails?—I still had to think. AI implemented my answers.

Marc Andreessen describes this as the superpowered individual: “The really great people are becoming spectacularly great. If you’re very good at it and you can really harness AI, you can become not just great, but spectacularly productive.”

I felt this. The multiplier was real—maybe 3-5x overall. But gains weren’t uniform. Boilerplate: 10x faster. Complex statistical logic: 2x. Novel UX decisions: same speed, but with more options to consider.

The Combination Effect

Andreessen also talks about what he calls the “Mexican standoff” between product managers, engineers, and designers. Everyone now believes they can do each other’s jobs—and they’re correct.

But the real insight is this: “The additive effect of being good at two things is more than double. The additive effect of being good at three things is more than triple. You become a super relevant specialist in the combination of the domains.”

Building Inquiro required that combination. Product sense to know what researchers need. Engineering judgment to architect the system. Statistical knowledge to ensure methodological rigor. Design intuition to make it usable.

I’m not world-class at any one of these. But I’m competent at all of them. And AI amplified each one. The combination—plus leverage—let me build what would have required a team.

The Nadh Warning

Nadh raises a concern worth taking seriously: younger developers risk never building foundational skills, becoming dependent on tools they don’t understand.

I agree this is a risk. But I’d frame it differently.

AI doesn’t eliminate the need for understanding—it raises the bar for what understanding means. You can’t evaluate AI output without knowing what good looks like. You can’t direct implementation without knowing what you’re building. You can’t debug generated code without understanding the system.

The builders who thrive won’t be those who type fastest. They’ll be those who think clearest. Who can articulate problems precisely. Who know enough about enough domains to spot when AI is wrong.

Code is cheap. Judgment is not.

What I Learned

Inquiro embeds methodological best practices from J-PAL, DIME, and other research institutions. It knows about cluster-robust standard errors, pre-trend validation, influential observations—not because I hardcoded rules, but because I could articulate what rigor looks like and AI could implement it.

The product exists because I had the talk. The vision, the requirements, the quality bar. AI provided the code.

That’s the shift. Not AI doing the work. AI making the work cheap enough that one person with clear thinking can build what previously required many.

Show me the talk.

Sources:

Kailash Nadh, Code is Cheap
Marc Andreessen on Lenny’s Podcast, The AI Future

The Engineer Who Writes

2025-01-31T00:00:00+00:00

“The additive effect of being good at two things is more than double,” Andreessen says. “The additive effect of being good at three things is more than triple. You become a super relevant specialist in the combination of the domains.”

We like clear boundries. In school we had subjects. Real world, however, is messy. The most interesting problems live at the boundaries. Innovation happens in the interstices. Engineers wrote code. Product managers wrote specs. Designers drew screens. Everyone stayed in their lane, attended their standups, handed off their artifacts. The system worked because the tools were hard. Specialization was efficient.

That world is ending.

Every engineer now believes they can be a PM and a designer—because with AI, they can prototype entire products in an afternoon. Every PM thinks they can code and design—because they can. Every designer knows they can do both other jobs too.

The Old Career Path Is Dead

The traditional path was linear: junior engineer, senior engineer, staff engineer, principal engineer. Or: APM, PM, senior PM, director of product. Climb the ladder in your lane.

That ladder is now horizontal.

The most valuable people I know are hard to title. They’re engineers who can run customer discovery calls. Product managers who can read a codebase and spot architectural risk. Designers who can write SQL and ship experiments.

What This Means Practically

If you’re an engineer: learn to write. Not just code comments—actual prose. Write about technical decisions. Write about product tradeoffs. Write about what you’re learning. Writing forces clarity.

If you’re a PM: learn to read code. You don’t need to ship production features. You need to understand what’s possible, what’s expensive, and what’s being oversimplified. Technical literacy changes the questions you ask.

If you’re a designer: learn to prototype beyond Figma. The gap between mockup and working software is where ideas die. Close that gap yourself, and you control the outcome.

Fighting slop

You still have to do the work. You still need to understand what makes a good design. You need to understand systems design. For a hands-on developer, reading and critically evaluating code have become more important than learning syntax and typing it out line by line. Of course, that is still an important skill, because the ability to read code effectively comes from that in the first place. But, the daily software development workflows have flipped over completely. We’ll be outsourcing tasks to AI and that frees up time to you to think and make better decisions.

Sources:

Kailash Nadh, Code is Cheap
Marc Andreessen on Lenny’s Podcast, The AI Future

The Disease of Obsession

2024-12-26T00:00:00+00:00

We often discuss work-life balance, but rarely work-life obsession. Yet every great achievement stems from someone’s inability to let go, to stop thinking, to “switch off.”

Obsession gets a bad rap. We celebrate the outcomes—the breakthrough products, the elegant solutions, the market-defining companies—while condemning the mindset that birthed them. This cognitive dissonance serves no one.

Here’s what obsession really means: you keep working on the problem in the shower. You wake up with solutions. Your mind constantly reorganizes information, finding new patterns, better approaches. You can’t help it. The problem has infected you.

This isn’t about hours worked. It’s about mental real estate occupied. When you’re genuinely obsessed with your craft, ego dissolves. Pride emerges not from being better than others, but from the work itself meeting your own standards.

The question isn’t whether to be obsessed, but what to be obsessed with. Choose wisely. The disease is beautiful, but it’s still a disease. Make sure it’s one worth having.