1f:["$","$13",null,{"fallback":null,"children":["$","$L14",null,{"reason":"next/dynamic","children":["$","$L22",null,{"title":"Ivan Moshkov & Daria Gitman - How to Build an LLM for Math Reasoning without Proprietary Data?","initialTldrHtml":"

LLMs struggle with math reasoning, especially arithmetic. To improve, researchers trained models on an 8M-instruction dataset using open-source math problems. They tested (1) text-based, (2) code-based, and (3) code-interpreter solutions, with the last proving most effective.

A brute-force approach generated multiple solutions per problem, filtering out incorrect ones. GPT-4 was best at creating synthetic datasets. A data explorer tool helped analyze model errors and refine training data. The model executed Python code in a sandbox to verify answers, improving accuracy. Despite errors in reasoning, the approach significantly boosted performance without proprietary data.

\n","sectionsData":[{"title":"Why Math is the Ultimate Test for LLMs","htmlContent":"

LLMs can handle basic text tasks, but math reasoning exposes their weaknesses. Unlike simple text generation, solving math problems requires logical steps and precise arithmetic, areas where models often fail. Math benchmarks like GSM8K and MATH serve as a proxy for general reasoning ability, making them a key focus for improvement.

\n
Mathematical problems are reasoning problems—solving them requires logical steps that lead to a final answer.
\n

\n"},{"title":"Training LLMs for Math: From Raw Data to Fine-Tuning","htmlContent":"

The process starts with pre-training on massive datasets (trillions of words), followed by fine-tuning on curated math problems. This second stage teaches the model to generate structured solutions rather than just predicting the next word. The goal is to train models to reason through problems step by step, not just memorize answers.

\n"},{"title":"Building an 8M-Instruction Dataset Without Proprietary Data","htmlContent":"

To train their models, the team created a 1.8 million instruction dataset using open-source math problems from GSM8K and MATH. Unlike other models that rely on proprietary OpenAI data, this dataset was built entirely from public sources, achieving competitive results without commercial restrictions.

\n"},{"title":"Testing Three Solution Styles: Text, Code, and Code Interpreter","htmlContent":"

Three approaches were tested:

Text-based solutions – Traditional human-like explanations, but prone to arithmetic mistakes.
Code-based solutions – Python handles calculations, reducing errors but lacking flexibility.
Code-interpreter solutions – The best approach: the model writes text, inserts Python code for calculations, executes it, and continues reasoning based on the output.

\n
The model writes some text, inserts Python code, executes it, and then continues generating based on the result.
\n

\n"},{"title":"Brute-Force Data Generation: Filtering for Accuracy","htmlContent":"

To create high-quality training data, the team used a brute-force approach, generating 128–256 solutions per problem and filtering out incorrect ones. GPT-4 produced the most reliable synthetic data, but since it’s proprietary, they relied on open-source models instead.

\n"},{"title":"Preventing Memorization: A Simple Trick That Boosted Coverage","htmlContent":"

A major issue was models memorizing answers instead of reasoning. The solution? Replacing intermediate computations with variables, forcing the model to derive answers step by step. This simple tweak increased dataset coverage by 133%, ensuring more diverse training data.

\n"},{"title":"Understanding Model Errors: It’s Not Just Extraction Mistakes","htmlContent":"

Most errors weren’t from extracting the wrong final answer but from flawed reasoning. The model often misunderstood problems or made logical mistakes before reaching the final step. This made evaluation tricky—just checking if the final answer was correct wasn’t enough.

\n"},{"title":"A Custom Tool for Debugging and Analysis","htmlContent":"

To streamline evaluation, the team built a data explorer tool with:

Interactive visualization – Easily inspect model outputs.
Filtering & statistics – Identify common failure patterns.
Markdown & LaTeX support – Properly render math equations.

This tool helped uncover key issues like arithmetic failures and formatting inconsistencies.

\n"},{"title":"Running Python Code Securely: Sandboxing & Error Handling","htmlContent":"

Since models generate executable Python code, security was a concern. The solution:

A secure sandbox – Prevents harmful code execution.
Error handling techniques – Majority voting and retries improved accuracy by 2%.

\n
We regenerate failed code a couple of times and do majority voting—this improved scores by 2%.
\n

\n"},{"title":"What’s Next? A Math-Specialized LLM","htmlContent":"

Future improvements include:

Developing a math-specific LLM, trained exclusively on mathematical data.
Refining fine-tuning strategies, possibly using separate models for different reasoning tasks.
Exploring agent-based approaches, where one LLM solves math while another handles general tasks.

These advancements could push open-source models closer to GPT-4-level performance in math reasoning.

\n"}],"goldenNuggetCount":10,"subtitle":"Building a math reasoning LLM without proprietary data using synthetic datasets, code-based solutions, and open-source strategies.","isPublicAccess":true,"materialType":"talk"}]}]}]