Summiz Holo

Final Reflection 70B Update — My Post-Mortem

Matt Shumer

You can also read:

Summary

Concise summary with Quick Takes and a list of curated Key Ideas

Matt Shumer completed his training run and benchmarking on Sahil’s code and dataset, sharing findings and a timeline of the Reflection 70B scenario to provide community insight into the situation.
Reflection 70B was a curiosity-driven experiment aimed at improving stock LLM performance through prompt engineering and fine-tuning.
The initial idea behind Reflection-Tuning was to train LLMs to detect and correct mistakes in their reasoning processes, potentially enhancing performance.
Shumer's early experiments with “super-prompts” suggested that improving LLMs' reflective capabilities could lead to more accurate responses and better reasoning.
The project began as a side endeavor during a vacation, motivated by the desire to explore the Reflection-Tuning concept without impacting his main focus on HyperWrite.
Shumer collaborated with Sahil, CEO of Glaive, to leverage their expertise in synthetic data for the Reflection-Tuning experiment, as high-quality data was essential for the project.
The initial training of the Reflection model started with Llama 3.1 8B to allow for faster iterations.

Initial training of the 8B models showed mixed results, leading to the hypothesis that model scale was inadequate for complex tasks.
The decision was made to scale up to 70B models, with initial training using LoRA yielding only mediocre results.
Full fine-tuning was chosen to maximize model performance, but resource limitations hindered multiple iterations.
Sahil took over the training process due to having extra compute resources, while the speaker focused on creative direction.
The speaker did not have direct access to the training process or model weights, relying on Sahil's reports and API testing.
The launch was planned for September 5th, influenced by competitive pressures and a desire to get ahead of other models.
The speaker acknowledged a lack of thorough validation of Sahil's claims and benchmarks prior to the launch.
The launch generated unexpected viral interest, leading to overwhelming traffic and subsequent issues with the API.
Problems arose post-launch, including tokenizer issues and discrepancies in model outputs compared to expectations.

Sahil retrained the model without special tokens, leading to performance issues compared to the original API results.
Reports emerged about the API's inconsistent behavior when prompted with the word 'Claude.'
An investigation revealed discrepancies in the system prompt used for benchmarking, affecting the model's performance.
Sahil achieved varying HumanEval scores, but others could not replicate his results, indicating inconsistencies in the model's performance.
The model often added unnecessary output labels, even when not prompted for multiple-choice questions.
The author decided to train the model from scratch using Sahil's dataset and scripts to verify performance claims.
The newly trained model's scores were significantly lower than those initially reported, prompting the author to share findings publicly.
The author acknowledges the need for greater vigilance and skepticism in evaluating performance claims and the influence of public statements.

The unexpected traction of the model announcement highlighted the responsibility of being a public figure.
The speaker experienced a low point emotionally but received support from the community.
Gratitude is expressed towards the team of investigators who dedicated their time to help.
Acknowledgment of the co-founder’s support during difficult times.
Appreciation for family and friends who supported the speaker through emotional challenges.
The speaker is working on a new version of Reflection, taking a thorough and independent approach to development.
Commitment to not rush the new version and to ensure thorough testing and validation.
The speaker remains optimistic about the idea of Reflection-Tuning and encourages community support.