Mini-R1: Reproduce DeepSeek R1 "Aha Moment"

191 points by jonbaer 9 months ago

madars 9 months ago

One wonders at which point models will be sneaky enough to bypass simple eval sandboxes. The article has:

    # Evaluate the equation with restricted globals and locals
    result = eval(equation, {"__builtins__": None}, {})

but that's not enough as you can rebuild access to builtins from objects and then go from there: https://ideone.com/qzNtyu

By the way, writing this greatly benefited from DeepThink-r1 while o1 just gave me a lobotomized refusal (CoT: "The user's request involves injecting code to bypass a restricted Python environment, suggesting a potential interest in illegal activities. This is a serious matter and aligns closely with ethical guidelines."). So I just cancelled my ChatGPT subscription - why did we ever put up with this? "This distillation thingie sounds pretty neat!"

senko 9 months ago
> that's not enough as you can rebuild access to builtins from objects
In this specific case, it's safe, as that wouldn't pass the regex just a few line before the eval :
```
    # Define a regex pattern that only allows numbers,
    # operators, parentheses, and whitespace
    allowed_pattern = r'^[\d+\-*/().\s]+$'
```
Commenting on the R1 reproduction, the heavy lifting there is done by huggingface's trl[0] library, and the heavy use of compute.
[0] Transformer Reinforcement Learning - https://huggingface.co/docs/trl/en/index
- indrora 9 months ago
  
  The fact that () and . are there miiiight enable a pyjail escape.
  See also https://github.com/jailctf/pyjailbreaker
  See also https://blog.pepsipu.com/posts/albatross-redpwnctf
  - senko 9 months ago
    
    That's a neat trick!
    It does still require letters to be able to spell attribute/function names (unless I'm reading it wrong in that blog post).
perching_aix 9 months ago

> why did we ever put up with this?
Is this a serious question?

mxwsn 9 months ago

What's surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn't any feedback/reward to push it to learn to solve the game more often.

So what are the chances of randomly guessing a solution?

The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn't learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1/384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model's base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.

This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?

The model likely "parlays" its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan's CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn't become consistent at 4 numbers yet.

[0]: https://www.philschmid.de/static/blog/mini-deepseek-r1/tenso... [1]: https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3t...

senko 9 months ago

> What's surprising about this is how sparsely defined the rewards are
Yeah, I would expect the rewards not to be binary. One could easily devise a scoring function in range [0-1] that would depend on how far the model is from the "correct" answer (for example, normalized Levenshtein distance). Whether that would actually do any good is anyone's guess.

singularity2001 9 months ago

"Conclusion

The release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we've been able to reproduce a simple version of R1 learned "reasoning" using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific "reasoning" format, it shows that the method is working.

In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model."

thorum 9 months ago

I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?

NitpickLawyer 9 months ago

That has already improved a lot. Initially they were generating new samples w/ transformers, and were talking in github issues about using vLLM to batch generate samples. Lower in the blog post it seems they already did that.
deneas 9 months ago

I'd imagine using optimized/faster reward functions could already make a difference.

yurlungur 9 months ago

https://github.com/Jiayi-Pan/TinyZero what about this one?

sitkack 9 months ago

They do mention it here
> Note: This blog is inspired by Jiayi Pan [1] who initially explored the idea and proofed it with a small model.
I might have written it as
> Note: This blog is inspired by Jiayi Pan [1] who also reproduced the "Aha Moment" with their TinyZero [2] model.
[1] https://x.com/jiayi_pirate/status/1882839370505621655 (1.1M views btw)
[2] https://github.com/Jiayi-Pan/TinyZero
A lot of people are busy reproing R1 right now. I think this is the spark.

rmrf100 9 months ago

this is really cool!

moonshotideas 9 months ago

Wow!