madars 7 hours ago

One wonders at which point models will be sneaky enough to bypass simple eval sandboxes. The article has:

    # Evaluate the equation with restricted globals and locals
    result = eval(equation, {"__builtins__": None}, {})
but that's not enough as you can rebuild access to builtins from objects and then go from there: https://ideone.com/qzNtyu

By the way, writing this greatly benefited from DeepThink-r1 while o1 just gave me a lobotomized refusal (CoT: "The user's request involves injecting code to bypass a restricted Python environment, suggesting a potential interest in illegal activities. This is a serious matter and aligns closely with ethical guidelines."). So I just cancelled my ChatGPT subscription - why did we ever put up with this? "This distillation thingie sounds pretty neat!"

  • perching_aix 5 hours ago

    > why did we ever put up with this?

    Is this a serious question?

  • senko 6 hours ago

    > that's not enough as you can rebuild access to builtins from objects

    In this specific case, it's safe, as that wouldn't pass the regex just a few line before the eval :

        # Define a regex pattern that only allows numbers,
        # operators, parentheses, and whitespace
        allowed_pattern = r'^[\d+\-*/().\s]+$'
    
    Commenting on the R1 reproduction, the heavy lifting there is done by huggingface's trl[0] library, and the heavy use of compute.

    [0] Transformer Reinforcement Learning - https://huggingface.co/docs/trl/en/index

singularity2001 6 hours ago

"Conclusion

The release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we've been able to reproduce a simple version of R1 learned "reasoning" using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific "reasoning" format, it shows that the method is working.

In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model."

mxwsn 7 hours ago

What's surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn't any feedback/reward to push it to learn to solve the game more often.

So what are the chances of randomly guessing a solution?

The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn't learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1/384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model's base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.

This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?

The model likely "parlays" its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan's CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn't become consistent at 4 numbers yet.

[0]: https://www.philschmid.de/static/blog/mini-deepseek-r1/tenso... [1]: https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3t...

  • senko 6 hours ago

    > What's surprising about this is how sparsely defined the rewards are

    Yeah, I would expect the rewards not to be binary. One could easily devise a scoring function in range [0-1] that would depend on how far the model is from the "correct" answer (for example, normalized Levenshtein distance). Whether that would actually do any good is anyone's guess.

thorum 7 hours ago

I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?

  • NitpickLawyer 7 hours ago

    That has already improved a lot. Initially they were generating new samples w/ transformers, and were talking in github issues about using vLLM to batch generate samples. Lower in the blog post it seems they already did that.

  • deneas 7 hours ago

    I'd imagine using optimized/faster reward functions could already make a difference.

yurlungur 7 hours ago
  • sitkack 3 hours ago

    They do mention it here

    > Note: This blog is inspired by Jiayi Pan [1] who initially explored the idea and proofed it with a small model.

    I might have written it as

    > Note: This blog is inspired by Jiayi Pan [1] who also reproduced the "Aha Moment" with their TinyZero [2] model.

    [1] https://x.com/jiayi_pirate/status/1882839370505621655 (1.1M views btw)

    [2] https://github.com/Jiayi-Pan/TinyZero

    A lot of people are busy reproing R1 right now. I think this is the spark.

rmrf100 8 hours ago

this is really cool!