VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

(arxiv.org)

123 points | by timhigins 4 hours ago

11 comments

deftio 2 hours ago
There is some base level of intelligence any model needs to be useful, even in narrow tasks.
Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...
Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.
[-]
- swiftcoder 27 minutes ago
  > To drive a car requires being able to read
  Emphatically, it does not. Passing your drivers test may require being able to read, but plenty of illiterate people around the world drive just fine.
  There is a reason we made all the common road signs recognisable purely by shape/colour, after all.
- ygjb 1 hour ago
  > Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...
  I would be interested to see a formal study of this. I say this not out of anything other than a observation that I think the only real blockers are a) judgement, and b) physical reflexes/strength. As a kid I was certainly aware of ice,snow, and rain, because I road my bike year round and had low confidence in my own ability to control my bike on snowy or wet terrain, especially during season changes. That translated into learning to drive in northern Canada in the winter and applying those lessons to driving.
  In an environment devoid of consequences, I have seen kids operate driving simulations (both real simulations, and video games) with a degree of precision that is shocking, including seeing several 9-11 year olds play the simulations and games with a much higher degree of confidence than adult drivers. Children have an awareness that the simulations are consequence free, unless given other motivation. Adults that are consistent drivers have muscle memory and preconceived expectations that govern the decisions they make when playing the game. I am curious about the level of training and exposure required for children to overcome their lack of awareness of the hard limits and consequences of driving and driver error, versus the amount of training and exposure required for expert drivers that are novice gamers to stop applying their learned experience to consequence free simulations.
- jmalicki 33 minutes ago
  You can teach a dog to drive a car.
  https://www.youtube.com/watch?v=BWAK0J8Uhzk
  [-]
  - dkersten 18 minutes ago
    And AI tried telling me that Uber for Dogs (dogs are the drivers) was a terrible idea…
- universa1 1 hour ago
  A 10 year old definitely,and 5year old is close, but not unrealistic, To drive a car you don't need to be able to read... To drive a car on the road with other people is a whole other story :-)
  [-]
  - 3eb7988a1663 54 minutes ago
    I suspect plenty of five year olds can do a respectable job in Mario Kart, Gran Turismo, etc driving games. Gaming has too low of stakes to judge them on perfectly adhering to the rules of the road, but the ability is there.
- smokel 1 hour ago
  Being able to drive a car properly also depends on having the right exploration-exploitation balance. A three-year-old is likely to explore too much in a situation where mistakes can be dangerous.
  This requires not only knowledge, but also the control systems that develop with the prefrontal cortex. LLMs don't do much control yet.
- satvikpendem 58 minutes ago
  While I agree with your assessment, probably could've chosen a better example, as in many countries young kids even as young as 8 will learn how to drive.
- threatripper 45 minutes ago
  Ask people who grew up on a farm in a rural area. Sometimes you have to even if you can't and you do.
- wilg 30 minutes ago
  This is more of a question of the definition of "drive a car" than any specific issue about intelligence. Drive a car without errors? Impossible, and now we're into a subjective discussion about what feels intelligent. Pass the DMV test? Probably. How complicated are the conditions? There are plenty of drivers with bad judgement. It's a quicksand sort of discussion.
gslepak 2 hours ago
Note that these are Python-only results, the model will not do as well with other languages.
I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.
[-]
- nsingh2 1 hour ago
  Lots of confusion about what this model is actually focused on.
  It is a cheap specialist for closed-world, verifiable reasoning tasks like math, self-contained coding problems, and similar.
  "Closed-world" means the needed information is already in the context. It is not a tool-using agent that can discover missing context. "Verifiable" means answers are hard to generate but easy to check.
  So no open ended research, repo wide agent work, factual Q&A, or SVG generation. More of a compact reasoning module for bounded problems.
  [-]
  - nsingh2 34 minutes ago
    To follow up on this, I had it solve a nasty ODE problem that I saw in the recent Mathematica 15 release post:
```
    Solve the following first-order ODE for f(x):

    ((-1 - 2*x)*f(x)*tan(1 + x - exp(-61 - 2*x)*f(x)/x)
    + exp(61 + 2*x)*x*(1 - x*tan(1 + x - exp(-61 - 2*x)*f(x)/x))
    + x*tan(1 + x - exp(-61 - 2*x)*f(x)/x)*f'(x)) = 0

    Find the general solution f(x).
```
    And surprisingly it found a valid solution! Extra impressive because it runs 25 tok/s on my measly RTX 2070 super.
```
    f(x) = x*exp(61 + 2*x)*(1 + x - arccos(C/x))

    C is an arbitrary constant.
```
    Apparently Mathematica 14.3 couldn't solve this ODE.
secretslol 2 hours ago
Am I right in thinking this is a tiny model which has been trained well to reason, and that's it? Makes me think of a smart person who doesn't know anything about a given topic, but with the right tools will go and research the heck out of it. I really like the sound of this... why have models train on learning anything when you can just train them how to learn and let them get on with it from something as small as a Pi Zero and an internet connection.
[-]
- numlock86 1 hour ago
  This has been my dream ever since. Instead of encoding "all the knowledge" into those parameters, how about just making a model that has the same size, but all (or rather most) it does is reasoning? Just give it the ability to browse the net (e.g. language specifications, documentation and best practices) and just have it do its thing. Why does my coding agent need to know the population of New York, know a cheese cake recipe or the general lifespan of an ostrich? Just give it the bare minimum knowledge to think and reason about, and let it figure out the rest.
  Sadly that's not how LLMs work, since all they do is "token prediction". At least the models we have to today ...
  [-]
  - 3eb7988a1663 1 hour ago
    It would also reduce training costs to nothing. Current methodology requires continual retraining to scoop up new facts. If you can do a one time "this is how to think" - that could conceptually work forever, just plug in a new database layer that can be queried as required.
  - tomaskafka 30 minutes ago
    Education had this sad 15 year period where it thought “competences” are all you need.
    Turns out that without the world knowledge to have a base of facts, it is not.
- Lerc 29 minutes ago
  I think you could probably train a model to consider boolean logic, modal logic, and mathematics reasonably well, but there is still a pretty big leap between that and thinking about things.
  Even the most basic questions such as put a ball in a cup and place it on a table upside down then pick up the cup and put it in a box.
  Requires knowledge of things not mentioned in the question (notably gravity).
  Strict definition of all terms quickly gets you into a quagmire of complexity. Some base level of knowledge about things is required for you to give it instructions. If it only knows how to reason, it lacks any idea of what to aim to achieve.
  There is quite a pronounced disconnect between the vast stores of written data that models are trained on and robust consideration of a topic. I do wonder if the path can be directed by the order of training.
  For example if you train a model to basic literacy using tinystories, then math and philosopy texts, then psychology, and sociology texts, and then finally the mass data of everything from conversations and rants, to code and fiction.
  Does that end up with a significantly different model to one that is trained on books on acting, creative writing, and fantasy novels, before introducing the same final mass data set.
  How much does it's current ability allow it to contextualise new training data?
- altmanaltman 42 minutes ago
  Yeah but don't you think like that's an oversimplication with the metaphor if we assume this model can do a smart human-level analysis and distillation of knowledge, no? I mean if that were true (i.e. its just like that) then yeah there is no need for massive models but I really would doubt that.
  Even recent massive models do not work anything like a smart human does at the moment so why are we assuming this can?
NotSuspicious 1 hour ago
The interesting thing about models this small is they should be able to be put on a single Taalas chip (the HC1 already runs a Llama 3.1 8B model). We're already at the point where half-decent reasoning could be run on an ASIC (and at mind-boggling speeds).
[-]
- pants2 54 minutes ago
  Yeah, if they can fit an 8B model that's really good at improving the output by thinking, running at 16K tok/s on Taalas would be mind-blowing.
noperator 3 hours ago
Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.
[-]
- hypfer 51 minutes ago
  > but I'm working around that in my harness.
  How?
- dummydummy1234 2 hours ago
  Can't you just force it to do structured output via constrained generation?
zkmon 24 minutes ago
Does python coding depend on political facts of the world?
It might appear not, but actually, the process of reasoning is not an isolated act. The right and wrong way of doing things is codified in social evolution that absorbed all facets of life. Why should you optimize a piece of code for performance? Why performance is needed? What is a bug? What features and UI themes would be more intuitive for humans?
There is a butterfly effect. Everything affects everything to some extent.
aero2146 3 hours ago
I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...
[-]
- fwipsy 3 hours ago
  I think this is predicted? Part of the story is how they were able to preserve core reasoning ability while cutting knowledge like "pelicans have wings."
  > these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.
  [-]
  - pylotlight 2 hours ago
    The only real essential item here is tool calling capability is it not? So I assume they tested a strong read/write/edit tool consistency?
    [-]
    - nsingh2 2 hours ago
      This model doesn't support tool calling, was not part of its training. It's focused on Python (and I think C++) competitive programming and mathematics tasks, i.e. tasks with verifiable rewards. So if you have a task that fits that description, the size-to-capability ratio is good.
      These kinds of models might be more useful as tools to be used by larger orchestrator models, than being the orchestrators themselves.
    - btown 2 hours ago
      I'm not seeing any mention of tools in the paper, much less a bias towards "curiosity" to use those tools when it encounters gaps in its knowledge. So perhaps this is a good proof-of-concept that single-pass code generation is viable with this small a model - but we're still a long way from a viable solution.
- kristopolous 33 minutes ago
  try it again but give a careful explanation of what a bicycle and a pelican is and how the pelican would sit atop the bicycle. Then give it a reference to the SVG tags you want it to use with documentation.
- physPop 3 hours ago
  Its for reasoning not generating art?
  [-]
  - websap 3 hours ago
    Can you explain this a bit more
    [-]
    - tyre 3 hours ago
      Imagine you want to make a smaller model that is really good at one thing, say, driving a car. You could remove the parameters that lead it to correctly answer, "What is the powerhouse of the cell?" or, "Who was the first president of the United States?"
      It would look really dumb if someone asked it that, but that's fine. You're trying to make a model that is optimized for efficiency for a specific task. As much as possible, you should prune uncorrelated things.
    - pylotlight 2 hours ago
      SVG generation is a useless test, what's there more to know?
      [-]
      - steve_adams_86 2 hours ago
        What if you're reasoning about how to generate SVG correctly?
        [-]
        Mtinie 2 hours ago
        In this case, I’d expect it should make a web search tool call to find the Python library best suited for SVG generation and manipulation, and then use what it learns there to execute the task you’ve asked it to do (either asking if you’d like to incorporate the library as a dependency or to roll its own implementation of a subset of the features if that was your preference),
        Assuming tool calling hasn’t been entirely stripped out of this model.
        (Edit) No tool calling, per this comment: https://news.ycombinator.com/item?id=48640189
- realitysballs 3 hours ago
  That’s all I needed to hear
  [-]
  - pylotlight 2 hours ago
    As in, you learnt that a useless test that no one should be using was tested here, that's what you meant right?
SwellJoe 2 hours ago
It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).
https://swelljoe.com/post/will-it-mythos/
[-]
- nsingh2 2 hours ago
  The lack of tool use will hinder it a lot I think, since bug hunting requires collecting context across a code base and stitching it together. It might be good in a more narrow sense, i.e "is there a bug in this block of code" and not considering how it interacts with the rest of the code base.
  That's also more aligned to its leetcode style training data, the code under test is fully in the context window. It might be interesting to have a bigger tool use model go through the effort of collecting the context, and feeding it into this kind of model for analysis only. It becomes more of a thinking tool, instead of the orchestrator.
riponcm 2 hours ago
[dead]
sosojustdo 3 hours ago
[flagged]