- Aristotle uses modern AI techniques heavily, including language modeling.
- Aristotle can be guided by an informal (English) proof. If the proof is correct, Aristotle has a good chance at translating it into Lean (which is a strong vote of confidence that your English proof is solid). I believe that's what happened here.
- Once a proof is formalized into Lean (assuming you have formalized the statement correctly), there is no doubt that the proof is correct. This is the core of our approach: you can do a lot of (AI-driven) search, and once you find the answer you are certain it's correct no matter how complex the solution is.
How do you verify that the AI translation to Lean is a correct formalization of the problem? In other fields, generative AI is very good at making up plausible sounding lies, so I'm wondering how likely that is for this usage.
∀ᶠ ε : ℝ in [>] 0, ∀ C > (0 : ℝ), ∀ C' > C,
∃ a b n : ℕ,
0 < n ∧
ε * n < a ∧
ε * n < b ∧
a ! * b ! ∣ n ! * (a + b - n)! ∧
a + b > n + C * log n ∧
a + b < n + C' * log n
Yes on the one hand, one needs to know enough about Lean to be sure that this formulation matches what we intend, and isn't stating something trivial. But on the other hand, this is not as hard as finding an error on some obscure line of a long proof.
That's what's covered by the "assuming you have formalized the statement correctly" parenthetical.
Given a formal statement of what you want, Lean can validate that the steps in a (tedious) machine-readable purported proof are valid and imply the result from accepted axioms. This is not AI, but a tiny, well reviewed kernel that only accepts correct formal logic arguments.
So, if you have a formal statement that you've verified to represent what you are interested in by some other means, Lean can tell you whether the proof created by genAI is correct. Basically, there is a nigh infallible checker that won't accept incorrect hallucinations.
Yeah people dramatically overestimate the difficulty of getting one's definitions correct for most problems, especially when you are doing an end to end proof rather than just axiomatizing some system. They are still worth looking at carefully, especially for AI-generated proofs where you don't get the immediate feedback that you do as a human when something you expect to be hard goes through easily, but contrary to what seems to be popular belief here they are generally much easier to verify than the corresponding proof (in the case of formally verified software, the corresponding analogy is verifying that the spec is what you want vs. verifying that the program matches the spec; the former is generally much easier).
Everyone has a different perspective, based on their math background. From the OP's perspective, the formalization of this problem statement was apparently worth talking about. On the other hand, for you it's just a homework problem that belongs in an intro class.
Let's just be generous and try to accept these differences.
Are you an expert? Not gatekeeping here but I have no intuition for what is easy or hard to formalise. A lot of very simply stated graph theoretical results are apparently extremely hard to formalise.
I can't speak for ndriscoll, but I am a university math professor with extensive experience teaching these sorts of topics, and I agree with their comment in full.
You are right that some (other) statements are harder to formalize than they look. The Four Color Theorem from graph theory is an example. Generally speaking, discrete math, inequalities, for all/there exists, etc. are all easy to formalize. Anything involving geometry or topology is liable to be harder. For example, the Jordan curve theorem states that "any plane simple closed curve divides the plane into two regions, the interior and the exterior". As anyone who has slogged through an intro topology book knows, statements like this take more work to make precise (and still more to prove).
When someone takes the time to explain undergrad-level concepts in a comment, responding with "are you an expert?" is a level of skepticism that's bordering on hostile. The person you're responding to is correct, it's rare that the theorem statement itself is particularly hard to formalize. Whatever you read likely refers to the difficulty of formalizing a proof.
To be fair, the comment did not explain any concept that I can see, or why this statement is simple. It gave the statement and said it was simple to formalise. It does seem simple enough to me (basic arithmetic statement with a few variables and a bunch of quantifiers) but if somebody has no expertise/intuition, I think it is a fair question, without any hostile intent assumed.
> it's rare that the theorem statement itself is particularly hard to formalize
That's very dependent on the problem area. For example there's a gap between high school explanation of central limit theorem and actual formalization of it. And when dealing with turing machines sometimes you'll say that something grows e.g. Omega(n), but what happens is that there's some subsequence of inputs for which it does. Generally for complexity theory plain-language explanations can be very vague, because of how insensitive the theory is to small changes and you need to operate on a higher level of abstraction to have a chance to explain a proof in reasonable time.
Yes, if the theorem statement itself is "hard to formalize" even given our current tools, formal foundations etc. for this task, this suggests that the underlying math itself is still half-baked in some sense, and could be improved to better capture the concepts we're interested in. Much of analysis-heavy math is in that boat at present, compared to algebra.
Lol it’s weird seeing high school redditors saying gatekeeping and are you an expert in the same thread as university professors, all talking about the same topic. But I guess that’s HN for you.
As the sibling comment says, in my experience, the difficulty of formalizing the problem varies greatly between different areas. Probability theory is another notorious field in which modelling a problem correctly can be very subtle and difficult.
On the other hand, many problems in number theory and discrete structures tend to be rather simple to formalize. If you want to take your own look at that, I'd recommend to check out the lean games[1]. I'd say after the natural numbers game, you most likely know enough lean to write that problem down in lean with the "sufficiently small" being the trickiest part.
Voters: please reconsider your ups and downs. I think the “Are you an expert” question triggered a lot of downvotes when it was in fact asked in good faith to judge the person’s perspective of easy and hard.
Isn't that kind of a general problem with proofs, even when they're written by humans? There's nothing stopping someone from accidentally writing their own Lean proof that has slightly different semantics than an English version of the same proof, or even for their English proof to subtly miss something important or make an incorrect logical leap. This seems like a bit of a double standard, although maybe there's nuance here I'm missing.
One nuance you are missing is that the discussion is about formalizing the statement (the theorem), not the proof. The latter is what the article is about, but that doesn't suffice if you can't trust that the statement is also correctly formalized.
These are the two main problems:
1. Formalizing a theorem.
2. Finding a formal proof.
Part 2 is where AI could help as proof search is full of heuristics. That's also how humans find proofs and is one of the main skills of a mathematician. The formal proof can then be machine checked with well known and mature techniques not involving AI.
Part 1 is the part that's missing and will always be hard. It's also the issue with formal verification of programs for which correctness criteria are often very complex and it's easy to mess up the formalization, so that even if you trust the proof, you can't trust that it proves the right thing.
They can read the statement, and the definitions that the statement references. If everything it references is in a well-tread part of the Lean library, you can have pretty high confidence in a few minutes of going over the syntax.
To borrow some definitions from Systems engineering for verification and validation, this question is one of validation. Verification is performed by Lean and spec syntax and logic enforcement. But Validation is a question of is if the Lean spec encodes a true representation of the problem statement (was the right thing specced). Validation at highest levels is probably an irreplaceable human activity.
Also, on the verification side - there could also be a window of failure that Lean itself has a hidden bug in it too. And with automated systems that seek correctness, it is slightly elevated that some missed crack of a bug becomes exploited in the dev-check-dev loop run by the AI.
I can read and understand e.g. Python, but I have seen subtle bugs that were hard to spot in code generated by AI. At least the last time I tried coding agents (mid 2025), it was often easier to write the code myself then play "spot the bug" with whatever was generated. I don't know anything about Lean, so I was wondering if there were similar pitfalls here.
In this case the human written statement of the theorem is small. Can there be bugs? Absolutely! But it's essentially a few lines of code worth of thinking.
The lean proof checker then checks to make sure the proof actually proves the statement.
In this case an AI is generating the proof, but if it "compiles" it's correct. The only thing humans need to check is the statement to be proven.
(I don't know anything about this project but I've played around with lean and used other proof checkers more sesrisously).
This is also an issue with non-AI and non-Lean proofs. Andrew Wiles' initial Fermat's Last Theorem proof initially had an error in it. That was spotted by peer review, fixed, and an updated proof was submitted.
I don't disagree with you, but on the other hand, I feel the same way about code written by other humans, and that's not because they're necessarily worse at writing code than me, but because for code I've written myself, I've already spent time thinking about it, so I don't have to start from scratch when re-reading it. It's also not like I don't think I potentially write as many bugs as my coworkers either; it's just easier for me to catch my own up front as I'm coding than it is to catch theirs in code review. The two main differences are that I can have a more meaningful conversation with my coworkers about their approach, what bugs they might think are worth looking out for, etc. compared to an LLM (which in my experience will claim completely incorrect things about the code it wrote far more often than any engineer I've worked with even junior ones; the humans I've worked with have uniformly been able to report how confident they are in what they've produced being what they were tasked with without insane exaggerations), and that an LLM can produce a much higher volume of plausible-enough looking code in a given unit of time than most humans I've worked with. It's not obvious to me that these would be particularly severe problems in generating proofs though; unless the proof is so large that it would be infeasible to read through it in a reasonable amount of time, I would expect mathematicians to be able to make up for the lower quality conversations with the "author" by devoting more time to reading and thinking, or having someone else also read through the proof and talking through it with them. If anything, it's possible that the timeline for writing up a paper about the results might be better for some mathematicians than the equivalent amount of time most engineers have to spend reviewing code before the pressure to get it merged and move on to the next thing. (I'm aware that there is certainly pressure to get things published in academia, but I don't have firsthand experience, so I've tried to be intentional in how I've worded this to clarify that I want to avoid any assumptions about what the norms would be, but given the relatively wide range of time pressure that engineers might experience across the industry as a whole, I'd expect that at least some mathematicians might have some flexibility to spend extra time reading through an LLM-written proof, especially if it might be time they'd otherwise have to spend trying to come up with the proof themselves).
As I understand it Lean is not a general purpose programming language, it is a DSL focused on formal logic verification. Bugs in a DSL are generally easier to identify and fix.
It seems one side of this argument desperately needs AI to have failed, and the other side is just saying that it probably worked but it is not as important as presented, that it is actually just a very cool working methodology going forward.
A little bit nitpicking, but according to books like AIMA that is indeed AI. In the first chapter even any control system is classified as AI.
Because of the reasons stated in the 1st chapter, I totally agree with the authors.
The whole system is AI. That part is a verifier in a chain of “suggestions/instict -> verifier” like used in neurosymbolic systems for automated driving, for example.
Soo, it can definitively tell you that 42 is correct Answer to the Ultimate Question of Life, The Universe, and Everything. It just can't tell you if you're asking the right question.
1. you write a proof in English that there is an infinite number of primes.
2. the llm writes 2+2=4 in lean.
3. lean confirms that this is correct and it's impossible that this proof is wrong.
Calling Lean "AI" is quite a stretch. Though I'm also in the camp that dislikes the inflationary use of "AI" for LLMs, so I have sympathies for your viewpoint.
So basically anything we don’t know how to write an algorithm for? I see where you’re coming from - but at the same time it’s actually an AI meme and smells of permanently moving goalposts.
For this reason, when we announce results on e.g. the IMO, we formalize the statements by hand and inspect the proofs carefully to ensure they capture the full spirit of the problem.
However, there are some good heuristics. If you expect a problem to be hard and the proof is very short, you've probably missed something!
That doesn’t look like a counterexample to “we formalize the statements by hand and inspect the proofs carefully to ensure they capture the full spirit of the problem”.
To answer the question a different way, I think you are asking how we know the proof actually matches the description the human provided? And I'd say we can't know for sure, but the idea is that you can pretty concisely write and check yourself that the problem is accurate, i.e. "There are an infinite number of primes" or whatever, and then even if an LLM goes off and makes up a lean proof wildly different from your description, if lean says the proof is valid then you have proven the original statement. I guess in theory the actual proof could be way different than what you thought it would be, but ultimately all the logic will still check out.
I feel like even outside of AI translation, formalization not capturing the spirit of what the informal description was provided is always a risk.
This is also a big risk when trying to prove code correctness: "prove this algo works" means you gotta define "works" along certain axes, and if you're very unlucky you might have a proof that exploits the uncertainty around a certain axis.
The statement is something you provide. It's the search you can have the LLM do. If this works for math it will immediately make code way higher quality via the same tools.
You're looking for the practical answer, but philosophically it isn't possible to translate an informal statement into a formal one 'correctly'. It is informal, ie, vaguely specified. The only certain questions are if the formal axioms and results are interesting which is independent of the informal formalisation and that can only be established by inspecting the the proof independently of the informal spec.
Philosophically, this is not true in general, but that's for trivial reasons: "how many integers greater than 7 are blue?" doesn't correspond to a formal question. It is absolutely true in many specific cases. Most problems posed by a mathematician will correspond to exactly one formal proposition, within the context of a given formal system. This problem is unusual, in that it was originally misspecified.
I suppose there's no formally defined procedure that accepts a natural language statement and outputs either its formalization or "misspecified". And "absolutely true" means "the vast majority of mathematicians agree that there's only one formal proposition that corresponds to this statement".
I think you suppose wrong. A statement like "the area of the square whose side is the hypotenuse is equal to the sum of the areas of the squares on the other two sides" doesn't seam out of reach of an algorithmic procedure like a classical NLP.
Sure, we can write a procedure that recognizes some formal grammar, which intersects with the natural language. Defining the formal grammar that fully captures the current natural language understanding of the mathematical community is a bit harder.
This problem was even worse: it's matched by the formal grammar, but the naïve formalisation has a trivial answer, so it is clearly not what was intended.
Is anyone working on applying these techniques to formal verification of software?
My limited understanding of Rust is that it applies a fixed set of rules to guarantee memory safety. The rules are somewhat simple and limiting, for ease of understanding and implementation, but also because of undecidability.
Programmers run into situations where they know that their code won't cause memory errors, but it doesn't follow the rules. Wouldn't it be cool if something like Aristotle was integrated into the compiler? Any code for which a proof of correctness could be written would pass/compile, without having to add more and more rules
Formal methods are cool because, by contrast to tools like the borrow checker, you can prove some very "nonlocal" properties: this system does not deadlock, or it makes progress at least every N steps, etc.
An issue with this approach is that it may not be robust. That is, you could run into a casr where a minor modification of your program is suddenly not provable anymore, even though it is still correct. The heuristic (AI or otherwise) has necessarily limits, and if your are close to the "edge" of its capabilities then a minor change could push it across.
If the proof is rooted in the programmer's understanding who can give proof hints to the prover then any modification of the program can then be accompanied with a modification of the hints, still allowing automatic proofs. But if the human has no clue then the automatic system can get stuck without the human having a chance to help it along.
The same is true for optimization. One small change and the compiler's optimizer doesn't know anymore how to optimize the code, and your code is now slow. And there is no way for a programmer to fix it except by rolling back their changes or by inspecting the assembly output.
Formal verification of program correctness is also (for obvious reasons) key to unlocking AI-driven synthesis (i.e. 'vibe' coding) of "correct" programs that will verifiably meet the given spec.
Not all aspects of a spec can be formally encoded. But even half-way houses are good.
Eg you can give the vague spec 'build me a todo list app', but you can still formally prove that everything your app does finishes, or even that it finishes in reasonable time.
Yes. For sure we will never be able to 100% automate the whole SWE process. As you say, the first input is a human wish, and there comes the joke of the genie that always screw the wishes by leaving something obvious out, because not explicitly specified. Also I think at some point the halting problem will make some programs impossible to test. But it would so great, program in a loose syntax, but with more safety than Rust and Ada together
It will certainly help - but its an extremely high bar. Almost all formal verification of software today is "does this pass the typechecker"?.
Now this captures some errors, but it doesn't really capture high level ones (is this program guaranteed to not deadlock is a hard one), and it doesn't capture the one that is important for business purposes (does this do what the customer wants). That requirement is more important than correctness (vitness all the software that is described as "crap", but is nonetheless widely used).
I don't think this is a required key to unlocking vibe coding. That seems to be easy: does this provide business value? And there the answer seems roughly to be "yes".
Sometimes when I'm using new LLMs I'm not sure if it’s a step forward or just benchmark hacking, but formalized math results always show that the progress is real and huge.
When do you think Harmonic will reach formalizing most (even hard) human written math?
I saw an interview with Christian Szegedy (your competitor I guess) that he believes it will be this year.
Thank you! It depends on the topic. Some fields (algebra, number theory) are covered well by Lean's math library, and so I think we are already there; I recommend trying Aristotle for yourself to see how reliably it can formalize these theorems!
In other fields (topology, probability, linear algebra), many key definitions are not in Mathlib yet, so you will struggle to write down the theorem itself. (But in some cases, Aristotle can define the structure you are talking about on the fly!)
This is not an intrinsic limitations of Lean, it's just that nobody has taken the time to formalize much of those fields yet. We hope to dramatically accelerate this process by making it trivial to prove lemmas, which make up much of the work. For now, I still think humans should write the key definitions and statements of "central theorems" in a field, to ensure they are compatible with the rest of the library.
> If the proof is correct, Aristotle has a good chance at translating it into Lean
How does this depend on the area of mathematics of the proof? I was under the impression that it was still difficult to formalize most research areas, even for a human. How close is Aristotle to this frontier?
>assuming you have formalized the statement correctly
That's a pretty big assumption, though, isn't it? As we saw the Navier-Stokes psychosis episode over the New Year holiday, formalizing correctly really isn't guaranteed.
What occurs when this process is reversed - translate from lean to informal english, and does iterating this then help research better approaches toward writing proofs in human language?
I had the same thought but unfortunately even if that translation is accurate it could still be bidirectional hallucinating and would not really be sufficient evidence...
It's another reformulation rather than a true proof. Now, instead of wanting a proof of a theorem, now we just need to prove that this proof is actually proving the theorem. The proof itself being so incomprehensible that it can't on its own be trusted, but if it can be shown that it can be trusted then the theorem must be true.
Aristotle's output is formally verified in Lean, so you can run it for days on a hard problem and be assured that the answer, no matter how complex, is right without needing to manually check it.
Claude Code can write lean, but we do a heck of a lot of RL on theorem proving, so Aristotle winds up being much better at writing Lean than other coding agents are.
You seem to be openly contradicting your company's PR and language. Your description very clearly describes the "AI" as a tool to translate relatively informal specifications into formal proof logic, but does not itself do the proving.
Yes! I think that working with Mathlib is the best long term solution, because it's how people already collaborate on building out the formal "universe of mathematics." We want to speed that up, and hopefully we'll cover all of the common topics very soon!
This is the forst time I heard about Aristotle and find it very interesting. First question first: is it available for the general public? I don't know if this is the page to try it? [1]
Second, when you say language modeling support, it means that can better understand code representation (ASTs) or something else? I am just an AI user, not very knowledgeable in the field. My main interest is if it would be great for static analysis oriented to computer security (SAST).
Based on Tao’s description of how the proof came about - a human is taking results backwards and forwards between two separate AI tools and using an AI tool to fill in gaps the human found?
I don’t think it can really be said to have occurred autonomously then?
Looks more like a 50/50 partnership with a super expert human one the one side which makes this way more vague in my opinion - and in line with my own AI tests, ie. they are pretty stupid even OPUS 4.5 or whatever unless you're already an expert and is doing boilerplate.
EDIT: I can see the title has been fixed now from solved to "more or less solved" which is still think is a big stretch.
I'm not sure i understand the wild hype here in this thread then.
Seems exactly like the tests at my company where even frontier models are revealed to be very expensive rubber ducks, but completely fails with non experts or anything novel or math heavy.
Ie. they mirror the intellect of the user but give you big dopamine hits that'll lead you astray.
Yes, the contributions of the people promoting the AI should be considered, as well as the people who designed the Lean libraries used in-the-loop while the AI was writing the solution. Any talk of "AGI" is, as always, ridiculous.
But speaking as a specialist in theorem proving, this result is pretty impressive! It would have likely taken me a lot longer to formalize this result even if it was in my area of specialty.
How did you arrive at "ridiculous"? What we're seeing here is incredible progress over what we had a year ago. Even ARC-AGI-2 is now at over 50%. Given that this sort of process is also being applied to AI development itself, it's really not clear to me that humans would be a valuable component in knowledge work for much longer.
It requires constant feedback, critical evaluation, and checks. This is not AGI, its cognitive augmentation. One that is collective, one that will accelerate human abilities far beyond what the academic establishment is currently capable of, but that is still fundamentally organic. I don't see a problem with this--AGI advocates treat machine intelligence like some sort of God that will smite non-believers and reward the faithful. This is what we tell children so that they won't shit their beds at night, otherwise they get a spanking. The real world is not composed of rewards and punishments.
It does seem that the venn diagram of "roko's basilisk" believers and "AGI is coming within our lifetimes" believers is nearly a circle. Would be nice if there were some less... religious... arguments for AGI's imminence.
I think the “Roko’s Basilisk” thing is mostly a way for readers of Nick Land to explain part of his philosophical perspective without the need for, say, an actual background in philosphy. But the simplicity reduces his nuanced thought into a call for a sheeplike herd—they don’t even need a shepherd! Or perhaps there is, but he is always yet to come…best to stay in line anyway, he might be just around the corner.
> It requires constant feedback, critical evaluation, and checks. This is not AGI, its cognitive augmentation.
To me that doesn't sound qualitatively different from a PhD student. Are they just cognitive augmentation for their mentor?
In any case, I wasn't trying to argue that this system as-is is AGI, but just that it's no longer "ridiculous", and that this to me looks like a herald of AGI, as the portion being done by humans gets smaller and smaller
People would say the same thing about a calculator, or computation in general. Just like any machine it must be constructed purposefully to be useful, and once we require something which exceeds that purpose it must be constructed once again. Only time will tell the limits of human intelligence, now that AI is integrating into society and industry.
>AGI advocates treat machine intelligence like some sort of God that will smite non-believers and reward the faithful.
>The real world is not composed of rewards and punishments.
Most "AGI advocates" say that AGI is coming, sooner rather than later, and it will fundamentally reshape our world. On its own that's purely descriptive. In my experience, most of the alleged "smiting" comes from the skeptics simply being wrong about this. Rarely there's talk of explicit rewards and punishments.
> it's really not clear to me that humans would be a valuable component in knowledge work for much longer.
To me, this sounds like when we first went to the moon, and people were sure we'd be on Mars be the end of the 80's.
> Even ARC-AGI-2 is now at over 50%.
Any measure of "are we close to AGI" is as scientifically meaningful as "are we close to a warp drive" because all anyone has to go on at this point is pure speculation. In my opinion, we should all strive to be better scientists and think more carefully about what an observation is supposed to mean before we tout it as evidence. Despite the name, there is no evidence that ARC-AGI tests for AGI.
> To me, this sounds like when we first went to the moon, and people were sure we'd be on Mars be the end of the 80's.
Unlike space colonisation, there are immediate economic rewards from producing even modest improvements in AI models. As such, we should expect much faster progress in AI than space colonisation.
But it could still turn out the same way, for all we know. I just think that's unlikely.
The minerals in the asteroid belt are estimated to be worth in the $100s of quintillions. I would say that’s a decent economic incentive to develop space exploration (not necessarily colonization, but it may make it easier).
Not going to happen as long as the society we live in has this big of a hard on for capitalism and working yourself to the bone is seen as a virtue. Every time there’s a productivity boost, the newly gained free time is immediately consumed by more work. It’s a sick version of Parkinson’s law where work is infinite.
Let me put it like this: I expect AI to replace much of human wage labor over the next 20 years and push many of us, and myself almost certainly included, into premature retirement. I'm personally concerned that in a few years, I'll find my software proficiency to be as useful as my chess proficiency today is useful to Stockfish. I am afraid of a massive social upheaval both for myself and my family, and for society at large.
There are other bounds here at play that are often not talked about.
Ai runs on computers. Consider the undecidability of Rices theorem. Where compiled code of non trivial statements may or may not be error free. Even an ai can’t guarantee its compiled code is error free. Not because it wouldn’t write sufficient code that solves a problem, but the code it writes is bounded by other externalities. Undecidability in general makes the dream of generative ai considerably more challenging than how it’s being ‘sold.
Here “much of” is doing the heavy lifting. Are you willing to commit to a percentage or a range?
I work at an insurance company and I can’t see AI replacing even 10% of the employees here. Too much of what we do is locked up in decades-old proprietary databases that cannot be replaced for legal reasons. We still rely on paper mail for a huge amount of communication with policyholders. The decisions we make on a daily basis can’t be trusted to AI for legal reasons. If AI caused even a 1% increase in false rejections of claims it would be an enormous liability issue.
Yes, absolutely willing to commit. I can't find a single reliable source, but from what I gather, over 70% of people in the West do "pure knowledge work", which doesn't include any embodied actuvities. I am happy to put my money that these jobs will start being fully taken over by AI rapidly soon (if they aren't already), and that by 2035, less than 50% of us will have a job that doesn't require "being there".
And regarding your example of an insurance company, I'm not sure about that industry, but seeing the transformation of banking over the last decade to fully digital providers like Revolut, I would expect similar disruption there.
I would easily take the other side of this bet. It just reminds me when everyone was sure back in 2010 that we’d have self driving cars within 10 years and human drivers would be obsolete. Today replacing human drivers fully is still about 10 years away.
This accurately mirrors my experience. It never - so far - has happened that the AI brought any novel insight at the level that I would see as an original idea. Presumably the case of TFA is different but the normal interaction is that that the solution to whatever you are trying to solve is a millimeter away from your understanding and the AI won't bridge that gap until you do it yourself and then it will usually prove to you that was obvious. If it was so obvious then it probably should have made the suggestion...
Recent case:
I have a bar with a number of weights supported on either end:
|---+-+-//-+-+---|
What order and/or arrangement or of removing the weights would cause the least shift in center-of-mass? There is a non-obvious trick that you can pull here to reduce the shift considerably and I was curious if the AI would spot it or not but even after lots of prompting it just circled around the obvious solutions rather than to make a leap outside of that box and come up with a solution that is better in every case.
I wonder what the cause of that kind of blindness is.
That problem is not clearly stated, so if you’re pasting that into an AI verbatim you won’t get the answer you’re looking for.
My guess is: first move the weights to the middle, and only then remove them.
However “weights” and “bar” might confuse both machines and people into thinking that this is related to weight lifting, where there’s two stops on the bar preventing the weights from being moved to the middle.
The problem is stated clearly enough that humans that we ask the question of will sooner or later see that there is an optimum and that that optimum relies on understanding.
And no, the problem is not 'not clearly stated'. It is complete as it is and you are wrong about your guess.
And if machines and people think this is related to weight lifting then they're free to ask follow up questions. But even in the weight lifting case the answer is the same.
Illusion of transparency. You are imagining yourself asking this question, while standing in the gym and looking at the bar (or something like this). I, for example, have no idea how the weights are attached and which removal actions are allowed.
Yeah, LLMs have a tendency to run with some interpretation of a question without asking follow-up questions. Probably, it's a consequence of RLHFing them in that way.
And none of those details matter to solve the problem correctly. I'm purposefully not putting any answers here because I want to see if future generations of these tools suddenly see the non-obvious solution. But you are right about the fact that the details matter, one detail is mentioned very explicitly that holds the key.
Sure they, do, the problem makes no sense as stated. The solution to the stated problem is to remove all weights all at once, solved. Or even two at a time, opposite the centre of gravity. Solved, but not what you're asking I assume?
You didn't even label your ASCII art, so I've no clue what you mean, are the bars at the end the supports or weights? Can I only remove one weight at a time? Initially I assumed you mean a weightlifting bar the weights on which can only be removed from its ends. Is that the case or what? What's the double slash in the middle?
Also: "what order and/or arrangement or of removing the weights" this isn't even correct English. Arrangement of removing the weights? State the problem clearly, from first principles, like you were talking to a 5 year old.
The sibling comment is correct, you're clearly picturing something in your mind that you're failing to properly describe. It seems obvious to you, but it's not.
The problem is unclear. I think you have a labelled graph G=(V, E) with labels c:V->R, such that each node in V consists of a triple (L, R, S) where L is a sequence of weights are on the left, R is a sequence of weights that are on the right, and S is a set of weight that have been taken off. Define c(L, R, S) to be the centre of mass. Introduce an undirected edge e={(L, R, S), (L', R', S')} between (L, R, S) and (L', R', S') either if (i) (L', R', S') results from taking the first weight off L and adding it to S, or (ii) (L', R', S') results from taking the first weight off R and adding it to S, or (iii) (L', R', S') results from taking a weight from W and adding it to L, or (iv) (L', R', S') results from taking a weight from W and adding it to R.
There is a starting node (L_0, R_0, {}) and an ending node ({}, {}, W) , with the latter having L=R={}.
I think you're trying to find the path (L_n, R_n, S_n) from the starting node to the ending node that minimises the maximum absolute value of c(L_n, R_n, S_n).
Lots of users seem to think LLM's think and reason so this sounds wonderful. A mechanical process isn't thinking, certainly it does NOT mirror human thinking. The processes being altogether different.
Do you have any idea how many people here have paychecks that depend on the hype, or hope to be in that position? They were the same way for Crypto until it stopped being part of the get-rich-quick dream.
"Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver"
Not saying it's not an amazing setup, i just don't understand the word "AI" being used like this when it's the setup / system that's brilliant in conjunction with absolute experts.
> Ie. they mirror the intellect of the user but give you big dopamine hits that'll lead you astray.
This hits so true to home. Just today in my field a manager without expertise in a topic gave me an AI solution to something I am an expertise in. The AI was very plainly and painfully wrong, but it comes down to the user prompting really poorly. When I gave a el formulated prompt to the same topic, I got the correct answer on the first go.
"the more interesting capability revealed by these events is the ability to rapidly write and rewrite new versions of a text as needed, even if one was not the original author of the argument." From the Tao thread. The ability to quickly iterate on research is a big change because "This is sharp contrast to existing practice where....large-scale reworking of the paper often avoided due both to the work required and the large possibility of introducing new errors."
Exactly "The Geordi LaForge Paradox" of "AI" systems. The most sophisticated work requires the most sophisticated user, who can only become sophisticated the usual way --- long hard work, trial and error, full-contact kumite with reality, and a degree of devotion to the field.
> There seems to be some confusion on this so let me clear this up. No, after the model gave its original response, I then proceeded to ask it if it could solve the problem with C=k/logN arbitrarily large. It then identified for itself what both I and Tao noticed about it throwing away k!, and subsequently repaired its proof. I did not need to provide that observation.
so it was literally "yo, your proof is weak!" - "naah, watch this! [proceeds to give full proof all on its own]"
I had the impression Tao/community weren't even finding the gaps, since they mentioned using an automatic proof verifier. And that the main back and forth involved re-reading Erdos' paper to find out the right problem Erdos intended. So more like 90/10 LLM/human. Maybe I misread it.
This website was made by Thomas Bloom, a mathematician who likes to think about the problems Erdős posed. Technical assistance with setting up the code for the website was provided by ChatGPT -from the FAQ
> EDIT: I can see the title has been fixed now from solved to "more or less solved" which is still think is a big stretch.
"solved more or less autonomously by AI" were Tao's exact words, so I think we can trust his judgment about how much work he or the AI did, and how this indicates a meaningful increase in capabilities.
Reconfiguring existing proofs in ways that have been tedious or obscured from humans, or using well framed methods in novel ways, will be done at superhuman speeds, and it'll unlock all sorts of capabilities well before we have to be concerned about AGI. It's going to be awesome to see what mathematicians start to do with AI tools as the tools become capable of truly keeping up with what the mathematicians want from the tools. It won't necessarily be a huge direct benefit for non-mathematicians at first, because the abstract and complex results won't have direct applications, but we might start to see millenium problems get taken down as legitimate frontier model benchmarks.
Or someone like Terence Tao might figure out how to wield AI better than anyone else, even the labs, and use the tools to take a bunch down at once. I'm excited to see what's coming this year.
> Reconfiguring existing proofs in ways that have been tedious or obscured from humans,
To a layman, that doesn't sound like very AI-like? Surely there must be a dozen algorithms to effectively search this space already, given that mathematics is pretty logical?
I actually know about this a bit since it was part of what I was studying with my incomplete PhD.
Isabelle has had the "Sledgehammer" tool for quite awhile [1]. It uses solvers like z3 to search and apply a catalog of proof strategies and then try and construct a proof for your main proof or any remaining subtasks that you have to complete. It's not perfect but it's remarkably useful (even if it does sometimes give you proofs that import like ten different libraries and are hard to read).
I think Coq has Coqhammer but I haven't played with that one yet.
1 Does this mean that Sledgehammer and Coqhammer offer concolic testing based on an input framework (say some computing/math system formalization) for some sort of system execution/evaluation or does this only work for hand-rolled systems/mathematical expressions?
Sorry for my probably senseless questions, as I'm trying to map the computing model of math solvers to common PL semantics. Probably there is better overview literature. I'd like to get an overview of proof system runtime semantics for later usage.
2 Is there an equivalent of fuzz testing (of computing systems) in math, say to construct the general proof framework?
3 Or how are proof frameworks (based on ideas how the proof could work) constructed?
4 Do I understand it correct, that math in proof systems works with term rewrite systems + used theory/logic as computing model of valid representation and operations? How is then the step semantic formally defined?
1. Sledgehammer/CoqHammer are automated proof-search/assistant tools that bridge interactive proof assistants (Isabelle/Coq) to external SMT provers, they aren't about concolic testing in a PL sense. They translate goals and context to formats the external provers understand, call those provers, and replay/translate returned proofs or proof fragments back into the ITP, that's search and translation of a complete proof, not running concrete+symbolic program executions like concolic testing.
2. there is no exact analogue of fuzz-testing bytecode for "theorem fuzzing", but arguably the closest match is counterexample generators - model finders, finite-model search, SMT instantiation with concrete valuations, all serve a role similar to fuzzers by finding invalid conjectures quickly.
these are weaker than fuzzing for finding execution bugs because mathematical statements are higher-level and provers operate symbolically.
3. here's how proof frameworks are constructed, at the high-level:
a. start by picking a logical foundation: e.g., first-order logic (FOL), higher-order logic (HOL), dependent type theory (Martin-Löf/CIC).
b. define syntax (terms, types, formulas) and typing/judgment rules (inference rules or typing rules).
c. define proof objects or proof rules: natural deduction, sequent calculus, Hilbert-style axioms, or type-as-proofs (Curry-Howard).
d. pick or design the kernel: small, trusted inference engine that checks proof objects (reduction rules, conversion, rule admissibility).
f. provide libraries of axioms/definitions and extraction/interpretation mechanisms (code extraction, models).
g. implement proof search strategies and heuristics (backtracking, heuristics, lemma databases, proof-producing solvers).
this is the standard engineering pipeline behind modern ITPs and automated systems.
4. yes, many proof assistants and automated provers treat computation inside proofs as:
a. term rewriting / reduction (beta-reduction, delta-unfolding, normalization) for computation oriented parts; this is the "computation" layer.
b. a separate deductive/logic layer for reasoning (inference rules, quantifier instantiation, congruence closure).
the combined model is: terms represent data/computation; rewriting gives deterministic computation semantics; logical rules govern valid inference about those terms. Dependent type theories conflate computation and proof via conversion (definitional equality) in the kernel.
5. here's how a proof step's semantics is defined:
proof steps are applications of inference rules transforming sequents/judgments. formally:
a. a judgment form J (e.g., Γ ⊢ t : T or Γ ⊢ φ) is defined.
b. inference rules are of the form: from premises J1,...,Jn infer J, written as (J1 ... Jn) / J.
c. a proof is a finite tree whose nodes are judgments; leaves are axioms or assumptions; each internal node follows an inference rule.
d. for systems with computation, a reduction relation → on terms defines definitional equality; many rules use conversion: if t →* t' and t' has form required by rule, the rule applies.
e. in type-theoretic kernels, the step semantics is checking that a proof object reduces/normalizes and that constructors/eliminators are used respecting typing/conversion; the kernel checks a small set of primitive steps (e.g., beta, iota reductions, rule applications).
operationally: a single proof step = instantiate a rule schema with substitutions for metavariables, perform any required reductions/conversions, and check side conditions (freshness, well-formedness).
equational reasoning and rewriting: a rewrite rule l → r can be applied to a term t at a position p if the subterm at p matches l under some substitution σ; result is t with that subterm replaced by σ(r). Confluence/termination properties determine global behavior.
higher-level tactics encode sequences of such primitive steps (rule applications + rewrites + searches) but are not part of kernel semantics; only the kernel rules determine validity.
relevant concise implications for mapping to PL semantics:
treat proof state as a machine state (context Γ, goal G, local proof term), tactics as programs that transform state; kernel inference rules are the atomic instruction set; rewriting/normalization are deterministic evaluation semantics used inside checks.
automation ≈ search processes over nondeterministic tactic/program choices; model finders/SMTs act as external oracles producing either counterexamples (concrete) or proof certificates (symbolic).
1. Yes, the Sledgehammer suite contains three testing systems that I believe are concolic (quickcheck, nitpick, nunchaku), but they only work due to hand-coded support for the mathematical constructs in question. They'd be really inefficient for a formalised software environment, because they'd be operating at a much lower-level than the abstractions of the software environment, unless dedicated support for that software environment were provided.
2. Quickcheck is a fuzz-testing framework, but it doesn't help to construct anything (except as far as it constructs examples / counterexamples). Are you thinking of something that automatically finds and defines intermediary lemmas for arbitrary areas of mathematics? Because I'm not aware of any particular work in that direction: if computers could do that, there'd be little need for mathematicians.
3. By thinking really hard, then writing down the mathematics. Same way you write computer programs, really, except there's a lot more architectural work. (Most of the time, the computer can brute-force a proof for you, so you need only choose appropriate intermediary lemmas.)
4. I can't parse the question, but I suspect you're thinking of the meta-logic / object-logic distinction. The actual steps in the term rewriting are not represented in the object logic: the meta-logic simply asserts that it's valid to perform these steps. (Not even that: it just does them, accountable to none other than itself.) Isabelle's meta-logic is software, written in a programming language called ML.
> Logic solvers are useful, but not tractable as a general way to approach mathematics.
To be clear, there are explicitly computationally tractable fragments of existing logics, but they're more-or-less uninteresting by definition: they often look like very simple taxonomies (i.e. purely implicational) or like a variety of "modal" and/or "multi-modal" constructions over simpler logics.
Of course it would be nice to explicitly tease out and write down the "computationally tractable" general logical reasoning that some existing style of proof is implicitly relying on (AIUI this kind of inquiry would generally be comprised under "synthetic mathematics", trying to find simple treatments in axiom- and rule-of-inference style for existing complex theories) but that's also difficult.
I agree only with the part about reconfiguring existing proofs. That's the value here. It is still likely very tedious to confirm what the LLMs say, but at least it's better than waiting for humans to do this half of the work.
For all topics that can be expressed with language, the value of LLMs is shuffling things around to tease out a different perspective from the humans reading the output. This is the only realistic way to understand AI enough to make it practical and see it gain traction.
As much as I respect Tao, I feel like his comments about AI usage can be misleading without carefully reading what he is saying in the linked posts.
> It is still likely very tedious to confirm what the LLMs say,
A large amount of Tao's work is around using AI to assist in creating Lean proofs.
I'm generally on the more skeptical side of things regarding LLMs and grand visions, but assisting in the creation of Lean proofs is a huge area of opportunity for LLMs and really could change mathematics in fundamental ways.
One naive belief many people have is that proofs should be "intelligible" but it's increasingly clear this is not the case. We have proofs that are gigabytes (I believe even terabytes in some cases) in size, but we know they are correct because they check in Lean.
This particular pattern of using state of the art work in two different areas (LLMs and theorem proving) absolutely has the potentially to fundamentally change how mathematics is done. There's a great picture on pp 381 of Type Theory and Formal Proof where you can easily see how LLMs can be placed in two of the most tricky parts of that diagram to solve.
Because the work is formally verified we can throw out entire classes of LLM problems (like hallucinations).
Personally I think strongly typed language, with powerful type systems are also the long term ideal coding with LLMs (but I'm less optimistic about devs following this path).
Perhaps you meant, "a decent amount of his recent work." He has been doing math long before LLMs, and is still regularly publishing papers with collaborators that have nothing to do with AI. The most recent was last week. https://arxiv.org/search/math?searchtype=author&query=Tao,+T
You are correct, I assumed context was implied, but I do mean "recent work with LLMs". Friends of mine where doing side projects with him about two decades ago, and I have a few of his books on my shelf, so yes, I am aware that Terry Tao was doing work in mathematics prior to the advent of LLMs.
> I don't believe that's what's happening in this specific example (and am probably wrong), but this is where a lot of Tao's enthusiasm lies.
It absolutely is. With the twist that ChatGPT 5.2 can now also "explain" an AI-generated Lean proof in human-readable terms. This is a game changer, because "refactoring" can now become end-to-end: if the human explanation of a Lean proof is hard to grok and could be improved, you can test changes directly on the formal text and check that the proof still goes through for the original statement.
If you consider the statement that perfect play by both sides in checked results in a draw to be the statement of a theorem, then the proof is 237GB compressed :) And verifying it requires quite a lot of computation.
> One naive belief many people have is that proofs should be "intelligible" but it's increasingly clear this is not the case.
That’s one of the main reason why I did not pursue an academic math career. The pure joy of solving exam problems with elegant proofs is very hard to get on harder problems.
> One naive belief many people have is that proofs should be "intelligible" but it's increasingly clear this is not the case.
That's not a naïve belief. Intelligible proofs represent insight that can be applied to other problems. If our only proof is an opaque one, that means we don't really understand the area yet. Take, for example, the classification of finite simple groups (a ten-thousand-page proof): that is very much not a closed area of research, and we're still discovering new things in the vicinity of the problem.
This is what has excited me for many years - the idea I call "scientific refactoring"
What happens if we reason upwards but change some universal constants? What happens if we use Tao instead of Pi everywhere, these kind of fun questions would otherwise require an enormous intellectual effort whereas with the mechanisation and automation of thought, we might be able to run them and see!
Not just for math, but ALL of Science suffers heavily from a problem of less than 1% of the published works being capable of being read by leading researchers.
Google Scholar was a huge step forward for doing meta-analysis vs a physical library.
But agents scanning the vastness of PDFs to find correlations and insights that are far beyond human context-capacity will I hope find a lot of knowledge that we have technically already collected, but remain ignorant of.
This idea is just ridiculous to anyone who's worked in academia. The theory is nice, but academic publishing is currently in the late stages of a huge death spiral.
In any given scientific niche, there is a huge amount of tribal knowledge that never gets written down anywhere, just passed on from one grad student to the rest of the group, and from there spreads by percolation in the tiny niche. And papers are never honest about the performance of the results and what does not work, there is always cherry picking of benchmarks/comparisons etc.
There is absolutely no way you can get these kinds of insights beyond human context capacity that you speak of. The information necessary does not exist in any dataset available to the LLM.
No no, in comparison to academia, programmers have been extremely diligent at documenting exactly how stuff works and providing fairly reproducible artifacts since the 1960s.
Imagine trying to teach an AI how to code based on only slide decks from consultants. No access todocumentation, no stack overflow, no open source code used in the training data; just sales pitches and success stories. That's close to how absurd this idea is.
Exactly, and I think not every instance can be claimed to be a hallucination, there will be so much latent knowledge they might have explored.
It is likely we might see some AlphaGo type new styles in existing research workflows that AI might work out if there is some verification logic. Humans could probably never go into that space, or may be none of the researchers ever ventured there due to different reasons as progress in general is mostly always incremental.
Google Scholar is still ignoring a huge amount of scholarship that is decades old (pre-digital) or even centuries old (and written in now-unused languages that ChatGPT could easily make sense of).
In 1897, the Indiana General Assembly attempted to legislate a new value for pi, proposing it be defined as 3.2, which was based on a flawed mathematical proof. This bill, known as the Indiana pi bill, never became law due to its incorrect assertions and the prior proof that squaring the circle is impossible: https://en.wikipedia.org/wiki/Indiana_pi_bill
I don't think it's just the sheer number of symbols. It's also the fact that the symbol τ means "turn". So you can say "quarter-turn" instead of π/2.
I'm not sure why that point gets lost in these discussions. And personally, I think of the set of fundamental mathematical objects as having a unique and objective definition. So, I get weirdly bothered by the offset in the Gamma function.
I can write a sed command/program that replaces every occurence of PI with TAU/2 in LaTeX formulas and it'll take me about 30 minutes.
The "intellectual effort" this requires is about 0.
Maybe you meant Euler's number? Since it also relates to PI, it can be used and might actually change the framework in an "interesting way" (making it more awkward in most cases - people picked PI for a reason).
I think they mean in a more general way - thinking with tau instead of pi might shift the context in terms of another method or problem solving algorithm, or there might be obscure or complex uses of tau or pi that haven't cross-fertilized in the literature - where it might be natural to think of clever extensions or use cases in one context but not the other, and those extensions and extrapolations will be apparent to AI, within reach of a tedious and exhaustive review of existing literature.
I think what they were getting at is something like this: The application of existing ideas that simply haven't been applied in certain ways because it's too boring or obvious or abstract for humans to have bothered with, but AI can plow through a year's worth of human drudgery in a day or a month or so, and that sort of "brute force" won't require any amazing new technical capabilities from AI.
I'm using LLMs to rewrite every formula featuring the Gamma function to instead use the factorial. Just let "z!" mean "Gamma(z+1)", substitute everywhere, and simplify. Then have the AI rewrite any prose.
If this isn't AGI, what is? It seems unavoidable that an AI which can prove complex mathematical theorems would lead to something like AGI very quickly.
"I doubt that anything resembling genuine "artificial general intelligence" is within reach of current #AI tools. However, I think a weaker, but still quite valuable, type of "artificial general cleverness" is becoming a reality in various ways.
By "general cleverness", I mean the ability to solve broad classes of complex problems via somewhat ad hoc means. These means may be stochastic or the result of brute force computation; they may be ungrounded or fallible; and they may be either uninterpretable, or traceable back to similar tricks found in an AI's training data. So they would not qualify as the result of any true "intelligence". And yet, they can have a non-trivial success rate at achieving an increasingly wide spectrum of tasks, particularly when coupled with stringent verification procedures to filter out incorrect or unpromising approaches, at scales beyond what individual humans could achieve.
This results in the somewhat unintuitive combination of a technology that can be very useful and impressive, while simultaneously being fundamentally unsatisfying and disappointing - somewhat akin to how one's awe at an amazingly clever magic trick can dissipate (or transform to technical respect) once one learns how the trick was performed.
But perhaps this can be resolved by the realization that while cleverness and intelligence are somewhat correlated traits for humans, they are much more decoupled for AI tools (which are often optimized for cleverness), and viewing the current generation of such tools primarily as a stochastic generator of sometimes clever - and often useful - thoughts and outputs may be a more productive perspective when trying to use them to solve difficult problems."
This comment was made on Dec. 15, so I'm not entirely confident he still holds it?
While quickly I noticed that my pre-ChatGPT-3.5 use of the term was satisfied by ChatGPT-3.5, this turned out to be completely useless for 99% of discussions, as everyone turned out to have different boolean cut-offs for not only the generality, but also the artificiality and the intelligence, and also what counts as "intelligence" in the first place.
That everyone can pick a different boolean cut-off for each initial, means they're not really booleans.
Therefore, consider that this can't drive a car, so it's not fully general. And even those AI which can drive a car, can't do so in genuinely all conditions expected of a human, just most of them. Stuff like that.
A blind person does not have the necessary input (sight data) to make the necessary computation. A car autopilot would.
So no we do not deem a blind person to be unintelligent due to their lack of being able to drive without sight. But we might judge a sighted person as being not generally intelligent if they could not drive with sight.
AGI in its standard definition requires matching or surpassing humans on all cognitive tasks, not just in some, especially some where only handful of humans took a stab on.
Surely AGI would be matching humans on most tasks. To me, surpassing humans on all cognitive tasks sounds like superintelligence, while AGI "only" need to perform most, but not necessarily all, cognitive tasks at the level of a human highly capable at that task.
Personally I could accept "most" provided that the failures were near misses as opposed to total face plants. I also wouldn't include "incompatible" tasks in the metric at all (but using that to game the metric can't be permitted either). For example the typical human only has so much working memory, so tasks which overwhelm that aren't "failed" so much as "incompatible". I'm not sure exactly what that looks like for ML but I expect the category will exist. A task that utilizes adversarial inputs might be an example of such.
This is very narrow AI, in a subdomain where results can be automatically verified (even within mathematics that isn't currently the case for most areas).
Not really. A completely unintelligent autopilot can fly an F-16. You cannot assume general intelligence from scaffolded tool-using success in a single narrow area.
I assumed extreme performance of a general AI matching and exceeding average human intelligence when placed in an F16 or an equivalent cockpit specified for conducting math proofs.
That’s not agi at all. I don’t think you understand that LLMs will never hit agi even when they exceed human intelligence in all applicable domains.
The main reason is they don’t feel emotions. Even if the definition of agi doesn’t currently encompass emotions people like you will move the goal posts and shift the definition until it does. So as AI improves, the threshold will be adjusted to make sure they will never reach agi as it’s an existential and identity crisis to many people to admit that an AI is better than them on all counts.
I have kept track of a few instances where AI has been applied to real and genuine problems. ~
Not trivial problems. Issues with possible solutions, errors, and unresolved history.
AI did not \\"solve\\" any issues on its own, but what stood out to me was the speed at which concepts could be rewritten, restructured and tested for stress.
A mental model that has been useful to me is that AI is not particularly good at providing the first answer, however, it is very good at providing the second, third, and tenth versions of the answer, especially when the first answer has already been identified as weak by a human.
In these instances, the progress seemed to stem from the AI being able to: Quickly reword and restate a given argument. Convert implicit assumptions into explicit ones. Identify small gaps in logic before they became large.
What I have been grappling with is how to differentiate when AI is just clarifying versus when it is silently hallucinating structure. Is the output of AI being treated as a draft, a reviewer, a rubber duck, or some combination? When is the output so fast that the rigor of thought is compromised? I am interested in how others are using AI for hard thinking and not just for writing cleanup.
Can anyone with specific knowledge in a sophisticated/complex field such as physics or math tell me: do you regularly talk to AI models? Do feel like there's anything to learn? As a programmer, I can come to the AI with a problem and it can come up with a few different solutions, some I may have thought about, some not.
Are you getting the same value in your work, in your field?
Context: I finished a PhD in pure math in 2025 and have transitioned to being a data scientist and I do ML/stats research on the side now.
For me, deep research tools have been essential for getting caught up with a quick lit review about research ideas I have now that I'm transitioning fields. They have also been quite helpful with some routine math that I'm not as familiar with but is relatively established (like standard random matrix theory results from ~5 years ago).
It does feel like the spectrum of utility is pretty aligned with what you might expect: routine programming > applied ML research > stats/applied math research > pure math research.
I will say ~1 year ago they were still useless for my math research area, but things have been changing quickly.
I don't have a degree in either physics or math, but what AI helps me to do is to stay focused on the job before me rather than to have to dig through a mountain of textbooks or many wikipedia pages or scientific papers trying to find an equation that I know I've seen somewhere but did not register the location of and did not copy down. This saves many days, every day. Even then I still check the references once I've found it because errors can and do slip into anything these pieces of software produce, and sometimes quite large ones (those are easy to spot though).
So yes, there is value here, and quite a bit but it requires a lot of forethought in how you structure your prompts and you need to be super skeptical about the output as well as able to check that output minutely.
If you would just plug in a bunch of data and formulate a query and would then use the answer in an uncritical way you're setting yourself up for a world of hurt and lost time by the time you realize you've been building your castle on quicksand.
I'm an active researcher in TCS. For me, AI has not been very helpful on technical things (or even technical writing), but has been super helpful for (1) literature reviews; (2) editing papers (e.g., changing a convention everywhere in the paper); and (3) generating Tikz figures/animations.
I did a theoretical computer science PhD a few years ago and write one or two papers a year in industry. I have not had much success getting models to come up with novel ideas or even prove theorems, but I have had some success asking them to prove smaller and narrower results and using them as an assistant to read papers (why are they proving this result, what is this notation they're using, expand this step of their proof, etc). Asking it to find any bugs in a draft before Arxiving also usually turns up some minor things to clarify.
Overall: useful, but not yet particularly "accelerating" for me.
I do / have done research in building deep learning models and custom / novel attention layers, architectures, etc., and AI (ChatGPT) is tremendously helpful in facilitating (semantic) search for papers in areas where you may not quite know the magic key words / terminology for what you are looking for. It is also very good at linking you to ideas / papers that you might not have realized were related.
I also found it can be helpful when exploring your mathematical intuitions on something, e.g. like how a dropout layer might effect learned weights and matrix properties, etc. Sometimes it will find some obscure rigorous math that can be very enlightening or relevant to correcting clumsy intuitions.
I talk to them (math research in algebraic geometry) not really helpful outside of literature search unfortunately. Others around me get a lot more utility so it varies. (Most powerful model i tried was Gemini 2.5 deep think and Gemini 3.0 pro) not sure if the new gpts are much better
I work in quantum computing. There is quite a lot of material about quantum computing out there that these LLMs must have been trained on. I have tried a few different ones, but they all start spouting nonsense about anything that is not super basic.
But maybe that is just me. I have read some of Terence Tao's transcripts, and the questions he asks LLMs are higher complexity than what I ask. Yet, he often gets reasonable answers. I don't yet know how I can get these tools to do better.
This often feels like an annoying question to ask, but what models were you using?
The difference between free ChatGPT, GPT-5.2 Thinking, and GPT-5.2 Pro is enormous for areas like logic and math. Often the answer to bad results is just to use a better model.
Additionally, sometimes when I get bad results I just ask the question again with a slightly rephrased prompt. Often this is enough to nudge the models in the right direction (and perhaps get a luckier response in the process). However, if you are just looking at a link to a chat transcript, this may not be clear.
I have openrouter account, so I try different models easily. I have tried Sonnet, Opus, various versions of GPT, Deepseek. There are certainly differences in the quality. I also do rephrase prompts all the time. But ultimately, I can't quite get them to work in quantum computing. Far easier to get them to answer coding or writing related questions.
"I don't yet know how I can get these tools to do better."
I have wondered if he has access to a better model than I, the way some people get promotional merchandise. A year or two ago he was saying the models were as good as an average math grad student when to me they were like a bad undergrad. In the current models I don't get solutions to new problems. I guess we could do some debugging and try prompting our models with this Erdos problem and see how far we get. (edit: Or maybe not; I guess LLMs search the web now.)
I’m a hobbyist math guy (with a math degree) and LLMs can at least talk a little talk or entertain random attempts at proofs I make. In general they rebuke my more wild attempts, and will lead me to well-trodden answers for solved problems. I generally enjoy (as a hobby) finding fun or surprising solutions to basic problems more than solving novel maths, so LLMs are fun for me.
As the other person said, Deep Research is invaluable; but generating hypotheses is not as good at the true bleeding edge of the research. The ChatGPT 4.0 OG with no guardrails, briefly generated outrageously amazing hypotehses that actually made sense. After that they have all been neutered beyond use in this direction.
My experience has been mixed. Honestly though, talking to AI and discussing a problem with it is better than doing nothing and just procrastinating. It's mostly wrong, but the conversation helps me think. In the end, once my patience runs out and my own mind has been "refreshed" through the conversation (even if it was frustrating), I can work on it myself. Some bits of the conversation will help but the "one-shot" doesn't exist. tldr: ai chatbots can get you going, and may be better than just postponing and procrastinating over the problem you're trying to solve.
- Minor nit: The documentation mentions "uvx aristotlelib@latest aristotle" but that doesn't work; it should be "uvx --from aristotlelib@latest aristotle"
- It took me a minute or two of clicking around to figure out that the (only?) way to use it is to create an API key, then start aristotle in the terminal and interact with it there. It could be more obvious I think.
Edit: I just realized from https://news.ycombinator.com/item?id=46296801 that you're the CEO! - in that case maybe you, or whoever you think most appropriate from your organization, could submit it along with a text description of what it is, and what is the easiest and/or most fun way to try it out?
Absolutely, you've made something new you want to show us that we can try out. dang once posted some tips about making these types of submissions https://news.ycombinator.com/item?id=22336638 I'd recommend reading first though.
This is great, there is still so much potential in AI once we move beyond LLMs to specialized approaches like this.
EDIT: Look at all the people below just reacting to the headline and clearly not reading the posts. Aristotle (https://arxiv.org/abs/2510.01346) is key here folks.
EDIT2: It is clear much of the people below don't even understand basic terminology. Something being a transformer doesn't make it an LLM (vision transformers, anyone) and if you aren't training on language (e.g. AlphaFold, or Aristotle on LEAN stuff), it isn't a "language" model.
> It is clear much of the people below don't even understand basic terminology. Something being a transformer doesn't make it an LLM (vision transformers, anyone) and if you aren't training on language (e.g. AlphaFold, or Aristotle on LEAN stuff), it isn't a "language" model.
I think it's because it comes off as you are saying that we should move off of GenAI, and alot of people use LLM when they mean GenAI.
Ugh, you're right. This was not intended. Conflating LLMs with GenAI is a serious error, but you're right, it is obviously a far more common error than I realized. I clearly should have said "move beyond solely LLMs" or "move beyond LLMs in isolation", perhaps this would have avoided the confusion.
This is a really hopeful result for GenAI (fitting deep models tuned by gradient descent on large amounts of data), and IMO this is possible because of specific domain knowledge and approaches that aren't there in the usual LLM approaches.
"Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver"
It is far more than an LLM, and math != "language".
> Aristotle integrates three main components (...)
The second one being backed by a model.
> It is far more than an LLM
It's an LLM with a bunch of tools around it, and a slightly different runtime that ChatGPT. It's "only" that, but people - even here, of all places - keep underestimating just how much power there is in that.
Transformer != LLM. See my edited top-level post. Just because Aristotle uses a transformer doesn't mean it is an LLM, just as Vision Transformers and AlphaFold use transformers but are not LLMs.
LLM = Large Language Model. Large refers to both the number of parameters (and in practice, depth) of the model, and also implicitly the amount of data used for training, and "language" means human (i.e. written, spoken) language. A Vision Transformer is not an LLM because it is trained on images, and AlphaFold is not an LLM because it is trained molecular configurations.
Aristotle works heavily with formalized LEAN statements and expressions. While you can certainly argue this is a language of sorts, it is not at all the same "language" as the "language" in LLMs. Calling Aristotle an "LLM" just because it has a transformer is more misleading than truthful, because every other single aspect of it is far more clever and involved.
Sigh. If I start with a pre-trained LLM architecture, and then do extensive further training / fine-tuning with different data and loss functions and custom similarity metrics for specialized search and specialized training procedures, and use feedback from other automated systems, we are far, far more than an LLM. That's the point. Calling something like this an LLM is as deeply misleading as calling AlphaFold an LLM. These tools goes far beyond simple LLMs. The special losses and metrics are really so important here and are why these tools can be so game-changing.
In this context, we're not even talking about "math" (as a broad, abstract concept). We're strictly talking about converting English to Lean. Both are just languages. Lean isn't just something that can be a language. It's a language.
There is no reason or framing where you can say Aristotle isn't a language model.
That's true, and a good fundamental point. But here it's much simpler than that: math is a language the same way code is, and if there's one thing LLMs excel at, it's reading and writing code and translating back and forth between code and natural language.
"Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver"
It is far more than an LLM, and math != "language".
1. "The search algorithm is a highly parallel Monte Carlo Graph Search (MCGS) using a large transformer as its policy and value functon." ... "We use a generative policy to take progressively widened [7] samples from the large action space of Lean tactics, conditioning on the Lean proof state, proof history, and, if available, an informal proof. We use the same model and prompt (up to a task token) to compute the value function which guides the search."
See that 'large transformer' phrase? That's where the LLM is involved.
2. "A lemma-based informal reasoning system which generates informal proofs of mathematical state-ments, breaks these proofs down into lemmas, formalizes each lemma into Lean, and iterates this process based on formal feedback" ... "First, the actions it generates consist of informal comments in addition to Lean tactics. Second, it uses a hidden chain of thought with a dynamically set thinking budget before predicting an action."
Unless you're proposing that this team solved AGI, "chain of thought" is a specific term of art in LLMs.
3. "A geometry solver which solves plane geometry problems outside of Lean using an approach based on AlphaGeometry [45]." ... following the reference: "AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. "
AlphaGeometry, like all of Deepmind's Alpha tools, is an LLM.
Instead of accusing people of not reading the paper, perhaps you should put some thought into what the things in the paper actually represent.
If you think "transformer" = LLM, you don't understand the basic terminology of the field. This is like calling AlphaFold an LLM because it uses a transformer.
No, it isn't. They call out ExIt as an inspiration as well as AlphaZero, and the implementation of these things (available in many of their authors' papers) is almost indistinguishable from LLMs. The architecture isn't novel, which is why this paper is about the pipeline instead of about any of the actual processing tools. Getting prickly about meaningless terminology differences is definitely your right, but for anyone who isn't trying to define a policy algorithm for a transformer network, the difference is immaterial to understanding the computation involved.
Equating LLMs and transformers is not a meaningless terminology difference at all, Aristotle is so different from the things people call LLMs in terms of training data, loss function, and training that this is a grievous error.
Very cool to see how far things have come with this technology!
Please remember that this is a theorem about integers that is subject to a fairly elementary proof that is well-supported by the existing Mathlib infrastructure. It seems that the AI relies on the symbolic proof checker, and the proofs that it is checking don't use very complex definitions in this result. In my experience, proofs like this which are one step removed from existing infra are much much more likely to work.
2026 should be interesting. This stuff is not magic, and progress is always going to be gradual with solutions to less interesting or "easier" problems first, but I think we're going to see more milestones like this with AI able to chip away around the edges of unsolved mathematics. Of course, that will require a lot of human expertise too: even this one was only "solved more or less autonomously by AI (after some feedback from an initial attempt)".
People are still going to be moving the goalposts on this and claiming it's not all that impressive or that the solution must have been in the training data or something, but at this point that's kind of dubiously close to arguing that Terence Tao doesn't know what he's talking about, which to say the least is a rather perilous position.
At this point, I think I'm making a belated New Years resolution to stop arguing with people who are still staying that LLMs are stochastic parrots that just remix their training data and can never come up with anything novel. I think that discussion is now dead. There are lots of fascinating issues to work out with how we can best apply LLMs to interesting problems (or get them to write good code), but to even start solving those issues you have to at least accept that they are at least somewhat capable of doing novel things.
In 2023 I would have bet hard against us getting to this point ("there's no way chatbots can actually reason their way through novel math!"), but here we are are three years later. I wonder what comes next?
Uh, this was exactly a "remix" of similar proofs that most likely were in the training data. It's just that some people misunderestimate how compelling that "remix" ability can be, especially when paired with a direct awareness of formal logical errors in one's attempted proof and how they might be addressed in the typical case.
> Then what sort of math problem would be a milestone for you where an AI was doing something novel?
What? If we're discussing novel synthesis, and it's being contrasted with answer-from-search / answer-from-remix.. the problem does not matter. Only the answer and the originality of the approach. Connecting two fields that were not previously connected is novel, or applying a new kind of technique to an old problem. Recognizing that an unsolved problem is very much like a solved one is search / remix. So what happened here? Tao says it is
> is largely consistent with other recent demonstrations of AI using existing methods to resolve Erdos problem
Existing. Methods. Tao also says "This is a demonstration of the genuine increase in capability of these tools in recent months". This is the sentence everyone will focus on, so what is that capability?
> the more interesting capability revealed by these events is the ability to rapidly write and rewrite new versions of a text as needed, even if one was not the original author of the argument.
Rejoice! But rejoice for the right reasons, and about what actually happened. Style and voice transformations, interesting new capabilities for fuzzy search. Correct usage of external tools for heavy-lifting with symbolics. And yes, actual problem solving. Novel techniques, creativity, originality though? IDK, sounds kind of optimistic based on the detail here.
If you squint hard enough, every new thing is an example of "answer-from-search / answer-from-remix". Solving any Erdős problem in this manner was largely seen as unthinkable just a year ago.
>the problem does not matter.
Really? All of the other Erdős problems? Millennium Problems? Anything at all? This gets us directly into the territory of "nothing can convince us otherwise".
The goalposts are still the same. We want to be able to independently verify that an AI can do something instead of just hearing such a claim from a corporation that is absolutely willing to lie through their teeth if it gets them money.
Not disagreeing with you, but I don't think Tao is blowing this out of proportion either. I think it's a pretty reasonable way of saying, "Hey, AI is now capable of something it wasn't able to do before".
Digging through the PDFs on Google Drive, this seems to be (one of) the generated proofs. I may be misunderstanding something, but 1400 lines of AI-generated code seems a very good place for some mistake in the translation to sneak in https://github.com/plby/lean-proofs/blob/main/src/v4.24.0/Er...
Though I suppose if the problem statement in Lean is human-generated and there are no ways to "cheat" in a Lean proof, the proof could be trusted without understanding it
The erdos problem website tells the theorem is formalized in Lean but on the mathlib project there is just the theorem statement with a sorry. Does someone know where I can find the lean proof? I don't know maybe it's in some random pull request I didn't find.
When Deep Blue beat Kaspaorov, it was not the end of career for human players. But since mathematics is not a sport with human players, what are the career prospects for mathematicians or mathematics-like fields?
1. This result is very far from showing something like "human mathematicians are no longer needed to advance mathematics".
2. Even if it did show that, as long as we need humans trained in understanding maths, since "professional mathematicians" are mostly educators, they probably aren't going anywhere.
I wouldn't say professional mathematicians are mostly educators. The educating that mathematicians do even at graduate level to non-future-mathematicians can mostly be done (not fully at parity due to depth of understanding that we accumulate but close) by non professional mathematicians. Most of the education is to other current/future mathematicians in my limited opinion.
> ... are mostly educators, they probably aren't going anywhere
Educator business survived so far, only because they provided in-person interactive knowledge transfer and credentials - both were not possible by static sources of knowledge such as libraries and internet. But now all that is possible without involvement of human teachers.
Tao's broad project, which he has spoken about a few times, is for mathematics to move beyond the current game of solving individual theorems to being able to make statements about broad categories of problems. So not 'X property is true for this specific magma' but 'X property is true for all possible magmas', as an example I just came up with. He has experimented with this via crowdsourcing problems in a given domain on GitHub before, and I think the implications of how to use AI here are obvious.
I remember seeing a documentary where there was a bit about some guy who' life's work was computing pi to 30 digits. Imagine all that time to do what my computer can do in less than a second + a day or two to write the code using the algorithm he used. 10 min if you use newton's
There is an ongoing effort to formalize a modern, streamlined proof of FLT in Lean, with all the needed prereqs. It's estimated that it will take approx. 5 years, but perhaps AI will lead to some meaningful speedup.
What I'm hoping to see is high volume automated formalization of the math literature, with the goal of formalizing (or finding flaws in) the entire thing.
And once we have that formalized corpus, it's all set up as training data for moving forward.
We can't really have across-the-board formalization of the math literature without getting the basics done first (including the whole undergrad curriculum) which is what the mathlib folks are working on. It will in fact be interesting to see if AI can meaningfully speed up that work (although they seem to be bottlenecked on review and merging at the moment, not new contribs per se. So a "coding" AI workflow may be a bit of a closer fit.)
It classifies the advancements based on the level of AI input. In particular, the entry in Table 1 related to the original post has both a green and yellow light, reflecting the skepticism from others.
We need you to stop posting shallow dismissals and cynical, curmudgeonly, and snarky comments.
We asked you about this just recently, but it's still most of what you're posting. You're making the site worse by doing this, right at the point where it's most vulnerable these days.
Your comment here is a shallow dismissal of exactly the type the HN guidelines ask users to avoid here:
Predictably, it led to by far the worst subthread on this article. That's not cool. I don't want to ban you because you're also occasionally posting good comments that don't fit these negative categories, but we need you to fix this and stop degrading the threads.
I respect your awareness of that, which I'm sure is much broader than mine is. HN consumes my attention; I'm all depth and no breadth.
What I'm interested in is how well HN does at fulfilling its own mandate in its own terms. On that scale, it's getting worse—in this respect, at least, which is a big one. We're going to do something about it, the same way we've always tried to stave off the decline of this place (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...).
I think it's both external and internal, but the internal factors are more important because a compromised immune system is more vulnerable to outer pathogens.
I'm not sure it's about HN getting larger, though. It's a bit hard to tell, but at least some of the upswing in cynical and curmudgeonly comments is coming from established users.
Whether powered by human or computer, it is usually easier (and requires far fewer resources) to verify a specific proof than to search for a proof to a problem.
Professors elsewhere can verify the proof, but not how it was obtained. My assumption was that the focus here is on how "AI" obtains the proof and not on whether it is correct. There is no way to reproduce this experiment in an unbiased, non-corporate, academic setting.
It seems to me that in your view the sheer openness to evaluate LLM use, anecdotally or otherwise, is already a bias.
I don't see how that's sensible, given that to evaluate the utility of something, it's necessary to accept the possibility of that utility existing in the first place.
On the other hand, if this is not just me strawmanning you, your rejection of such a possibility is absolutely a bias, and it inhibits exploration.
To willfully conflate finding such an exploration illegitimate with the findings of someone who thinks otherwise as illegitimate, strikes me as extremely deceptive. I don't appreciate being forced to think with someone else's opinion covertly laundered in very much. And no, Tao's comments do not meet this same criteria, as his position is not covert, but explicit.
> ... Also, I would not put it past OpenAI to drag up a similar proof using ChatGPT, refine it and pretend that ChatGPT found it. ...
That's the best part! They don't even need to, because ChatGPT will happily do its own private "literature search" and then not tell you about it - even Terence Tao has freely admitted as much in his previous comments on the topic. So we can at least afford to be a bit less curmudgeonly and cynical about that specific dynamic: we've literally seen it happen.
> ChatGPT will happily do its own private "literature search" and then not tell you about it
Also known as model inference. This is not something "private" or secret [*]. AI models are lossily compressed data stores and will always will be. The model doesn't report on such "searches", because they are not actual searches driven by model output, but just the regular operation of the model driven by the inference engine used.
> even Terence Tao has freely admitted as much
Bit of a (willfully?) misleading way of saying they actively looked for it on a best effort basis, isn't it?
[*] A valid point of criticism would be that the training data is kept private for the proprietary models Tao and co. using, so source finding becomes a goose chase with no definitive end to it.
An I think valid counterpoint however is that if locating such literature content is so difficult for subject matter experts, then the model being able to "do so" in itself is a demonstration of value. Even if the model is not able to venture a backreference, by virtue of that not being an actual search.
This is reflected in many other walks of life too. One of my long held ideas regarding UX for example is that features users are not able to find "do not exist".
It genuinely seemed to me that they were looking for empirical reproductions of a formal proof, which is a nonsensical demand and objection given what formal proofs are. My question was spurred on by this and genuine.
It may be that there wasn't enough information in your comment for me to read its intent correctly. I thought you were taking a snarky swipe at the other commenter—especially because most people on HN can be presumed to know what a formal proof is.
Thank you, I'll try to keep it in mind. I'll admit that the curtness of my original question was not just you misreading it, but it did (also) come from a place of genuine confusion.
For what it's worth, it's not even that I don't see merit to their points. I'm just unable to trust that they're being genuine, not the least for how they conduct themselves (which I only fault them for so much). This also impacts my ability to reason about their points clearly.
Sadly, I'm not able to pitch any systematic solutions.
If you don't stop, we're going to ban you. As I said, I don't want to—but when you respond to requests to stop breaking the site guidelines by breaking them again, that's not good.
I get how it's activating and annoying when moderators show up and start fault-finding, so I can appreciate the irritation here. But really, we're just trying to have an internet forum that doesn't destroy itself. I can't imagine why you wouldn't want to contribute positively to that.
What scientific field do you reckon the regular usage of LLMs falls under? Do you genuinely think Tao was making scientific claims or just provided evidence that may eventually feed into some? It reads to me like just a plain recollection of events, an anecdotal experience.
This almost implies mathematicians aren’t some ungodly geniuses if something as absolutely dumb as an LLM can solve these problems via blind pattern matching.
Meanwhile I can’t get Claude code to fix its own shit to save my life.
There are "ungodly geniuses" within mathematics but no one is saying every mathematician is an "ungodly genius". The quality of results you get from an LLM can vary greatly depending on the environment you place it in and the context you provide it. This isn't to say it's your fault Claude Code can't fix whatever issue you're having.
To clear up a few misconceptions:
- Aristotle uses modern AI techniques heavily, including language modeling.
- Aristotle can be guided by an informal (English) proof. If the proof is correct, Aristotle has a good chance at translating it into Lean (which is a strong vote of confidence that your English proof is solid). I believe that's what happened here.
- Once a proof is formalized into Lean (assuming you have formalized the statement correctly), there is no doubt that the proof is correct. This is the core of our approach: you can do a lot of (AI-driven) search, and once you find the answer you are certain it's correct no matter how complex the solution is.
Happy to answer any questions!
The natural-language statement of the problem is (from https://www.erdosproblems.com/728):
> Let C>0 and ϵ>0 be sufficiently small. Are there infinitely many integers a,b,n with a≥ϵn and b≥ϵn such that a!b!∣n!(a+b−n)! and a+b>n+Clogn?
The Lean-language statement of the problem (which can be done either by hand or by AI) is (from https://github.com/plby/lean-proofs/blob/f44d8c0e433ab285541...):
Yes on the one hand, one needs to know enough about Lean to be sure that this formulation matches what we intend, and isn't stating something trivial. But on the other hand, this is not as hard as finding an error on some obscure line of a long proof.(There's also an older formulation at https://github.com/google-deepmind/formal-conjectures/blob/f... but the new one is more in the spirit of what was intended: see the discussion starting at https://www.erdosproblems.com/forum/thread/728#post-2196 which gives a clear picture, as of course does Tao's thread in the OP that summarizes this discussion.)
Given a formal statement of what you want, Lean can validate that the steps in a (tedious) machine-readable purported proof are valid and imply the result from accepted axioms. This is not AI, but a tiny, well reviewed kernel that only accepts correct formal logic arguments.
So, if you have a formal statement that you've verified to represent what you are interested in by some other means, Lean can tell you whether the proof created by genAI is correct. Basically, there is a nigh infallible checker that won't accept incorrect hallucinations.
> Let C>0 and ϵ>0 be sufficiently small. Are there infinitely many integers a,b,n with a≥ϵn and b≥ϵn such that a!b!∣n!(a+b−n)! and a+b>n+Clogn?
Which seems like it's the type of thing you give as a homework problem to state formally in an intro class.
Let's just be generous and try to accept these differences.
I can't speak for ndriscoll, but I am a university math professor with extensive experience teaching these sorts of topics, and I agree with their comment in full.
You are right that some (other) statements are harder to formalize than they look. The Four Color Theorem from graph theory is an example. Generally speaking, discrete math, inequalities, for all/there exists, etc. are all easy to formalize. Anything involving geometry or topology is liable to be harder. For example, the Jordan curve theorem states that "any plane simple closed curve divides the plane into two regions, the interior and the exterior". As anyone who has slogged through an intro topology book knows, statements like this take more work to make precise (and still more to prove).
When someone takes the time to explain undergrad-level concepts in a comment, responding with "are you an expert?" is a level of skepticism that's bordering on hostile. The person you're responding to is correct, it's rare that the theorem statement itself is particularly hard to formalize. Whatever you read likely refers to the difficulty of formalizing a proof.
That's very dependent on the problem area. For example there's a gap between high school explanation of central limit theorem and actual formalization of it. And when dealing with turing machines sometimes you'll say that something grows e.g. Omega(n), but what happens is that there's some subsequence of inputs for which it does. Generally for complexity theory plain-language explanations can be very vague, because of how insensitive the theory is to small changes and you need to operate on a higher level of abstraction to have a chance to explain a proof in reasonable time.
On the other hand, many problems in number theory and discrete structures tend to be rather simple to formalize. If you want to take your own look at that, I'd recommend to check out the lean games[1]. I'd say after the natural numbers game, you most likely know enough lean to write that problem down in lean with the "sufficiently small" being the trickiest part.
1: https://adam.math.hhu.de/
These are the two main problems:
1. Formalizing a theorem.
2. Finding a formal proof.
Part 2 is where AI could help as proof search is full of heuristics. That's also how humans find proofs and is one of the main skills of a mathematician. The formal proof can then be machine checked with well known and mature techniques not involving AI.
Part 1 is the part that's missing and will always be hard. It's also the issue with formal verification of programs for which correctness criteria are often very complex and it's easy to mess up the formalization, so that even if you trust the proof, you can't trust that it proves the right thing.
Also, on the verification side - there could also be a window of failure that Lean itself has a hidden bug in it too. And with automated systems that seek correctness, it is slightly elevated that some missed crack of a bug becomes exploited in the dev-check-dev loop run by the AI.
The lean proof checker then checks to make sure the proof actually proves the statement.
In this case an AI is generating the proof, but if it "compiles" it's correct. The only thing humans need to check is the statement to be proven.
(I don't know anything about this project but I've played around with lean and used other proof checkers more sesrisously).
It seems one side of this argument desperately needs AI to have failed, and the other side is just saying that it probably worked but it is not as important as presented, that it is actually just a very cool working methodology going forward.
Sure. But it's fair to ask how to validate that assumption.
If it is easier to convert backwards, maybe the AI can at least describe what the equations mean…
A little bit nitpicking, but according to books like AIMA that is indeed AI. In the first chapter even any control system is classified as AI.
Because of the reasons stated in the 1st chapter, I totally agree with the authors.
The whole system is AI. That part is a verifier in a chain of “suggestions/instict -> verifier” like used in neurosymbolic systems for automated driving, for example.
1. you write a proof in English that there is an infinite number of primes. 2. the llm writes 2+2=4 in lean. 3. lean confirms that this is correct and it's impossible that this proof is wrong.
Aristotle is doing the matching AI part, the modern LLM approach, previously called fuzzy logic.
Both are AI.
To me "AI" is machine learning, statistical algorithms trained on data. That's not true for Lean.
However, there are some good heuristics. If you expect a problem to be hard and the proof is very short, you've probably missed something!
https://www.reddit.com/r/singularity/comments/1pv3nl3/commen...
This is also a big risk when trying to prove code correctness: "prove this algo works" means you gotta define "works" along certain axes, and if you're very unlucky you might have a proof that exploits the uncertainty around a certain axis.
My limited understanding of Rust is that it applies a fixed set of rules to guarantee memory safety. The rules are somewhat simple and limiting, for ease of understanding and implementation, but also because of undecidability.
Programmers run into situations where they know that their code won't cause memory errors, but it doesn't follow the rules. Wouldn't it be cool if something like Aristotle was integrated into the compiler? Any code for which a proof of correctness could be written would pass/compile, without having to add more and more rules
Formal methods are cool because, by contrast to tools like the borrow checker, you can prove some very "nonlocal" properties: this system does not deadlock, or it makes progress at least every N steps, etc.
If the proof is rooted in the programmer's understanding who can give proof hints to the prover then any modification of the program can then be accompanied with a modification of the hints, still allowing automatic proofs. But if the human has no clue then the automatic system can get stuck without the human having a chance to help it along.
Eg you can give the vague spec 'build me a todo list app', but you can still formally prove that everything your app does finishes, or even that it finishes in reasonable time.
Now this captures some errors, but it doesn't really capture high level ones (is this program guaranteed to not deadlock is a hard one), and it doesn't capture the one that is important for business purposes (does this do what the customer wants). That requirement is more important than correctness (vitness all the software that is described as "crap", but is nonetheless widely used).
I don't think this is a required key to unlocking vibe coding. That seems to be easy: does this provide business value? And there the answer seems roughly to be "yes".
Sometimes when I'm using new LLMs I'm not sure if it’s a step forward or just benchmark hacking, but formalized math results always show that the progress is real and huge.
When do you think Harmonic will reach formalizing most (even hard) human written math?
I saw an interview with Christian Szegedy (your competitor I guess) that he believes it will be this year.
In other fields (topology, probability, linear algebra), many key definitions are not in Mathlib yet, so you will struggle to write down the theorem itself. (But in some cases, Aristotle can define the structure you are talking about on the fly!)
This is not an intrinsic limitations of Lean, it's just that nobody has taken the time to formalize much of those fields yet. We hope to dramatically accelerate this process by making it trivial to prove lemmas, which make up much of the work. For now, I still think humans should write the key definitions and statements of "central theorems" in a field, to ensure they are compatible with the rest of the library.
Do you have any plans to characterize these cases more fully, and perhaps propose your own contributions to mathlib itself on that basis?
How does this depend on the area of mathematics of the proof? I was under the impression that it was still difficult to formalize most research areas, even for a human. How close is Aristotle to this frontier?
That's a pretty big assumption, though, isn't it? As we saw the Navier-Stokes psychosis episode over the New Year holiday, formalizing correctly really isn't guaranteed.
It's another reformulation rather than a true proof. Now, instead of wanting a proof of a theorem, now we just need to prove that this proof is actually proving the theorem. The proof itself being so incomprehensible that it can't on its own be trusted, but if it can be shown that it can be trusted then the theorem must be true.
Claude Code can write lean, but we do a heck of a lot of RL on theorem proving, so Aristotle winds up being much better at writing Lean than other coding agents are.
Have you tried Aristotle on other, non-Lean tasks? Is it better at logical reasoning in general?
Translate an informal description of the proof into this Lean?
Do you have any links to reading about how often lean core has soundness bugs or mathlib has correctness bugs?
Second, when you say language modeling support, it means that can better understand code representation (ASTs) or something else? I am just an AI user, not very knowledgeable in the field. My main interest is if it would be great for static analysis oriented to computer security (SAST).
[1] https://aristotle.ai/
I don’t think it can really be said to have occurred autonomously then?
Looks more like a 50/50 partnership with a super expert human one the one side which makes this way more vague in my opinion - and in line with my own AI tests, ie. they are pretty stupid even OPUS 4.5 or whatever unless you're already an expert and is doing boilerplate.
EDIT: I can see the title has been fixed now from solved to "more or less solved" which is still think is a big stretch.
Seems exactly like the tests at my company where even frontier models are revealed to be very expensive rubber ducks, but completely fails with non experts or anything novel or math heavy.
Ie. they mirror the intellect of the user but give you big dopamine hits that'll lead you astray.
But speaking as a specialist in theorem proving, this result is pretty impressive! It would have likely taken me a lot longer to formalize this result even if it was in my area of specialty.
How did you arrive at "ridiculous"? What we're seeing here is incredible progress over what we had a year ago. Even ARC-AGI-2 is now at over 50%. Given that this sort of process is also being applied to AI development itself, it's really not clear to me that humans would be a valuable component in knowledge work for much longer.
To me that doesn't sound qualitatively different from a PhD student. Are they just cognitive augmentation for their mentor?
In any case, I wasn't trying to argue that this system as-is is AGI, but just that it's no longer "ridiculous", and that this to me looks like a herald of AGI, as the portion being done by humans gets smaller and smaller
>The real world is not composed of rewards and punishments.
Most "AGI advocates" say that AGI is coming, sooner rather than later, and it will fundamentally reshape our world. On its own that's purely descriptive. In my experience, most of the alleged "smiting" comes from the skeptics simply being wrong about this. Rarely there's talk of explicit rewards and punishments.
To me, this sounds like when we first went to the moon, and people were sure we'd be on Mars be the end of the 80's.
> Even ARC-AGI-2 is now at over 50%.
Any measure of "are we close to AGI" is as scientifically meaningful as "are we close to a warp drive" because all anyone has to go on at this point is pure speculation. In my opinion, we should all strive to be better scientists and think more carefully about what an observation is supposed to mean before we tout it as evidence. Despite the name, there is no evidence that ARC-AGI tests for AGI.
Unlike space colonisation, there are immediate economic rewards from producing even modest improvements in AI models. As such, we should expect much faster progress in AI than space colonisation.
But it could still turn out the same way, for all we know. I just think that's unlikely.
https://en.wikipedia.org/wiki/Parkinson%27s_law
Ai runs on computers. Consider the undecidability of Rices theorem. Where compiled code of non trivial statements may or may not be error free. Even an ai can’t guarantee its compiled code is error free. Not because it wouldn’t write sufficient code that solves a problem, but the code it writes is bounded by other externalities. Undecidability in general makes the dream of generative ai considerably more challenging than how it’s being ‘sold.
I work at an insurance company and I can’t see AI replacing even 10% of the employees here. Too much of what we do is locked up in decades-old proprietary databases that cannot be replaced for legal reasons. We still rely on paper mail for a huge amount of communication with policyholders. The decisions we make on a daily basis can’t be trusted to AI for legal reasons. If AI caused even a 1% increase in false rejections of claims it would be an enormous liability issue.
And regarding your example of an insurance company, I'm not sure about that industry, but seeing the transformation of banking over the last decade to fully digital providers like Revolut, I would expect similar disruption there.
Recent case:
I have a bar with a number of weights supported on either end:
|---+-+-//-+-+---|
What order and/or arrangement or of removing the weights would cause the least shift in center-of-mass? There is a non-obvious trick that you can pull here to reduce the shift considerably and I was curious if the AI would spot it or not but even after lots of prompting it just circled around the obvious solutions rather than to make a leap outside of that box and come up with a solution that is better in every case.
I wonder what the cause of that kind of blindness is.
My guess is: first move the weights to the middle, and only then remove them.
However “weights” and “bar” might confuse both machines and people into thinking that this is related to weight lifting, where there’s two stops on the bar preventing the weights from being moved to the middle.
And no, the problem is not 'not clearly stated'. It is complete as it is and you are wrong about your guess.
And if machines and people think this is related to weight lifting then they're free to ask follow up questions. But even in the weight lifting case the answer is the same.
Yeah, LLMs have a tendency to run with some interpretation of a question without asking follow-up questions. Probably, it's a consequence of RLHFing them in that way.
If you do solve it don't post the answer.
You didn't even label your ASCII art, so I've no clue what you mean, are the bars at the end the supports or weights? Can I only remove one weight at a time? Initially I assumed you mean a weightlifting bar the weights on which can only be removed from its ends. Is that the case or what? What's the double slash in the middle?
Also: "what order and/or arrangement or of removing the weights" this isn't even correct English. Arrangement of removing the weights? State the problem clearly, from first principles, like you were talking to a 5 year old.
The sibling comment is correct, you're clearly picturing something in your mind that you're failing to properly describe. It seems obvious to you, but it's not.
There is a starting node (L_0, R_0, {}) and an ending node ({}, {}, W) , with the latter having L=R={}.
I think you're trying to find the path (L_n, R_n, S_n) from the starting node to the ending node that minimises the maximum absolute value of c(L_n, R_n, S_n).
I won't post a solution, as requested.
"Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver"
Not saying it's not an amazing setup, i just don't understand the word "AI" being used like this when it's the setup / system that's brilliant in conjunction with absolute experts.
https://en.wikipedia.org/wiki/Dartmouth_workshop
AI != AGI != neural networks != LLMs
But Tao did mention ChatGPT so i believe LLMs were involved at least partially.
This hits so true to home. Just today in my field a manager without expertise in a topic gave me an AI solution to something I am an expertise in. The AI was very plainly and painfully wrong, but it comes down to the user prompting really poorly. When I gave a el formulated prompt to the same topic, I got the correct answer on the first go.
> There seems to be some confusion on this so let me clear this up. No, after the model gave its original response, I then proceeded to ask it if it could solve the problem with C=k/logN arbitrarily large. It then identified for itself what both I and Tao noticed about it throwing away k!, and subsequently repaired its proof. I did not need to provide that observation.
so it was literally "yo, your proof is weak!" - "naah, watch this! [proceeds to give full proof all on its own]"
I'd say that counts
"solved more or less autonomously by AI" were Tao's exact words, so I think we can trust his judgment about how much work he or the AI did, and how this indicates a meaningful increase in capabilities.
Or someone like Terence Tao might figure out how to wield AI better than anyone else, even the labs, and use the tools to take a bunch down at once. I'm excited to see what's coming this year.
To a layman, that doesn't sound like very AI-like? Surely there must be a dozen algorithms to effectively search this space already, given that mathematics is pretty logical?
Isabelle has had the "Sledgehammer" tool for quite awhile [1]. It uses solvers like z3 to search and apply a catalog of proof strategies and then try and construct a proof for your main proof or any remaining subtasks that you have to complete. It's not perfect but it's remarkably useful (even if it does sometimes give you proofs that import like ten different libraries and are hard to read).
I think Coq has Coqhammer but I haven't played with that one yet.
[1] https://isabelle.in.tum.de/dist/doc/sledgehammer.pdf
Sorry for my probably senseless questions, as I'm trying to map the computing model of math solvers to common PL semantics. Probably there is better overview literature. I'd like to get an overview of proof system runtime semantics for later usage. 2 Is there an equivalent of fuzz testing (of computing systems) in math, say to construct the general proof framework? 3 Or how are proof frameworks (based on ideas how the proof could work) constructed? 4 Do I understand it correct, that math in proof systems works with term rewrite systems + used theory/logic as computing model of valid representation and operations? How is then the step semantic formally defined?
2. there is no exact analogue of fuzz-testing bytecode for "theorem fuzzing", but arguably the closest match is counterexample generators - model finders, finite-model search, SMT instantiation with concrete valuations, all serve a role similar to fuzzers by finding invalid conjectures quickly.
these are weaker than fuzzing for finding execution bugs because mathematical statements are higher-level and provers operate symbolically.
3. here's how proof frameworks are constructed, at the high-level:
a. start by picking a logical foundation: e.g., first-order logic (FOL), higher-order logic (HOL), dependent type theory (Martin-Löf/CIC).
b. define syntax (terms, types, formulas) and typing/judgment rules (inference rules or typing rules).
c. define proof objects or proof rules: natural deduction, sequent calculus, Hilbert-style axioms, or type-as-proofs (Curry-Howard).
d. pick or design the kernel: small, trusted inference engine that checks proof objects (reduction rules, conversion, rule admissibility).
e. add automation layers: tactics, decision procedures, external ATP/SMT, rewriting engines.
f. provide libraries of axioms/definitions and extraction/interpretation mechanisms (code extraction, models).
g. implement proof search strategies and heuristics (backtracking, heuristics, lemma databases, proof-producing solvers).
this is the standard engineering pipeline behind modern ITPs and automated systems.
4. yes, many proof assistants and automated provers treat computation inside proofs as:
a. term rewriting / reduction (beta-reduction, delta-unfolding, normalization) for computation oriented parts; this is the "computation" layer.
b. a separate deductive/logic layer for reasoning (inference rules, quantifier instantiation, congruence closure).
the combined model is: terms represent data/computation; rewriting gives deterministic computation semantics; logical rules govern valid inference about those terms. Dependent type theories conflate computation and proof via conversion (definitional equality) in the kernel.
5. here's how a proof step's semantics is defined:
proof steps are applications of inference rules transforming sequents/judgments. formally:
a. a judgment form J (e.g., Γ ⊢ t : T or Γ ⊢ φ) is defined.
b. inference rules are of the form: from premises J1,...,Jn infer J, written as (J1 ... Jn) / J.
c. a proof is a finite tree whose nodes are judgments; leaves are axioms or assumptions; each internal node follows an inference rule.
d. for systems with computation, a reduction relation → on terms defines definitional equality; many rules use conversion: if t →* t' and t' has form required by rule, the rule applies.
e. in type-theoretic kernels, the step semantics is checking that a proof object reduces/normalizes and that constructors/eliminators are used respecting typing/conversion; the kernel checks a small set of primitive steps (e.g., beta, iota reductions, rule applications).
operationally: a single proof step = instantiate a rule schema with substitutions for metavariables, perform any required reductions/conversions, and check side conditions (freshness, well-formedness).
equational reasoning and rewriting: a rewrite rule l → r can be applied to a term t at a position p if the subterm at p matches l under some substitution σ; result is t with that subterm replaced by σ(r). Confluence/termination properties determine global behavior.
higher-level tactics encode sequences of such primitive steps (rule applications + rewrites + searches) but are not part of kernel semantics; only the kernel rules determine validity.
relevant concise implications for mapping to PL semantics:
treat proof state as a machine state (context Γ, goal G, local proof term), tactics as programs that transform state; kernel inference rules are the atomic instruction set; rewriting/normalization are deterministic evaluation semantics used inside checks.
automation ≈ search processes over nondeterministic tactic/program choices; model finders/SMTs act as external oracles producing either counterexamples (concrete) or proof certificates (symbolic).
1. Yes, the Sledgehammer suite contains three testing systems that I believe are concolic (quickcheck, nitpick, nunchaku), but they only work due to hand-coded support for the mathematical constructs in question. They'd be really inefficient for a formalised software environment, because they'd be operating at a much lower-level than the abstractions of the software environment, unless dedicated support for that software environment were provided.
2. Quickcheck is a fuzz-testing framework, but it doesn't help to construct anything (except as far as it constructs examples / counterexamples). Are you thinking of something that automatically finds and defines intermediary lemmas for arbitrary areas of mathematics? Because I'm not aware of any particular work in that direction: if computers could do that, there'd be little need for mathematicians.
3. By thinking really hard, then writing down the mathematics. Same way you write computer programs, really, except there's a lot more architectural work. (Most of the time, the computer can brute-force a proof for you, so you need only choose appropriate intermediary lemmas.)
4. I can't parse the question, but I suspect you're thinking of the meta-logic / object-logic distinction. The actual steps in the term rewriting are not represented in the object logic: the meta-logic simply asserts that it's valid to perform these steps. (Not even that: it just does them, accountable to none other than itself.) Isabelle's meta-logic is software, written in a programming language called ML.
Logic solvers are useful, but not tractable as a general way to approach mathematics.
To be clear, there are explicitly computationally tractable fragments of existing logics, but they're more-or-less uninteresting by definition: they often look like very simple taxonomies (i.e. purely implicational) or like a variety of "modal" and/or "multi-modal" constructions over simpler logics.
Of course it would be nice to explicitly tease out and write down the "computationally tractable" general logical reasoning that some existing style of proof is implicitly relying on (AIUI this kind of inquiry would generally be comprised under "synthetic mathematics", trying to find simple treatments in axiom- and rule-of-inference style for existing complex theories) but that's also difficult.
The fact of how you use the term AI tells me that you are a representative of laymen so what precisely are you trying to define?
It might be helpful to understand the term artificial intelligence first:
https://kemendo.com/Understand-AI.html
For all topics that can be expressed with language, the value of LLMs is shuffling things around to tease out a different perspective from the humans reading the output. This is the only realistic way to understand AI enough to make it practical and see it gain traction.
As much as I respect Tao, I feel like his comments about AI usage can be misleading without carefully reading what he is saying in the linked posts.
A large amount of Tao's work is around using AI to assist in creating Lean proofs.
I'm generally on the more skeptical side of things regarding LLMs and grand visions, but assisting in the creation of Lean proofs is a huge area of opportunity for LLMs and really could change mathematics in fundamental ways.
One naive belief many people have is that proofs should be "intelligible" but it's increasingly clear this is not the case. We have proofs that are gigabytes (I believe even terabytes in some cases) in size, but we know they are correct because they check in Lean.
This particular pattern of using state of the art work in two different areas (LLMs and theorem proving) absolutely has the potentially to fundamentally change how mathematics is done. There's a great picture on pp 381 of Type Theory and Formal Proof where you can easily see how LLMs can be placed in two of the most tricky parts of that diagram to solve.
Because the work is formally verified we can throw out entire classes of LLM problems (like hallucinations).
Personally I think strongly typed language, with powerful type systems are also the long term ideal coding with LLMs (but I'm less optimistic about devs following this path).
Perhaps you meant, "a decent amount of his recent work." He has been doing math long before LLMs, and is still regularly publishing papers with collaborators that have nothing to do with AI. The most recent was last week. https://arxiv.org/search/math?searchtype=author&query=Tao,+T
It absolutely is. With the twist that ChatGPT 5.2 can now also "explain" an AI-generated Lean proof in human-readable terms. This is a game changer, because "refactoring" can now become end-to-end: if the human explanation of a Lean proof is hard to grok and could be improved, you can test changes directly on the formal text and check that the proof still goes through for the original statement.
Formal verification combined with AI is, imho, exactly the type of thinking that gets the most value out of the current state of LLMs.
https://www.science.org/cms/asset/7f2147df-b2f1-4748-9e98-1a...
That’s one of the main reason why I did not pursue an academic math career. The pure joy of solving exam problems with elegant proofs is very hard to get on harder problems.
That's not a naïve belief. Intelligible proofs represent insight that can be applied to other problems. If our only proof is an opaque one, that means we don't really understand the area yet. Take, for example, the classification of finite simple groups (a ten-thousand-page proof): that is very much not a closed area of research, and we're still discovering new things in the vicinity of the problem.
What happens if we reason upwards but change some universal constants? What happens if we use Tao instead of Pi everywhere, these kind of fun questions would otherwise require an enormous intellectual effort whereas with the mechanisation and automation of thought, we might be able to run them and see!
Google Scholar was a huge step forward for doing meta-analysis vs a physical library.
But agents scanning the vastness of PDFs to find correlations and insights that are far beyond human context-capacity will I hope find a lot of knowledge that we have technically already collected, but remain ignorant of.
In any given scientific niche, there is a huge amount of tribal knowledge that never gets written down anywhere, just passed on from one grad student to the rest of the group, and from there spreads by percolation in the tiny niche. And papers are never honest about the performance of the results and what does not work, there is always cherry picking of benchmarks/comparisons etc.
There is absolutely no way you can get these kinds of insights beyond human context capacity that you speak of. The information necessary does not exist in any dataset available to the LLM.
Imagine trying to teach an AI how to code based on only slide decks from consultants. No access todocumentation, no stack overflow, no open source code used in the training data; just sales pitches and success stories. That's close to how absurd this idea is.
It is likely we might see some AlphaGo type new styles in existing research workflows that AI might work out if there is some verification logic. Humans could probably never go into that space, or may be none of the researchers ever ventured there due to different reasons as progress in general is mostly always incremental.
Literally nothing other than mild convenience. It’s just 2pi.
I'm not sure why that point gets lost in these discussions. And personally, I think of the set of fundamental mathematical objects as having a unique and objective definition. So, I get weirdly bothered by the offset in the Gamma function.
The "intellectual effort" this requires is about 0.
Maybe you meant Euler's number? Since it also relates to PI, it can be used and might actually change the framework in an "interesting way" (making it more awkward in most cases - people picked PI for a reason).
I think what they were getting at is something like this: The application of existing ideas that simply haven't been applied in certain ways because it's too boring or obvious or abstract for humans to have bothered with, but AI can plow through a year's worth of human drudgery in a day or a month or so, and that sort of "brute force" won't require any amazing new technical capabilities from AI.
https://ddcolrs.wordpress.com/2018/01/17/maxwells-equations-...
"I doubt that anything resembling genuine "artificial general intelligence" is within reach of current #AI tools. However, I think a weaker, but still quite valuable, type of "artificial general cleverness" is becoming a reality in various ways.
By "general cleverness", I mean the ability to solve broad classes of complex problems via somewhat ad hoc means. These means may be stochastic or the result of brute force computation; they may be ungrounded or fallible; and they may be either uninterpretable, or traceable back to similar tricks found in an AI's training data. So they would not qualify as the result of any true "intelligence". And yet, they can have a non-trivial success rate at achieving an increasingly wide spectrum of tasks, particularly when coupled with stringent verification procedures to filter out incorrect or unpromising approaches, at scales beyond what individual humans could achieve.
This results in the somewhat unintuitive combination of a technology that can be very useful and impressive, while simultaneously being fundamentally unsatisfying and disappointing - somewhat akin to how one's awe at an amazingly clever magic trick can dissipate (or transform to technical respect) once one learns how the trick was performed.
But perhaps this can be resolved by the realization that while cleverness and intelligence are somewhat correlated traits for humans, they are much more decoupled for AI tools (which are often optimized for cleverness), and viewing the current generation of such tools primarily as a stochastic generator of sometimes clever - and often useful - thoughts and outputs may be a more productive perspective when trying to use them to solve difficult problems."
This comment was made on Dec. 15, so I'm not entirely confident he still holds it?
https://mathstodon.xyz/@tao/115722360006034040
While quickly I noticed that my pre-ChatGPT-3.5 use of the term was satisfied by ChatGPT-3.5, this turned out to be completely useless for 99% of discussions, as everyone turned out to have different boolean cut-offs for not only the generality, but also the artificiality and the intelligence, and also what counts as "intelligence" in the first place.
That everyone can pick a different boolean cut-off for each initial, means they're not really booleans.
Therefore, consider that this can't drive a car, so it's not fully general. And even those AI which can drive a car, can't do so in genuinely all conditions expected of a human, just most of them. Stuff like that.
So blind people are not general intelligences?
So no we do not deem a blind person to be unintelligent due to their lack of being able to drive without sight. But we might judge a sighted person as being not generally intelligent if they could not drive with sight.
AI can already beat humans in pretty much any game like Go or Chess or many videogames, but that doesn't make it general.
Just because a specialized human placed in an F-16 can fly at Mach 2.0, doesn't mean humans in general can fly.
What happens when we put an artificial general intelligence in an F-16? That's what happened here with this proof.
I assumed extreme performance of a general AI matching and exceeding average human intelligence when placed in an F16 or an equivalent cockpit specified for conducting math proofs.
That’s not agi at all. I don’t think you understand that LLMs will never hit agi even when they exceed human intelligence in all applicable domains.
The main reason is they don’t feel emotions. Even if the definition of agi doesn’t currently encompass emotions people like you will move the goal posts and shift the definition until it does. So as AI improves, the threshold will be adjusted to make sure they will never reach agi as it’s an existential and identity crisis to many people to admit that an AI is better than them on all counts.
Not trivial problems. Issues with possible solutions, errors, and unresolved history.
AI did not \\"solve\\" any issues on its own, but what stood out to me was the speed at which concepts could be rewritten, restructured and tested for stress.
A mental model that has been useful to me is that AI is not particularly good at providing the first answer, however, it is very good at providing the second, third, and tenth versions of the answer, especially when the first answer has already been identified as weak by a human.
In these instances, the progress seemed to stem from the AI being able to: Quickly reword and restate a given argument. Convert implicit assumptions into explicit ones. Identify small gaps in logic before they became large.
What I have been grappling with is how to differentiate when AI is just clarifying versus when it is silently hallucinating structure. Is the output of AI being treated as a draft, a reviewer, a rubber duck, or some combination? When is the output so fast that the rigor of thought is compromised? I am interested in how others are using AI for hard thinking and not just for writing cleanup.
Are you getting the same value in your work, in your field?
For me, deep research tools have been essential for getting caught up with a quick lit review about research ideas I have now that I'm transitioning fields. They have also been quite helpful with some routine math that I'm not as familiar with but is relatively established (like standard random matrix theory results from ~5 years ago).
It does feel like the spectrum of utility is pretty aligned with what you might expect: routine programming > applied ML research > stats/applied math research > pure math research.
I will say ~1 year ago they were still useless for my math research area, but things have been changing quickly.
So yes, there is value here, and quite a bit but it requires a lot of forethought in how you structure your prompts and you need to be super skeptical about the output as well as able to check that output minutely.
If you would just plug in a bunch of data and formulate a query and would then use the answer in an uncritical way you're setting yourself up for a world of hurt and lost time by the time you realize you've been building your castle on quicksand.
Overall: useful, but not yet particularly "accelerating" for me.
I also found it can be helpful when exploring your mathematical intuitions on something, e.g. like how a dropout layer might effect learned weights and matrix properties, etc. Sometimes it will find some obscure rigorous math that can be very enlightening or relevant to correcting clumsy intuitions.
But maybe that is just me. I have read some of Terence Tao's transcripts, and the questions he asks LLMs are higher complexity than what I ask. Yet, he often gets reasonable answers. I don't yet know how I can get these tools to do better.
The difference between free ChatGPT, GPT-5.2 Thinking, and GPT-5.2 Pro is enormous for areas like logic and math. Often the answer to bad results is just to use a better model.
Additionally, sometimes when I get bad results I just ask the question again with a slightly rephrased prompt. Often this is enough to nudge the models in the right direction (and perhaps get a luckier response in the process). However, if you are just looking at a link to a chat transcript, this may not be clear.
I have wondered if he has access to a better model than I, the way some people get promotional merchandise. A year or two ago he was saying the models were as good as an average math grad student when to me they were like a bad undergrad. In the current models I don't get solutions to new problems. I guess we could do some debugging and try prompting our models with this Erdos problem and see how far we get. (edit: Or maybe not; I guess LLMs search the web now.)
- It took me a minute or two of clicking around to figure out that the (only?) way to use it is to create an API key, then start aristotle in the terminal and interact with it there. It could be more obvious I think.
- Your profile links to http://www.cs.stanford.edu/~tachim/ which doesn't work; should be http://cs.stanford.edu/~tachim/ (without the www) (I think Stanford broke something recently for the former not to work.)
Edit: I just realized from https://news.ycombinator.com/item?id=46296801 that you're the CEO! - in that case maybe you, or whoever you think most appropriate from your organization, could submit it along with a text description of what it is, and what is the easiest and/or most fun way to try it out?
Edit: Also https://hn.algolia.com/?dateRange=all&page=2&prefix=true&que... for the most popular Show HNs. Don't be discouraged we like personal/open source projects most often, Obsidian made #1
EDIT: Look at all the people below just reacting to the headline and clearly not reading the posts. Aristotle (https://arxiv.org/abs/2510.01346) is key here folks.
EDIT2: It is clear much of the people below don't even understand basic terminology. Something being a transformer doesn't make it an LLM (vision transformers, anyone) and if you aren't training on language (e.g. AlphaFold, or Aristotle on LEAN stuff), it isn't a "language" model.
I think it's because it comes off as you are saying that we should move off of GenAI, and alot of people use LLM when they mean GenAI.
This is a really hopeful result for GenAI (fitting deep models tuned by gradient descent on large amounts of data), and IMO this is possible because of specific domain knowledge and approaches that aren't there in the usual LLM approaches.
Do you mean that in this case, it was not a LLM?
It is far more than an LLM, and math != "language".
The second one being backed by a model.
> It is far more than an LLM
It's an LLM with a bunch of tools around it, and a slightly different runtime that ChatGPT. It's "only" that, but people - even here, of all places - keep underestimating just how much power there is in that.
> math != "language".
How so?
LLM = Large Language Model. Large refers to both the number of parameters (and in practice, depth) of the model, and also implicitly the amount of data used for training, and "language" means human (i.e. written, spoken) language. A Vision Transformer is not an LLM because it is trained on images, and AlphaFold is not an LLM because it is trained molecular configurations.
Aristotle works heavily with formalized LEAN statements and expressions. While you can certainly argue this is a language of sorts, it is not at all the same "language" as the "language" in LLMs. Calling Aristotle an "LLM" just because it has a transformer is more misleading than truthful, because every other single aspect of it is far more clever and involved.
There is no reason or framing where you can say Aristotle isn't a language model.
It's an LLM.
It is far more than an LLM, and math != "language".
1. "The search algorithm is a highly parallel Monte Carlo Graph Search (MCGS) using a large transformer as its policy and value functon." ... "We use a generative policy to take progressively widened [7] samples from the large action space of Lean tactics, conditioning on the Lean proof state, proof history, and, if available, an informal proof. We use the same model and prompt (up to a task token) to compute the value function which guides the search."
See that 'large transformer' phrase? That's where the LLM is involved.
2. "A lemma-based informal reasoning system which generates informal proofs of mathematical state-ments, breaks these proofs down into lemmas, formalizes each lemma into Lean, and iterates this process based on formal feedback" ... "First, the actions it generates consist of informal comments in addition to Lean tactics. Second, it uses a hidden chain of thought with a dynamically set thinking budget before predicting an action."
Unless you're proposing that this team solved AGI, "chain of thought" is a specific term of art in LLMs.
3. "A geometry solver which solves plane geometry problems outside of Lean using an approach based on AlphaGeometry [45]." ... following the reference: "AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. "
AlphaGeometry, like all of Deepmind's Alpha tools, is an LLM.
Instead of accusing people of not reading the paper, perhaps you should put some thought into what the things in the paper actually represent.
Please remember that this is a theorem about integers that is subject to a fairly elementary proof that is well-supported by the existing Mathlib infrastructure. It seems that the AI relies on the symbolic proof checker, and the proofs that it is checking don't use very complex definitions in this result. In my experience, proofs like this which are one step removed from existing infra are much much more likely to work.
Again though, this is really insanely cool!!
People are still going to be moving the goalposts on this and claiming it's not all that impressive or that the solution must have been in the training data or something, but at this point that's kind of dubiously close to arguing that Terence Tao doesn't know what he's talking about, which to say the least is a rather perilous position.
At this point, I think I'm making a belated New Years resolution to stop arguing with people who are still staying that LLMs are stochastic parrots that just remix their training data and can never come up with anything novel. I think that discussion is now dead. There are lots of fascinating issues to work out with how we can best apply LLMs to interesting problems (or get them to write good code), but to even start solving those issues you have to at least accept that they are at least somewhat capable of doing novel things.
In 2023 I would have bet hard against us getting to this point ("there's no way chatbots can actually reason their way through novel math!"), but here we are are three years later. I wonder what comes next?
Or are you just saying that solving novel problems involves remixing ideas? Well, that's true for human problem solving too.
What? If we're discussing novel synthesis, and it's being contrasted with answer-from-search / answer-from-remix.. the problem does not matter. Only the answer and the originality of the approach. Connecting two fields that were not previously connected is novel, or applying a new kind of technique to an old problem. Recognizing that an unsolved problem is very much like a solved one is search / remix. So what happened here? Tao says it is
> is largely consistent with other recent demonstrations of AI using existing methods to resolve Erdos problem
Existing. Methods. Tao also says "This is a demonstration of the genuine increase in capability of these tools in recent months". This is the sentence everyone will focus on, so what is that capability?
> the more interesting capability revealed by these events is the ability to rapidly write and rewrite new versions of a text as needed, even if one was not the original author of the argument.
Rejoice! But rejoice for the right reasons, and about what actually happened. Style and voice transformations, interesting new capabilities for fuzzy search. Correct usage of external tools for heavy-lifting with symbolics. And yes, actual problem solving. Novel techniques, creativity, originality though? IDK, sounds kind of optimistic based on the detail here.
>the problem does not matter.
Really? All of the other Erdős problems? Millennium Problems? Anything at all? This gets us directly into the territory of "nothing can convince us otherwise".
Though I suppose if the problem statement in Lean is human-generated and there are no ways to "cheat" in a Lean proof, the proof could be trusted without understanding it
Edit: Found it here https://github.com/plby/lean-proofs/blob/main/src/v4.24.0/Er...
1. This result is very far from showing something like "human mathematicians are no longer needed to advance mathematics".
2. Even if it did show that, as long as we need humans trained in understanding maths, since "professional mathematicians" are mostly educators, they probably aren't going anywhere.
Educator business survived so far, only because they provided in-person interactive knowledge transfer and credentials - both were not possible by static sources of knowledge such as libraries and internet. But now all that is possible without involvement of human teachers.
The METR institute predicts that the length of tasks AI agents can complete doubles every 7 months.
We should expect it to take until 2033 before AI solves Clay Institute-level problems with 50% reliability.
1. https://mppbench.com/
And once we have that formalized corpus, it's all set up as training data for moving forward.
https://github.com/teorth/erdosproblems/wiki/AI-contribution...
It classifies the advancements based on the level of AI input. In particular, the entry in Table 1 related to the original post has both a green and yellow light, reflecting the skepticism from others.
We asked you about this just recently, but it's still most of what you're posting. You're making the site worse by doing this, right at the point where it's most vulnerable these days.
Your comment here is a shallow dismissal of exactly the type the HN guidelines ask users to avoid here:
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something." (https://news.ycombinator.com/newsguidelines.html)
Predictably, it led to by far the worst subthread on this article. That's not cool. I don't want to ban you because you're also occasionally posting good comments that don't fit these negative categories, but we need you to fix this and stop degrading the threads.
If you're interested, https://news.ycombinator.com/item?id=46515507 and https://news.ycombinator.com/item?id=46508115 are other places I wrote about this recently.
It's the biggest problem facing HN, in my opinion.
Surprisingly the latest increase in polarization around generative AI has impacted Hacker News the least our of all tech social spaces.
What I'm interested in is how well HN does at fulfilling its own mandate in its own terms. On that scale, it's getting worse—in this respect, at least, which is a big one. We're going to do something about it, the same way we've always tried to stave off the decline of this place (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...).
If you're right that the phenomenon is affecting other places even more than HN then I guess we get to be the-worst-internet-forum-except-for-all-the-others even more than before (https://news.ycombinator.com/item?id=13494318, https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...) - not a bad outcome for HN, though I think it might be overly optimistic.
I don't mean to express doubt (you would know better than I) but what makes you say that?
Do you think it's a consequence of external factors like the current political climate or of internal factors like HN getting larger?
I'm not sure it's about HN getting larger, though. It's a bit hard to tell, but at least some of the upswing in cynical and curmudgeonly comments is coming from established users.
It seems to me that in your view the sheer openness to evaluate LLM use, anecdotally or otherwise, is already a bias.
I don't see how that's sensible, given that to evaluate the utility of something, it's necessary to accept the possibility of that utility existing in the first place.
On the other hand, if this is not just me strawmanning you, your rejection of such a possibility is absolutely a bias, and it inhibits exploration.
To willfully conflate finding such an exploration illegitimate with the findings of someone who thinks otherwise as illegitimate, strikes me as extremely deceptive. I don't appreciate being forced to think with someone else's opinion covertly laundered in very much. And no, Tao's comments do not meet this same criteria, as his position is not covert, but explicit.
That's the best part! They don't even need to, because ChatGPT will happily do its own private "literature search" and then not tell you about it - even Terence Tao has freely admitted as much in his previous comments on the topic. So we can at least afford to be a bit less curmudgeonly and cynical about that specific dynamic: we've literally seen it happen.
Also known as model inference. This is not something "private" or secret [*]. AI models are lossily compressed data stores and will always will be. The model doesn't report on such "searches", because they are not actual searches driven by model output, but just the regular operation of the model driven by the inference engine used.
> even Terence Tao has freely admitted as much
Bit of a (willfully?) misleading way of saying they actively looked for it on a best effort basis, isn't it?
[*] A valid point of criticism would be that the training data is kept private for the proprietary models Tao and co. using, so source finding becomes a goose chase with no definitive end to it.
An I think valid counterpoint however is that if locating such literature content is so difficult for subject matter experts, then the model being able to "do so" in itself is a demonstration of value. Even if the model is not able to venture a backreference, by virtue of that not being an actual search.
This is reflected in many other walks of life too. One of my long held ideas regarding UX for example is that features users are not able to find "do not exist".
https://news.ycombinator.com/newsguidelines.html
https://news.ycombinator.com/newsguidelines.html
I now see in the other subthread what they mean.
If that was the case, I apologize for misreading you! If you're interested, I can point you to https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... for past explanations about this type of misunderstanding and how to avoid it in the future.
For what it's worth, it's not even that I don't see merit to their points. I'm just unable to trust that they're being genuine, not the least for how they conduct themselves (which I only fault them for so much). This also impacts my ability to reason about their points clearly.
Sadly, I'm not able to pitch any systematic solutions.
I get how it's activating and annoying when moderators show up and start fault-finding, so I can appreciate the irritation here. But really, we're just trying to have an internet forum that doesn't destroy itself. I can't imagine why you wouldn't want to contribute positively to that.
https://news.ycombinator.com/newsguidelines.html
Meanwhile I can’t get Claude code to fix its own shit to save my life.
Maybe this should give you some hint to that you're trying to use it in a different way than others?