AI Takes - May 2025

09 May 2025

8 minute read

Also on Substack. Thanks to Gavin, Kaivu, Kushal, Nikola for giving feedback.

AI is moving fast. I wanted to put down some thoughts on the future and what I believe will be important, moreso on a technical level than a governance/policy perspective. Give me feedback, push back if you have takes!

What remains as a blocker to AGI? #

ability to do long horizon tasks and reason on lots of information in context
- I predict the memory problem will be solved in up to 2 years with RL training on long horizon environments. The bottleneck is more on data than ideas, and it’s a lot of work but not a lot of insight.
creativity and taste
- it seems that o3 is a step up, probably due to much better ability to dynamically query and reason on information
- this is hard to formalize but important for research, along with general skills like understanding what matters and prioritizing, being able to debug the world and be reliable
- it’s plausible that the type of cognition for coming up with new ideas, doing scientific innovation, noticing discrepancies in a research process, understanding the current level frontier of knowledge, is not being represented / trained on and is hard to bootstrap
- This remains nebulous, but it’s plausible that models could learn to bootstrap environments and data to develop these skills.
  - could take a while though
reasoning in non verifiable environments, “true” generalization
- to what extent is true meta learning happening? I think to a large extent, real, long horizon RL with memory and result aggregation will be a large step forward on this because the model will actually have time to properly reason on the environment
- but it will require environments that require the right kinds of skills - do we have those? can we generate them? maybe
- then fundamentally with our current methods it’s a data issue, and probably an expensive one
- could also be solved by new algorithms, but I am less optimistic about this in the short term
- very unclear

In terms of judging works like AI 2027, or other strong crazy AI soon believers, I think there are these key open questions:

to what extent does research automation accelerate the future of machine learning?
- no strong gut feeling here, there is a lot of low hanging fruit in the research process to automate
- but if you believe that we are not making progress on a fundamental problem, eg improving at modelling and predicting real structures in the world, then the acceleration may not be drastically effective
- I believe it’s harder than AI 2027 makes it out to be for this reason, but could still end up being pretty quick
the meta learning/generalization vs interpolation question mentioned above
- if you don’t believe there is generalization beyond interpolation, which I think for some definition is plausible, then you need to rely on massive data collection + intelligent synthetic data

It seems that broadly intelligence wise the bottleneck will not be on algorithms but data – making solid eval settings, and good signals of correctness, and generating synthetic data that emulates cognitive patterns that are more representative of what it takes to do R&D.

This question of reward signals and robust evaluation brings us to another thing, robustness and reliability, which will be essential for both capabilities and alignment:

Reward hacking is increasing as your models get smarter and your tasks get harder to truly solve
- Your signals become harder and harder to train on, your model isn’t actually solving the tasks you ask it to
- Sourcing unhackeable problems will become a key problem, as will oversight for RL reward hacking
- reward hacking is a key problem for capabilities, and this overlaps with the alignment kind of reward hacking – stitching bandaids doesn’t work that well and there will be incentive to understand value/goal formation more which is good for safety
  - or you can try to bruteforce patch reward hacking in a bunch of domains, maybe this is feasible

There are basically 2 failure modes for robustness and reliability:

the model doesn’t care: its smart, but it’s not actually aligned. I think this regime of quasi alignment is the future for a while, and this will harm robustness a lot and increase the likelihood of gradual disempowerment / certain pchristiano outcomes
the model is stupid: classic spatial reasoning errors, not trained on being able to actually check models of reality and notice errors in research processes, etc…

The trend will be that most of the failures shift from the second type to the first.

Predictions #

Loose predictions on the future of AI.

The RL machine is going to churn on many agentic settings and tool use, reasoning, it will get smarter, more powerful, more automonous, better at using information, this will keep scaling quite well, with increasing inference costs.
The humans forces of the next few years are the best cyborgs. That’s how impactful it will be. No one is at the LLM augmentation ceiling. Take this very seriously.
- it’s plausible to me that in 1-2 years one cyborg maxxer (highly augmented human) is equivalent to several usage profiles of “ML engineer who uses models for work in April 2025”
Reliability will be hard, integration will be slow.
- due to reward hacking, quasi alignment, and reasoning problems human oversight will be key, and cyborg systems will triumph in the short term
Money is more and more key. As a researcher, the extent to which I can use models and be augmented to do good work will be money constrained. In general, having access to resources becomes more and more important in this world of insane inference time tool use and scaling. Of course, prices also go down. It is important to be able to signal that you have taste and can pick the problems that matter. There will also be even more money.
When memory arrives properly, personalization is going to also go a step up. Models will be more engaging, more interesting, and startups will start optimizing for weird things wrt human-AI connection. This happens faster than you think.
- AI persuasion also gets scarier.
Misguided deployment of LLMs by governments or industry leads to severe failures and unintended consequences, increasing system brittleness in the intermediate term.

What the future calls for #

What seems robustly good for the future from this perspective?

understanding reward hacking and goal formation in language models
- Reward hacking in this setting is super interesting. Training a model on reasoning and outcome based reward increases its propensity and goal driven behavior to hack metrics in general, which makes sense.
- The basic thing people will do in the next few weeks/months is train probes, look at ablating different directions, look at reward hacking across training, throw a bunch of domain experts in specific environments to fix them up
- hopefully we also have some more big brained research on different kinds of objectives, better oversight and consistency checking, to resolve this
- This will be essential for capabilities and alignment in settings where reward specification is hard
human-cyborg interfaces and methods that allow for insane augmentation
- information aggregation, idea redteaming, systems modeling, code generation, research ideation etc…
- the upcoming interfaces and usages will be drastically better and different, much more creative
- I think this is actually pretty key for a better future. In a world with good models that don’t care that much about the things human would care about but sort of do things that are useful sometimes, pushing the boundary of human willpower as a competing and still relevant force in the world is important. Augment humans rather than replacing them.
relatedly, powerful oversight tools and observability mechanisms
- I run an agent, I want to understand the world it’s running in, its interactions, how its solution impacts my needs and concerns, I want to be empowered
- This is key in a world where agents are quasi aligned, and humans and models have asymmetric capability profiles
- Similar to classical AI control scenarios, but here the agent isn’t actively deceptive—it’s simply misaligned.
AI control, being very careful about steganography and COT optimization and adverse incentives
formal verification where it’s possible, eg software
synthetic data is clearly key research wise

What I’m doing #

I think these are the key problems. I’ll be working on them, and will have more updates soon. See ya there!