LLM generates the ENTIRE output at once (world's first diffusion LLM)

Matthew Berman

33.7K views • March 06, 2025

Register for 3-hour AI training with GrowthSchool! Free for the first 1000 people who sign up! https://web.growthschool.io/MWB

Join My Newsletter for Regular AI Updates 👇🏼
https://forwardfuture.ai

My Links 🔗
👉🏻 Subscribe: https://www.youtube.com/@matthew_berman
👉🏻 Twitter: https://twitter.com/matthewberman
👉🏻 Discord: https://discord.gg/xxysSXBxFW
👉🏻 Patreon: https://patreon.com/MatthewBerman
👉🏻 Instagram: https://www.instagram.com/matthewberman_ai
👉🏻 Threads: https://www.threads.net/@matthewberman_ai
👉🏻 LinkedIn: https://www.linkedin.com/company/forward-future-ai

Media/Sponsorship Inquiries ✅
https://bit.ly/44TC45V

Links:
https://www.inceptionlabs.ai/news
https://x.com/karpathy/status/1894923254864978091

Login or create an account to generate AI summaries

🔑 Login 👤 Sign Up

0:00 a breakthrough in large language models

0:02 just happened and it claims to be 10

0:04 times faster and 10 times less expensive

0:06 using a completely novel technique

0:09 lifted from text to image generation

0:12 models this is diffusion large language

0:15 models the way that traditional large

0:17 language models work is it generates one

0:19 token and then it generates the next

0:21 token and so on and so forth so

0:23 sequential token generation and it

0:25 cannot generate the next token until it

0:27 has the previous one in comes Fusion

0:30 large language models it actually

0:32 generates the entire response all at

0:35 once in a really rough way and then

0:38 iteratively refines it into what it

0:41 considers to be the right answer this is

0:44 exactly how diffusion text to image

0:46 generation models work so with text to

0:49 image generation models specifically

0:50 diffusion models it starts with a

0:52 completely noisy image and gradually

0:55 refines it and over time with enough

0:57 refinement it becomes an image that you

0:59 can actually tell what's going on and it

1:01 does that all at once it doesn't do one

1:02 pixel at a time and then goes on to the

1:04 next pixel and so they took that

1:06 approach and applied it to text based

1:09 models large language models the company

1:11 is called Inception labs and this is the

1:14 first production grade diffusion-based

1:16 large language model so on the left what

1:18 you're seeing is what a traditional

1:20 autor regressive llm looks like so it

1:22 generates one token then it generates

1:24 the next and the next but over here you

1:26 can see much more quickly over 14

1:29 iterations versus 75 iterations it

1:32 actually starts with a really rough kind

1:34 of almost nonsensical set of text and

1:37 then refines it until it becomes the

1:40 actual solution it's pretty incredible

1:42 and again 10 times faster 10 times less

1:45 expensive and this is going to be

1:46 especially powerful with test time

1:48 compute with all of the scaling laws at

1:50 inference time these Cutting Edge models

1:53 have actually become pretty slow to get

1:55 to the final answer if we're working at

1:57 40 50 60 tokens per second it could be

2:00 thinking for minutes before providing

2:02 you with the answer based on test time

2:04 compute but now at a th000 tokens per

2:07 second you might just be waiting a few

2:10 seconds for that answer and so all of a

2:12 sudden we're able to throw a lot more

2:14 test time compute at problems and still

2:16 get answers in a reasonable amount of

2:18 time and I've been saying this for a

2:20 while the biggest bottleneck right now

2:23 for scaling up intelligence is actually

2:26 the speed at which these models perform

2:28 for example if you're doing coding you

2:30 put in a prompt you're potentially

2:32 waiting 5 10 15 minutes for the agent to

2:36 come up with the solution iterate on the

2:38 solution and imagine if that were just

2:40 30 seconds I mean the potential is crazy

2:44 and it doesn't require custom Hardware

2:46 either these numbers Mercury is 10 times

2:49 faster than Frontier speed optimized

2:51 llms it runs at over a th000 tokens per

2:54 second on an Nvidia h100 that is not a

2:57 custom chip that is a standard chip that

2:59 every other large language model can and

3:01 usually does run on and this is actually

3:04 a code generation model so it

3:06 specializes in coding which is even

3:08 crazier because this really has the

3:11 potential to change how coding works

3:13 overnight so let me just show you it in

3:16 action really quickly before I dive into

3:18 the details make a particle system where

3:20 particles follow the mouse cursor add

3:22 controls for particle speed size and

3:24 color use HTML 5 canvas for smooth

3:26 animation so I'm going to hit enter and

3:29 there we go

3:30 look how crazy fast that was literally

3:33 seconds and here we go here it is

3:35 obviously this is a very simple demo but

3:38 it could get more complicated so let's

3:40 increase the size of the particles and

3:43 it just works really well and the point

3:45 is it's insanely fast write a simple

3:48 Byram model in

3:52 Python okay so there it is and it's

3:54 interesting because when it's actually

3:55 running it looks sequential but that's

3:58 not actually what's happening it's it's

3:59 actually generating the entire thing all

4:02 at once in a really kind of rough and

4:03 noisy way and then iteratively improves

4:06 it and refines it and that's how you get

4:08 this speed it's a completely new

4:10 approach and if you've watched this

4:11 Channel at all I've tried other

4:13 approaches like the Mamba architecture

4:16 and other things and they never quite

4:17 work that well but this seems to work

4:20 incredibly well now I actually see this

4:22 diffusion effect up here so you can

4:25 actually turn it on and you can actually

4:26 see what the diffusion process looks

4:28 like all right so let me show you one

4:29 more and actually I'm going to slow it

4:31 down because it's so fast it's hard to

4:34 see what's actually happening so you can

4:36 see just gibberish nonsense and then all

4:38 of a sudden over time and very quick

4:41 time it refines itself to actually make

4:43 sense and we have a snake game right in

4:45 the console and it works obviously again

4:48 very simple but that's all it takes and

4:51 it is crazy fast so I can't wait to try

4:54 this with Vibe coding it's really going

4:55 to change everything because I'm kind of

4:57 tired of waiting so long in between

4:59 prompts and thanks to the sponsor of

5:01 this segment growth School 2025 is a

5:03 critical year we have agents entering

5:05 the workforce and AI infiltrating

5:08 basically every aspect of our work and a

5:11 lot of people think AI is going to take

5:12 their jobs but I say other humans using

5:15 AI will take your jobs so to become

5:18 hyperproductive you need to learn how to

5:19 use AI if you're watching this channel

5:22 you're already ahead of the pack but you

5:23 can go even further a great way to learn

5:26 Cutting Edge AI skills is by using

5:28 growth schools courses grow school is

5:30 offering a 3-hour Hands-On AI training

5:34 and they will teach you how to use over

5:36 25 different AI tools this is the way to

5:39 be the star of your company whether

5:41 you're in finance sales HR recruiting

5:44 you should be learning Ai and you can do

5:46 so with growth School growth school has

5:47 helped over a million people upskill

5:50 across the globe this is normally a paid

5:52 training but right now for the first

5:54 thousand signups it will be free through

5:56 my link down in the description below so

5:58 check out growth school thanks again to

5:59 them for sponsoring this segment and now

6:01 back to the video all right but how does

6:03 it perform obviously we have some

6:05 benchmarks let's take a look at the

6:06 details and benchmarks so here is

6:08 artificial analysis on the x-axis we

6:10 have output speed on the y- AIS we have

6:13 the coding index so this is by

6:15 artificial analysis so all the way up in

6:18 the top left the output speed very slow

6:22 Claude 3.5 high coup but a very high

6:25 score now all the way over here with

6:28 output speeds that are kind of in we

6:29 have mercury coder small which is about

6:33 equal with GPT 40 mini and then we have

6:36 the Mercury coder mini all the way above

6:39 1100 tokens per second which is pretty

6:41 much equal with DC coder V2 light and

6:44 other small models now here's the thing

6:46 with more test time compute these models

6:49 can just be better be smarter and when

6:52 you're running this fast when you have

6:54 this fast of inference speed you can run

6:56 a lot of test time computer in a very

6:58 short period of time so there's no

7:00 reason these models can't get better so

7:02 here is their pitch current large

7:04 language models are Auto regressive that

7:06 basically just means it generates one

7:08 token and then the next and the next and

7:09 you can't generate the next before

7:11 generating the prior each token requires

7:13 evaluating a neural network with

7:15 billions of parameters Frontier llm

7:17 companies are betting on test time

7:19 computation to increase reasoning and

7:20 error correction capabilities but these

7:23 long Generations these long thinking

7:26 times cost a lot both in terms of

7:28 latency and L leral token cost but

7:31 diffusion models provide such a paradigm

7:33 shift these models operate with a course

7:36 toine generation process where the

7:37 output is refined from Pure Noise over a

7:40 few denoising steps as Illustrated in

7:42 the video above and it's not just that

7:44 they're faster they are actually

7:47 potentially better at reasoning listen

7:49 to this because diffusion models are not

7:51 restricted to only considering previous

7:53 output they are better at reasoning and

7:55 at structuring their responses and

7:57 because diffusion models can continue

7:59 refine their outputs they can correct

8:01 mistakes and hallucinations so they

8:04 generate the whole thing they can see

8:05 the whole thing and they iterate and

8:07 correct it all at once it's crazy to

8:10 think about and this is the first time

8:12 we've actually had a successful

8:13 diffusion-based text large language

8:16 model it supports all use cases

8:18 including rag tool use and agentic

8:21 workflows here's another chart showing

8:24 really how fast it is these are the

8:25 Mercury coders right here and here is

8:28 the second fastest one at the quen 2.5

8:31 coder 7B that is a small model but a

8:34 fraction of the speed of this diffusion

8:37 based large language model and so what

8:39 if you can generate code so much more

8:42 quickly let me show you what that looks

8:44 like this is mercury versus Claude and

8:46 Chachi PT and look how much faster it is

8:50 mercury is going to finish in just 6

8:52 seconds whereas Chachi PT and Claud take

8:56 obviously a lot longer they actually had

8:57 to fast forward it to get it done in a

8:59 reason reasonable amount of time for the

9:00 video 36 seconds and 28 seconds really

9:03 just a substantial multiple Factor speed

9:06 increase this type of architecture with

9:09 this speed and this size footprint have

9:12 incredible implications one agents

9:15 that's the obvious one agents are only

9:18 limited by the speed of the model that

9:20 they're using because so much needs to

9:23 be generated between agents especially

9:25 if they're thinking based agents then

9:27 the speed is really the only bottleneck

9:29 there so all of a sudden agents can work

9:32 so much faster get so much more done

9:34 have higher quality because of that next

9:37 it can also do more advanced reasoning

9:40 with again such a cheaper architecture

9:43 such cheaper inference such faster

9:45 inference you could do a lot more

9:47 inference at test time that allows the

9:49 models to perform better we've already

9:51 seen multiple examples where the more

9:53 thinking time a model gets the better it

9:56 performs so now imagine you can compress

9:58 that thinking down to a fraction of the

10:00 amount of time and then you just let it

10:02 go for the same amount of time you get

10:04 just so much more compute and such

10:06 higher quality potential and this is an

10:08 interesting one that I hadn't actually

10:09 thought of so controllable generation

10:12 dlms can edit their output and generate

10:14 tokens in any order allowing users to

10:17 infill text align outputs with

10:19 objectives like safety or produce

10:21 outputs that reliably conform to user

10:23 specified formats again because it can

10:25 do everything all at once it kind of has

10:27 more control over that output and then

10:29 finally Edge applications and of course

10:31 that's what I'm really interested in

10:33 because the footprint of these models is

10:35 so small but they are so capable you can

10:37 run them on your laptop or your desktop

10:40 these are smaller models to run on the

10:42 edge so Andre karpathy one of the

10:45 leading Minds in artificial intelligence

10:47 reposted this and added some comments

10:49 and if anybody is the right person to

10:52 give their opinion on this it's him so

10:54 listen to this most of the image video

10:56 generation AI tools actually work this

10:58 way this way being diffusion and use

11:01 diffusion not Auto regression it's only

11:04 text and sometimes audio that have

11:07 resisted so it's been a bit of a mystery

11:09 to me and many others why for some

11:11 reason text prefers autor regression but

11:13 images videos prefer diffusion this

11:16 turns out to be a fairly deep rabbit

11:18 hole that has to do with the

11:20 distribution of information and noise

11:21 and our own perception of them in these

11:24 domains if you look close enough a lot

11:26 of interesting connections emerge

11:27 between the two as well all that is to

11:30 say that this model has the potential to

11:31 be different and possibly showcase new

11:33 unique psychology or new strengths and

11:36 weaknesses I encourage people to try it

11:38 out and there was a paper about a month

11:40 ago that came out called large language

11:41 diffusion models that kind of proposes

11:43 the same thing but it didn't actually

11:44 come with a working model but now we

11:47 have it and if you want to check out

11:48 this paper I'll drop it in the

11:50 description below and so I think this

11:52 really has the potential to be a new

11:55 type of model that elicits new Behavior

11:58 out of the these intelligent models and

12:00 I'm really excited to continue to try it

12:02 out I really want to plug it into cursor

12:04 and wind surf if you enjoyed this video

12:06 please consider giving a like And

12:08 subscribe and I'll see you in the next

12:09 one

You have ... regenerations remaining for this video.

Length

Short (100 words) Medium (300 words) Long (500+ words)

Tone

Casual Professional Conversational Persuasive

Focus

Overview Key Points Actionable Takeaways

Language Style

Simple Standard Detailed

Additional Notes (Optional)

500 characters remaining

Save these preferences for future summaries

Length

Short (~300 words - quick blog post or Substack snippet) Medium (~600 words - standard blog post) Long (~1,000+ words - in-depth article or feature)

Tone

Casual (friendly, informal) Professional (formal, business-like) Conversational (chatty, engaging) Persuasive (call-to-action focused)

Keyword Emphasis and SEO Optimization

These keywords will be used to optimize the article for search engines

Auto-generate meta title (50-60 characters) Auto-generate meta description (150-160 characters)

Content Focus

Full Overview (use entire transcript) Actionable Takeaways (focus on practical advice or steps)

Additional Notes (Optional)

500 characters remaining

Save these preferences for future blog posts