LLM generates the ENTIRE output at once (world's first diffusion LLM)

33.7K views March 06, 2025

Register for 3-hour AI training with GrowthSchool! Free for the first 1000 people who sign up! https://web.growthschool.io/MWB

Join My Newsletter for Regular AI Updates 👇🏼
https://forwardfuture.ai

My Links 🔗
👉🏻 Subscribe: https://www.youtube.com/@matthew_berman
👉🏻 Twitter: https://twitter.com/matthewberman
👉🏻 Discord: https://discord.gg/xxysSXBxFW
👉🏻 Patreon: https://patreon.com/MatthewBerman
👉🏻 Instagram: https://www.instagram.com/matthewberman_ai
👉🏻 Threads: https://www.threads.net/@matthewberman_ai
👉🏻 LinkedIn: https://www.linkedin.com/company/forward-future-ai

Media/Sponsorship Inquiries ✅
https://bit.ly/44TC45V

Links:
https://www.inceptionlabs.ai/news
https://x.com/karpathy/status/1894923254864978091

0:00 a breakthrough in large language models
0:02 just happened and it claims to be 10
0:04 times faster and 10 times less expensive
0:06 using a completely novel technique
0:09 lifted from text to image generation
0:12 models this is diffusion large language
0:15 models the way that traditional large
0:17 language models work is it generates one
0:19 token and then it generates the next
0:21 token and so on and so forth so
0:23 sequential token generation and it
0:25 cannot generate the next token until it
0:27 has the previous one in comes Fusion
0:30 large language models it actually
0:32 generates the entire response all at
0:35 once in a really rough way and then
0:38 iteratively refines it into what it
0:41 considers to be the right answer this is
0:44 exactly how diffusion text to image
0:46 generation models work so with text to
0:49 image generation models specifically
0:50 diffusion models it starts with a
0:52 completely noisy image and gradually
0:55 refines it and over time with enough
0:57 refinement it becomes an image that you
0:59 can actually tell what's going on and it
1:01 does that all at once it doesn't do one
1:02 pixel at a time and then goes on to the
1:04 next pixel and so they took that
1:06 approach and applied it to text based
1:09 models large language models the company
1:11 is called Inception labs and this is the
1:14 first production grade diffusion-based
1:16 large language model so on the left what
1:18 you're seeing is what a traditional
1:20 autor regressive llm looks like so it
1:22 generates one token then it generates
1:24 the next and the next but over here you
1:26 can see much more quickly over 14
1:29 iterations versus 75 iterations it
1:32 actually starts with a really rough kind
1:34 of almost nonsensical set of text and
1:37 then refines it until it becomes the
1:40 actual solution it's pretty incredible
1:42 and again 10 times faster 10 times less
1:45 expensive and this is going to be
1:46 especially powerful with test time
1:48 compute with all of the scaling laws at
1:50 inference time these Cutting Edge models
1:53 have actually become pretty slow to get
1:55 to the final answer if we're working at
1:57 40 50 60 tokens per second it could be
2:00 thinking for minutes before providing
2:02 you with the answer based on test time
2:04 compute but now at a th000 tokens per
2:07 second you might just be waiting a few
2:10 seconds for that answer and so all of a
2:12 sudden we're able to throw a lot more
2:14 test time compute at problems and still
2:16 get answers in a reasonable amount of
2:18 time and I've been saying this for a
2:20 while the biggest bottleneck right now
2:23 for scaling up intelligence is actually
2:26 the speed at which these models perform
2:28 for example if you're doing coding you
2:30 put in a prompt you're potentially
2:32 waiting 5 10 15 minutes for the agent to
2:36 come up with the solution iterate on the
2:38 solution and imagine if that were just
2:40 30 seconds I mean the potential is crazy
2:44 and it doesn't require custom Hardware
2:46 either these numbers Mercury is 10 times
2:49 faster than Frontier speed optimized
2:51 llms it runs at over a th000 tokens per
2:54 second on an Nvidia h100 that is not a
2:57 custom chip that is a standard chip that
2:59 every other large language model can and
3:01 usually does run on and this is actually
3:04 a code generation model so it
3:06 specializes in coding which is even
3:08 crazier because this really has the
3:11 potential to change how coding works
3:13 overnight so let me just show you it in
3:16 action really quickly before I dive into
3:18 the details make a particle system where
3:20 particles follow the mouse cursor add
3:22 controls for particle speed size and
3:24 color use HTML 5 canvas for smooth
3:26 animation so I'm going to hit enter and
3:29 there we go
3:30 look how crazy fast that was literally
3:33 seconds and here we go here it is
3:35 obviously this is a very simple demo but
3:38 it could get more complicated so let's
3:40 increase the size of the particles and
3:43 it just works really well and the point
3:45 is it's insanely fast write a simple
3:48 Byram model in
3:52 Python okay so there it is and it's
3:54 interesting because when it's actually
3:55 running it looks sequential but that's
3:58 not actually what's happening it's it's
3:59 actually generating the entire thing all
4:02 at once in a really kind of rough and
4:03 noisy way and then iteratively improves
4:06 it and refines it and that's how you get
4:08 this speed it's a completely new
4:10 approach and if you've watched this
4:11 Channel at all I've tried other
4:13 approaches like the Mamba architecture
4:16 and other things and they never quite
4:17 work that well but this seems to work
4:20 incredibly well now I actually see this
4:22 diffusion effect up here so you can
4:25 actually turn it on and you can actually
4:26 see what the diffusion process looks
4:28 like all right so let me show you one
4:29 more and actually I'm going to slow it
4:31 down because it's so fast it's hard to
4:34 see what's actually happening so you can
4:36 see just gibberish nonsense and then all
4:38 of a sudden over time and very quick
4:41 time it refines itself to actually make
4:43 sense and we have a snake game right in
4:45 the console and it works obviously again
4:48 very simple but that's all it takes and
4:51 it is crazy fast so I can't wait to try
4:54 this with Vibe coding it's really going
4:55 to change everything because I'm kind of
4:57 tired of waiting so long in between
4:59 prompts and thanks to the sponsor of
5:01 this segment growth School 2025 is a
5:03 critical year we have agents entering
5:05 the workforce and AI infiltrating
5:08 basically every aspect of our work and a
5:11 lot of people think AI is going to take
5:12 their jobs but I say other humans using
5:15 AI will take your jobs so to become
5:18 hyperproductive you need to learn how to
5:19 use AI if you're watching this channel
5:22 you're already ahead of the pack but you
5:23 can go even further a great way to learn
5:26 Cutting Edge AI skills is by using
5:28 growth schools courses grow school is
5:30 offering a 3-hour Hands-On AI training
5:34 and they will teach you how to use over
5:36 25 different AI tools this is the way to
5:39 be the star of your company whether
5:41 you're in finance sales HR recruiting
5:44 you should be learning Ai and you can do
5:46 so with growth School growth school has
5:47 helped over a million people upskill
5:50 across the globe this is normally a paid
5:52 training but right now for the first
5:54 thousand signups it will be free through
5:56 my link down in the description below so
5:58 check out growth school thanks again to
5:59 them for sponsoring this segment and now
6:01 back to the video all right but how does
6:03 it perform obviously we have some
6:05 benchmarks let's take a look at the
6:06 details and benchmarks so here is
6:08 artificial analysis on the x-axis we
6:10 have output speed on the y- AIS we have
6:13 the coding index so this is by
6:15 artificial analysis so all the way up in
6:18 the top left the output speed very slow
6:22 Claude 3.5 high coup but a very high
6:25 score now all the way over here with
6:28 output speeds that are kind of in we
6:29 have mercury coder small which is about
6:33 equal with GPT 40 mini and then we have
6:36 the Mercury coder mini all the way above
6:39 1100 tokens per second which is pretty
6:41 much equal with DC coder V2 light and
6:44 other small models now here's the thing
6:46 with more test time compute these models
6:49 can just be better be smarter and when
6:52 you're running this fast when you have
6:54 this fast of inference speed you can run
6:56 a lot of test time computer in a very
6:58 short period of time so there's no
7:00 reason these models can't get better so
7:02 here is their pitch current large
7:04 language models are Auto regressive that
7:06 basically just means it generates one
7:08 token and then the next and the next and
7:09 you can't generate the next before
7:11 generating the prior each token requires
7:13 evaluating a neural network with
7:15 billions of parameters Frontier llm
7:17 companies are betting on test time
7:19 computation to increase reasoning and
7:20 error correction capabilities but these
7:23 long Generations these long thinking
7:26 times cost a lot both in terms of
7:28 latency and L leral token cost but
7:31 diffusion models provide such a paradigm
7:33 shift these models operate with a course
7:36 toine generation process where the
7:37 output is refined from Pure Noise over a
7:40 few denoising steps as Illustrated in
7:42 the video above and it's not just that
7:44 they're faster they are actually
7:47 potentially better at reasoning listen
7:49 to this because diffusion models are not
7:51 restricted to only considering previous
7:53 output they are better at reasoning and
7:55 at structuring their responses and
7:57 because diffusion models can continue
7:59 refine their outputs they can correct
8:01 mistakes and hallucinations so they
8:04 generate the whole thing they can see
8:05 the whole thing and they iterate and
8:07 correct it all at once it's crazy to
8:10 think about and this is the first time
8:12 we've actually had a successful
8:13 diffusion-based text large language
8:16 model it supports all use cases
8:18 including rag tool use and agentic
8:21 workflows here's another chart showing
8:24 really how fast it is these are the
8:25 Mercury coders right here and here is
8:28 the second fastest one at the quen 2.5
8:31 coder 7B that is a small model but a
8:34 fraction of the speed of this diffusion
8:37 based large language model and so what
8:39 if you can generate code so much more
8:42 quickly let me show you what that looks
8:44 like this is mercury versus Claude and
8:46 Chachi PT and look how much faster it is
8:50 mercury is going to finish in just 6
8:52 seconds whereas Chachi PT and Claud take
8:56 obviously a lot longer they actually had
8:57 to fast forward it to get it done in a
8:59 reason reasonable amount of time for the
9:00 video 36 seconds and 28 seconds really
9:03 just a substantial multiple Factor speed
9:06 increase this type of architecture with
9:09 this speed and this size footprint have
9:12 incredible implications one agents
9:15 that's the obvious one agents are only
9:18 limited by the speed of the model that
9:20 they're using because so much needs to
9:23 be generated between agents especially
9:25 if they're thinking based agents then
9:27 the speed is really the only bottleneck
9:29 there so all of a sudden agents can work
9:32 so much faster get so much more done
9:34 have higher quality because of that next
9:37 it can also do more advanced reasoning
9:40 with again such a cheaper architecture
9:43 such cheaper inference such faster
9:45 inference you could do a lot more
9:47 inference at test time that allows the
9:49 models to perform better we've already
9:51 seen multiple examples where the more
9:53 thinking time a model gets the better it
9:56 performs so now imagine you can compress
9:58 that thinking down to a fraction of the
10:00 amount of time and then you just let it
10:02 go for the same amount of time you get
10:04 just so much more compute and such
10:06 higher quality potential and this is an
10:08 interesting one that I hadn't actually
10:09 thought of so controllable generation
10:12 dlms can edit their output and generate
10:14 tokens in any order allowing users to
10:17 infill text align outputs with
10:19 objectives like safety or produce
10:21 outputs that reliably conform to user
10:23 specified formats again because it can
10:25 do everything all at once it kind of has
10:27 more control over that output and then
10:29 finally Edge applications and of course
10:31 that's what I'm really interested in
10:33 because the footprint of these models is
10:35 so small but they are so capable you can
10:37 run them on your laptop or your desktop
10:40 these are smaller models to run on the
10:42 edge so Andre karpathy one of the
10:45 leading Minds in artificial intelligence
10:47 reposted this and added some comments
10:49 and if anybody is the right person to
10:52 give their opinion on this it's him so
10:54 listen to this most of the image video
10:56 generation AI tools actually work this
10:58 way this way being diffusion and use
11:01 diffusion not Auto regression it's only
11:04 text and sometimes audio that have
11:07 resisted so it's been a bit of a mystery
11:09 to me and many others why for some
11:11 reason text prefers autor regression but
11:13 images videos prefer diffusion this
11:16 turns out to be a fairly deep rabbit
11:18 hole that has to do with the
11:20 distribution of information and noise
11:21 and our own perception of them in these
11:24 domains if you look close enough a lot
11:26 of interesting connections emerge
11:27 between the two as well all that is to
11:30 say that this model has the potential to
11:31 be different and possibly showcase new
11:33 unique psychology or new strengths and
11:36 weaknesses I encourage people to try it
11:38 out and there was a paper about a month
11:40 ago that came out called large language
11:41 diffusion models that kind of proposes
11:43 the same thing but it didn't actually
11:44 come with a working model but now we
11:47 have it and if you want to check out
11:48 this paper I'll drop it in the
11:50 description below and so I think this
11:52 really has the potential to be a new
11:55 type of model that elicits new Behavior
11:58 out of the these intelligent models and
12:00 I'm really excited to continue to try it
12:02 out I really want to plug it into cursor
12:04 and wind surf if you enjoyed this video
12:06 please consider giving a like And
12:08 subscribe and I'll see you in the next