Tesla's Dojo chip and supercomputer have mouth salivating specifications and capabilities, but cost, custom interconnects, memory constraints, lack of software, and the fact that this chip is 2022 or beyond is something we have to keep in mind. SemiAnalysis investigates these while also exploring the partnerships with traditional semiconductor firms.
"their unique system on wafer packaging and chip design choices potentially allow an order magnitude advantage over competing AI hardware in training of massive multi-trillion parameter networks."
^ This is just not true, IMO (same as Cerebras). They've designed an amazing machine for _everything other than multi-trillion parameter networks_. They have a very high ratio of memory and interconnect bandwidth to FLOPs (meaning relatively poor FLOPs/$), so they can scale small models (and large-image convnets, both of which have low arithmetic intensity) much further than conventional GPU clusters. But they can't do any better for multi-trillion parameter Transformer models (with very high arithmetic intensity) than the same number of GPU dies.
the fact that they have a partner is not breaking news, it does not say anything, the tesla cars are built with thousands of components from other suppliers, and the M1 is built with supplies from many companies, TSMC nodes them self are built on partnerships and suppliers. but the model S is still called a tesla car, and the M1 is still a Apple chip.
i came off with the impression that the networking switch is a hurdle that tesla will overcome using that partner, but it will take much longer than tesla is projecting.
but regarding the software part, i think that they have not even studied this part! they just thought that it's doable!
and by the way, this is how they have designed the very first tesla car, this is how tesla achieved what they have achieved, they are targeting ambitious goals that no one dare to even think about, but Elon is oversimplifying it in his head, and then he works extremely hard to execute coming up with some scientific or engineering innovations during the way.
he has said many times about different products that it was more complicated than of what they initially thought... but after long delays he usually achieve most of his dreams.
so i would say, the mere fact that elon has entered the room will have a huge impact on the entire industry, although he will most likely be delayed by many years.
I am a little confused with your thesis. Is Tesla not able to use other's IP? This chip is not a car, and Tesla's car production is mostly constrained by battery cell production --- are you claiming this chip/system uses Lion battery cells? Obviously not, and your supposition on timing is pure poorly imformed speculation. I cannot comment on memory resources, but Tesla has very specific NN-processing targets, and I am sure they have designed this chip to those needs. They have said that they have determined that they can use CFP8, so likely took this into consideration. After all, it would not be difficult for them to have made alternate tradeoffs on resources, if they need to. The compiler difficulties are just that difficulties. They are recruiting, likely to solve this ... great opportunity for software engineers to help, and to make a name for themselves -- nothing like a challenge. As for the 'competition', we will have to see if they can perform to Tesla's level, since Tesla is nothing if not acutely focused on their goals, and are willing to abandon their angels if a change in direction is needed. The 'don't bet against Elon' is likely operative here.
Now that Dojo production is ramping, any update on your analysis from August 2021 would be appreciated.
Lotta this stuff came true. They made a memory interface chip and big delays. Here's a sorta update
https://www.semianalysis.com/p/tesla-ai-capacity-expansion-h100
I'm also confused that you're not at liberty to name...the company that (all indications suggest is the one) you named in your Dojo teaser post 😉
"their unique system on wafer packaging and chip design choices potentially allow an order magnitude advantage over competing AI hardware in training of massive multi-trillion parameter networks."
^ This is just not true, IMO (same as Cerebras). They've designed an amazing machine for _everything other than multi-trillion parameter networks_. They have a very high ratio of memory and interconnect bandwidth to FLOPs (meaning relatively poor FLOPs/$), so they can scale small models (and large-image convnets, both of which have low arithmetic intensity) much further than conventional GPU clusters. But they can't do any better for multi-trillion parameter Transformer models (with very high arithmetic intensity) than the same number of GPU dies.
Cerebras does scale up to many WSE as easily or effectively.
(But that's okay, since they're not targeting this at giant transformers; they're targeting it at the models they actually train.)
great analysis!
the fact that they have a partner is not breaking news, it does not say anything, the tesla cars are built with thousands of components from other suppliers, and the M1 is built with supplies from many companies, TSMC nodes them self are built on partnerships and suppliers. but the model S is still called a tesla car, and the M1 is still a Apple chip.
i came off with the impression that the networking switch is a hurdle that tesla will overcome using that partner, but it will take much longer than tesla is projecting.
but regarding the software part, i think that they have not even studied this part! they just thought that it's doable!
and by the way, this is how they have designed the very first tesla car, this is how tesla achieved what they have achieved, they are targeting ambitious goals that no one dare to even think about, but Elon is oversimplifying it in his head, and then he works extremely hard to execute coming up with some scientific or engineering innovations during the way.
he has said many times about different products that it was more complicated than of what they initially thought... but after long delays he usually achieve most of his dreams.
so i would say, the mere fact that elon has entered the room will have a huge impact on the entire industry, although he will most likely be delayed by many years.
I am a little confused with your thesis. Is Tesla not able to use other's IP? This chip is not a car, and Tesla's car production is mostly constrained by battery cell production --- are you claiming this chip/system uses Lion battery cells? Obviously not, and your supposition on timing is pure poorly imformed speculation. I cannot comment on memory resources, but Tesla has very specific NN-processing targets, and I am sure they have designed this chip to those needs. They have said that they have determined that they can use CFP8, so likely took this into consideration. After all, it would not be difficult for them to have made alternate tradeoffs on resources, if they need to. The compiler difficulties are just that difficulties. They are recruiting, likely to solve this ... great opportunity for software engineers to help, and to make a name for themselves -- nothing like a challenge. As for the 'competition', we will have to see if they can perform to Tesla's level, since Tesla is nothing if not acutely focused on their goals, and are willing to abandon their angels if a change in direction is needed. The 'don't bet against Elon' is likely operative here.