> Given that we now fully utilize 128 threads or 64 cores for pretty much the entire compile time, we can do a back of the envelope calculation for how long it should take: 25 min / 128 = 12 sec (or maybe 24 sec since hyper-threads aren't real cores). Yet it takes 170s to compile everything.
FWIW the article carefully worded it as a "back of the envelope" calculation, says we can't expect a linear speedup in practice and also gives the time it takes for linking (7 secs).
(Disclaimer I am the author of the article and I am quite familiar with the law)
This is an observed change of going from 1 core at 100% to 64 cores at 100%. This is establishing a lower bound, assuming there is no wasted work or contention for shared resources.
Amdahl's Law, like most 20th-century performance "wisdom" and metrics, focuses excessively on instruction count, neglecting memory and I/O pressure. 64 cores doesn't mean 64 independent cache pyramids and memory buses. In real life, the difference between CPU cycle frequency and memory latency is so great that memory pressure primarily determines performance, whereas core count really only matters to the extent that it contributes to that memory pressure.
Amdahls law is about coordination costs, so either you would expect cores to be starved or lots pf extra coordination-related compute to be happening, which, i guess is not totally crazy since there are that many crates, but as a first guess OP's back of the envelope is fine
Eminently pragmatic solution — I like it. In Rust, a crate is a compilation unit, and the compiler has limited parallelism opportunities, especially since rustc offloads much of the work to LLVM, which is largely single-threaded.
It’s not surprising they didn’t see a linear speedup from splitting into so many crates. The compiler now produces a large number of intermediate object files that must be read back and linked into the final binary. On top of that, rustc caches a significant amount of semantic information — lifetimes, trait resolutions, type inference — much of which now has to be recomputed for each crate, including dependencies. That introduces a lot of redundant work.
I also would expect this to hurt runtime performance as it likely reduces inlining opportunities (unless LTO is really good now?)
They mention that compiling one crate at a time (-j1) doesnt give the 7x slowdown, which rules out the object file/caching-in-rustc theories... I think the only explanation is the rustcs are sharing limited L3 cache.
- in rust one semantic compilation unit is one crate
- in C one semantic compilation unit is one file
There are quite a bunch of benefits in the rust approach, but also drawbacks, like huge projects have to be split into multiple workspaces to maximize parallel building.
Oversimplified the codegen-units setting tells the compiler into how many parts the compiler is allowed to split the a single semantic code gen unit.
Now it still seems strange (as in it looks like a performance bug) that most times rust was stuck in just one threat (instead of e.g. 8).
I haven't dug into the details, but it may not even be a performance bug, depending on how you define 'bug': the Rust compiler is not fully parallel itself yet. That's a bug in the sense of something that needs to be improved and fixed, but isn't one in the sense of "unexpected bad behavior".
Makes sense. We'd appreciate some more eyeballs here for sure. Between HN and a Reddit thread, there are a few hypotheses floating around. I've shared a repro here for anyone interested: https://github.com/feldera/feldera/issues/3882
codegen-units defaults to 16 in release builds, and by far the most time in the "passes" list is spend in LLVM passed (which is was codegen-units parallelizes),so most times it shouldn't be stuck with 1 high load core (even if it's not 16 all the time).
so it looks a lot like something is prevented the intended codegen parallelization of the crate
Through it indeed might not have been a bug, e.g. before the change in generation to split it across crates source code might have been in a way where it can't split the crate into multiple units. Or maybe something made rust believe splitting it is a bad idea, e.g. related to memory usage or similar.
- better optimizations (potentially, not always, sometimes not at all)
- how generics and compilation units interact (reduces the benefit of making each module a compilation unit)
- a lot of unclearity about how rust will develop in the future when this decision was made
Also when people speak about rust compiling slow and splitting helping it's most times related to better caching of repeated builds (unrelated to the incremental build feature) and not the specific issue here. But there is definitely potential to improve on it to make humongous single crates work better (like instead of just 16/256 internal splits you could factor in the crate size, maybe add a attribute to hint code unit splits etc.), but so far no one has deemed it important enough to invest their time into fixing it. I mean splitting crates is often easy so you do that once are good forever or at lest a long time.
I think it's due to the fact that (unlike crates) cyclic dependencies are allowed between modules without any extra ceremony (e.x. forward declarations in C.)
Agreed, this is the underlying main issue. I've faced it before with generated C++ code too, and after long and painful refactorings what ultimately helped the most was to just split the generated code into multiple compilation units to allow for parallel compilation. It comes with the drawback of potentially instatiating (and later throwing away) a lot more templates though
But I wonder if generating rust is the best approach. On the plus side, you can take advantage of the rich type and type checking system the compiler has. On the other hand, you're stuck with that compiler.
I wonder if the dynamic constraints can be expressed and checked through some more directly implemented mechanism. It should be both simpler to express exactly the constraints you want (no need to translate to a rust construct that rustc will check as desired), and, of course, should be a lot more efficient. Feldera may have no feasible way to get away from generated rust, but a potential competitor might avoid the issue. (That's not to say the runtime shouldn't/couldn't be implemented in rust. I'm just talking about the large amounts of generated rust.)
I would probably have gone with generating C here. You don't need all the safety of the Rust compiler, you're the one generating the code. As you point out, you can check all the constraints you want before you compile it.
I love Rust, but it seems like a really bad intermediate language if you have a compiler/transpiler which is sound. You don't need the type system from Rust telling you something's wrong.
That said, I could see how it would make writing the transpiler easier, so that's a win.
> back of the envelope calculation for how long it should take: 25 min / 128 = 12 sec (or maybe 24 sec since hyper-threads aren't real cores). Yet it takes 170s to compile everything.
I'd aim for this linear speedup for compiling (sans overhead to compile a small crate), but the linking part won't be faster, maybe even slower.
Maybe a slightly bigger envelope can tell you how much performance is there to extract and the cost of using "too many" crates (which I'm not even sure it's too many, maybe your original crate was too big to ease incremental compilation?)
Yeah, I read that part too late. In that case it seems that there's indeed a lot of overhead when building many crates from a cold-start, but it pays off in wall time and can probably save resources in incremental builds.
Yes. We found both cold and incremental builds sped up. The incremental builds were the main win -- small changes to the SQL can sometimes complete in seconds for what used to be a full recompilation.
Yeah, I was working on this project to generate a python C-API module from the SVG schema and found when I generated one huge file (as opposed to one file per class) the compilation times were significantly faster. Or maybe it was generating the C++ SVG library, don't quite remember which one was super slow as I did both at around the same time since the code changes between the two were minimal.
Looks like I settled on the slower 'one class per file' compilation method for whatever reason, probably because generating a 200k+ file didn't seem like such a good idea.
I have just went through this with a project of mine, though unfortunately the code wasn’t autogenerated, so I needed to do a lot of mind-numbingly boring search-and-replace commands. I have cobbled together a little utility that allowed to automate the process somewhat.
Mostly a throwaway code with a heavy input from Claude, so the docs are in the code itself :-)
The evidently misguided assumption was that whoever uses it will need to tweak it anyhow, so might as well read it through. As I wrote - it’s very close to throwaway code.
Anyway, I decide to experiment with Claude also writing a README - the result doesn’t seem too terribly incorrect on the first squint, and hopefully gives a slightly more impression of what that thing was attempting to do. (Disclaimer: I didn’t test it much other than my use case, so YMMV on whether it works at all).
> The evidently misguided assumption was that whoever uses it will need to tweak it anyhow, so might as well read it through. As I wrote - it’s very close to throwaway code.
Even if that assumption is true for part of the potential users, they would appreciate a starting point, you know.
Now I have bookmarked it and will check it out at one point.
If you ever figure you want to invest some more effort into it: try make it into an LSP server so it can integrate with LSP Code Actions directly.
Thanks, hopefully it will be of any use and not too buggy ! (It was a classic “worked on my machine for my needs”, but I would be very suspicious of the autogenerated code being 100% bug-free).
I looked shortly at LSP but never had experience with it, and it looked very overwhelming… (and given that I generally use vi, it seemed like a bit too much overhead to also start using a different editor or learn integrations - which I looked at but they seemed a bit unsatisfying).
As a result this exercise got me into an entirely worse kind of shiny: writing my own TUI editor, with a function/type being the unit of editing rather than file. facepalm.
probably entirely worthless exercise, but it is a ton of fun and that is what matters for now ! :-)
Well, you can at least check out Zed and Helix first? Many people say they work perfectly for them and that they are simpler than both Emacs and [Neo]vim.
LSP Code Actions is super neat though. You can have a compiler error or a warning and when your cursor is positioned on it (inside the editor) you can invoke LSP Code Actions and have changes offered to you with a preview, then you can just agree to it and boom, it's done.
Obviously this might be too much work or too tedious for a hobby project, but it's good for you to know what's out there and how it's used. I don't use LSP Code Actions too often but find them invaluable when I do.
> Of course, we tried debug builds too. Those cut the time down to ~5 minutes — but they’re not usable in practice.
I wonder how true this is.
Haven't use feldera but other rust stuff I have if I run as debug it has serious performance problems. However, for testing I have it compile a few crates that do work like `image` to be optimized (and the vast majority as debug) and that is enough to make the performance issues not noticeable. So if the multi-crate hadn't worked, possibly just only compile some of the stuff as optimized.
The dependency costs seem to be self-evident but perhaps it would be enlightening to see a comparison of the energy costs, as well as the time costs and storage costs, of compiling Rust programs versus their C equivalents. For example, comparison might reveal that there is no difference and no trade-off or that any differences and trade-offs are small enough to be worth making in the interest of some higher purpose.
Rust documentation needs some work to emphasize that splitting the codebase into many smaller crates is the "correct" way to do things if you care about build time.
I wonder if it would make a difference, at the starting point, to use fully optimized compile for dependencies but only opt-level=1 for the main crate?
I agree that it is doing a LOT of work, but I believe OP and many others will compare it to other languages and notice that the Rust compiler is a LOT slower.
> despite having a 64-core machine with 128 threads, Rust barely puts any of them to work.
Rust is fast in theory, but if in practice they can't even get their compiler to squeeze any juice from the CPU, then what's the value of that language from a software engineering viewpoint?
Compilation is inherently pretty hard to parallelise, and various design decisions around how Rust modules/creates work make it even harder (as demonstrated by achieving greater parallelism here with many smaller crates). There's no particular reflection on the performance of rust code here, it's really a design/algorithms problem.
And Rust's module design makes this significantly more complicated than in superficially-similar languages. That's why splitting it into modules brings drastic improvements - you are effectively giving the compiler clear boundaries across which it doesn't need to propagate as much information.
Yeah, my bad. I do think this terminology drives some confusion, honestly, since neither "module" nor "crate" is a very portable term to other languages' compilation schemes.
I also think it's kind of why people get confused by Rust's module system; they assume that it works like the module system of whatever language they're coming from. But they all work differently!
- Compilation is an inherently hard to parallelize thing, and not just hard to parallelize but there are a lot of trade offs. This trade offs aren't even rust specific (i.e. C,C++, etc. are affected as much) and can lead to less performant generated binaries. And much more memory pressure which could make the compilation in total slower. Luckily for most code this doesn't matter as long as you don't goo too parallel. But's it's the reason why max codegen units is 16 not #num_cpus (for release builds, 256 for debug builds).
- Codegen means producing machine code, i.e. we are speaking about of LLVM so the bug might not be rust specific and might affect other languages, too.
- This still should mean 16 threads at high load not 1, so they seem to have hit a perforamnce-bug, i.e. not how things normally work. It's very unclear if the bug is in rust of LLVM if it's the later C/C++ might be affected, too.
- While rust does strongly prefers you splitting you code into multiple crates, it still shouldn't be stuck at a singe threat, i.e. from everything we can tell we are hitting some performance bug.
- Algorithm most times dominate performance much more then if you language is slightly faster or slower this is clearly an algorithmic error i.e. failing to parallelize while it normally does parallelize and the issue might be in C/C++ code, so "rust is fast" has pretty much nothing to do with this.
- Through it still should be mentioned that for certain design reasons rust doesn't want you to make a single crate too large. In most normal situations you often end up splitting a crate for various reasons before it becomes too large, but if you have a very huge blog of auto generated code that is easy to miss. Funnily what can lead to compiler time issues if your crate is way to big also does "in general" lead to better performance which brings us back to a lot of decisions in compilers having trade offs.
> then what's the value of that language from a software engineering viewpoint?
You mean besides producing fast code in a way which is much better to maintain then C++ (or many other languages) but has many of the benefits C++ has over C when it comes to being able to reuse code and algorithm which practically makes it much easier to use better algorithms which often is a much higher performance gain then any normal language code optimizations. Something which has shown repeatedly in Praxis. Not even speaking about the fact that tends to have less bugs, makes it much easier to communicate interface constraints in a reliable maintainable way etc.
The argument of a single case of running into a compile time performance bug while already doing something which isn't exactly a normal use case and doesn't follow the common advice to split crates if they become large somehow implying that rust has no value for software engineers is just kind dump. I mean you also wouldn't go around saying cats have no value from a family POV in general because one specific case where a cat did repeatedly scratch a teenager twice.
codegen in the OP article is machine gode generation (i.e. running LLVM)
It's semantically kinda like having a single 100k file, but because rust knows it often generates huge "files" there is a splitting step, somewhere between parsing AST and generating machine code (I think after generating MIR but not fully sure). And the codgen-unit setting is in how many parts rust is allowed to split a thing which semantically is just one code unit. By default for release builds 16 (and as it can affect perf. of generated code it's not based on #cpus). But in there case there seems to be a bug which makes it effective more like 1! Which is much worse then it should be. (But also the statistic they show aren't sufficient to draw too much conclusions).
C++ is fast in theory, but if they can't get C++ compiler (LLVM) to squeeze any juice from CPU, then what's the value of that language from a software engineering viewpoint.
Hopefully, you can see why this reasoning is a problem. The main stumbling point being compilation speed != runtime speed.
Except the Rust ecosystem lacks the solutions we have in C++ land to compile fast and have easy paralelisation of builds.
Because C and C++ communities for historical reasons embrace binary libraries, and binary component frameworks like COM, so while in theory a full build from scratch takes similar time as Rust, in practice that isn't the case.
Also note that D, a language as complex as C++, with three compilers, one of them being based on LLVM, is largely faster to compile than Rust while using the LLVM backend, because Walter Bright did the right decisions on what to focus for development workflow.
except the company had a straight forward solution to the problem
and it's not always possible in C++ land either
and COM isn't part of C/C++ but a microsoft specific extension which solves very different issues as it's for cross application communication while we here have compile time perf issues inside of a single library
> binary libraries
I'm not sure if you mean dynamic linking or binary objects but there is the thing:
- dynamic linking isn't the issue here as the issue has nothing to do with re-compilation and similar (where dynamic linking can help),
- binary object files on the other hand are also something rust has and uses, it just sets the boundaries in different places (crate instead of file) which makes development easier, can lead to better runtime performance etc. It just has the drawback that you sometimes have to split things into multiple crates.
> a language as complex as
how complex a language is to write has not too much to do with how complex it is to split a single code unit into multiple parts for parallel compilation. C, C++ and D mainly side steps this by making each file a compilation unit, while in rust it's each crate. But that isn't fundamentally better or worse, it's trade offs. It's trade offs.
> because Walter Bright did the right decisions on what to focus for development workflow.
and so did rust, just with different priorities and trade offs
and given that D is mostly irrelevant and rust increasingly more successful maybe it was pursing the more important priorities
Also the OP case is about release builds i.e. not the normal dev loop
neither does the splitting affect dev experience as it's all auto generated code
and if projects which aren't auto generated it's very normal to split out libraries etc. anyway, and weather you split them into their own module, file or crate doesn't matter too much as long as you keep code clean (as in not having supper entangled Spagetti code)
and I wouldn't even be sure if D does perform in any relevant way better in compiler time if you compare their performance with the end result after they split the crate
COM is one approach, among others like SOM, DCE, ... as means to write binary libraries in poliglot languages.
Turns out that it is mostly used on Windows since Vista days, as means to have an OOP based OS, with most libraries written in C++, and having a stable ABI for such components.
So while it isn't ISO C++, it is mostly used by and from C++. .NET land usually only reaches for COM when using Windows APIs.
Binary libraries, mean binary libraries, doesn't matter if static or dynamically linked.
My point is that Rust ecosystem does not use them, you always need to compile from source code the complete dependency tree after a git clone, some of them even multiple times due to different feature flags configurations.
Not so with most commercial C and C++ development, we enjoy having binary libraries for dependecies, after a git clone only the main code needs to be compiled from scratch.
Yes there are ways to kind of do with sccache, but it is additionally tooling, not something that apparently cargo will ever support.
Also if you watch recent talks from Microsoft regarding their Rust adoption, the lack of tooling support for binary libraries distribution is one of their pain points.
- how build steps are cached in on instance of the project
- how build steps are cached across projects of common dependencies
- linking to system libraries
- bundling dependencies
--
Lets first look at it from a POV of system dependencies vs. bundling dependencies:
For each dependency you either link them, making them a system dependencies (because you link against the on in you system and require systems to have them) or bundle them, doesn't matter if it's C or Rust. The problem with system dependencies is they don't just need API compatibility but also ABI compatibility and not everything with ABI compatibility is actually API compatible. Which is all nice and fun except a ton of thing highly useful (some would say required) do not work well (or at all) if you need ABI compatibility and there are tone of (potential security bugs) seemingly API compatible but not API compatible libraries has.
In general history has shown that making most dependencies system dependencies is a complete shit show not worth anyone time and money, especially if people start mixing versions which seem compatible but aren't leading to strange runtime bug aren't possible with supported builds but anyway somehow are your fault as library maintainer.
Which is why the huge majority of the software industry _gave up on them_ for anything where they aren't strictly needed.
Rust can produce and use system dependencies, using a C API and with some less officially supported way also using rlibs (i.e. the binary libraries rust produces when compiling a crate, so yes it's using binary libraries).
But mostly it's not worth bothering with it, in the same way the majority of the rest of the software ecosystem stopped doing it.
--
Then let's look at reusing builds i.e. caching.
By default rust does that, but only on the scope of the project. I.e. you build a project change something and then rebuild it and dependencies won't be build again (except if you need to rebuild them I come back to it later). To be clear this is _not_ incremental building, which is a feature to re-use build parts on a more granular level then crates.
If you want it to cache things across projects or with some company build server you can do so using 3rd party software, i.e. same situation as with C.
> most commercial C and C++ development
Committing binary build artifact to a source code repo is a huge anti pattern, a terrible way to have distributed build caches. Stuff like that can easily make your company fail security reviews or become classified as having acted negligently if sued for damaged (e.g. caused by a virus sneaked into your program).
Also please _never ever_ checkout a 3rd party open source project with any pre-build binary artifacts in it, it's a huge security threat.
So in C/C++ you also should use the additional tools.
> Rust ecosystem does not use them
as mentioned they produce rlibs which are binary libraries (or you could say binary libraries bundled with metadata, and stuff which is roughly like how C++ templates are handled wrt. binary libraries)
And yes the tooling for shipping pre-build rlibs could be better, and it probably will get better. It's not that it can't be done, just priorities have been elsewhere so far.
> even multiple times due to different feature flags configurations.
Features are strictly additive, so no that won't happen.
The only reason for them being build multiple times is different incompatible versions of the package (which from rust POV are two different dependencies altogether). And while that seems initially kinda dump (unnecessary binary size/build time) but I can't understate how much of a huge blessing this turned out to be.
> not something that apparently cargo will ever support.
yes and make doesn't support distributed build caches without including 3rd party tools either. But it doesn't matter as long as you can just pull in the 3rd party tools if you need them.
EDIT: Rust features is like using #if and similar in the C/C++ pre-processor, i.e. if they change you have to rebuild. Like in C/C++. Also even without a crate might have been only partially compiled before (e.g. only 1 of 3 functions) so if you start using the other parts they still need to be compiled which will look a lot like recompilation (and without the incremental build feature might be a rebuild).
While D is complex, it's a different beast than Rust. I suspect think various checks, from lifetime to trait resolution, might make it more complex to parallelize than C++.
Rust is already able to do more fine-grained parallel compilation than C or C++, at least in the codegen step. The "codegen units" concept doesn't work for those languages.
> The "codegen units" concept doesn't work for those languages.
Sure it does. Or at least could, depending on what you mean by that term. See, e.g., GCC's -flto-partition option, which supports various partitioning strategies for parallel LTO codegen.
I also use make -j for C++ with great success. And so I am also stuck with having to declare my functions in a separate file from where they're defined, and thus when stepping through code with a debugger or just a good source viewer that can jump to callers/callees, I never see the comments because they're in the header not the source. Not to mention the problems I run into when linking with a library that was compiled with different options or a different version of the source. And bending over backwards to implement concurrent algorithms to get decent runtime performance, and debugging the inevitable bugs that follow.
Rust focuses on a different set of optimizations. I'm still working out when I prefer the Rust set or the C++ set or the Python set. I want to love Rust, but when doing exploratory work with unfamiliar APIs, the slow recompile loop and need to get everything correct for each incremental experimental build I do to try to figure out how something works is pretty painful. I don't know how much better it gets with familiarity. Rust is very nice when I fully understand the problem I'm solving and the environment I'm working in, and I vastly prefer it to C++ for that. But I frequently dive into unfamiliar codebases to make modifications.
cargo has the `-j` flag and defaults it to #cpus (logical cpus), so it's by default using the most times most optimal choice there
And this will parallelize the compilation of all "jobs", roughly like with make. Where a job is normally (oversimplified) compiling one code unit into one object file (.o).
And cargo does that too.
The problem is that where rust and C/C++ (and I think D) etc. set code unit boundaries differ.
In rust it's per crate. In C/C++ it's (oversimplified!!) per .h+.c file pair.
This has drawbacks and benefits. But one drawback is that it parallelizes less good. Hence why rust internally split one "semantic code unit" into multiple internal code units passed to LLM. So this is an additional level of parallelism to the -j flag.
In general this works fine and if people speak about rust builds being slow it is very rarely related to this aspect. But it puts a limit onto how much code you want in a single crate which people sometimes overlook.
But in the OP article they did run into it due to placing like idk 100k lines of code (with proc macors maybe _way_ more then that) into a single crate. And then also running into a bug where this internal parallelization somehow failed.
Basically imagine 100k+ line of code in a single .cpp file passing `-j` to the build to it will not help ;)
I think one important takeaway is that it could make sense to crate awareness about this by emitting a warning if your crate becomes way to big with a link to a in-depth explanation. Through practically most projects either aren't affected or split it into crates way earlier for various reasons (which sometimes include build time, but related to caching and incremental rebuilds, not fully clean debug builds).
> Hence why rust internally split one "semantic code unit" into multiple internal code units passed to LLM.
And the same has happened in C and C++ land, albeit in the opposite direction, where multiple compilation units can be optimized together, i.e. LTO. See, e.g., GCC's -flto-partition option for selecting strategies for partitioning symbols for LTO.
Also note that you can manually partition LTO in your Makefile by grouping compilation units into object files to be individually LTO'd.
Which is exactly what this project is now able to do. Your parallel make jobs don't help if you have one gigantic compilation unit, as they originally did.
LTO can be parallelized, both implicitly from the Makefile, and also within the compiler. GCC's -flto itself takes an optional argument to control the number of parallel threads/jobs. See also the -flto-partition option for selecting symbol partitioning strategies.
Rust/Cargo does this automagically, except the only control you have are the crate and module boundaries. The analogous approach for C is to (optionally) manually group compilation units into a smaller set of object files in your Makefile, LTO'ing each object file in parallel (make -j), and then (optionally) telling the compiler to partition and parallelize a second time on the backend. Which is what Rust does, basically, IIUC--a crate is nominally the LTO codegen unit, except to speed up compilation Rust has heuristics for partitioning crates internally for parallel LTO.
> Given that we now fully utilize 128 threads or 64 cores for pretty much the entire compile time, we can do a back of the envelope calculation for how long it should take: 25 min / 128 = 12 sec (or maybe 24 sec since hyper-threads aren't real cores). Yet it takes 170s to compile everything.
Amdahl’s Law would like to have a word.
FWIW the article carefully worded it as a "back of the envelope" calculation, says we can't expect a linear speedup in practice and also gives the time it takes for linking (7 secs).
(Disclaimer I am the author of the article and I am quite familiar with the law)
This is an observed change of going from 1 core at 100% to 64 cores at 100%. This is establishing a lower bound, assuming there is no wasted work or contention for shared resources.
Amdahl's Law, like most 20th-century performance "wisdom" and metrics, focuses excessively on instruction count, neglecting memory and I/O pressure. 64 cores doesn't mean 64 independent cache pyramids and memory buses. In real life, the difference between CPU cycle frequency and memory latency is so great that memory pressure primarily determines performance, whereas core count really only matters to the extent that it contributes to that memory pressure.
Amdahls law is about coordination costs, so either you would expect cores to be starved or lots pf extra coordination-related compute to be happening, which, i guess is not totally crazy since there are that many crates, but as a first guess OP's back of the envelope is fine
Eminently pragmatic solution — I like it. In Rust, a crate is a compilation unit, and the compiler has limited parallelism opportunities, especially since rustc offloads much of the work to LLVM, which is largely single-threaded.
It’s not surprising they didn’t see a linear speedup from splitting into so many crates. The compiler now produces a large number of intermediate object files that must be read back and linked into the final binary. On top of that, rustc caches a significant amount of semantic information — lifetimes, trait resolutions, type inference — much of which now has to be recomputed for each crate, including dependencies. That introduces a lot of redundant work.
I also would expect this to hurt runtime performance as it likely reduces inlining opportunities (unless LTO is really good now?)
They mention that compiling one crate at a time (-j1) doesnt give the 7x slowdown, which rules out the object file/caching-in-rustc theories... I think the only explanation is the rustcs are sharing limited L3 cache.
The L3 cache angle is one of our hypotheses too. But it doesn't seem like we can do much about it.
The main issue here is:
- in rust one semantic compilation unit is one crate
- in C one semantic compilation unit is one file
There are quite a bunch of benefits in the rust approach, but also drawbacks, like huge projects have to be split into multiple workspaces to maximize parallel building.
Oversimplified the codegen-units setting tells the compiler into how many parts the compiler is allowed to split the a single semantic code gen unit.
Now it still seems strange (as in it looks like a performance bug) that most times rust was stuck in just one threat (instead of e.g. 8).
> Now it still seems strange (as in it looks like a performance bug) that most times rust was stuck in just one threat (instead of e.g. 8).
Agreed, seems like there are some rustc performance bugs at play here.
I haven't dug into the details, but it may not even be a performance bug, depending on how you define 'bug': the Rust compiler is not fully parallel itself yet. That's a bug in the sense of something that needs to be improved and fixed, but isn't one in the sense of "unexpected bad behavior".
Makes sense. We'd appreciate some more eyeballs here for sure. Between HN and a Reddit thread, there are a few hypotheses floating around. I've shared a repro here for anyone interested: https://github.com/feldera/feldera/issues/3882
You may want to post on the Zulip, I think that's the way to get in touch with the team these days.
the thing is:
codegen-units defaults to 16 in release builds, and by far the most time in the "passes" list is spend in LLVM passed (which is was codegen-units parallelizes),so most times it shouldn't be stuck with 1 high load core (even if it's not 16 all the time).
so it looks a lot like something is prevented the intended codegen parallelization of the crate
Through it indeed might not have been a bug, e.g. before the change in generation to split it across crates source code might have been in a way where it can't split the crate into multiple units. Or maybe something made rust believe splitting it is a bad idea, e.g. related to memory usage or similar.
Ah yeah, that does sound like a bug to me; it’s the earlier stages that I’m thinking of that aren’t parallel yet.
Rust has a great compromise between crate and file: module. I wonder why that's not the compilation unit?
- cyclic dependencies
- some subtleties related to (proc-)macros
- better optimizations (potentially, not always, sometimes not at all)
- how generics and compilation units interact (reduces the benefit of making each module a compilation unit)
- a lot of unclearity about how rust will develop in the future when this decision was made
Also when people speak about rust compiling slow and splitting helping it's most times related to better caching of repeated builds (unrelated to the incremental build feature) and not the specific issue here. But there is definitely potential to improve on it to make humongous single crates work better (like instead of just 16/256 internal splits you could factor in the crate size, maybe add a attribute to hint code unit splits etc.), but so far no one has deemed it important enough to invest their time into fixing it. I mean splitting crates is often easy so you do that once are good forever or at lest a long time.
Per a reddit comment, modules are allowed to have circular dependencies while crates are not.
I think it's due to the fact that (unlike crates) cyclic dependencies are allowed between modules without any extra ceremony (e.x. forward declarations in C.)
Agreed, this is the underlying main issue. I've faced it before with generated C++ code too, and after long and painful refactorings what ultimately helped the most was to just split the generated code into multiple compilation units to allow for parallel compilation. It comes with the drawback of potentially instatiating (and later throwing away) a lot more templates though
> We're using rustc v1.83, and despite having a 64-core machine with 128 threads, Rust barely puts any of them to work.
> That’s right — 1,106 crates! Sounds excessive? Maybe. But in the end this is what makes rustc much more effective.
> What used to take 30–45 minutes now compiles in under 3 minutes.
I wonder if this kind of trick can be implemented in rustc itself in a more automated fashion to benefit more projects.
> I wonder if this kind of trick can be implemented in rustc itself in a more automated fashion to benefit more projects.
It partially is, with codegen units. The problem is that you can't generally do that until codegen time, because of circular dependencies.
1106 crates? Are they sure this is not a Javascript project?
They compile their customers' SQL to Rust code. Hence the preponderance of crates. It's a somewhat unique scenario.
That's correct. These aren't external dependencies but a dataflow graph being split into crates.
For any Rust compiler experts interested in taking a look, I've put together a short repro here: https://github.com/feldera/feldera/issues/3882
It will give you a workspace with a bunch of crates that seems to exercise some of the same bottlenecks the blog post described.
That's a cool project.
But I wonder if generating rust is the best approach. On the plus side, you can take advantage of the rich type and type checking system the compiler has. On the other hand, you're stuck with that compiler.
I wonder if the dynamic constraints can be expressed and checked through some more directly implemented mechanism. It should be both simpler to express exactly the constraints you want (no need to translate to a rust construct that rustc will check as desired), and, of course, should be a lot more efficient. Feldera may have no feasible way to get away from generated rust, but a potential competitor might avoid the issue. (That's not to say the runtime shouldn't/couldn't be implemented in rust. I'm just talking about the large amounts of generated rust.)
I would probably have gone with generating C here. You don't need all the safety of the Rust compiler, you're the one generating the code. As you point out, you can check all the constraints you want before you compile it.
The right way done by the likes of Oracle and SQL Server is to JIT compile their queries and stored procedures, with PGO data from query analyser.
We think a JIT compiler is the right approach too. Will be a substantial effort though, so we're waiting to get a bit of bandwidth on that front.
Are there any performance implications for the final binary because you’re splitting it up into thousands of crates?
Loss of inlining
Loss of automatic inlining of non-generics without LTO, to be a little pedantic.
Functions marked #[inline] can still be handled across crates.
LTO can inline across crates but, of course, at a substantial compile time cost.
I love Rust, but it seems like a really bad intermediate language if you have a compiler/transpiler which is sound. You don't need the type system from Rust telling you something's wrong.
That said, I could see how it would make writing the transpiler easier, so that's a win.
> back of the envelope calculation for how long it should take: 25 min / 128 = 12 sec (or maybe 24 sec since hyper-threads aren't real cores). Yet it takes 170s to compile everything.
I'd aim for this linear speedup for compiling (sans overhead to compile a small crate), but the linking part won't be faster, maybe even slower. Maybe a slightly bigger envelope can tell you how much performance is there to extract and the cost of using "too many" crates (which I'm not even sure it's too many, maybe your original crate was too big to ease incremental compilation?)
Towards the end, the article says it takes 7s for linking using mold.
Yeah, I read that part too late. In that case it seems that there's indeed a lot of overhead when building many crates from a cold-start, but it pays off in wall time and can probably save resources in incremental builds.
Yes. We found both cold and incremental builds sped up. The incremental builds were the main win -- small changes to the SQL can sometimes complete in seconds for what used to be a full recompilation.
Rewrite it in Zig? You might even be able to sidestep the LLVM bottleneck entirely. https://news.ycombinator.com/item?id=43016944
Yeah, I was working on this project to generate a python C-API module from the SVG schema and found when I generated one huge file (as opposed to one file per class) the compilation times were significantly faster. Or maybe it was generating the C++ SVG library, don't quite remember which one was super slow as I did both at around the same time since the code changes between the two were minimal.
Looks like I settled on the slower 'one class per file' compilation method for whatever reason, probably because generating a 200k+ file didn't seem like such a good idea.
I have just went through this with a project of mine, though unfortunately the code wasn’t autogenerated, so I needed to do a lot of mind-numbingly boring search-and-replace commands. I have cobbled together a little utility that allowed to automate the process somewhat.
Mostly a throwaway code with a heavy input from Claude, so the docs are in the code itself :-)
But in case anyone can find it useful:
https://github.com/ayourtch/tweak-code
Zero documentation? Do you expect potential users t9 figure it out by themselves?
Thanks for the feedback !
The evidently misguided assumption was that whoever uses it will need to tweak it anyhow, so might as well read it through. As I wrote - it’s very close to throwaway code.
Anyway, I decide to experiment with Claude also writing a README - the result doesn’t seem too terribly incorrect on the first squint, and hopefully gives a slightly more impression of what that thing was attempting to do. (Disclaimer: I didn’t test it much other than my use case, so YMMV on whether it works at all).
That's much better, thanks.
> The evidently misguided assumption was that whoever uses it will need to tweak it anyhow, so might as well read it through. As I wrote - it’s very close to throwaway code.
Even if that assumption is true for part of the potential users, they would appreciate a starting point, you know.
Now I have bookmarked it and will check it out at one point.
If you ever figure you want to invest some more effort into it: try make it into an LSP server so it can integrate with LSP Code Actions directly.
Thanks, hopefully it will be of any use and not too buggy ! (It was a classic “worked on my machine for my needs”, but I would be very suspicious of the autogenerated code being 100% bug-free).
I looked shortly at LSP but never had experience with it, and it looked very overwhelming… (and given that I generally use vi, it seemed like a bit too much overhead to also start using a different editor or learn integrations - which I looked at but they seemed a bit unsatisfying).
As a result this exercise got me into an entirely worse kind of shiny: writing my own TUI editor, with a function/type being the unit of editing rather than file. facepalm.
probably entirely worthless exercise, but it is a ton of fun and that is what matters for now ! :-)
Well, you can at least check out Zed and Helix first? Many people say they work perfectly for them and that they are simpler than both Emacs and [Neo]vim.
LSP Code Actions is super neat though. You can have a compiler error or a warning and when your cursor is positioned on it (inside the editor) you can invoke LSP Code Actions and have changes offered to you with a preview, then you can just agree to it and boom, it's done.
Obviously this might be too much work or too tedious for a hobby project, but it's good for you to know what's out there and how it's used. I don't use LSP Code Actions too often but find them invaluable when I do.
I had looked at Zed, but it’s a GUI, and I would rather stay inside a terminal for a variety of reasons. Helix I didn’t try though…
It might be a bit closer to what I am after, so I will definitely give it a try even if as a source of inspiration for my reinvention of bicycle !
Thanks a lot !
> Of course, we tried debug builds too. Those cut the time down to ~5 minutes — but they’re not usable in practice.
I wonder how true this is.
Haven't use feldera but other rust stuff I have if I run as debug it has serious performance problems. However, for testing I have it compile a few crates that do work like `image` to be optimized (and the vast majority as debug) and that is enough to make the performance issues not noticeable. So if the multi-crate hadn't worked, possibly just only compile some of the stuff as optimized.
The dependency costs seem to be self-evident but perhaps it would be enlightening to see a comparison of the energy costs, as well as the time costs and storage costs, of compiling Rust programs versus their C equivalents. For example, comparison might reveal that there is no difference and no trade-off or that any differences and trade-offs are small enough to be worth making in the interest of some higher purpose.
One thing I am curious about is why the need for crates? Did they try modules or was the initial compiler using them??
Edit: grammar
Rust documentation needs some work to emphasize that splitting the codebase into many smaller crates is the "correct" way to do things if you care about build time.
I wonder if it would make a difference, at the starting point, to use fully optimized compile for dependencies but only opt-level=1 for the main crate?
Not sure about the nature of the generated code, but wouldn’t doing the equivalent of dynamic linking mitigate this problem?
How can you get around Rust's lack of a stable ABI?
The same way as in other ecosystems that lack it, compile everything with the same toolchain.
And then you have to recompile everything every time you upgrade the toolchain.
the rust compiler is so impressively slow
It actually isn't.
It becomes more evident when you consider the amount of work it is doing as well.
I agree that it is doing a LOT of work, but I believe OP and many others will compare it to other languages and notice that the Rust compiler is a LOT slower.
> despite having a 64-core machine with 128 threads, Rust barely puts any of them to work.
Rust is fast in theory, but if in practice they can't even get their compiler to squeeze any juice from the CPU, then what's the value of that language from a software engineering viewpoint?
Compilation is inherently pretty hard to parallelise, and various design decisions around how Rust modules/creates work make it even harder (as demonstrated by achieving greater parallelism here with many smaller crates). There's no particular reflection on the performance of rust code here, it's really a design/algorithms problem.
> Compilation is inherently pretty hard to parallelise
I don't agree. In a large project there is going to be a lot of stuff that can be compiled in parallel without problems.
And Rust's module design makes this significantly more complicated than in superficially-similar languages. That's why splitting it into modules brings drastic improvements - you are effectively giving the compiler clear boundaries across which it doesn't need to propagate as much information.
Splitting it into crates, not modules.
Yeah, my bad. I do think this terminology drives some confusion, honestly, since neither "module" nor "crate" is a very portable term to other languages' compilation schemes.
It's all good! I agree it's frustrating.
I also think it's kind of why people get confused by Rust's module system; they assume that it works like the module system of whatever language they're coming from. But they all work differently!
you are missing a lot of things
- Compilation is an inherently hard to parallelize thing, and not just hard to parallelize but there are a lot of trade offs. This trade offs aren't even rust specific (i.e. C,C++, etc. are affected as much) and can lead to less performant generated binaries. And much more memory pressure which could make the compilation in total slower. Luckily for most code this doesn't matter as long as you don't goo too parallel. But's it's the reason why max codegen units is 16 not #num_cpus (for release builds, 256 for debug builds).
- Codegen means producing machine code, i.e. we are speaking about of LLVM so the bug might not be rust specific and might affect other languages, too.
- This still should mean 16 threads at high load not 1, so they seem to have hit a perforamnce-bug, i.e. not how things normally work. It's very unclear if the bug is in rust of LLVM if it's the later C/C++ might be affected, too.
- While rust does strongly prefers you splitting you code into multiple crates, it still shouldn't be stuck at a singe threat, i.e. from everything we can tell we are hitting some performance bug.
- Algorithm most times dominate performance much more then if you language is slightly faster or slower this is clearly an algorithmic error i.e. failing to parallelize while it normally does parallelize and the issue might be in C/C++ code, so "rust is fast" has pretty much nothing to do with this.
- Through it still should be mentioned that for certain design reasons rust doesn't want you to make a single crate too large. In most normal situations you often end up splitting a crate for various reasons before it becomes too large, but if you have a very huge blog of auto generated code that is easy to miss. Funnily what can lead to compiler time issues if your crate is way to big also does "in general" lead to better performance which brings us back to a lot of decisions in compilers having trade offs.
> then what's the value of that language from a software engineering viewpoint?
You mean besides producing fast code in a way which is much better to maintain then C++ (or many other languages) but has many of the benefits C++ has over C when it comes to being able to reuse code and algorithm which practically makes it much easier to use better algorithms which often is a much higher performance gain then any normal language code optimizations. Something which has shown repeatedly in Praxis. Not even speaking about the fact that tends to have less bugs, makes it much easier to communicate interface constraints in a reliable maintainable way etc.
The argument of a single case of running into a compile time performance bug while already doing something which isn't exactly a normal use case and doesn't follow the common advice to split crates if they become large somehow implying that rust has no value for software engineers is just kind dump. I mean you also wouldn't go around saying cats have no value from a family POV in general because one specific case where a cat did repeatedly scratch a teenager twice.
It's a single Rust file with 100k lines of code spit out by a code generator.
yesn't
codegen in the OP article is machine gode generation (i.e. running LLVM)
It's semantically kinda like having a single 100k file, but because rust knows it often generates huge "files" there is a splitting step, somewhere between parsing AST and generating machine code (I think after generating MIR but not fully sure). And the codgen-unit setting is in how many parts rust is allowed to split a thing which semantically is just one code unit. By default for release builds 16 (and as it can affect perf. of generated code it's not based on #cpus). But in there case there seems to be a bug which makes it effective more like 1! Which is much worse then it should be. (But also the statistic they show aren't sufficient to draw too much conclusions).
C++ is fast in theory, but if they can't get C++ compiler (LLVM) to squeeze any juice from CPU, then what's the value of that language from a software engineering viewpoint.
Hopefully, you can see why this reasoning is a problem. The main stumbling point being compilation speed != runtime speed.
Except the Rust ecosystem lacks the solutions we have in C++ land to compile fast and have easy paralelisation of builds.
Because C and C++ communities for historical reasons embrace binary libraries, and binary component frameworks like COM, so while in theory a full build from scratch takes similar time as Rust, in practice that isn't the case.
Also note that D, a language as complex as C++, with three compilers, one of them being based on LLVM, is largely faster to compile than Rust while using the LLVM backend, because Walter Bright did the right decisions on what to focus for development workflow.
> have easy paralelisation of builds.
except the company had a straight forward solution to the problem
and it's not always possible in C++ land either
and COM isn't part of C/C++ but a microsoft specific extension which solves very different issues as it's for cross application communication while we here have compile time perf issues inside of a single library
> binary libraries
I'm not sure if you mean dynamic linking or binary objects but there is the thing:
- dynamic linking isn't the issue here as the issue has nothing to do with re-compilation and similar (where dynamic linking can help),
- binary object files on the other hand are also something rust has and uses, it just sets the boundaries in different places (crate instead of file) which makes development easier, can lead to better runtime performance etc. It just has the drawback that you sometimes have to split things into multiple crates.
> a language as complex as
how complex a language is to write has not too much to do with how complex it is to split a single code unit into multiple parts for parallel compilation. C, C++ and D mainly side steps this by making each file a compilation unit, while in rust it's each crate. But that isn't fundamentally better or worse, it's trade offs. It's trade offs.
> because Walter Bright did the right decisions on what to focus for development workflow.
and so did rust, just with different priorities and trade offs
and given that D is mostly irrelevant and rust increasingly more successful maybe it was pursing the more important priorities
Also the OP case is about release builds i.e. not the normal dev loop
neither does the splitting affect dev experience as it's all auto generated code
and if projects which aren't auto generated it's very normal to split out libraries etc. anyway, and weather you split them into their own module, file or crate doesn't matter too much as long as you keep code clean (as in not having supper entangled Spagetti code)
and I wouldn't even be sure if D does perform in any relevant way better in compiler time if you compare their performance with the end result after they split the crate
COM is one approach, among others like SOM, DCE, ... as means to write binary libraries in poliglot languages.
Turns out that it is mostly used on Windows since Vista days, as means to have an OOP based OS, with most libraries written in C++, and having a stable ABI for such components.
So while it isn't ISO C++, it is mostly used by and from C++. .NET land usually only reaches for COM when using Windows APIs.
Binary libraries, mean binary libraries, doesn't matter if static or dynamically linked.
My point is that Rust ecosystem does not use them, you always need to compile from source code the complete dependency tree after a git clone, some of them even multiple times due to different feature flags configurations.
Not so with most commercial C and C++ development, we enjoy having binary libraries for dependecies, after a git clone only the main code needs to be compiled from scratch.
Yes there are ways to kind of do with sccache, but it is additionally tooling, not something that apparently cargo will ever support.
Also if you watch recent talks from Microsoft regarding their Rust adoption, the lack of tooling support for binary libraries distribution is one of their pain points.
you are mixing up use cases and concepts
mainly the concepts of
- how build steps are cached in on instance of the project
- how build steps are cached across projects of common dependencies
- linking to system libraries
- bundling dependencies
--
Lets first look at it from a POV of system dependencies vs. bundling dependencies:
For each dependency you either link them, making them a system dependencies (because you link against the on in you system and require systems to have them) or bundle them, doesn't matter if it's C or Rust. The problem with system dependencies is they don't just need API compatibility but also ABI compatibility and not everything with ABI compatibility is actually API compatible. Which is all nice and fun except a ton of thing highly useful (some would say required) do not work well (or at all) if you need ABI compatibility and there are tone of (potential security bugs) seemingly API compatible but not API compatible libraries has.
In general history has shown that making most dependencies system dependencies is a complete shit show not worth anyone time and money, especially if people start mixing versions which seem compatible but aren't leading to strange runtime bug aren't possible with supported builds but anyway somehow are your fault as library maintainer.
Which is why the huge majority of the software industry _gave up on them_ for anything where they aren't strictly needed.
Rust can produce and use system dependencies, using a C API and with some less officially supported way also using rlibs (i.e. the binary libraries rust produces when compiling a crate, so yes it's using binary libraries).
But mostly it's not worth bothering with it, in the same way the majority of the rest of the software ecosystem stopped doing it.
--
Then let's look at reusing builds i.e. caching.
By default rust does that, but only on the scope of the project. I.e. you build a project change something and then rebuild it and dependencies won't be build again (except if you need to rebuild them I come back to it later). To be clear this is _not_ incremental building, which is a feature to re-use build parts on a more granular level then crates.
If you want it to cache things across projects or with some company build server you can do so using 3rd party software, i.e. same situation as with C.
> most commercial C and C++ development
Committing binary build artifact to a source code repo is a huge anti pattern, a terrible way to have distributed build caches. Stuff like that can easily make your company fail security reviews or become classified as having acted negligently if sued for damaged (e.g. caused by a virus sneaked into your program).
Also please _never ever_ checkout a 3rd party open source project with any pre-build binary artifacts in it, it's a huge security threat.
So in C/C++ you also should use the additional tools.
> Rust ecosystem does not use them
as mentioned they produce rlibs which are binary libraries (or you could say binary libraries bundled with metadata, and stuff which is roughly like how C++ templates are handled wrt. binary libraries)
And yes the tooling for shipping pre-build rlibs could be better, and it probably will get better. It's not that it can't be done, just priorities have been elsewhere so far.
> even multiple times due to different feature flags configurations.
Features are strictly additive, so no that won't happen.
The only reason for them being build multiple times is different incompatible versions of the package (which from rust POV are two different dependencies altogether). And while that seems initially kinda dump (unnecessary binary size/build time) but I can't understate how much of a huge blessing this turned out to be.
> not something that apparently cargo will ever support.
yes and make doesn't support distributed build caches without including 3rd party tools either. But it doesn't matter as long as you can just pull in the 3rd party tools if you need them.
EDIT: Rust features is like using #if and similar in the C/C++ pre-processor, i.e. if they change you have to rebuild. Like in C/C++. Also even without a crate might have been only partially compiled before (e.g. only 1 of 3 functions) so if you start using the other parts they still need to be compiled which will look a lot like recompilation (and without the incremental build feature might be a rebuild).
> Also note that D, a language as complex as C++
While D is complex, it's a different beast than Rust. I suspect think various checks, from lifetime to trait resolution, might make it more complex to parallelize than C++.
Rust is already able to do more fine-grained parallel compilation than C or C++, at least in the codegen step. The "codegen units" concept doesn't work for those languages.
> The "codegen units" concept doesn't work for those languages.
Sure it does. Or at least could, depending on what you mean by that term. See, e.g., GCC's -flto-partition option, which supports various partitioning strategies for parallel LTO codegen.
LTO is distinct, and happens after what codegen units does.
That said, it is a more fine grained parallelism, for sure. Rust does LTO as well as codegen-units.
Really, as you gesture towards, on some level, this is all semantics: our linkers are also basically compilers too, at this point.
I have been using make's -j flag to compile C++ projects with great success.
The main point being, Rust focuses on the wrong type of optimizations.
I also use make -j for C++ with great success. And so I am also stuck with having to declare my functions in a separate file from where they're defined, and thus when stepping through code with a debugger or just a good source viewer that can jump to callers/callees, I never see the comments because they're in the header not the source. Not to mention the problems I run into when linking with a library that was compiled with different options or a different version of the source. And bending over backwards to implement concurrent algorithms to get decent runtime performance, and debugging the inevitable bugs that follow.
Rust focuses on a different set of optimizations. I'm still working out when I prefer the Rust set or the C++ set or the Python set. I want to love Rust, but when doing exploratory work with unfamiliar APIs, the slow recompile loop and need to get everything correct for each incremental experimental build I do to try to figure out how something works is pretty painful. I don't know how much better it gets with familiarity. Rust is very nice when I fully understand the problem I'm solving and the environment I'm working in, and I vastly prefer it to C++ for that. But I frequently dive into unfamiliar codebases to make modifications.
codegen units is quite the same as the `-j` flag
cargo has the `-j` flag and defaults it to #cpus (logical cpus), so it's by default using the most times most optimal choice there
And this will parallelize the compilation of all "jobs", roughly like with make. Where a job is normally (oversimplified) compiling one code unit into one object file (.o).
And cargo does that too.
The problem is that where rust and C/C++ (and I think D) etc. set code unit boundaries differ.
In rust it's per crate. In C/C++ it's (oversimplified!!) per .h+.c file pair.
This has drawbacks and benefits. But one drawback is that it parallelizes less good. Hence why rust internally split one "semantic code unit" into multiple internal code units passed to LLM. So this is an additional level of parallelism to the -j flag.
In general this works fine and if people speak about rust builds being slow it is very rarely related to this aspect. But it puts a limit onto how much code you want in a single crate which people sometimes overlook.
But in the OP article they did run into it due to placing like idk 100k lines of code (with proc macors maybe _way_ more then that) into a single crate. And then also running into a bug where this internal parallelization somehow failed.
Basically imagine 100k+ line of code in a single .cpp file passing `-j` to the build to it will not help ;)
I think one important takeaway is that it could make sense to crate awareness about this by emitting a warning if your crate becomes way to big with a link to a in-depth explanation. Through practically most projects either aren't affected or split it into crates way earlier for various reasons (which sometimes include build time, but related to caching and incremental rebuilds, not fully clean debug builds).
> Hence why rust internally split one "semantic code unit" into multiple internal code units passed to LLM.
And the same has happened in C and C++ land, albeit in the opposite direction, where multiple compilation units can be optimized together, i.e. LTO. See, e.g., GCC's -flto-partition option for selecting strategies for partitioning symbols for LTO.
Also note that you can manually partition LTO in your Makefile by grouping compilation units into object files to be individually LTO'd.
> C and C++ land, albeit in the opposite direction, [...] LTO
you also have that in rust to link different crates (and the internal splits for parallel compute) together
Which is exactly what this project is now able to do. Your parallel make jobs don't help if you have one gigantic compilation unit, as they originally did.
LTO can be parallelized, both implicitly from the Makefile, and also within the compiler. GCC's -flto itself takes an optional argument to control the number of parallel threads/jobs. See also the -flto-partition option for selecting symbol partitioning strategies.
Rust/Cargo does this automagically, except the only control you have are the crate and module boundaries. The analogous approach for C is to (optionally) manually group compilation units into a smaller set of object files in your Makefile, LTO'ing each object file in parallel (make -j), and then (optionally) telling the compiler to partition and parallelize a second time on the backend. Which is what Rust does, basically, IIUC--a crate is nominally the LTO codegen unit, except to speed up compilation Rust has heuristics for partitioning crates internally for parallel LTO.