There's a major concurrency problem in Wine, the Windows emulator for Linux, related to this, which I encountered several years ago. Wine has its own set of DLLs which provide malloc-level services. When you grow a buffer, but it can't be grown in place, a new buffer has to be allocated and the old buffer copied. The trouble is that the copying takes place while holding a spinlock on the entire buffer pool. I had a multi-thread Rust program go into futex congestion and drop performance by two orders of magnitude due to this. 20 threads were all stuck compute-bound trying to acquire the lock. There are several layers of locking involved, and they fight.
Wine bug only; works OK on real Windows.[1][2] (bugs.winehq.org seems to be down)
ISO C realloc has braindamaged corner cases. Some implementations behave like the above, in which case you can just have #define sane_realloc realloc on those targets.
With the above you can initialize a vector to null, with size zero, and use nothing but realloc for the entire lifetime management: growing it from zero to nonzero size allocates it, shrinking down to zero frees it.
malloc(0) doesn't necessarily return null; it can return some non-null pointer that can be passed to free. We can get such a thing if we call sane_realloc(0, 0), and can avoid that if we change the malloc line to:
According to the standard `realloc(NULL, size)` should already behave like `malloc(size)`. You shouldn't need that special case unless you're working on a system with a very buggy/non-compliant libc.
Unfortunately, when shrinking an array down to 0, you run into a complication. Detecting allocation failure now requires checking both size > 0 and sane_realloc returning 0. To simplify this further, just always allocate a non-zero size.
But the second sane_realloc now never frees. That's a problem shared by by the ISO C realloc.
According to ISO C, size zero can behave like this:
free(old)
return malloc(0)
and if malloc(0) allocates something, we have not achieved freeing.
There are ways to implement malloc(0) such that it returns unique pointers, without allocating memory. Or at least not very much memory. For instance we can use the 64 bit space to have some range of (unmapped) virtual addresses where we allocate bytes, and use a compact bitmask (actually allocated somewhere) to keep track of them.
Such a scheme was described by Tim Rentsch in the Usenet newsgroup comp.lang.c.
If an implementation does such a thing, adjusting the size to 1 will defeat it; allocations of size 1 need real memory.
(I can't fathom the requirement why we need malloc(0) to be a source of unique values, and why someone would implement that as efficiently as possible, when it's implementation-defined behavior that portable programs cannot rely on. Why wouldn't you use some library module for unique, space-efficient pointers.)
I would never rely malloc(0) to obtain unique pointers at all, let alone pray that it is efficient for that purpose.
I'd be happy with a malloc(0) which returns, for instance, ((void *) -1) which can be hidden behind some #define symbol.
saner_realloc isn't realloc; it is our API, and we can make it do this:
Now, a null return always means failure. The shrink to zero, or allocate zero cases give us SANE_REALLOC_EMPTY which tests unequal to null, and we accept that value for growing or freeing.
The caller can also pass in something returned by malloc(0) that is not equal to null or SANE_REALLOC_EMPTY.
I forgot to mention my own opinion. I think that malloc(0) ought to return a 0-sized object, and likewise realloc(ptr, 0) ought to resize the object to 0. Malloc and realloc always allocating feels more consistent to me. For a practical example, I have some code that reads an entire FILE stream into memory. This requires realloc with doubling the size every time. After finishing, I'd like to resize the buffer down to the total bytes read. If it reads 0 bytes, I'd like it to resize the buffer to 0.
I think my sane_realloc never freeing has much simpler behavior. As much as I hate the needless waste of 1 byte, if my code allocates thousands of 0-sized objects, I'd rather fix that before adding complexity to my sane_realloc.
With yours solving the 1 byte problem, it still interests me. We can simplify your code slightly.
At least with glibc's allocator, realloc() will already go straight to mremap() if your buffer is on the order of megabytes. And if the objects in the buffer can't be moved bytewise (e.g., because they have pointers into themselves), then moving them with mremap() instead of realloc() is hardly going to help.
Of course, C++ being C++, the language-level position is "it's all UB" either way (except for implicit-lifetime types), and even the proposals for trivial relocation make you go through a special function [0].
This reminds me why I keep going back to C. C++ features often clash with each other (especially using object oriented features with classic C), but from the C programmer's point of view the additional features don't really add too much but end up massively constraining the user.
The only thing I ever miss from C++ is a way of guaranteeing that some code is run at every exit path of a function. I guess this is why zig became so popular among C developers.
Oh man, I remember in 2000 when I first started working in the industry we had this database build process written in Java that took almost 30 days to run. The delivery schedule was monthly, and if anything went wrong we'd have to restart from checkpoint and the database build would be late. It also pegged a 32-CPU SMP DEC Alpha machine for the entire time, which was, well... CPUs would regularly (once every other build or so) cook the socket they were in and have to be replaced. The GS-320 would hot-swap (semi-reliably) so it wasn't a HUGE deal, but it would slow it down and inevitably the build would be a day or two late.
Enter myself and a buddy of mine. First thing we discovered was that they were using regular java.lang.Strings for all the string manipulation, and it'd garbage collect for between 30 and 50 seconds every minute once the process got rolling. It used a positively criminal number of threads as well in our predecessor's desperate attempt to make it go faster. SO much time was spent swapping threads on CPUs and garbage collecting that almost no real work got done.
Enter the StringBuffer rotation scheme. John and I decided to use the backup GS-160 as a hub to read source data and distribute it among 16 of our floor's desktop machines as an experiment. The hub was written in C++ and did very little other than read a series of fixed-length records from a number of source files and package them up into payloads to ship over socket to the readers.
The readers gut-rehabbed the Java code and swapped out StringBuffer for String (and io for nio) to take the majority of garbage collection out of the picture.
The trick we employed was to pre-allocate a hoard of StringBuffers with a minimum storage size and put them in a checkin/checkout "repository" where the process could ask for N buffers (generally one per string column) and it'd get a bunch of randomly selected ones from the repo. They'd get used and checked back in dirty. Any buffer that was over a "terminal length" when it was checked in would be discarded and a new buffer would be added in its place.
We poked and prodded and when we were finally happy with it, we were down to one garbage collection every 10 minutes on each server. The final build was cut from 30 days to 2.8 and we got allocated a permanent "beowulf cluster" to run our database build.
Way back in the ASP3/IIS days, we had these "big" pages and because of the engine building big strings was slow. So, I made this little COM library (Linum) which would simply pre-allocate a buffer for writing into. Yes, a StringBuffer. For those scripts the runtime improved like 500%. Memory usage was only like 10% worse.
I have found that a general pattern of setting up a memory arena or anything of that kind does help, especially of course if one anticipates more frequent / heavier usage of heap. Helps with performance (e.g.: faster allocations), memory fragmentation (which itself also helps with performance indirectly of course) and other things. Can be very simple.[1]
Or more complicated (if anticipating a lot of allocation work for very varying buffer sizes) - e.g. slab allocators. memcached is an example where this is used, a couple pictures explain the gist.[2]
[1]: note: can be even simpler of course, but quick example of structs used:
[2]: https://siemens.blog/posts/memcached-memory-model/ - I'm sure that when heap is visualised, it would show how this helps keeping fragmentation at bay as well (this helps wasting fewer memory pages, too).
In addition to growing pages using VirtualAlloc, it's also possible to have the same memory pages mapped at different locations by using memory mapped 'files' (even when the backing file is just the paging file). This means that if your new block isn't contiguous, you can remap the existing memory without a copy.
However, system calls also have overhead (thanks Meltdown/Spectre mitigations), and you might not come ahead by avoiding memory copies.
Is there another option besides memfd_create? I'm currently using those to implement a circular buffer that avoids having to reconstruct objects if they happen to cross the size boundary by mapping the same memory twice consecutively into the process space.
On Unix-like platforms? sysv shared memory (shmget, etc). https://github.com/gnzlbg/slice_deque/blob/045fb28701d3b674b... does this. (Though that code looks racy to me; unmapping the second half's virtual address space before mapping again is wrong. I guess it does that because shmat's SHM_REMAP flag is Linux-specific. At least it will fail rather than silently overwrite some other mapping. edit: oh, and it retries too, so not bad.)
Seems like you could use shm_open + mmap also.
On Linux, you could probably also do some juggling with mremap + MREMAP_DONTUNMAP, but I don't know why you'd prefer this over memfd_create.
I don't really understand the point of all this (on Linux). Just mmap(2) a virtual address range as large as you think you will ever need and let demand paging do the rest. No need to explicitly reallocate anything.
seems to miss the point that with the giant address space available on a 64-bit system you can mmap whatever size you want, because it's not actually allocated until you touch a byte in the page
There's a major concurrency problem in Wine, the Windows emulator for Linux, related to this, which I encountered several years ago. Wine has its own set of DLLs which provide malloc-level services. When you grow a buffer, but it can't be grown in place, a new buffer has to be allocated and the old buffer copied. The trouble is that the copying takes place while holding a spinlock on the entire buffer pool. I had a multi-thread Rust program go into futex congestion and drop performance by two orders of magnitude due to this. 20 threads were all stuck compute-bound trying to acquire the lock. There are several layers of locking involved, and they fight.
Wine bug only; works OK on real Windows.[1][2] (bugs.winehq.org seems to be down)
[1] https://forum.winehq.org/viewtopic.php?t=37688
[2] https://bugs.winehq.org/show_bug.cgi?id=54979
Before using realloc, you have to do this:
ISO C realloc has braindamaged corner cases. Some implementations behave like the above, in which case you can just have #define sane_realloc realloc on those targets.With the above you can initialize a vector to null, with size zero, and use nothing but realloc for the entire lifetime management: growing it from zero to nonzero size allocates it, shrinking down to zero frees it.
malloc(0) doesn't necessarily return null; it can return some non-null pointer that can be passed to free. We can get such a thing if we call sane_realloc(0, 0), and can avoid that if we change the malloc line to:
According to the standard `realloc(NULL, size)` should already behave like `malloc(size)`. You shouldn't need that special case unless you're working on a system with a very buggy/non-compliant libc.
I think that you can omit the call to malloc, as realloc(NULL, size) does the same thing as malloc(size), and free(NULL) does nothing.
Unfortunately, when shrinking an array down to 0, you run into a complication. Detecting allocation failure now requires checking both size > 0 and sane_realloc returning 0. To simplify this further, just always allocate a non-zero size.But the second sane_realloc now never frees. That's a problem shared by by the ISO C realloc.
According to ISO C, size zero can behave like this:
and if malloc(0) allocates something, we have not achieved freeing.There are ways to implement malloc(0) such that it returns unique pointers, without allocating memory. Or at least not very much memory. For instance we can use the 64 bit space to have some range of (unmapped) virtual addresses where we allocate bytes, and use a compact bitmask (actually allocated somewhere) to keep track of them.
Such a scheme was described by Tim Rentsch in the Usenet newsgroup comp.lang.c.
If an implementation does such a thing, adjusting the size to 1 will defeat it; allocations of size 1 need real memory.
(I can't fathom the requirement why we need malloc(0) to be a source of unique values, and why someone would implement that as efficiently as possible, when it's implementation-defined behavior that portable programs cannot rely on. Why wouldn't you use some library module for unique, space-efficient pointers.)
I would never rely malloc(0) to obtain unique pointers at all, let alone pray that it is efficient for that purpose.
I'd be happy with a malloc(0) which returns, for instance, ((void *) -1) which can be hidden behind some #define symbol.
saner_realloc isn't realloc; it is our API, and we can make it do this:
Now, a null return always means failure. The shrink to zero, or allocate zero cases give us SANE_REALLOC_EMPTY which tests unequal to null, and we accept that value for growing or freeing.The caller can also pass in something returned by malloc(0) that is not equal to null or SANE_REALLOC_EMPTY.
I forgot to mention my own opinion. I think that malloc(0) ought to return a 0-sized object, and likewise realloc(ptr, 0) ought to resize the object to 0. Malloc and realloc always allocating feels more consistent to me. For a practical example, I have some code that reads an entire FILE stream into memory. This requires realloc with doubling the size every time. After finishing, I'd like to resize the buffer down to the total bytes read. If it reads 0 bytes, I'd like it to resize the buffer to 0.
I think my sane_realloc never freeing has much simpler behavior. As much as I hate the needless waste of 1 byte, if my code allocates thousands of 0-sized objects, I'd rather fix that before adding complexity to my sane_realloc.
With yours solving the 1 byte problem, it still interests me. We can simplify your code slightly.
At least with glibc's allocator, realloc() will already go straight to mremap() if your buffer is on the order of megabytes. And if the objects in the buffer can't be moved bytewise (e.g., because they have pointers into themselves), then moving them with mremap() instead of realloc() is hardly going to help.
Of course, C++ being C++, the language-level position is "it's all UB" either way (except for implicit-lifetime types), and even the proposals for trivial relocation make you go through a special function [0].
[0] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p27...
This reminds me why I keep going back to C. C++ features often clash with each other (especially using object oriented features with classic C), but from the C programmer's point of view the additional features don't really add too much but end up massively constraining the user.
The only thing I ever miss from C++ is a way of guaranteeing that some code is run at every exit path of a function. I guess this is why zig became so popular among C developers.
Oh man, I remember in 2000 when I first started working in the industry we had this database build process written in Java that took almost 30 days to run. The delivery schedule was monthly, and if anything went wrong we'd have to restart from checkpoint and the database build would be late. It also pegged a 32-CPU SMP DEC Alpha machine for the entire time, which was, well... CPUs would regularly (once every other build or so) cook the socket they were in and have to be replaced. The GS-320 would hot-swap (semi-reliably) so it wasn't a HUGE deal, but it would slow it down and inevitably the build would be a day or two late.
Enter myself and a buddy of mine. First thing we discovered was that they were using regular java.lang.Strings for all the string manipulation, and it'd garbage collect for between 30 and 50 seconds every minute once the process got rolling. It used a positively criminal number of threads as well in our predecessor's desperate attempt to make it go faster. SO much time was spent swapping threads on CPUs and garbage collecting that almost no real work got done.
Enter the StringBuffer rotation scheme. John and I decided to use the backup GS-160 as a hub to read source data and distribute it among 16 of our floor's desktop machines as an experiment. The hub was written in C++ and did very little other than read a series of fixed-length records from a number of source files and package them up into payloads to ship over socket to the readers.
The readers gut-rehabbed the Java code and swapped out StringBuffer for String (and io for nio) to take the majority of garbage collection out of the picture.
The trick we employed was to pre-allocate a hoard of StringBuffers with a minimum storage size and put them in a checkin/checkout "repository" where the process could ask for N buffers (generally one per string column) and it'd get a bunch of randomly selected ones from the repo. They'd get used and checked back in dirty. Any buffer that was over a "terminal length" when it was checked in would be discarded and a new buffer would be added in its place.
We poked and prodded and when we were finally happy with it, we were down to one garbage collection every 10 minutes on each server. The final build was cut from 30 days to 2.8 and we got allocated a permanent "beowulf cluster" to run our database build.
Way back in the ASP3/IIS days, we had these "big" pages and because of the engine building big strings was slow. So, I made this little COM library (Linum) which would simply pre-allocate a buffer for writing into. Yes, a StringBuffer. For those scripts the runtime improved like 500%. Memory usage was only like 10% worse.
I have found that a general pattern of setting up a memory arena or anything of that kind does help, especially of course if one anticipates more frequent / heavier usage of heap. Helps with performance (e.g.: faster allocations), memory fragmentation (which itself also helps with performance indirectly of course) and other things. Can be very simple.[1]
Or more complicated (if anticipating a lot of allocation work for very varying buffer sizes) - e.g. slab allocators. memcached is an example where this is used, a couple pictures explain the gist.[2]
[1]: note: can be even simpler of course, but quick example of structs used:
```
```[2]: https://siemens.blog/posts/memcached-memory-model/ - I'm sure that when heap is visualised, it would show how this helps keeping fragmentation at bay as well (this helps wasting fewer memory pages, too).
In addition to growing pages using VirtualAlloc, it's also possible to have the same memory pages mapped at different locations by using memory mapped 'files' (even when the backing file is just the paging file). This means that if your new block isn't contiguous, you can remap the existing memory without a copy.
However, system calls also have overhead (thanks Meltdown/Spectre mitigations), and you might not come ahead by avoiding memory copies.
Is there another option besides memfd_create? I'm currently using those to implement a circular buffer that avoids having to reconstruct objects if they happen to cross the size boundary by mapping the same memory twice consecutively into the process space.
On Unix-like platforms? sysv shared memory (shmget, etc). https://github.com/gnzlbg/slice_deque/blob/045fb28701d3b674b... does this. (Though that code looks racy to me; unmapping the second half's virtual address space before mapping again is wrong. I guess it does that because shmat's SHM_REMAP flag is Linux-specific. At least it will fail rather than silently overwrite some other mapping. edit: oh, and it retries too, so not bad.)
Seems like you could use shm_open + mmap also.
On Linux, you could probably also do some juggling with mremap + MREMAP_DONTUNMAP, but I don't know why you'd prefer this over memfd_create.
Any virtual memory manipulation will have massive overhead.
I don't really understand the point of all this (on Linux). Just mmap(2) a virtual address range as large as you think you will ever need and let demand paging do the rest. No need to explicitly reallocate anything.
This means every buffer uses 4k (page size) minimum of RAM. Sounds very wasteful.
I mean, it depends...
I’m surprised that they only ran the measurements ten times. Hard to trust any data in this post given it is certainly deficient.
For win10+ here is also VirtualAlloc2 which permits reserving placeholders as well as a flag for allocating 64k pages.
Windows is no longer economically viable for the kind of work in which you would drop down to doing this kind of thing.
seems to miss the point that with the giant address space available on a 64-bit system you can mmap whatever size you want, because it's not actually allocated until you touch a byte in the page
(including the page tables)
2^48 is 2^16*2^32 = 65536*4GB, not 256*4GB