Memory mapped files, allocation and JVM crash

I have some experience when it comes to memory map (mmap) files and database storage/ Long time ago I added mmap storage to H2 SQL database. MapDB was first pure java db to really utilize memory mapped files.

There is older article about mmap files I wrote last year. This article summarises experience I had with mmap files so far. It also outlines new approach I use in IODB, a storage engine developed for blockchain applications. This work is developed for and sponsored by Input Output HK.

Common problems with mmap files

Let’s start with common issues memory mapped (mmap) files have. They were part of Java for 15 year since 1.4 release. But mmap implementation is far from perfect. JVM has number of unfixed bugs related to memory mapped files. One of them is 15 years old now, there is small group of fans who celebrate its birthday every year ;-)

Those issues are:

Limited number of mmap handles

There is limited number of mmap handles per process (64K on Linux), If you use too many handles, JVM process will not be able to mmap new files. Usually your code will fail with exception:

Caused by: java.lang.OutOfMemoryError: Map failed
    at sun.nio.ch.FileChannelImpl.map0(Native Method)
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937)

But this failure may happen anywhere in JVM (when it loads new lib for example), and could even cause JVM crash.

You can count number of mmap handles process is using with this command: sudo cat /proc/$PID/maps | wc -l

You can also increase number of mmap handles in /proc/sys/vm/max_map_count.

No unmap

Each mapped ByteBuffer uses one mmap handle. But there is no official way to release this handle.

There is no official way to close memory mapped buffer. JVM releases file handles only after ByteBuffer gets garbage collected. If GC is not invoked for some time (large free heap), you will run out of file handles.

Workaround is to use cleaner hack. It uses reflection to access undocumented API that releases mmap file handle, before ByteBuffer gets garbage collected.

However the ByteBuffer can not be used after it was unmapped, any access to it lead to invalid memory operation and causes JVM to crash. So cleaner hack should be used together with some locking mechanism, that makes ByteBuffer inaccessible after it was closed.

Windows file delete

If file region was mmaped on Windows, it is not possible to delete that fil. File remains undeletable, even after it was correctly closed, unmapped and JVM process exits. To delete this file, one has to restart operating system (anyone remembers deleteAfterRestart hook in Delphi? :-)) This was reported, but JVM team can not solve it, since it is Windows feature.

This problem is so annoying, that we recommend not to use mmap files on Windows.

JVM crash when disk gets full

It is possible to crash JVM using only official API:

  • create empty file
  • mmap large section of this file for write
  • write into mmaped ByteByffer
  • operating system will keep modified memory pages for delayed write
  • if delayed write fails, OS will send SIGBUS signal to JVM process
  • JVM process does not handle SIGBUS signal, so JVM process gets killed by Linux kernel

One would think IO errors are rare. But it is very common situation when free disk space runs out. Most operating systems (such as Linux) use sparse files, that means no disk space is really allocated until first write. If operating system fails to allocate space (no free space on disk), delayed write fails and the JVM gets killed.

There is old bug report opened for MapDB. It was solved by preallocating space, when file expands it is filled with FileOutputStream to ensure all file space is allocated. However this slows down MapDB inserts significantly, and better solution is needed.

mmap regions have fixed size, and do not expand

FileChannel#map returns fixed sizeByteBuffer. There is no way to change its size after file expands.

MapDB solved this problem by mapping files in small buffers (1MB). However that uses too many file handles, and would cause crash when large stores are opened (>64GB).

Also it very slow on large files. It takes a long time to mmap a few thousands new ByteBuffer (1MB regions and 20GB file) when file is open. And it takes a long time to sync such file to disk (iterate and call force() over all ByteBuffer).

2GB limit per single region

ByteBuffer use 32bit signed integer for addressing, that limits size of single mmap region to 2 GB. One has to mmap multiple regions to handle larger files.

New approach

Over time I learned better way to handle memory mapped files. Here are a few pieces of the puzzle:

Read-only mmap, write with FileChannel

LMDB is key-value store which provides btree on top of mmap files. It is written in C, uses low level API and is pretty efficient. I recently studied its source code to see how its design could be reused in IODB.

The most interesting is that LMDB does not use mmap files for writing. Files are mmaped in read-only mode and uses FileChannel like method to write changes to disk. Once changes are written and synced, they are visible to read-only mmaped ByteBuffer.

This approach has some interesting features:

  • There is no JVM crash if disk becomes full and delayed write fails

  • EDIT: FileChannel (or any buffered writer) does not preserve write order unless it is append-only. LMDB does double sync to make changes durable.

`FileChannel` preserves write order; data are flushed to disk in the same order they were written to `FileChannel`. Write order is important to prevent data corruption. If LMDB would use mmap files for write, it would require write-ahead-log for crash protection
  • Performance difference between writable mmap files and FileChannel is minimal if larger blocks are used.

  • This works extremely well for append-only files

So this could be a new way to write append-only files such as Write Ahead Log.

mmap beyond end of file

Here is solution to another problem: how to increase size of mmaped ByteBuffer, if the underlying file expands, but buffers are fixed size? MapDB maps file in 1MB increments, but that is very slow and requires too many resources (file handles).

Solution is very simple, and works pretty well on Linux. Let’s mmap beyond the end of file:

  • create empty file
  • open writable FileChannel for this file
  • use FileChannel#map() to map 2GB read-only ByteBuffer over this file
  • file size grows to 2GB, but it is sparse file, no disk space is actually used
  • call FileChannel#truncate() to shrink file size back to zero

Now there is an empty file, with large read-only ByteBuffer mmaped over it. File can be safely modified (and expanded) with FileeChannel. File changes will be visible in mmapedByteBuffer.

Once file grows over 2GB, mmap another 2GB region and use FileChannel#truncate() again.

This approach does not work on all operating systems. In Windows, the FileChannel.truncate() does not shrink file, empty file still has 2GB. This method can still be usable on Windows, we just use smaller regions (128MB vs 2GB) and file can be truncated when store is closed (after unmap).

Fast disk space allocation

Until recently I though that region of a sparse file has only two states:

  • Large file exists, its disk space is not allocated, data were not written yet
  • Large file exists, its disk space is allocated and data were written

But apparently on Linux there is a third state, where the disk space is allocated, but no data were written yet. There is fallocate call which handles that

JVM does not provide any “legal” way to call fallocate directly. There are other options:

  • JNA call, for example in this library.
  • Another option is to execute /usr/lib/fallocate command.
  • And finally the FileChannel.transferFrom() from sparse file seems to have the same effect.

Why this is excellent? If sparse file is in this allocated state, the delayed write can not fail when disk is full. So when the file expands, we do not have to overwrite new region with zeroes, but can use fallocate instead. That is much faster.

Tags:

Updated: