Log Store format and compaction

Designing reliable append-only stores is hard. This blog post outlines design for new append-only store for MapDB4. It has following features:

Store is basic low level storage layer in MapDB. It maps recid (record ID) into binary record (byte[]). Here is interface definition

Files

LogStore uses directory with multiple subdirectories:

data - contains data files
blob - stores large records, those are too big to be stored directly in log file
generations - file generation markers, used for replay

Data file has maximal size
- store wide, configured at store creation (typically 128MB)
- maximal value is 2GB (ByteBuffer limit), file offset is signed int
New records are appended to end of file
When file becomes too big
- It is closed and
- New file is started
Each Data File has 4byte number
- File number starts from 0 and increments for each file
- Newer file has higher file number

Some records are too big for log
- Compaction moves data, large records would cause overhead
Large records are stored in separate files, in dedicated ‘blob’ directory
Cut off size is configurable at store creation, typically 64KB

File number is N-bit unsigned long
- N depends on maximal Data File size, together they form 8-byte long
Overtime this number overflows and starts again from #0
To deal with overflow ‘file generations’ are used
- It ensures that data file #0 is deleted by the time it overflows
- It finds oldest log file for log replay (data file #0 might not be oldest)
There are 64K file generations, highest two bytes in file number
Generations form cyclic buffer
- Used generations are continously allocated
  - As new file are added, old deleted this allocated window moves in cyclic buffer
- There is a ‘hole’ in cyclic buffer, log starts at upper end of this hole
  - For example if used generations are 0-1 the hole is 2-64K, upper end of hole is 0
Each generation has empty file in ‘generations’ directory
- if file with given name exists, generation is used
New Data File can be only created if its generation is used
New Data File can not be created if there is no hole in generation cyclic buffer
- it would be impossible to find oldest Data File and start of the log

Log Entry Headers (they will most likely change in final version):

0 - invalid header
- reserved to detect data corruption
1 - eof
- end replay and skip to next file
2 - skip single byte
- used for padding
3 - commit
- written by tx commit
- apply changes from last commit
4 - rollback
- written by tx rollback
- discard changes since last commit
5 - record
- followed by 4 byte signed int (record size) and
- 8 byte recid
- byte[] that contains record
6 - blob record
- followed by 8-byte recid
- followed by 8-byte long, blob file number

Compaction removes older version of records and reclaims space.
It takes old Data File and moves active record into newer Data File, old file is then deleted

Algorithm:

Select file #N
- big part of file should be unused, some file stats are needed
Replay file, iterate over all records
If record is active (its recid is in Index Table)
- append this record to new file
  - use FileChannel#transferTo()
- if not, ignore this record
at this point file should have no active records
unmap delete

TODO: can transaction spread over multiple files?

Most recent version of record is determined by Index Table
Index Table maps Recid (long) to File Position (long)
Index Table is updated when record is inserted, updated or deleted
When store is opened, all Data Files must be replayed to reconstruct index table
- Index Table can be saved every Nth files, that will require only a few Data Files
Snapshots makes immutable copy of Index Table
Index Table can be stored in
- on-heap Map<Long,Long>
- memory mapped flat temporary file, offset=recid*8
- some sort of persistent (crash resistant) file, to prevent full log replay