Log Store format and compaction

Designing reliable append-only stores is hard. This blog post outlines design for new append-only store for MapDB4. It has following features:

  • Append-only log based files. Old record are never overwritten
  • Compaction in background processes, does not block IO
  • Lock-free snapshot and isolation
  • Parallel writters
  • Transactions with rollback

Store is basic low level storage layer in MapDB. It maps recid (record ID) into binary record (byte[]). Here is interface definition

Files

LogStore uses directory with multiple subdirectories:

  • data - contains data files
  • blob - stores large records, those are too big to be stored directly in log file
  • generations - file generation markers, used for replay

Data files

  • Data file has maximal size
    • store wide, configured at store creation (typically 128MB)
    • maximal value is 2GB (ByteBuffer limit), file offset is signed int
  • New records are appended to end of file
  • When file becomes too big
    • It is closed and
    • New file is started
  • Each Data File has 4byte number
    • File number starts from 0 and increments for each file
    • Newer file has higher file number

Blob files

  • Some records are too big for log
    • Compaction moves data, large records would cause overhead
  • Large records are stored in separate files, in dedicated ‘blob’ directory
  • Cut off size is configurable at store creation, typically 64KB

File Generations

  • File number is N-bit unsigned long
    • N depends on maximal Data File size, together they form 8-byte long
  • Overtime this number overflows and starts again from #0
  • To deal with overflow ‘file generations’ are used
    • It ensures that data file #0 is deleted by the time it overflows
    • It finds oldest log file for log replay (data file #0 might not be oldest)
  • There are 64K file generations, highest two bytes in file number
  • Generations form cyclic buffer
    • Used generations are continously allocated
      • As new file are added, old deleted this allocated window moves in cyclic buffer
    • There is a ‘hole’ in cyclic buffer, log starts at upper end of this hole
      • For example if used generations are 0-1 the hole is 2-64K, upper end of hole is 0
  • Each generation has empty file in ‘generations’ directory
    • if file with given name exists, generation is used
  • New Data File can be only created if its generation is used
  • New Data File can not be created if there is no hole in generation cyclic buffer
    • it would be impossible to find oldest Data File and start of the log

Log Entries

  • Each data file is composed of Log Entries
  • It starts at beginning of file, log replay traverses file from start to end
  • Each Log Entry starts with 1 byte header, that identifies its type

Log Entry Headers (they will most likely change in final version):

  • 0 - invalid header
    • reserved to detect data corruption
  • 1 - eof
    • end replay and skip to next file
  • 2 - skip single byte
    • used for padding
  • 3 - commit
    • written by tx commit
    • apply changes from last commit
  • 4 - rollback
    • written by tx rollback
    • discard changes since last commit
  • 5 - record
    • followed by 4 byte signed int (record size) and
    • 8 byte recid
    • byte[] that contains record
  • 6 - blob record
    • followed by 8-byte recid
    • followed by 8-byte long, blob file number

Compaction

  • Compaction removes older version of records and reclaims space.
  • It takes old Data File and moves active record into newer Data File, old file is then deleted

Algorithm:

  • Select file #N
    • big part of file should be unused, some file stats are needed
  • Replay file, iterate over all records
  • If record is active (its recid is in Index Table)
    • append this record to new file
      • use FileChannel#transferTo()
    • if not, ignore this record
  • at this point file should have no active records
  • unmap delete

TODO: can transaction spread over multiple files?

Index Table

  • Most recent version of record is determined by Index Table
  • Index Table maps Recid (long) to File Position (long)
  • Index Table is updated when record is inserted, updated or deleted
  • When store is opened, all Data Files must be replayed to reconstruct index table
    • Index Table can be saved every Nth files, that will require only a few Data Files
  • Snapshots makes immutable copy of Index Table

  • Index Table can be stored in
    • on-heap Map<Long,Long>
    • memory mapped flat temporary file, offset=recid*8
    • some sort of persistent (crash resistant) file, to prevent full log replay