Lessons from MapDB development

MapDB is great project, but for many reasons it is falling behind other projects which raised around the same time (Hazelcast, Redis…). In this post I will outline mistakes I made over the years, while working on MapDB.

  • concurrency
    • JDMB3 back in 2012 used single ReadWriteLock to handle concurrency.
      • That allows parallel readers, but single writer.
      • SQLite has similar approach
      • Antirez from Redis is a big advocate of simplifying things by avoiding concurrency
    • That was redesigned in MapDB 1, to allow parallel writers
    • Parallel writers did not brought real performance benefits
      • j.u.c.ConcurrentSkipListMap scales linearly with number of cores
      • MapDB only scales up to 4 cores
    • Concurrency greatly complicated things and is responsible for many delays
    • Most benefits for concurrency could be achieved under single lock
      • Fail fast iterator in JDBM3 would throw ConcurrentModificationException, that can be fixed under single lock
      • Concurrent scalability is still possible under single lock with sharding and other trivial tricks
      • Single lock would still allow background writer thread, main benefit for a latency
  • writing database is hard task
    • working over raw binary files is thought
      • good luck debuging wrong file offset at 1TB stores
    • even now in 2017, there are not many database engines
      • most of them use relatively simple ideas (BTree, LSM)
    • papers describe logarithms are often impresise
      • there is B-Link-Tree paper which describes concurrent BTree
        • published in 1980ties, many citations
        • but even today it is not clear howto handle some concurrent cases (root update)
        • initial implementation took one week
        • it took about 3 months of work to nail it and make it thread safe
  • too many features
    • MapDB had too many ways to open files, handle concurrency,
    • that created too many combinations to test
    • it was hard to document and explain all the features
  • code duplication and not invented here
    • I spend long time written code, which was already written in other libraries
    • MapBD was self contained, with no dependencies
  • MapDB does not integrate with default tools and defacto standards
    • TFile and HFile and other formats
      • data exported from MapDB could be used by other databases
      • mapdb could operate directly over data created by other tools
    • other file formats
    • reimplementation of existing API (LevelDB java binding)
      • this way MapDB could be used as a drop-in replacement for other libraries
  • did not follow test driven development
    • automated testing in MapDB is still fairly good, for example we test process crash recovery with killl -9 PID
    • we had the same problem with JDBM 1.5 back in 2008
    • long running tests were broken for a long time (fixed now)
    • it took too long to run default unit tests (fixed now)
  • not enough performance testing
    • no performance regression testing
    • Single Entry locks destroyed performance in 3.x branch (fixed in 3.0.5)
    • old code had concurrent scalability problems
      • it used segmented locks
      • too many embedded locks, sometimes semaphores without memory barrier would do
  • file format and API had changed way too many times
    • it was necessary to fix early design mistakes
    • should have started with in-memory store
    • code change is sometimes necessary
  • documentation
    • spend way too much time deciding on format (Markdown versus Restructured)
    • too much time went into generating PDFs, not many people are using it
    • mapdb.org was originally generated by Maven Site plugin (haha)
    • github wiki would be fine from start
    • MapDB needs way more code examples

Tags:

Updated: