Lessons from MapDB development

MapDB is great project, but for many reasons it is falling behind other projects which raised around the same time (Hazelcast, Redis…). In this post I will outline mistakes I made over the years, while working on MapDB.

concurrency
- JDMB3 back in 2012 used single ReadWriteLock to handle concurrency.
  - That allows parallel readers, but single writer.
  - SQLite has similar approach
  - Antirez from Redis is a big advocate of simplifying things by avoiding concurrency
- That was redesigned in MapDB 1, to allow parallel writers
- Parallel writers did not brought real performance benefits
  - j.u.c.ConcurrentSkipListMap scales linearly with number of cores
  - MapDB only scales up to 4 cores
- Concurrency greatly complicated things and is responsible for many delays
- Most benefits for concurrency could be achieved under single lock
  - Fail fast iterator in JDBM3 would throw ConcurrentModificationException, that can be fixed under single lock
  - Concurrent scalability is still possible under single lock with sharding and other trivial tricks
  - Single lock would still allow background writer thread, main benefit for a latency
writing database is hard task
- working over raw binary files is thought
  - good luck debuging wrong file offset at 1TB stores
- even now in 2017, there are not many database engines
  - most of them use relatively simple ideas (BTree, LSM)
- papers describe logarithms are often impresise
  - there is B-Link-Tree paper which describes concurrent BTree
    - published in 1980ties, many citations
    - but even today it is not clear howto handle some concurrent cases (root update)
    - initial implementation took one week
    - it took about 3 months of work to nail it and make it thread safe
too many features
- MapDB had too many ways to open files, handle concurrency,
- that created too many combinations to test
- it was hard to document and explain all the features
code duplication and not invented here
- I spend long time written code, which was already written in other libraries
- MapBD was self contained, with no dependencies
MapDB does not integrate with default tools and defacto standards
- TFile and HFile and other formats
  - data exported from MapDB could be used by other databases
  - mapdb could operate directly over data created by other tools
- other file formats
- reimplementation of existing API (LevelDB java binding)
  - this way MapDB could be used as a drop-in replacement for other libraries
did not follow test driven development
- automated testing in MapDB is still fairly good, for example we test process crash recovery with killl -9 PID
- we had the same problem with JDBM 1.5 back in 2008
- long running tests were broken for a long time (fixed now)
- it took too long to run default unit tests (fixed now)
not enough performance testing
- no performance regression testing
- Single Entry locks destroyed performance in 3.x branch (fixed in 3.0.5)
- old code had concurrent scalability problems
  - it used segmented locks
  - too many embedded locks, sometimes semaphores without memory barrier would do
file format and API had changed way too many times
- it was necessary to fix early design mistakes
- should have started with in-memory store
- code change is sometimes necessary
documentation
- spend way too much time deciding on format (Markdown versus Restructured)
- too much time went into generating PDFs, not many people are using it
- mapdb.org was originally generated by Maven Site plugin (haha)
- github wiki would be fine from start
- MapDB needs way more code examples

Comments

xamde ⬣ • 4 years ago Thanks for sharing. I watched the Java embedded database space thoroughly, there are few real candidates, MapDB always looking like the best of them. My main problem was stability. I needed far less features, cared less about ultra-performance, but need a stable, reliable, somewhat scalable (otherwise I could use in-memory) key-value store first. I am happy to see MapDB lives on and I hope there will be stable, maintained, releases with stable on-disk formats.

–

Nick Apperley • 3 years ago

Since Kotlin is being used with MapDB have Kotlin Coroutines ( https://www.youtube.com/wat… ) been considered for non blocking concurrency? Some NoSQL DB systems use Coroutines ( https://en.wikipedia.org/wi… ) to handle concurrency ( https://medium.com/software… ).