Database Comparisons

Some uses B Tree (mongoDB)
Mostly uses LSM Tree (Memtable + SSTable)
- Memtable is a balanced binary search tree in memory
Fast writes to memory
Search is slow because it may have to search many SSTables for the target key. Each SSTable requires O(Log(n)) to search (binary search because each SSTable is sorted), and O(n) for all the SSTables

partitioning: divide a table into subsets within the same node
sharding: a more advanced and distributed form of horizontal partitioning
- range-based sharding: good locality, but bad if hot spots exist
- hash-based sharding: evenly distributed, but no data locality for range queries
  - to mitigate data locality issue, we can use a secondary index
- geo-based sharding:

additional index on a different column
local secondary index:
- an additional index on the column in each of the sharded nodes
- reads need to query from all nodes
- write is same as before, still write to one shard
global secondary index:
- copy all data, order them by the new index, and shard those
- two copies of the data, possibly belonging to two differents nodes
- faster reads, because no need to read from all nodes anymore,
- horrible write, because we need to maintain consistency between the two seconds of data possiblly located on two nodes. Distributed transaction and 2 phase locking is required in this case.
example of local secondary index:
- table is sharded based on user_id
- secondary index is based on age
- query users whose age is in (30, 60)
- since shard key was user_id, there is no guarantee where those users are distributed, so all shards needs to be queried, and results needs to be merged
example of global secondary index:
- "table1" is sharded based on user_id
- another duplicated table "table2" is sharded based on age
- merge(all_nodes("table1")) = merge(all_nodes("table2"))
- but on any specific node, table1 != table2 because they're indexed on diffrent columns
- reading is fast, but writing requires distributed transaction, or 2 phase locking

Two phases
- Growing Phase: the transaction acquires all the necessary locks on data items it needs to access
- Shrinking Phase: the transaction holds all acquired locks and begins releasing htem
distributed transactions
It prevents data anomalies like dirty reads, lost updates, and phantom reads
but it may lead to deadlocks