Fundamentals and High Level Overview

System Design Process

Gather requirments
Back of envelope calculations
API design
Architectual design

1 Database

sequential reads and writes are always cheaper than random reads and writes
- want to keep data that is frequently read together close together
- want to keep data that is frequently written to together close together
indexes can be used to speed up reads, but often at the expense of slowing down writes
without index, we can just write data sequentially, that is basically a WAL (Write Ahead Log), but reads will be slow: O(n)

1.1 Hash Index

reads and writes are O(1)
usually in memory
range query is impossible, because each data is written into random memory/disk addresses, they're not sequential, so range query is essentailly N reads, where N is the size of the range

1.2 B+ Tree Index

tree structure allows for O(log n) read speed
data are written together on disk (for primary keys), so range query is fast
for indexes other than primary keys, range query is also fast because the pointers to the range data are stored together (the leaves of the B+ tree)
but data on disk are stored sequentially according to the primary key, not second indexes
- which means even though pointers are used for fast query reads, but the actual data are scattered around the disk, so querying is still not as efficient as primary key index

1.3 LSM Tree Index (Memtable + SSTables)

writes are fast because they're written to the memtable in memory first
memtable itself is also a tree - self balanced binary tree
data in memtable is being flushed to disk periodically as SSTables
SSTables are compacted into a new SSTable for the next layer to reduce the size because only the latest update for a key is relevant and stored
read is slow, because it requires a linear scan (O(n)) of all the memtable and SSTables until the key is found
- if a key is really really old, at worst case scenario, all SSTables would need to be scaned to find that key

ACID1.4 PrincipleTransactions

ACID
~~Atomicity~~:
- Atomicity: ~~entire~~ ~~transaction~~ ~~happens at once,~~all or ~~doesn'~~nothing write
- Consistency: write does not break rules
- Isolation: concurrent transactions don't ~~happen~~intefere atwith ~~all~~
  one another
- Durability: data are recoverable during system failure
~~Consistency~~~~: database must be consistent before and after transactions (follow all rules, no violations)~~

~~Isoluation~~~~: multiple~~ ~~transactions~~ ~~can occur independently without inteference~~

~~Durability~~~~: changes of a successful transaction occurs even if the system failure occurs~~

Isolation Levels
- Phantom ~~Non-repeatable~~Read: ~~Read~~~~: during the course of~~in a transaction, ~~data~~running the range query twice result in ~~a row changes~~
- ~~Phantom Read~~~~: two identical queries return~~ different ~~results~~sets (aof ~~new row has just been added by another transaction)~~
- ~~MySQL is~~ ~~repeatable read~~~~, PostgreSQL is~~ ~~read commited~~ data
- https://www.youtube.com/watch?v=GAe5oB742dw

1.4.1 Serializable

it appears to the writer as if all transactions ran on a single thread, no need to worry about race conditions

There are several ways of ensuring serializability in a database:
- actually running on a single thread
- pessimistic concurrency control: suitable for high probability of concurrent write, because locking is better than read-write-write overhead
- optimistic concurrency control: suitable for low probability of concurrent write

Pessimistic Concurrency Control

hold locks on all read-from and written-to database rows, others must wait

For example, MySQL supports Pessimistic

# acquires an exclusive lock on the selected rows
select ... for update
# locks entire tables for exclusive access by the current transaction
lock tables ... write:

Optimistic Concurrency Control

allow concurrent updates, but any update needs to compare data before and after, if they dont match, that means another process has updated the data, so abort. but abort is expensive.

For example, MySQL supports Pessimistic, but optimistic concurrency control con be achieved with a "version" column in table:

select id, version from table where id = <id>
Update table where id = <id> and version = <stored_version>
# if affected rows == 0, resolve the conflict

1.4.2 Who supports transactions?

Most SQL DB do

NoSQL generally don't support transactions because transactions are expensive

1.5 Row vs Column Store

it all depends how application consumes data

if data in the row are always needed together, like a user profile, then row store makes sense

but if data in the column are always needed together, like in a time series analysis, then column store makes sense

1.6 Replication

database failures are common

replication logs are used to keep rep DB synced

Synchronous Replication
- all replicas needs to be updated before success confirmation
- strong consistency
- but it is slow, uses a lot of network bandwidth
- and if a single replica is down, we cannot commit!

Asynchronous Replication
- eventual consistency
- only a subset of replicas must confirm success
- but some replica can contain stale data
- and write conflicts may occur (if not using single leader replication)
- even though, it is still preferable than synchronous replication

Single Leader Replication
- write to one, replicate to many
- simple to implement
- write throughput is limited
- if leader fails, a new one needs to be picked, and the new one may not be up to date
- during failover, we cannot complete any writes

Multi Leader Replication
- increased throughput, but need to resolve write conflicts
- Last Write Wins to resolve write conflicts
  - but timestamps are unreliable

Bloom Filter

in situations where read is not efficient (like a LSM tree), use bloom filter can eliminate some of the non-existing keys, thus return "no, key is not there" to save read time.
it may produce falst positives (it says it's there but it is actually not there) because of hash collisions
How to choose an appropriate size for bloom filter?
- for a chosen expected number of keys and acceptable false positive rates, use this formula:
  - m = -n * log2(p) / (log(2))^2
  - n = number of keys
  - p = acceptable false positive rates
  - m = size of bloom filter in bits
- optimal number of hash functions (k):
  - k = (m / n) * log(2)

C10K problem?

the challenge of designing and implementing a server that can efficiently handle a large number of concurrent client connections, specifically 10,000.
solution:
- asynchronous I/O
- Non-Blocking I/O
- Efficient Data Structures
- Connection Pooling

Change Data Capture

captures changes made within a database
Implmentations:
- Query based
- Trigger based
- Log based
- Proprietary (developed by database vendor) based

Idempotent

some operation is idempotent if it always returns the same result.
Ex, x=5 is idempotent while x++ is not

Thundering Herd problem?

when a large number of requests try to access a resources in a small time frame due to reasons like cache eviction, hardware restart, etc.
Flooded backend may further cause system failures
To mitigate this issue
- randomized cache expiration time
- rate limiting
- backend sharding
- asynchronous cache updates
- backoff time on client side

Gossip Protocol

a protocol to detect node failures in a distributed and de-centralized cluster
each node passes information (heartbeat, timestamp, etc) to a random set of other nodes to let them know its still alive
other nodes, upon receiving the information, adds information about it self, and pass those information to another set of random nodes
eventually all nodes will be ping and get the updated information.
If after a while, all of the nodes haven't heard about one particular node, than that node is considered as down.
Cassandar uses Gossip Protocol to detect node failures.

123