Message Broker / Stream Processing Cheatsheet

a broker between producers and consumers
- broker deals with TCP/UDP connections, processing speed, etc..

In memory vs Log based (on disk)
- if order doesn't matter, and no need to persistent data, then use in-memory message broker
- if order is important, and need to get the "raw" data sometime in the future, then use log-based message broker
why using a message broker? can't we just send events directly from producers to consumers?
- distributed, many-to-many relation, needs O(n^2) connections? if real time
Common use cases
- ~~metric/log time,~~ grouping and bucketing of events (batch processing)
  - grouping events based on time windows
- change data capture
- event sourcing
  - store events to message brokers
  - so that some time in the future, we can still get all events to process them for different purposes
  - key point: keep a record of all the "raw" datas

at least once: fault tolerant brokers
- disk persistence, and replication, consumer ackowledgement
no more than once:
- two phase commit (bad, want to avoid), or idempotence (good, with a key)

RabbitMQ, ActiveMQ, AmazonSQS
Messages are stored in a queue
Broker send messages to consumers
consumer acknowledges back
after acknowledgment, message is deleted from the queue
no guarantee to the processing order of the messages, because messages are sent asyncly, network partition, processing failure, or processing time may result in incorrect order of message processing
ideal for situations like: maximum throughput, order doesn't matter
- like user post videos on youtube, youtube has to process them, doesn't matter which one finishes before the others
- users posting tweets that will be sent to "news feed caches" of followers

Kafka, Amazon Kinesis
sequential writes on disk, messages are next to one another on disk
reads are also sequential, guaranteed
each consumer's current progress is saved in the broker, so when consumer asks for the next message, broker knows where the next one is
sequential order guarantee also means that throughput are limited, if a message takes a long time to process, then it blocks all subsequent reads
in order to increase throughput, we can use a second message queue for other consumers, so ideally, each consumer has it's own queue
messages are durable, they're on disk, and not deleted
ideal for situations where order matters
- sensor metrics, we wnat to take the average of the last 20
- each write from a database that we will put in a search index