Skip to main content
Message Broker / Stream Processing Cheatsheet
Stream Processing
- a broker between producers and consumers
- broker deals with TCP/UDP connections, processing speed, etc..
- In memory vs Log based (on disk)
- if order doesn't matter, and no need to persistent data, then use in-memory message broker
- if order is important, and need to get the "raw" data sometime in the future, then use log-based message broker
- why using a message broker? can't we just send events directly from producers to consumers?
- distributed, many-to-many relation, needs O(n^2) connections? if real time
- Common use cases
metric/log time, grouping and bucketing of events (batch processing)
- grouping events based on time windows
- change data capture
eventto sourcingget data in sync with a derived data source (for different purposes)
- event sourcing
- store events to message brokers
- so that some time in the future, we can still get all events to process them for different purposes
- key point: keep a record of all the "raw" datas
Exactly Once Message Processing
- at least once: fault tolerant brokers
- disk persistence, and replication, consumer ackowledgement
- no more than once:
- two phase commit (bad, want to avoid), or idempotence (good, with a key)
In Memory Message Brokers
- RabbitMQ, ActiveMQ, AmazonSQS
- Messages are stored in a queue
- Broker send messages to consumers
- consumer acknowledges back
- after acknowledgment, message is deleted from the queue
- no guarantee to the processing order of the messages, because messages are sent asyncly, network partition, processing failure, or processing time may result in incorrect order of message processing
- ideal for situations like: maximum throughput, order doesn't matter
- like user post videos on youtube, youtube has to process them, doesn't matter which one finishes before the others
- users posting tweets that will be sent to "news feed caches" of followers
Log Based Message Brokers
- Kafka, Amazon Kinesis
- sequential writes on disk, messages are next to one another on disk
- reads are also sequential, guaranteed
- each consumer's current progress is saved in the broker, so when consumer asks for the next message, broker knows where the next one is
- sequential order guarantee also means that throughput are limited, if a message takes a long time to process, then it blocks all subsequent reads
- in order to increase throughput, we can use a second message queue for other consumers, so ideally, each consumer has it's own queue
- messages are durable, they're on disk, and not deleted
- ideal for situations where order matters
- sensor metrics, we wnat to take the average of the last 20
- each write from a database that we will put in a search index
Stream Joins