broker deals with TCP/UDP connections, processing speed, etc..
In memory vs Log based (on disk)
if order doesn't matter, and no need to persistent data, then use in-memory message broker
if order is important, and need to get the "raw" data sometime in the future, then use log-based message broker
why using a message broker? can't we just send events directly from producers to consumers?
distributed, many-to-many relation, needs O(n^2) connections? if real time
Common use cases
grouping and bucketing of events (batch processing)
grouping events based on time windows
change data capture
to get data in sync with a derived data source (for different purposes)
event sourcing
store events to message brokers
so that some time in the future, we can still get all events to process them for different purposes
key point: keep a record of all the "raw" datas
Exactly Once Message Processing
at least once: fault tolerant brokers
disk persistence, and replication, consumer ackowledgement
no more than once:
two phase commit (bad, want to avoid), or idempotence (good, with a key)
In Memory Message Brokers
RabbitMQ, ActiveMQ, AmazonSQS
Messages are stored in a queue
Broker send messages to consumers
consumer acknowledges back
after acknowledgment, message is deleted from the queue
no guarantee to the processing order of the messages, because messages are sent asyncly, network partition, processing failure, or processing time may result in incorrect order of message processing
ideal for situations like: maximum throughput, order doesn't matter
like user post videos on youtube, youtube has to process them, doesn't matter which one finishes before the others
users posting tweets that will be sent to "news feed caches" of followers
Log Based Message Brokers
Kafka, Amazon Kinesis
sequential writes on disk, messages are next to one another on disk
reads are also sequential, guaranteed
each consumer's current progress is saved in the broker, so when consumer asks for the next message, broker knows where the next one is
sequential order guarantee also means that throughput are limited, if a message takes a long time to process, then it blocks all subsequent reads
in order to increase throughput, we can use a second message queue for other consumers, so ideally, each consumer has it's own queue
messages are durable, they're on disk, and not deleted
ideal for situations where order matters
sensor metrics, we wnat to take the average of the last 20
each write from a database that we will put in a search index