Skip to main content

Message Broker / Stream Processing Cheatsheet

Stream Processing

  • a broker between producers and consumers
    • broker deals with TCP/UDP connections, processing speed, etc..
  • In memory vs Log based (on disk)
    • if order doesn't matter, and no need to persistent data, then use in-memory message broker
    • if order is important, and need to get the "raw" data sometime in the future, then use log-based message broker
  • why using a message broker? can't we just send events directly from producers to consumers?
    • distributed, many-to-many relation, needs O(n^2) connections? if real time
  • Common use cases
    • grouping and bucketing of events (batch processing)
      • grouping events based on time windows
    • change data capture
      • to get data in sync with a derived data source (for different purposes)
    • event sourcing
      • store events to message brokers
      • so that some time in the future, we can still get all events to process them for different purposes
      • key point: keep a record of all the "raw" datas

Exactly Once Message Processing

  • at least once: fault tolerant brokers
    • disk persistence, and replication, consumer ackowledgement
  • no more than once:
    • two phase commit (bad, want to avoid), or idempotence (good, with a key)

In Memory Message Brokers

  • RabbitMQ, ActiveMQ, AmazonSQS
  • Messages are stored in a queue
  • Broker send messages to consumers
  • consumer acknowledges back
  • after acknowledgment, message is deleted from the queue
  • no guarantee to the processing order of the messages, because messages are sent asyncly, network partition, processing failure, or processing time may result in incorrect order of message processing
  • ideal for situations like: maximum throughput, order doesn't matter
    • like user post videos on youtube, youtube has to process them, doesn't matter which one finishes before the others
    • users posting tweets that will be sent to "news feed caches" of followers

Log Based Message Brokers

  • Kafka, Amazon Kinesis
  • sequential writes on disk, messages are next to one another on disk
  • reads are also sequential, guaranteed
  • each consumer's current progress is saved in the broker, so when consumer asks for the next message, broker knows where the next one is
  • sequential order guarantee also means that throughput are limited, if a message takes a long time to process, then it blocks all subsequent reads
  • in order to increase throughput, we can use a second message queue for other consumers, so ideally, each consumer has it's own queue
  • messages are durable, they're on disk, and not deleted
  • ideal for situations where order matters
    • sensor metrics, we wnat to take the average of the last 20
    • each write from a database that we will put in a search index

Stream Joins

  • stream-stream joins