Streaming Systems

Notes regarding the book with the same name

01: Streaming 101

Bounded - Stream com fim Unbounded - Stream sem fim teórico

Event time - quando ocorreu o evento Process time - quando foi processado

Triggers:
- Repeated update
- Completeness: only materializes when it considers a windows complete
Watermarks
- Help define a sense of completeness
- Types:
  - Perfect
  - Heuristic
- Allowed lateness: tempo limite para chegarem late events
Accumulation

Mapreduce paper
Hadoop open sourced mapreduce ideia
Flume was a google owned high level api to optimize mapreduce pipelines
Apache Storm: First highly used streaming system with weak consistency replaced with low latency
Spark replaced the need of lambda architecture and replaced mapreduce giving a streaming and batch all at once and with much better performance due to RDD
MillWhell provides true streaming (spark does a subset) and was added to flume as a streaming API. Being replaced by MillWheel
Kafka is a transport layer that gives a sense of security by allowing durability and replayability of streaming
Dataflow from google tries to simplify and bring everything under a single API (not open source)
Flink adopted the beam programming model and added great accuracy at a fraction of the cost of other systems
Beam is a semantic layer that hopes to implement only the best ideas of streaming