Streaming Systems
Notes regarding the book with the same name
01: Streaming 101
Bounded - Stream com fim Unbounded - Stream sem fim teórico
Event time - quando ocorreu o evento Process time - quando foi processado
Window types
- Sliding (hopping)
 - Fixed
 - Sessions
 - Tuple-based - numero fixo de elementos a processar
 
2: The what, where, when and how of data processing
- Triggers:
- Repeated update
 - Completeness: only materializes when it considers a windows complete
 
 - Watermarks
- Help define a sense of completeness
 - Types:
- Perfect
 - Heuristic
 
 - Allowed lateness: tempo limite para chegarem late events
 
 - Accumulation
 
3: Watermarks
4: Advanced Windowing
5: Exactly-Once and Side Effects
6: Streams and Tables
7: The practicalities of Persistent State
8: Streaming SQL
9: Streaming Joins
10: THe evolution of Large-scale Data Processing
- Mapreduce paper
 - Hadoop open sourced mapreduce ideia
 - Flume was a google owned high level api to optimize mapreduce pipelines
 - Apache Storm: First highly used streaming system with weak consistency replaced with low latency
 - Spark replaced the need of lambda architecture and replaced mapreduce giving a streaming and batch all at once and with much better performance due to RDD
 - MillWhell provides true streaming (spark does a subset) and was added to flume as a streaming API. Being replaced by MillWheel
 - Kafka is a transport layer that gives a sense of security by allowing durability and replayability of streaming
 - Dataflow from google tries to simplify and bring everything under a single API (not open source)
 - Flink adopted the beam programming model and added great accuracy at a fraction of the cost of other systems
 - Beam is a semantic layer that hopes to implement only the best ideas of streaming