Process Streaming Data with Apache Flink

Apache Flink

A framework and distributed processing engine 

Flink enables stateful computations across both unbounded and constrained data streams.

apache-flink

Stateful Computations on Data Streams

Guaranteed correctness

  • Exactly-once state consistency
  • Event-time processing
  • Sophisticated late data handling

Scalability

  • Scale-out architecture
  • Support for very large state
  • Incremental Checkpoints

Performance

  • Low latency
  • High throughput
  • In-Memory computing

Layered APIs

  • SQL on Stream & Batch Data
  • DataStream API & DataSet API
  • ProcessFunction (Time & State)

Operational focus

  • Flexible deployment
  • High-availability setup
  • Savepoints

Maximize the benefits with Flink

stream_processing_batch-processing
  • Process Unbounded and Bounded Data

  • Deploy Applications Anywhere

  • Run Applications at any Scale

  • Leverage In-Memory Performance

Building Blocks of Apache Flink

And how it approaches to handle each of them

Streams

  • Bounded and unbounded streams: Flink has sophisticated features to process unbounded streams, and also dedicated operators to efficiently process bounded streams.
  • Real-time and recorded streams: Flink applications can process recorded like a file system & object store or real-time generated streams.

State

  • Multiple State Primitives: Flink has state primitives for data structures, e.g. atomic values, lists, or maps. 
  • Pluggable State Backends: Flink stores state in memory or in an embedded on-disk data store. Custom state backends can be plugged in too.
  • Exactly-once state consistency: Flink guarantees the consistency of application state in case of a failure.
  • Very Large State: Flink maintains application state of several terabytes.
  • Scalable Applications: Flink supports scaling of stateful applications.

Time

  • Event-time Mode: Event-time processing of Flink allows for accurate and consistent results.
  • Watermark Support: Flink employs watermarks to reason about time in event-time applications.
  • Late Data Handling: Flink features multiple options to handle late events, such as rerouting, and updating.
  • Processing-time Mode: Flink also supports processing-time semantics which performs computations as triggered by the wall-clock time of the processing machine.

When do you need Flink?

Event-driven-applications

Event Driven Applications

An event-driven application is a stateful application that ingest events from one or more event streams and reacts to incoming events by triggering computations, state updates, or external actions.

  • Fraud detection
  • Anomaly detection
  • Rule-based alerting
  • Business process monitoring
  • Web application (social network)
stream_processing_batch-processing

Stream & Batch Analytics

Analytical jobs extract information and insight from raw data. Apache Flink supports traditional batch queries on bounded data sets and real-time, continuous queries from unbounded, live data streams.

  • Quality monitoring of Telco networks
  • Analysis of product updates & experiment evaluation in mobile applications
  • Ad-hoc analysis of live data in consumer technology
  • Large-scale graph analysis
Data Pipelines and ETL

Data Pipelines & ETL

Extract-transform-load (ETL) is a common approach to convert and move data between storage systems.

  • Real-time search index building in e-commerce
  • Continuous ETL in e-commerce

Setting up Flink to Work with Kafka

A reliable, scalable, low-latency real-time data processing pipelines with fault tolerance and exactly-once processing guarantee is possible by combining Apache Flink and Apache Kafka to provide a potent option for businesses wishing to instantly evaluate and gain insights from streaming data.

Setting up Kafka with Apache Flink

Unlike alternative systems such as ActiveMQ, RabbitMQ, etc., Kafka offers the capability to durably store data streams indefinitely, enabling consumers to read streams in parallel and replay them as necessary. This aligns with Flink’s distributed processing model and fulfills a crucial requirement for Flink’s fault tolerance mechanism.

Kafka can be used by Flink applications as a source as well as a sink by utilizing tools and services available in the Kafka ecosystem. Flink offers native support for commonly used formats like Avro, JSON, and Protobuf, similar to Kafka’s support for these formats.

Events are added to the Flink table in a similar manner as they are appended to the Kafka topic. A topic in a Kafka cluster is mapped to a table in Flink. In Flink, each table is equal to a stream of events that describe the modifications being made to that particular table. The table is automatically updated when a query refers to it, and its results are either materialized or emitted.

Success Stories