Apache Flink

A framework and distributed processing engine

Flink enables stateful computations across both unbounded and constrained data streams.

Stateful Computations on Data Streams

Guaranteed correctness

Exactly-once state consistency
Event-time processing
Sophisticated late data handling

Scalability

Scale-out architecture
Support for very large state
Incremental Checkpoints

Performance

Low latency
High throughput
In-Memory computing

Layered APIs

SQL on Stream & Batch Data
DataStream API & DataSet API
ProcessFunction (Time & State)

Operational focus

Flexible deployment
High-availability setup
Savepoints

Maximize the benefits with Flink

Process Unbounded and Bounded Data
Deploy Applications Anywhere
Run Applications at any Scale
Leverage In-Memory Performance

Building Blocks of Apache Flink

And how it approaches to handle each of them

Streams

Bounded and unbounded streams: Flink has sophisticated features to process unbounded streams, and also dedicated operators to efficiently process bounded streams.
Real-time and recorded streams: Flink applications can process recorded like a file system & object store or real-time generated streams.

State

Multiple State Primitives: Flink has state primitives for data structures, e.g. atomic values, lists, or maps.
Pluggable State Backends: Flink stores state in memory or in an embedded on-disk data store. Custom state backends can be plugged in too.
Exactly-once state consistency: Flink guarantees the consistency of application state in case of a failure.
Very Large State: Flink maintains application state of several terabytes.
Scalable Applications: Flink supports scaling of stateful applications.

Time

Event-time Mode: Event-time processing of Flink allows for accurate and consistent results.
Watermark Support: Flink employs watermarks to reason about time in event-time applications.
Late Data Handling: Flink features multiple options to handle late events, such as rerouting, and updating.
Processing-time Mode: Flink also supports processing-time semantics which performs computations as triggered by the wall-clock time of the processing machine.

When do you need Flink?

Event Driven Applications

An event-driven application is a stateful application that ingest events from one or more event streams and reacts to incoming events by triggering computations, state updates, or external actions.

Fraud detection
Anomaly detection
Rule-based alerting
Business process monitoring
Web application (social network)

Stream & Batch Analytics

Analytical jobs extract information and insight from raw data. Apache Flink supports traditional batch queries on bounded data sets and real-time, continuous queries from unbounded, live data streams.

Quality monitoring of Telco networks
Analysis of product updates & experiment evaluation in mobile applications
Ad-hoc analysis of live data in consumer technology
Large-scale graph analysis

Data Pipelines & ETL

Extract-transform-load (ETL) is a common approach to convert and move data between storage systems.

Real-time search index building in e-commerce
Continuous ETL in e-commerce

Setting up Flink to Work with Kafka

A reliable, scalable, low-latency real-time data processing pipelines with fault tolerance and exactly-once processing guarantee is possible by combining Apache Flink and Apache Kafka to provide a potent option for businesses wishing to instantly evaluate and gain insights from streaming data.

Unlike alternative systems such as ActiveMQ, RabbitMQ, etc., Kafka offers the capability to durably store data streams indefinitely, enabling consumers to read streams in parallel and replay them as necessary. This aligns with Flink’s distributed processing model and fulfills a crucial requirement for Flink’s fault tolerance mechanism.

Kafka can be used by Flink applications as a source as well as a sink by utilizing tools and services available in the Kafka ecosystem. Flink offers native support for commonly used formats like Avro, JSON, and Protobuf, similar to Kafka’s support for these formats.

Events are added to the Flink table in a similar manner as they are appended to the Kafka topic. A topic in a Kafka cluster is mapped to a table in Flink. In Flink, each table is equal to a stream of events that describe the modifications being made to that particular table. The table is automatically updated when a query refers to it, and its results are either materialized or emitted.

Process Streaming Data with Apache Flink