zhangsquared: Introduction to Kafka

What is Kafka

Kafka是⼀个open source的分布式的messaging system.

特点：

快，可以⽀持⼏百兆每秒的数据，以及上千个client发送数据。
快速扩展。
数据都会在磁盘中持久化。

Design Strategy

Defensive Design

Producer 发消息的⼀⽅，使⽤ push model. (built-in retry logic, exactly one delivery)
Consumer 主动从kafka上收消息，使⽤ pull model. Consumer keeps state (Offset). Consumers need to send heartbeats to the Group Coordinator to be considered alive

Push model: 主动去发消息。⽐较⾼的throughput, 处理⽐较复杂的server logic.
Pull model: 主动去收消息。⽐较简单的server logic, ⽀持replay消息。

Kafka把消息按照topic进⾏分类，物理上，topic由partition组成，⼀个partition可以简单的看成array, ⼀个partition只能属于⼀个topic。
Topics in Kafka are always multi-subscriber (single or multiple consumers)

Offset 类似于array的index, 接受者通过offset去定位某个partition上⾯的消息。

Kafka所有的消息是存在log file中
每个 message 有自己的 Offset

Log File Retention

Kafka有可配置的⾃⼰的retention机制：
Time-based, 例如，过了7天的message就被删除。
Size-based, 例如，⼤于1GB的就会被删除。

在读写时，sequential access

Kafka Message Format：每条 message 有一定的 overhead，传送 small size data 不划算

How to manage schema:
AVRO file
AVRO schema (metadata) + AVRO content
Parquet file: Column-based schema (applied to Spark)

Serialization
Encoding

Data Replication

单位是 partition, partition有leader和follower两种⻆⾊。
All reads/ writes must go to partition leader

ISR (in-sync Replica)

Kafka 一般需要和 ZooKeeper 配合使用
ZooKeeper 有 service discovery，可以 detect 到 broker 是否挂掉，如果leader挂掉了，ZooKeeper 会选出 replication 最及时的 follower 作为leader

zhangsquared

Saturday, November 3, 2018

Introduction to Kafka

What is Kafka

Design Strategy

Log File Retention

Data Replication

No comments:

Post a Comment

Most Recent Posts

Blog Archive