Messages are appended to a topic-partition in the order they are sent.
Consumers read messages in the order stored in a topic-partition.
With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down.
This is why a replication factor of 3 is a good idea: Allows for one broker to be taken down for maintenance; Allows for another broker to be taken down unexpectedly.
As long as the number of partitions remains constant for a topic (no new partitions), the same key will always go to the same partition.
Zookeeper manages brokers (keeps a list of them). Zookeeper helps in performing leader election for partitions. Zookeeper sends notifications to Kafka in case of changes (e.g. new topic, broker dies, broker comes up, delete topics, etc....). Kafka can't work without Zookeeper. Zookeeper by design operates with an odd number of servers (3, 5, 7). Zookeeper has a leader (handle writes) the rest of the servers are followers (handle reads). (Zookeeper does NOT store consumer offsets with Kafka v0.10)
Every Kafka broker is also called a "bootstrap server".
That means that you only need to connect to one broker, and you will be connected to the entire cluster.
Each broker knows about all brokers, topics and partitions (metadata).
Kafka stores the offsets at which a consuemr group has been reading.
The offsets committed live in a Kafka topic named __consumer_offsets.
When a consumer from group has processed data received from Kafka, it should be committing the offsets.
If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets!
Consumers choose when to commit offsets.
There are three delivery semantics: at most once, at least once and exactly once.
Consumers read data in a Consumer Group.
Each consumer within a group reads from exclusive partitions.
If you have more consumers than partitions, some consumers will be inactive.
Okay, so we know all about topics, we know all about brokers, we know about replication, but now how do we get data into Kafka?
Well, that's the role of a Producer.
This article attempts to explain what Kafka's Producer is.
At a high level, again, just remember there is keys, the concept of keys, the concept of acknowledgements, the concept of round robin, but you get the idea.
Purpose of this artical:
Our purpose is to replace the nodes of Kafka Broker.
We have 2 nodes newly added to the CDP PvC Base cluster.
We will migrate two Kafka Brokers that were originally in use to these two new nodes.
This article uses Cloudera CDP official documentation as a guide.
As you've seen, Kafka is a distributed system.
We may have three Brokers or one hundred Brokers, so this is distributed.
So, when there's a distributed system in the big data world, we need to have replication, such as,
if a machine goes down, then the things still work, and replication does that for us.
This article attempts to analyze what is the Replication Factor in Kafka.
Brokers A Kafka cluster is composed of multiple brokers(servers) Each Broker is identified with its ID(integer) Each Broker contains certain topic partitions After connecting to any Broker (called a bootstrap broker), you will be connected to the entire cluster A good number to get started is 3 Brokers, but some big clusters have over 100 Brokers In these examples we choose to number Brokers starting at 100 (arbitrary) OK, so we’ve talked about Topics, but what holds the Topics?
The basic concept in Kafka is Topic, which is split into one or more Partitions for storage. Topic is a logical concept, Partition is an entity. The unit of data we write to Kafka is message. These messages are saved in the form of Offset in Partition. The number of offsets is infinite, and offsets are only meaningful for specific Partitions.