Big Data

HPE Ezmeral Data Fabric - Store and Protect Cluster Data - Work with Volumes

This course begins by teaching you about volumes and topology and how to design and implement a volume plan to manage your data. It then covers snapshots, which provide protection against user or application errors. Finally, you will learn how to use mirror volumes for load balancing, deployment, backup, or disaster recovery. Source: [HPE Ezmeral Learn On-Demand](https://learn.ezmeral.software.hpe.com/store-and-protect-cluster-data)

Use Flink in a kerberos-enabled CDP cluster to connect to Kafka that is not managed by Cloudera Manager

A customer asked: How to use Flink in CDP to connect to Kafka that is not in the cluster (that is, this Kafka cluster is not managed by Cloudera Manager). I tried to find related Flink demos in Cloudera's official Github repository. In general, whether it is Flink or Spark, as a client connecting to Kafka, they must use one of the centralized protocols specified by Kafka.

Learning Kafka | Theory-9 | Kafka Guarantees

Messages are appended to a topic-partition in the order they are sent. Consumers read messages in the order stored in a topic-partition. With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down. This is why a replication factor of 3 is a good idea: Allows for one broker to be taken down for maintenance; Allows for another broker to be taken down unexpectedly. As long as the number of partitions remains constant for a topic (no new partitions), the same key will always go to the same partition.

Learning Kafka | Theory-8 | Zookeeper

Zookeeper manages brokers (keeps a list of them). Zookeeper helps in performing leader election for partitions. Zookeeper sends notifications to Kafka in case of changes (e.g. new topic, broker dies, broker comes up, delete topics, etc....). Kafka can't work without Zookeeper. Zookeeper by design operates with an odd number of servers (3, 5, 7). Zookeeper has a leader (handle writes) the rest of the servers are followers (handle reads). (Zookeeper does NOT store consumer offsets with Kafka v0.10)

Learning Kafka | Theory-7 | Kafka Broker Discovery

Every Kafka broker is also called a "bootstrap server". That means that you only need to connect to one broker, and you will be connected to the entire cluster. Each broker knows about all brokers, topics and partitions (metadata).

Learning Kafka | Theory-6 | Consumer Offsets and Delivery Semantics

Kafka stores the offsets at which a consuemr group has been reading. The offsets committed live in a Kafka topic named __consumer_offsets. When a consumer from group has processed data received from Kafka, it should be committing the offsets. If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets! Consumers choose when to commit offsets. There are three delivery semantics: at most once, at least once and exactly once.

Learning Kafka | Theory-5 | Consumers and Consumer Groups

Consumers read data in a Consumer Group. Each consumer within a group reads from exclusive partitions. If you have more consumers than partitions, some consumers will be inactive.

Learning Kafka | Theory-4 | Producers and Message Keys

Okay, so we know all about topics, we know all about brokers, we know about replication, but now how do we get data into Kafka? Well, that's the role of a Producer. This article attempts to explain what Kafka's Producer is. At a high level, again, just remember there is keys, the concept of keys, the concept of acknowledgements, the concept of round robin, but you get the idea.

Migrate brokers by modifying broker IDs in meta.properties

Purpose of this artical: Our purpose is to replace the nodes of Kafka Broker. We have 2 nodes newly added to the CDP PvC Base cluster. We will migrate two Kafka Brokers that were originally in use to these two new nodes. This article uses Cloudera CDP official documentation as a guide.

Learning Kafka | Theory-3 | Topic Replication

As you've seen, Kafka is a distributed system. We may have three Brokers or one hundred Brokers, so this is distributed. So, when there's a distributed system in the big data world, we need to have replication, such as, if a machine goes down, then the things still work, and replication does that for us. This article attempts to analyze what is the Replication Factor in Kafka.