1.7. Troubleshooting

We have spent thousands of hours with the Kafka stack. For every Fast Data release we test both our code and Kafka itself in order to weed out any bugs. Below are some of the most common issues we encountered.

1.7.1. Service Starts but Doesn’t Work

In such cases the best bet is to check the service’s role log from Cloudera Manager. In our experience there are cases where a service process may start but stack in the initialization process. Logs will usually show the reason.

An example are Kafka Brokers, where if they start with too little RAM available, they won’t be able to start their listeners.

Unfortunately given the facilities provided by Kafka and Cloudera Manager, it is very difficult for us to catch such errors for the time being.

Starting with CDH 5.8, Cloudera Manager can detect some errors in JVM based applications. We do make use of this feature but your mileage may vary.

1.7.2. Kafka REST Avro Support

For Kafka REST Proxy to support Avro, access to a running Schema Registry is needed.

1.7.3. Schema Registry Starts but Fails Later

Schema Registry uses a Kafka producer to store information to its topics. In some rare occasions we’ve seen this producer to fail to start due to memory issues. Increasing the memory limit for Schema Registry doesn’t seem to help.

There seems to be some connection to the number of bootstrap servers (listeners) and this issue, so you could try to lower the number of your listeners if you can —e.g if you have many listeners for each broker.

1.7.4. A Schema Registry Instance Fails to Start

An instance of Schema Registry at version 3.2.2 may fail to start when there is already a master instance running, or all instances try to start together. Usually it is enough to restart the failed instance and completely safe.

1.7.5. Connect Don’t Start

The are usually two scenarios that lead to this issue.

  1. The first time a Connect role is started, the setup script creates the topics for Connect Distributed’s data with a replication factor of 3 as suggested by Confluent. If you have less brokers than 3 the topics will fail to be created. Adjust Connect’s topics replication factor and start again the Connect Distributed instances.

    Should you ever need to increase the replication factor, either create a new cluster or perform the steps described in Kafka’s documentation.

  2. The Connect topics may only be created once but Connect workers, everytime they are started, they check if they are able to read and write from/to the configuration topics. If the test fails, the connect worker will exit. In such a case please ensure that the brokers are in a good state. There are cases where the brokers maybe up but unable to work properly. The monitoring CSD can help identify such issues. For example if a broker is up but doesn’t work, its partitions will show as offline in Grafana.

1.7.6. Logs Error: No such file or directory

We have found that in some occasions Cloudera Manager doesn’t give proper ownership to the logging directory, thus the Kafka services can’t write to their logs.

The Fast Data services run under user fastdata and group fastdata. In case you encounter this error, please make sure in the nodes that run Fast Data services, the /var/log/fastdata directory belongs to this user and group. If not, assign ownership manually:

chown fastdata:fastdata /var/log/fastdata

1.7.7. Service starts but exits without errors in logs

This could be due to the process being killed due to using too much memory (out of memory error). Usually Cloudera Manager from 5.5 onwards will show you this in the process status. Try to increase the heap size for the process. For Kafka REST you may also try to decrease the consumer.request.max.bytes property.

Another scenario that can lead to such an issue, is when incompatible classes are loaded. This holds specially true for Kafka Connect. Adding a connector to the classpath that shadows libraries used by Kafka with its own versions may lead to connect worker crashing with no message whatsoever (even if log level is set to TRACE). This is the reason we compile the Stream Reactor collection of connectors with the exact same version of Kafka we provide in our KAFKA_CONFLUENT parcel

1.7.8. HDFS Connector does not work

Note

This issue happened with the 3.1.x version of the KAFKA_CONFLUENT parcel. Please contact us if you still see this issue with 3.2.x.

If you have installed and activated the FAST_DATA_STREAM_REACTOR parcel it can cause problems with the HDFS connector that comes with the FAST_DATA_KAFKA_CONFLUENT parcel due to the included HBase connector jars shadowing some hadoop java classes.

Please deactivate the FAST_DATA_STREAM_REACTOR. We will issue a proper solution in the future to let you disable only specific connectors.

1.7.9. Underreplicated partitions with Cached zkVersion not equal to that in zookeeper error

In some cases you may notice underreplicated partitions and brokers out of sync whilst your cluster is all green with a log message like:

INFO Partition [topic,n] on broker m: Cached zkVersion [xxx] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

This is a documented issue. It may be fixed with a rolling restart of the brokers. If your Cloudera license does not permit automatic rolling restart, just restart the brokers one by one by hand.

To prevent the issue from reoccuring you can try to increase zookeeper session timeout for the brokers, via the setting zookeeper.session.timeout.ms in Fast Data’s configuration page. Since this change will require a restart of the brokers, you can adjust it before performing the rolling restart that will fix the issue.

../../../../_images/cm-fast-data-troubleshooting-zk-session-timeout2.png

Increase ZooKeeper Session Timeout for the Brokers to possibly avoid this in the future.

1.7.10. Kafka REST does not return messages

The consumer.request.timeout.ms option is by default set at 1000. Frequently this leads to Kafka REST not returning any messages for topics with more than a couple thousand messages, especially if the offset reset of the consumer is set at earliest.

To fix this, consumer.request.timeout.ms should be set to 30000. Due to a bug in version 3.2.x of Kafka REST, setting it lower isn’t possible because this settings leaks to Kafka REST consumer’s request.timeout.ms option which can’t be lower than a certain threshold (depends on other configuration values).

1.7.11. Brokers fail to start with Zookeeper Authentication

In version 3.2.x Kafka brokers may fail to start if Zookeeper Authentication (authenticate.zookeeper.connection) is enabled and not at least one broker is already up.

The workaround is to do a rolling restart when restart of all brokers is needed. If for some reason all brokers are down, please disable temporarily Zookeeper authentication, start one broker, enable Zookeeper authentication, start the rest of the brokers and restart the first one.

1.7.12. Generic Debug Information

A suggested debug process of a failing instance/role is:

  1. Check the stdout and stderr output for any messages. If the startup failed due to misconfiguration, the setup script may print a useful message in stdout.
  2. Check the role log for any messages that are at error level ERROR or FATAL. Also check the timestamp of the log messages because the role may didn’t write any messages and Cloudera Manager shows the messages from the previous instance start.
  3. Set the Logging Threshold to DEBUG or TRACE, start the instance again and check the log messages. Usually this method is used to find lower level issues such as networking problems.