Thanks .. it worked! On Wed, Jul 19, 2017 at 3:17 AM, Abdoulaye Diallo <abdoulaye...@gmail.com> wrote:
> Look into these 2 props: > rotate.schedule.interval.ms > flush.size > > On Tue, Jul 18, 2017 at 2:46 PM, Abdoulaye Diallo <abdoulaye...@gmail.com> > wrote: > >> Hi Debasish, >> >> >> > flush.size=3 >> this means every 3 messages in that topic will end up in its own HDFS >> file, which is probably why you end up with so many files that ls hurts. >> You should flush a bigger batch or after a high enough interval. >> >> >> > tasks.max=1 >> Unless you have a single partition topic, you need to up this number for >> better parallelism. >> >> HTH, >> Abdoulaye >> >> >> On Tue, Jul 18, 2017 at 11:12 AM, Debasish Ghosh < >> ghosh.debas...@gmail.com> wrote: >> >>> Hi - >>> >>> I have a Kafka Streams application that generates Avro records in a >>> topic, >>> which is being read by a Kafka Connect process that uses HDFS Sink >>> connector. The topic has around 1.6 million messages. And the Kafka >>> Connect >>> script is as follows .. >>> >>> bin/connect-standalone >>> > etc/schema-registry/connect-avro-standalone.properties >>> > etc/kafka-connect-hdfs/quickstart-hdfs.properties >>> >>> >>> where quickstart-hdfs.properties contains the following .. >>> >>> name=hdfs-sink >>> > connector.class=io.confluent.connect.hdfs.HdfsSinkConnector >>> > tasks.max=1 >>> > topics=avro-topic >>> > hdfs.url=hdfs://0.0.0.0:9000 >>> > flush.size=3 >>> >>> >>> The problem is that the Kafka Connect process looks to be running in an >>> infinite loop with messages like the following .. >>> >>> [2017-07-18 20:02:04,487] INFO Starting commit and rotation for topic >>> > partition avro-topic-0 with start offsets {partition=0=1143033} and end >>> > offsets {partition=0=1143035} >>> > (io.confluent.connect.hdfs.TopicPartitionWriter:297) >>> > [2017-07-18 20:02:04,491] INFO Committed hdfs:// >>> > 0.0.0.0:9000/topics/avro-topic/partition=0/avro-topic+0+0001 >>> 143033+0001143035.avro >>> > for avro-topic-0 (io.confluent.connect.hdfs.TopicPartitionWriter:625) >>> >>> >>> The result is that the avro files created are so many in numbers that I >>> cannot do an ls on the folder. >>> >>> $ hdfs dfs -ls /topics/avro-topic >>> > Found 1 items >>> > drwxr-xr-x - debasishghosh supergroup 0 2017-07-18 20:02 >>> > /topics/avro-topic/partition=0 >>> >>> >>> Trying to list to more depth in the HDFS folder results in an >>> OutOfMemoryError .. >>> >>> $ hdfs dfs -ls /topics/avro-topic/partition=0 >>> > 17/07/18 20:02:19 WARN util.NativeCodeLoader: Unable to load >>> native-hadoop >>> > library for your platform... using builtin-java classes where >>> applicable >>> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead >>> limit >>> > exceeded >>> > at java.util.Arrays.copyOfRange(Arrays.java:3664) >>> > at java.lang.String.<init>(String.java:207) >>> > at java.lang.String.substring(String.java:1969) >>> > at java.net.URI$Parser.substring(URI.java:2869) >>> > at java.net.URI$Parser.parseHierarchical(URI.java:3106) >>> > ... >>> >>> >>> Why is the Kafka Connect program going in an infinite loop ? How can I >>> prevent it ? >>> >>> I am using Confluent 3.2.2 for the schema registry, Avro serialization >>> part >>> and Apache Kafka 0.10.2.1 for Kafka Streams client and the broker part. >>> >>> Help ? >>> >>> regards. >>> >>> -- >>> Debasish Ghosh >>> http://manning.com/ghosh2 >>> http://manning.com/ghosh >>> >>> Twttr: @debasishg >>> Blog: http://debasishg.blogspot.com >>> Code: http://github.com/debasishg >>> >> >> >> >> -- >> Abdoulaye Diallo >> > > > > -- > Abdoulaye Diallo > -- Debasish Ghosh http://manning.com/ghosh2 http://manning.com/ghosh Twttr: @debasishg Blog: http://debasishg.blogspot.com Code: http://github.com/debasishg