Re: Problems with Kafka Connect HDFS Sink

Debasish Ghosh Tue, 18 Jul 2017 23:57:12 -0700

Thanks .. it worked!

On Wed, Jul 19, 2017 at 3:17 AM, Abdoulaye Diallo <abdoulaye...@gmail.com>
wrote:


> Look into these 2 props:
> rotate.schedule.interval.ms
> flush.size
>
> On Tue, Jul 18, 2017 at 2:46 PM, Abdoulaye Diallo <abdoulaye...@gmail.com>
> wrote:
>
>> Hi Debasish,
>>
>>
>> > flush.size=3
>> this means every 3 messages in that topic will end up in its own HDFS
>> file, which is probably why you end up with so many files that ls hurts.
>> You should flush a bigger batch or after a high enough interval.
>>
>>
>> > tasks.max=1
>> Unless you have a single partition topic, you need to up this number for
>> better parallelism.
>>
>> HTH,
>> Abdoulaye
>>
>>
>> On Tue, Jul 18, 2017 at 11:12 AM, Debasish Ghosh <
>> ghosh.debas...@gmail.com> wrote:
>>
>>> Hi -
>>>
>>> I have a Kafka Streams application that generates Avro records in a
>>> topic,
>>> which is being read by a Kafka Connect process that uses HDFS Sink
>>> connector. The topic has around 1.6 million messages. And the Kafka
>>> Connect
>>> script is as follows ..
>>>
>>> bin/connect-standalone
>>> > etc/schema-registry/connect-avro-standalone.properties
>>> > etc/kafka-connect-hdfs/quickstart-hdfs.properties
>>>
>>>
>>> where quickstart-hdfs.properties contains the following ..
>>>
>>> name=hdfs-sink
>>> > connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
>>> > tasks.max=1
>>> > topics=avro-topic
>>> > hdfs.url=hdfs://0.0.0.0:9000
>>> > flush.size=3
>>>
>>>
>>> The problem is that the Kafka Connect process looks to be running in an
>>> infinite loop with messages like the following ..
>>>
>>> [2017-07-18 20:02:04,487] INFO Starting commit and rotation for topic
>>> > partition avro-topic-0 with start offsets {partition=0=1143033} and end
>>> > offsets {partition=0=1143035}
>>> > (io.confluent.connect.hdfs.TopicPartitionWriter:297)
>>> > [2017-07-18 20:02:04,491] INFO Committed hdfs://
>>> > 0.0.0.0:9000/topics/avro-topic/partition=0/avro-topic+0+0001
>>> 143033+0001143035.avro
>>> > for avro-topic-0 (io.confluent.connect.hdfs.TopicPartitionWriter:625)
>>>
>>>
>>> The result is that the avro files created are so many in numbers that I
>>> cannot do an ls on the folder.
>>>
>>> $ hdfs dfs -ls /topics/avro-topic
>>> > Found 1 items
>>> > drwxr-xr-x   - debasishghosh supergroup          0 2017-07-18 20:02
>>> > /topics/avro-topic/partition=0
>>>
>>>
>>> Trying to list to more depth in the HDFS folder results in an
>>> OutOfMemoryError ..
>>>
>>> $ hdfs dfs -ls /topics/avro-topic/partition=0
>>> > 17/07/18 20:02:19 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop
>>> > library for your platform... using builtin-java classes where
>>> applicable
>>> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead
>>> limit
>>> > exceeded
>>> > at java.util.Arrays.copyOfRange(Arrays.java:3664)
>>> > at java.lang.String.<init>(String.java:207)
>>> > at java.lang.String.substring(String.java:1969)
>>> > at java.net.URI$Parser.substring(URI.java:2869)
>>> > at java.net.URI$Parser.parseHierarchical(URI.java:3106)
>>> > ...
>>>
>>>
>>> Why is the Kafka Connect program going in an infinite loop ? How can I
>>> prevent it ?
>>>
>>> I am using Confluent 3.2.2 for the schema registry, Avro serialization
>>> part
>>> and Apache Kafka 0.10.2.1 for Kafka Streams client and the broker part.
>>>
>>> Help ?
>>>
>>> regards.
>>>
>>> --
>>> Debasish Ghosh
>>> http://manning.com/ghosh2
>>> http://manning.com/ghosh
>>>
>>> Twttr: @debasishg
>>> Blog: http://debasishg.blogspot.com
>>> Code: http://github.com/debasishg
>>>
>>
>>
>>
>> --
>> Abdoulaye Diallo
>>
>
>
>
> --
> Abdoulaye Diallo
>



-- 
Debasish Ghosh
http://manning.com/ghosh2
http://manning.com/ghosh

Twttr: @debasishg
Blog: http://debasishg.blogspot.com
Code: http://github.com/debasishg

Re: Problems with Kafka Connect HDFS Sink

Reply via email to