>From your original email, option 1 is the correct approach. You are
right that it is performing extra deserialization/serialization, but
this is necessary to deal with the encoded schema references which
really are modified Avro records.

In option 2, if you take a whole bunch of records where the content is
"schema ref + bare avro" and then you merge them together one after
another, there is nothing that understands how to read this, it is not
valid Avro, and there are no readers that expect multiple messages
like this in a single flow file, that is why PutParquet can't read it.

In order to understand the slow down we will need more info...

- What version of Kafka broker are you using?
- Are you using the corresponding version of the record processor?
(i.e. if broker is 1.0.0 then should use ConsumeKafkaRecord_1_0_0)
- How many partitions does your topic have?
- How many nodes in your NiFi cluster?
- How many concurrent tasks configured for ConsumeKafkaRecord?
- What is the record batch size for ConsumeKafkaRecord?

On Wed, Dec 12, 2018 at 5:07 AM Krzysztof Zarzycki <[email protected]> wrote:
>
> Hello,
> I just pull the thread up, if someone knows how to make the avro messages 
> consumption faster, I would be grateful.
> Some more info: When we switched from ConsumeKafka with jsons to 
> ConsumeKafkaRecord with avro messages, we experienced a serious slowdown 
> (mutliple X) . I can get more data what slowdown precisely, but my question 
> about ConsumeKafka/MergeContent based flow becomes even more relevant to me.
> Or maybe I'm doing something wrong, that ConsumeKafkaRecord is so slower?
>
> BTW, I'm on Nifi 1.7.1.
>
> Thank you,
> Krzysztof Zarzycki
>
>
> pt., 7 gru 2018 o 22:24 Krzysztof Zarzycki <[email protected]> napisaƂ(a):
>>
>> Hi everyone,
>> I think I have quite a standard problem and maybe the answer would be quick, 
>> but I can't find it on the internet.
>> We have avro messages in Kafka topic, written with HWX schema reference. 
>> We're able to read them in with e.g. ConsumeKafkaRecord with Avro reader.
>>
>> Now we would like to merge smaller flowfiles to larger files, because we 
>> load these files to HDFS. What combination of processors should we use to 
>> get this with the highest performance?
>> Option 1: ConsumeKafkaRecord with AvroReader and AvroRecordSetWriter, then 
>> MergeRecord with AvroReader/AvroRecordSetWriter. It works, it seems straight 
>> forward, but for me it looks like there is too many interpretations and 
>> rewrites of records. Each records interpretation is an unnecessary cost of 
>> deserialization and then serialization through java heap.
>>
>> Option 2: somehow configure ConsumeKafka and MergeContent to do this? We 
>> used this combination for simple jsons (with binary concatenation), but we 
>> can't get it right with avro messages with schema reference (PutParquet 
>> processor can't read merged files with AvroReader). On the other side, this 
>> should be the fastest as there is no data interpretation, just byte to byte 
>> rewrite. Maybe we just haven't tried some of the configuration combination?
>>
>> Maybe Other options?
>>
>> Thank you for an advice.
>> Krzysztof

Reply via email to