Re: MergeRecord takes too much time

2024-03-12 Thread Chris Sampson
I’m not an expert with MergeRecord, but looking at your screenshots, I’d guess 
that your setup is taking that long to reach one of the defined “maximum” 
settings, e.g. 2GB, 5,000,000 records, or 3600 seconds (1 hour).

How large (number of records and content size in bytes) are the typical 
FlowFiles you’re sending to Merge Record, and how often do they arrive? For 
example, if you’re getting results form Elasticsearch in chunks of 10,000 
records per response and that’s 10MB in size every 1 second, it’s going to take 
a long time to meet any of the defined maximums.

How have you configured the Elasticsearch processor and what are you trying to 
combine together in your flow? Is it that you’re outputting a single FlowFile 
per Response (the default setting for “Search Results Split”), then trying to 
merge together all responses from a single query into one FlowFile? If so, I’d 
suggest changing the Elasticsearch processor’s “Search Results Split” to be 
“Per Query” instead, and increase the “Size” setting (leaving this blank will 
use the Elasticsearch default page size, which is often set as “10”). You might 
then be able to avoid the need for MergeRecord at all and the conversion of 
JSON to Parquet could be done with a ConvertRecord processor instead, for 
example.


Cheers,

---
Chris Sampson
IT Consultant
chris.samp...@naimuri.com


> On 12 Mar 2024, at 08:38, edi mari  wrote:
> 
> Hi, 
> My task is to query Elastic and save the results in a Parquet file.
> I'm querying Elastic using the PaginatedJsonQueryElasticsearch processor. 
> The files coming from Elastic are in JSON format, and I'm using the 
> MergeRecord processor to convert the JSON to Parquet format and merge the 
> result into one file . 
> The Record Reader uses the JsonTreeReader and Record Writer uses the 
> ParquetRecordSetWriter controller, which uses AVRO schema (Schema Text) to 
> help with the conversion task. 
> 
> The process works fine, the only problem is that it takes too much time. 
> Converting 5 MB takes 30 minutes. 
> 
> Do you have any idea how to enhance the process?  
> 
> 
> 
> 
> 



Re: MergeRecord performance

2020-06-01 Thread Robert R. Bruno
I have back pressure object threshold set to 10 on that queue and my
swap threshold is 20.  I don't think though when I had the issue the
number of flow files was very high in the queue in question since the issue
was now at updaterecord after I did a mergecontent that greatly reduced the
number of flow files.

On Mon, Jun 1, 2020, 16:02 Mark Payne  wrote:

> Hey Robert,
>
> How big are the FlowFile queues that you have in front of your
> MergeContent/MergeRecord processors? Or, more specifically, what do you
> have configured for the back pressure threshold? I ask because there was a
> fix in 1.11.0 [1] that had to do with ordering when swapping and ensuring
> that data remains in the same order after being swapped out and swapped
> back in when using the FIFO prioritizer.
>
> Some of the changes there can actually change the thresholds when we
> perform swapping. So I’m curious if you’re seeing a lot of swapping of
> FlowFiles to/from disk when running in 1.11.4 that you didn’t have in
> 1.9.2. Are you seeing logs about swapping occurring? And of note, when I
> talk about swapping, I’m talking about NiFi-level FlowFile swapping, not
> OS-level swapping.
>
> Thanks
> -Mark
>
> [1`] https://issues.apache.org/jira/browse/NIFI-7011
>
>
> On May 22, 2020, at 10:35 AM, Robert R. Bruno  wrote:
>
> Sorry one other thing I thought of that may help.  I noticed on 1.11.4
> when I would stop the updaterecord processor it would take a long period of
> time for the processor to stop (threads were hanging), but when I went back
> to 1.9.2 the processor would stop in a very timely manner.  Not sure if
> that helps, but just another data point.
>
> On Fri, May 22, 2020 at 9:22 AM Robert R. Bruno  wrote:
>
>> I had more updates on this.
>>
>> Yesterday I again attempted to upgrade one of our 1.9.2 clusters that is
>> now using mergecontent vs mergerecord.  The flow had been running on 1.9.2
>> for about a week with no issue.  I did the upgrade to 1.11.4, and saw about
>> 3 of 10 nodes not being able to keep up.  The load on these 3 nodes became
>> very high.  For perspective, a load of 80 is about as high as we like to
>> see these boxes, and some were getting as high as 120.  I saw one
>> bottleneck forming at an updaterecord.  I tried giving that processor a few
>> more threads to see if it would help work off the backlog.  No matter what
>> I tried (lowering thread, changing mergecontent sizes, etc) the load
>> wouldn't go down on those 3 boxes and they had either a slowing growing
>> backlog or would maintain the backlog they had.
>>
>> I then decide to downgrade the nifi back to 1.9.2 with out rebooting the
>> boxes.  I kept all flow files and content as they were.  Upon downgrading
>> no loads were above 50 and this was only on the boxes that had the backlog
>> that formed when we did the upgrade.  The backlog on the 3 boxes worked off
>> with no issue at all, and without me having to make changes to the flow.
>> Once backlogs were worked off then our loads all sat around 20.
>>
>> This is a similar behavior from what we saw before, but just in another
>> part of the flow.  Has anyone else seen anything like this on 1.11.4?
>> Unfortunately for now we can't upgrade due to this problem.  Any thoughts
>> from anyone would be greatly appreciated.
>>
>> Thanks,
>> Robert
>>
>> On Fri, May 8, 2020 at 4:47 PM Robert R. Bruno  wrote:
>>
>>> Sorry for the delayed answer, but was doing some testing this week and
>>> found a few more things out.
>>>
>>> First to answer some of your questions.
>>>
>>> I would say with no actual raw numbers, it was worse than a 10%
>>> degradation.  I say this since the flow was badly backing up, and a 10%
>>> decrease in performance should not have caused this since normally we can
>>> work off a backlog of data with no issues.  I looked at my mergerecord
>>> settings, and I am largely using size as the limiting factor.  I have a max
>>> size of 4MB and a max bin age of 1 minute followed by a second mergerecord
>>> with a max size of 32MB and a max bin age of 5 minutes.
>>>
>>> I changed our flow a bit on a test system that was running 1.11.4, and
>>> discovered the following:
>>>
>>> I changed mergerecords to mergecontents.  I used pretty much all of the
>>> same settings in the mergecontent but had the mergecontent deal with the
>>> avro natively.  In this flow, it currently seems like I don't need to chain
>>> multiple mergecontents together like I did with mergerecords.
>>>
>>> I then fed the merged avro from the mergecontent to a convertrecord to
>>> convert the data to parquet.  The convertrecord was tremendously slower
>>> than the mergecontent and become a bottleneck.  I then switched the
>>> convertrecord to the convertavrotoparquet processor.  Convertavrotoparquet
>>> can easily handle the output speed of the mergecontent and then some.
>>>
>>> My hope is to make these changes to our actual flow soon, and then
>>> upgrade to 1.11.4 again.  I'll let you know how that goes.

Re: MergeRecord performance

2020-06-01 Thread Mark Payne
Hey Robert,

How big are the FlowFile queues that you have in front of your 
MergeContent/MergeRecord processors? Or, more specifically, what do you have 
configured for the back pressure threshold? I ask because there was a fix in 
1.11.0 [1] that had to do with ordering when swapping and ensuring that data 
remains in the same order after being swapped out and swapped back in when 
using the FIFO prioritizer.

Some of the changes there can actually change the thresholds when we perform 
swapping. So I’m curious if you’re seeing a lot of swapping of FlowFiles 
to/from disk when running in 1.11.4 that you didn’t have in 1.9.2. Are you 
seeing logs about swapping occurring? And of note, when I talk about swapping, 
I’m talking about NiFi-level FlowFile swapping, not OS-level swapping.

Thanks
-Mark

[1`] https://issues.apache.org/jira/browse/NIFI-7011


On May 22, 2020, at 10:35 AM, Robert R. Bruno 
mailto:rbru...@gmail.com>> wrote:

Sorry one other thing I thought of that may help.  I noticed on 1.11.4 when I 
would stop the updaterecord processor it would take a long period of time for 
the processor to stop (threads were hanging), but when I went back to 1.9.2 the 
processor would stop in a very timely manner.  Not sure if that helps, but just 
another data point.

On Fri, May 22, 2020 at 9:22 AM Robert R. Bruno 
mailto:rbru...@gmail.com>> wrote:
I had more updates on this.

Yesterday I again attempted to upgrade one of our 1.9.2 clusters that is now 
using mergecontent vs mergerecord.  The flow had been running on 1.9.2 for 
about a week with no issue.  I did the upgrade to 1.11.4, and saw about 3 of 10 
nodes not being able to keep up.  The load on these 3 nodes became very high.  
For perspective, a load of 80 is about as high as we like to see these boxes, 
and some were getting as high as 120.  I saw one bottleneck forming at an 
updaterecord.  I tried giving that processor a few more threads to see if it 
would help work off the backlog.  No matter what I tried (lowering thread, 
changing mergecontent sizes, etc) the load  wouldn't go down on those 3 boxes 
and they had either a slowing growing backlog or would maintain the backlog 
they had.

I then decide to downgrade the nifi back to 1.9.2 with out rebooting the boxes. 
 I kept all flow files and content as they were.  Upon downgrading no loads 
were above 50 and this was only on the boxes that had the backlog that formed 
when we did the upgrade.  The backlog on the 3 boxes worked off with no issue 
at all, and without me having to make changes to the flow.  Once backlogs were 
worked off then our loads all sat around 20.

This is a similar behavior from what we saw before, but just in another part of 
the flow.  Has anyone else seen anything like this on 1.11.4?  Unfortunately 
for now we can't upgrade due to this problem.  Any thoughts from anyone would 
be greatly appreciated.

Thanks,
Robert

On Fri, May 8, 2020 at 4:47 PM Robert R. Bruno 
mailto:rbru...@gmail.com>> wrote:
Sorry for the delayed answer, but was doing some testing this week and found a 
few more things out.

First to answer some of your questions.

I would say with no actual raw numbers, it was worse than a 10% degradation.  I 
say this since the flow was badly backing up, and a 10% decrease in performance 
should not have caused this since normally we can work off a backlog of data 
with no issues.  I looked at my mergerecord settings, and I am largely using 
size as the limiting factor.  I have a max size of 4MB and a max bin age of 1 
minute followed by a second mergerecord with a max size of 32MB and a max bin 
age of 5 minutes.

I changed our flow a bit on a test system that was running 1.11.4, and 
discovered the following:

I changed mergerecords to mergecontents.  I used pretty much all of the same 
settings in the mergecontent but had the mergecontent deal with the avro 
natively.  In this flow, it currently seems like I don't need to chain multiple 
mergecontents together like I did with mergerecords.

I then fed the merged avro from the mergecontent to a convertrecord to convert 
the data to parquet.  The convertrecord was tremendously slower than the 
mergecontent and become a bottleneck.  I then switched the convertrecord to the 
convertavrotoparquet processor.  Convertavrotoparquet can easily handle the 
output speed of the mergecontent and then some.

My hope is to make these changes to our actual flow soon, and then upgrade to 
1.11.4 again.  I'll let you know how that goes.

Thanks,
Robert

On Mon, Apr 27, 2020 at 9:26 AM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Robert,

What kind of performance degradation were you seeing here? I put together some 
simple flows to see if I could reproduce using 1.9.2 and current master.
My flow consisted of GenerateFlowFile (generating 2 CSV rows per FlowFile) -> 
ConvertRecord (to Avro) -> MergeRecord (read Avro, write Avro) -> 
UpdateAttribute to try to mimic what you’ve got, given the details that I have.

I did 

Re: MergeRecord performance

2020-05-08 Thread Robert R. Bruno
Sorry for the delayed answer, but was doing some testing this week and
found a few more things out.

First to answer some of your questions.

I would say with no actual raw numbers, it was worse than a 10%
degradation.  I say this since the flow was badly backing up, and a 10%
decrease in performance should not have caused this since normally we can
work off a backlog of data with no issues.  I looked at my mergerecord
settings, and I am largely using size as the limiting factor.  I have a max
size of 4MB and a max bin age of 1 minute followed by a second mergerecord
with a max size of 32MB and a max bin age of 5 minutes.

I changed our flow a bit on a test system that was running 1.11.4, and
discovered the following:

I changed mergerecords to mergecontents.  I used pretty much all of the
same settings in the mergecontent but had the mergecontent deal with the
avro natively.  In this flow, it currently seems like I don't need to chain
multiple mergecontents together like I did with mergerecords.

I then fed the merged avro from the mergecontent to a convertrecord to
convert the data to parquet.  The convertrecord was tremendously slower
than the mergecontent and become a bottleneck.  I then switched the
convertrecord to the convertavrotoparquet processor.  Convertavrotoparquet
can easily handle the output speed of the mergecontent and then some.

My hope is to make these changes to our actual flow soon, and then upgrade
to 1.11.4 again.  I'll let you know how that goes.

Thanks,
Robert

On Mon, Apr 27, 2020 at 9:26 AM Mark Payne  wrote:

> Robert,
>
> What kind of performance degradation were you seeing here? I put together
> some simple flows to see if I could reproduce using 1.9.2 and current
> master.
> My flow consisted of GenerateFlowFile (generating 2 CSV rows per FlowFile)
> -> ConvertRecord (to Avro) -> MergeRecord (read Avro, write Avro) ->
> UpdateAttribute to try to mimic what you’ve got, given the details that I
> have.
>
> I did see a performance degradation on the order of about 10%. So on my
> laptop I went from processing 2.49 MM FlowFiles in 1.9.2 in 5 mins to 2.25
> MM on the master branch. Interestingly, I saw no real change when I enabled
> Snappy compression.
>
> For a point of reference, I also tried removing MergeRecord and just
> Generate -> Convert -> UpdateAttribute. I saw the same roughly 10%
> performance degradation.
>
> I’m curious if you’re seeing more than that. If so, I think a template
> would be helpful to understand what’s different.
>
> Thanks
> -Mark
>
>
> On Apr 24, 2020, at 4:50 PM, Robert R. Bruno  wrote:
>
> Joe,
>
> In that part of the flow, we are using avro readers and writers.  We are
> using snappy compression (which could be part of the problem).  Since we
> are using avro at that point the embedded schema is being used by the
> reader and the writer is using the schema name property along with an
> internal schema registry in nifi.
>
> I can see what could potentially be shared.
>
> Thanks
>
> On Fri, Apr 24, 2020 at 4:41 PM Joe Witt  wrote:
>
>> Robert,
>>
>> Can you please detail the record readers and writers involved and how
>> schemas are accessed?  There can be very important performance related
>> changes in the parsers/serializers of the given formats.  And we've added a
>> lot to make schema caching really capable but you have to opt into it.  It
>> is of course possible MergeRecord itself is the culprit for performance
>> reduction but lets get a more full picture here.
>>
>> Are you able to share a template and sample data which we can use to
>> replicate?
>>
>> Thanks
>>
>> On Fri, Apr 24, 2020 at 4:38 PM Robert R. Bruno 
>> wrote:
>>
>>> I wanted to see if anyone else has experienced performance issues with
>>> the newest version of nifi and MergeRecord?  We have been running on nifi
>>> 1.9.2 for awhile now, and recently upgraded to nifi 1.11.4.  Once upgraded,
>>> our identical flows were no longer able to keep up with our data mainly at
>>> MergeRecord processors.
>>>
>>> We ended up downgrading back to nifi 1.9.2.  Once we downgraded, all was
>>> keeping up again.  There were no errors to speak of when we were running
>>> the flow with 1.11.4.  We did see higher load on the OS, but this may have
>>> been caused by the fact there was such a tremendous backlog built up in the
>>> flow.
>>>
>>> Another side note, we saw one UpdateRecord processor producing errors
>>> when I tested the flow with nifi 1.11.4 with a small test flow.  I was able
>>> to fix this issue by changing some parameters in my RecordWriter.  So
>>> perhaps some underlying ways records are being handled since 1.9.2 caused
>>> the performance issue we saw?
>>>
>>> Any insight anyone has would be greatly appreciated, as we very much
>>> would like to upgrade to nifi 1.11.4.  One thought was switching the
>>> MergeRecord processors to MergeContent since I've been told MergeContent
>>> seems to perform better, but not sure if this is actually true.  We are
>>> using the pattern 

Re: MergeRecord performance

2020-04-27 Thread Mark Payne
Robert,

What kind of performance degradation were you seeing here? I put together some 
simple flows to see if I could reproduce using 1.9.2 and current master.
My flow consisted of GenerateFlowFile (generating 2 CSV rows per FlowFile) -> 
ConvertRecord (to Avro) -> MergeRecord (read Avro, write Avro) -> 
UpdateAttribute to try to mimic what you’ve got, given the details that I have.

I did see a performance degradation on the order of about 10%. So on my laptop 
I went from processing 2.49 MM FlowFiles in 1.9.2 in 5 mins to 2.25 MM on the 
master branch. Interestingly, I saw no real change when I enabled Snappy 
compression.

For a point of reference, I also tried removing MergeRecord and just Generate 
-> Convert -> UpdateAttribute. I saw the same roughly 10% performance 
degradation.

I’m curious if you’re seeing more than that. If so, I think a template would be 
helpful to understand what’s different.

Thanks
-Mark


On Apr 24, 2020, at 4:50 PM, Robert R. Bruno 
mailto:rbru...@gmail.com>> wrote:

Joe,

In that part of the flow, we are using avro readers and writers.  We are using 
snappy compression (which could be part of the problem).  Since we are using 
avro at that point the embedded schema is being used by the reader and the 
writer is using the schema name property along with an internal schema registry 
in nifi.

I can see what could potentially be shared.

Thanks

On Fri, Apr 24, 2020 at 4:41 PM Joe Witt 
mailto:joe.w...@gmail.com>> wrote:
Robert,

Can you please detail the record readers and writers involved and how schemas 
are accessed?  There can be very important performance related changes in the 
parsers/serializers of the given formats.  And we've added a lot to make schema 
caching really capable but you have to opt into it.  It is of course possible 
MergeRecord itself is the culprit for performance reduction but lets get a more 
full picture here.

Are you able to share a template and sample data which we can use to replicate?

Thanks

On Fri, Apr 24, 2020 at 4:38 PM Robert R. Bruno 
mailto:rbru...@gmail.com>> wrote:
I wanted to see if anyone else has experienced performance issues with the 
newest version of nifi and MergeRecord?  We have been running on nifi 1.9.2 for 
awhile now, and recently upgraded to nifi 1.11.4.  Once upgraded, our identical 
flows were no longer able to keep up with our data mainly at MergeRecord 
processors.

We ended up downgrading back to nifi 1.9.2.  Once we downgraded, all was 
keeping up again.  There were no errors to speak of when we were running the 
flow with 1.11.4.  We did see higher load on the OS, but this may have been 
caused by the fact there was such a tremendous backlog built up in the flow.

Another side note, we saw one UpdateRecord processor producing errors when I 
tested the flow with nifi 1.11.4 with a small test flow.  I was able to fix 
this issue by changing some parameters in my RecordWriter.  So perhaps some 
underlying ways records are being handled since 1.9.2 caused the performance 
issue we saw?

Any insight anyone has would be greatly appreciated, as we very much would like 
to upgrade to nifi 1.11.4.  One thought was switching the MergeRecord 
processors to MergeContent since I've been told MergeContent seems to perform 
better, but not sure if this is actually true.  We are using the pattern of 
chaining a few MergeRecord processors together to help with performance.

Thanks in advance!



Re: MergeRecord performance

2020-04-24 Thread Robert R. Bruno
Joe,

In that part of the flow, we are using avro readers and writers.  We are
using snappy compression (which could be part of the problem).  Since we
are using avro at that point the embedded schema is being used by the
reader and the writer is using the schema name property along with an
internal schema registry in nifi.

I can see what could potentially be shared.

Thanks

On Fri, Apr 24, 2020 at 4:41 PM Joe Witt  wrote:

> Robert,
>
> Can you please detail the record readers and writers involved and how
> schemas are accessed?  There can be very important performance related
> changes in the parsers/serializers of the given formats.  And we've added a
> lot to make schema caching really capable but you have to opt into it.  It
> is of course possible MergeRecord itself is the culprit for performance
> reduction but lets get a more full picture here.
>
> Are you able to share a template and sample data which we can use to
> replicate?
>
> Thanks
>
> On Fri, Apr 24, 2020 at 4:38 PM Robert R. Bruno  wrote:
>
>> I wanted to see if anyone else has experienced performance issues with
>> the newest version of nifi and MergeRecord?  We have been running on nifi
>> 1.9.2 for awhile now, and recently upgraded to nifi 1.11.4.  Once upgraded,
>> our identical flows were no longer able to keep up with our data mainly at
>> MergeRecord processors.
>>
>> We ended up downgrading back to nifi 1.9.2.  Once we downgraded, all was
>> keeping up again.  There were no errors to speak of when we were running
>> the flow with 1.11.4.  We did see higher load on the OS, but this may have
>> been caused by the fact there was such a tremendous backlog built up in the
>> flow.
>>
>> Another side note, we saw one UpdateRecord processor producing errors
>> when I tested the flow with nifi 1.11.4 with a small test flow.  I was able
>> to fix this issue by changing some parameters in my RecordWriter.  So
>> perhaps some underlying ways records are being handled since 1.9.2 caused
>> the performance issue we saw?
>>
>> Any insight anyone has would be greatly appreciated, as we very much
>> would like to upgrade to nifi 1.11.4.  One thought was switching the
>> MergeRecord processors to MergeContent since I've been told MergeContent
>> seems to perform better, but not sure if this is actually true.  We are
>> using the pattern of chaining a few MergeRecord processors together to help
>> with performance.
>>
>> Thanks in advance!
>>
>


Re: MergeRecord performance

2020-04-24 Thread Joe Witt
Robert,

Can you please detail the record readers and writers involved and how
schemas are accessed?  There can be very important performance related
changes in the parsers/serializers of the given formats.  And we've added a
lot to make schema caching really capable but you have to opt into it.  It
is of course possible MergeRecord itself is the culprit for performance
reduction but lets get a more full picture here.

Are you able to share a template and sample data which we can use to
replicate?

Thanks

On Fri, Apr 24, 2020 at 4:38 PM Robert R. Bruno  wrote:

> I wanted to see if anyone else has experienced performance issues with the
> newest version of nifi and MergeRecord?  We have been running on nifi 1.9.2
> for awhile now, and recently upgraded to nifi 1.11.4.  Once upgraded, our
> identical flows were no longer able to keep up with our data mainly at
> MergeRecord processors.
>
> We ended up downgrading back to nifi 1.9.2.  Once we downgraded, all was
> keeping up again.  There were no errors to speak of when we were running
> the flow with 1.11.4.  We did see higher load on the OS, but this may have
> been caused by the fact there was such a tremendous backlog built up in the
> flow.
>
> Another side note, we saw one UpdateRecord processor producing errors when
> I tested the flow with nifi 1.11.4 with a small test flow.  I was able to
> fix this issue by changing some parameters in my RecordWriter.  So perhaps
> some underlying ways records are being handled since 1.9.2 caused the
> performance issue we saw?
>
> Any insight anyone has would be greatly appreciated, as we very much would
> like to upgrade to nifi 1.11.4.  One thought was switching the MergeRecord
> processors to MergeContent since I've been told MergeContent seems to
> perform better, but not sure if this is actually true.  We are using the
> pattern of chaining a few MergeRecord processors together to help with
> performance.
>
> Thanks in advance!
>


Re: Re: MergeRecord can not guarantee the ordering of the input sequence?

2019-10-20 Thread wangl...@geekplus.com.cn
Hi Koji, 

My test is as follows.
ProcessorA, scheduled only on primary node and with only one cocurrency. 
The result of ProcessorA load balanced to ProcessorB.  The strategy is by 
attribute.  All the output FlowFiles of ProcessorA has  the same attribute used 
for balance, so all FlowFiles will be balanced to the same node. 
The order of ProcessorB received will probably not the same as ProcessorA 
emited. And the order is nondeterministic. 

Thanks,
Lei



wangl...@geekplus.com.cn
 
From: Koji Kawamura
Date: 2019-10-20 18:02
To: users
Subject: Re: Re: MergeRecord can not guarantee the ordering of the input 
sequence?
Hi Lei,
 
Does 'balance strategy' means load balance strategy? Which strategy
are you using? I thought Prioritizers are applied on the destination
node after load balancing has transferred FlowFiles. Are those A, B
and C flow files generated on different nodes and sent to a single
node to merge them?
 
Thanks,
Koji
 
On Fri, Oct 18, 2019 at 7:12 PM wangl...@geekplus.com.cn
 wrote:
>
>
> Seems it is because of the balance strategy that is used.
> The balance will not guarantee the the order.
>
> Thanks,
> Lei
>
> 
> wangl...@geekplus.com.cn
>
>
> From: wangl...@geekplus.com.cn
> Date: 2019-10-16 10:21
> To: dev; users
> CC: dev
> Subject: Re: Re: MergeRecord can not guarantee the ordering of the input 
> sequence?
> Hi Koji,
> Actually i have set all connections to FIFO and concurrency tasks to 1 for 
> all processors.
> Before and after the MergeRecord, I add a LogAttribute to debug.
>
> Before MergeRecord,the order in logfile is A,B,C in three flowfile
> After  MergeRecord, the order becomes {A,C,B} in one flowfile
> This is nondeterministic.
>
> I think I should look up the MergeRecord code and do further debug.
>
> Thanks,
> Lei
>
>
>
>
> wangl...@geekplus.com.cn
> From: Koji Kawamura
> Date: 2019-10-16 09:46
> To: users
> CC: dev
> Subject: Re: MergeRecord can not guarantee the ordering of the input sequence?
> Hi Lei,
> How about setting FIFO prioritizer at all the preceding connections
> before the MergeRecord?
> Without setting any prioritizer, FlowFile ordering is nondeterministic.
> Thanks,
> Koji
> On Tue, Oct 15, 2019 at 8:56 PM wangl...@geekplus.com.cn
>  wrote:
> >
> >
> > If  FlowFile A, B, C enter the MergeRecord sequentially, the output should 
> > be one FlowFile {A, B, C}
> > However, when testing with  large data volume, sometimes the output order 
> > will be not the same as they enter. And this result is nondeterministic
> >
> > This really confuses me a lot.
> > Anybody has any insight on this?
> >
> > Thanks,
> > Lei
> >
> > 
> > wangl...@geekplus.com.cn


Re: Re: MergeRecord can not guarantee the ordering of the input sequence?

2019-10-20 Thread Koji Kawamura
Hi Lei,

Does 'balance strategy' means load balance strategy? Which strategy
are you using? I thought Prioritizers are applied on the destination
node after load balancing has transferred FlowFiles. Are those A, B
and C flow files generated on different nodes and sent to a single
node to merge them?

Thanks,
Koji

On Fri, Oct 18, 2019 at 7:12 PM wangl...@geekplus.com.cn
 wrote:
>
>
> Seems it is because of the balance strategy that is used.
> The balance will not guarantee the the order.
>
> Thanks,
> Lei
>
> 
> wangl...@geekplus.com.cn
>
>
> From: wangl...@geekplus.com.cn
> Date: 2019-10-16 10:21
> To: dev; users
> CC: dev
> Subject: Re: Re: MergeRecord can not guarantee the ordering of the input 
> sequence?
> Hi Koji,
> Actually i have set all connections to FIFO and concurrency tasks to 1 for 
> all processors.
> Before and after the MergeRecord, I add a LogAttribute to debug.
>
> Before MergeRecord,the order in logfile is A,B,C in three flowfile
> After  MergeRecord, the order becomes {A,C,B} in one flowfile
> This is nondeterministic.
>
> I think I should look up the MergeRecord code and do further debug.
>
> Thanks,
> Lei
>
>
>
>
> wangl...@geekplus.com.cn
> From: Koji Kawamura
> Date: 2019-10-16 09:46
> To: users
> CC: dev
> Subject: Re: MergeRecord can not guarantee the ordering of the input sequence?
> Hi Lei,
> How about setting FIFO prioritizer at all the preceding connections
> before the MergeRecord?
> Without setting any prioritizer, FlowFile ordering is nondeterministic.
> Thanks,
> Koji
> On Tue, Oct 15, 2019 at 8:56 PM wangl...@geekplus.com.cn
>  wrote:
> >
> >
> > If  FlowFile A, B, C enter the MergeRecord sequentially, the output should 
> > be one FlowFile {A, B, C}
> > However, when testing with  large data volume, sometimes the output order 
> > will be not the same as they enter. And this result is nondeterministic
> >
> > This really confuses me a lot.
> > Anybody has any insight on this?
> >
> > Thanks,
> > Lei
> >
> > 
> > wangl...@geekplus.com.cn


Re: Re: MergeRecord can not guarantee the ordering of the input sequence?

2019-10-18 Thread wangl...@geekplus.com.cn

Seems it is because of the balance strategy that is used. 
The balance will not guarantee the the order.

Thanks,
Lei



wangl...@geekplus.com.cn
 
From: wangl...@geekplus.com.cn
Date: 2019-10-16 10:21
To: dev; users
CC: dev
Subject: Re: Re: MergeRecord can not guarantee the ordering of the input 
sequence?
Hi Koji, 
Actually i have set all connections to FIFO and concurrency tasks to 1 for all 
processors.
Before and after the MergeRecord, I add a LogAttribute to debug.
 
Before MergeRecord,the order in logfile is A,B,C in three flowfile 
After  MergeRecord, the order becomes {A,C,B} in one flowfile
This is nondeterministic.
 
I think I should look up the MergeRecord code and do further debug.
 
Thanks, 
Lei
 
 
 
 
wangl...@geekplus.com.cn
From: Koji Kawamura
Date: 2019-10-16 09:46
To: users
CC: dev
Subject: Re: MergeRecord can not guarantee the ordering of the input sequence?
Hi Lei,
How about setting FIFO prioritizer at all the preceding connections
before the MergeRecord?
Without setting any prioritizer, FlowFile ordering is nondeterministic.
Thanks,
Koji
On Tue, Oct 15, 2019 at 8:56 PM wangl...@geekplus.com.cn
 wrote:
>
>
> If  FlowFile A, B, C enter the MergeRecord sequentially, the output should be 
> one FlowFile {A, B, C}
> However, when testing with  large data volume, sometimes the output order 
> will be not the same as they enter. And this result is nondeterministic
>
> This really confuses me a lot.
> Anybody has any insight on this?
>
> Thanks,
> Lei
>
> 
> wangl...@geekplus.com.cn


Re: Re: MergeRecord can not guarantee the ordering of the input sequence?

2019-10-15 Thread wangl...@geekplus.com.cn
Hi Koji, 
Actually i have set all connections to FIFO and concurrency tasks to 1 for all 
processors.
Before and after the MergeRecord, I add a LogAttribute to debug.

Before MergeRecord,the order in logfile is A,B,C in three flowfile 
After  MergeRecord, the order becomes {A,C,B} in one flowfile
This is nondeterministic.

I think I should look up the MergeRecord code and do further debug.

Thanks, 
Lei




wangl...@geekplus.com.cn
 
From: Koji Kawamura
Date: 2019-10-16 09:46
To: users
CC: dev
Subject: Re: MergeRecord can not guarantee the ordering of the input sequence?
Hi Lei,
 
How about setting FIFO prioritizer at all the preceding connections
before the MergeRecord?
Without setting any prioritizer, FlowFile ordering is nondeterministic.
 
Thanks,
Koji
 
On Tue, Oct 15, 2019 at 8:56 PM wangl...@geekplus.com.cn
 wrote:
>
>
> If  FlowFile A, B, C enter the MergeRecord sequentially, the output should be 
> one FlowFile {A, B, C}
> However, when testing with  large data volume, sometimes the output order 
> will be not the same as they enter. And this result is nondeterministic
>
> This really confuses me a lot.
> Anybody has any insight on this?
>
> Thanks,
> Lei
>
> 
> wangl...@geekplus.com.cn


Re: MergeRecord can not guarantee the ordering of the input sequence?

2019-10-15 Thread Koji Kawamura
Hi Lei,

How about setting FIFO prioritizer at all the preceding connections
before the MergeRecord?
Without setting any prioritizer, FlowFile ordering is nondeterministic.

Thanks,
Koji

On Tue, Oct 15, 2019 at 8:56 PM wangl...@geekplus.com.cn
 wrote:
>
>
> If  FlowFile A, B, C enter the MergeRecord sequentially, the output should be 
> one FlowFile {A, B, C}
> However, when testing with  large data volume, sometimes the output order 
> will be not the same as they enter. And this result is nondeterministic
>
> This really confuses me a lot.
> Anybody has any insight on this?
>
> Thanks,
> Lei
>
> 
> wangl...@geekplus.com.cn


Re: MergeRecord, queue & backpressure

2018-04-13 Thread Juan Sequeiros
Good afternoon,

Another thing to help you out maybe ...

You can also tweak the nifi.properties setting:

nifi.queue.swap.threshold=2
This setting will control the value of the max flowfile count on a
connection if exceeded it will flush those flowfiles to disk.

I am not sure however there is a distinction between a "record flowfile"
since they live in memory, and the traditional way of thinking of flowfiles.

On Fri, Apr 13, 2018 at 10:49 AM Mark Payne  wrote:

> Aurélien,
>
> In that case you're looking to merge about 500,000 FlowFiles into a single
> FlowFile, so you'll
> definitely want to use a cascading approach. I'd shoot for about 1 MB for
> the first MergeRecord
> and then merge 128 of those together for the second MergeRecord.
>
> The provenance backpressure is occurring because of the large number of
> provenance events being
> generated. One even will be generated, more or less, for each time that a
> Processor touches a FlowFile.
> So if you are merging the FlowFiles together as early as possible, you'll
> reduce the load that you're putting
> on the Provenance Repository.
>
> Also, depending on how you're getting the data into your flow, if you're
> able, it is best to receive a larger "micro-batch"
> of records per flowfile to begin with and not split them up. This would
> greatly alleviate the pressure on the Provenance
> Repository and avoid needing multiple MergeRecord processors as well.
>
> Also, of note, there is a newer version of the Provenance Repository that
> you can switch to, by changing the
> "nifi.provenance.repository.implementation" property in nifi.properties
> from "org.apache.nifi.provenance.PersistentProvenanceRepository"
> to "org.apache.nifi.provenance.WriteAheadProvenanceRepository". The
> Write-Ahead version is quite a bit faster
> and behaves differently than the Persistent Provenance Repo, so you won't
> see those warnings about provenance
> backpressure.
>
> I hope this helps!
> -Mark
>
>
>
> > On Apr 13, 2018, at 10:30 AM, DEHAY Aurelien <
> aurelien.de...@faurecia.com> wrote:
> >
> > Hello.
> >
> > It's me again regarding my mergerecord question.
> >
> > I still don't manage to have what I want, I may have understand how bin
> based processor works, it's for clarification and a question regarding
> performance.
> >
> > I want to merge a huge number of 300 octets flowfiles in 128 MB parquet
> file.
> >
> > My understanding is, for mergerecord to be able to create a bin with
> 128MB of data, these data must be in queue. We can't feed the bin "one flow
> at a time", so working with small flowfiles, I have to set the backpressure
> parameter to something really high, or remove completely the number of
> flowfile backpressure limit.
> >
> > I understood by reading
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeRecord/additionalDetails.html
> that it's not the "good" way to do, but I should cascade multiple merge to
> "slowly" make the flowfile bigger?
> >
> > I've made some test with a single level but I hit the "provenance
> recording rate". Will multiple level help?
> >
> > Thanks for any help.
> >
> > Aurélien.
> >
> > This electronic transmission (and any attachments thereto) is intended
> solely for the use of the addressee(s). It may contain confidential or
> legally privileged information. If you are not the intended recipient of
> this message, you must delete it immediately and notify the sender. Any
> unauthorized use or disclosure of this message is strictly prohibited.
> Faurecia does not guarantee the integrity of this transmission and shall
> therefore never be liable if the message is altered or falsified nor for
> any virus, interception or damage to your system.
> >
>
>


Re: MergeRecord, queue & backpressure

2018-04-13 Thread Mark Payne
Aurélien,

In that case you're looking to merge about 500,000 FlowFiles into a single 
FlowFile, so you'll
definitely want to use a cascading approach. I'd shoot for about 1 MB for the 
first MergeRecord
and then merge 128 of those together for the second MergeRecord.

The provenance backpressure is occurring because of the large number of 
provenance events being
generated. One even will be generated, more or less, for each time that a 
Processor touches a FlowFile.
So if you are merging the FlowFiles together as early as possible, you'll 
reduce the load that you're putting
on the Provenance Repository.

Also, depending on how you're getting the data into your flow, if you're able, 
it is best to receive a larger "micro-batch"
of records per flowfile to begin with and not split them up. This would greatly 
alleviate the pressure on the Provenance
Repository and avoid needing multiple MergeRecord processors as well.

Also, of note, there is a newer version of the Provenance Repository that you 
can switch to, by changing the
"nifi.provenance.repository.implementation" property in nifi.properties from 
"org.apache.nifi.provenance.PersistentProvenanceRepository"
to "org.apache.nifi.provenance.WriteAheadProvenanceRepository". The Write-Ahead 
version is quite a bit faster
and behaves differently than the Persistent Provenance Repo, so you won't see 
those warnings about provenance
backpressure.

I hope this helps!
-Mark



> On Apr 13, 2018, at 10:30 AM, DEHAY Aurelien  
> wrote:
> 
> Hello.
> 
> It's me again regarding my mergerecord question.
> 
> I still don't manage to have what I want, I may have understand how bin based 
> processor works, it's for clarification and a question regarding performance.
> 
> I want to merge a huge number of 300 octets flowfiles in 128 MB parquet file. 
> 
> My understanding is, for mergerecord to be able to create a bin with 128MB of 
> data, these data must be in queue. We can't feed the bin "one flow at a 
> time", so working with small flowfiles, I have to set the backpressure 
> parameter to something really high, or remove completely the number of 
> flowfile backpressure limit.
> 
> I understood by reading 
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeRecord/additionalDetails.html
>  that it's not the "good" way to do, but I should cascade multiple merge to 
> "slowly" make the flowfile bigger?
> 
> I've made some test with a single level but I hit the "provenance recording 
> rate". Will multiple level help?
> 
> Thanks for any help.
> 
> Aurélien.
> 
> This electronic transmission (and any attachments thereto) is intended solely 
> for the use of the addressee(s). It may contain confidential or legally 
> privileged information. If you are not the intended recipient of this 
> message, you must delete it immediately and notify the sender. Any 
> unauthorized use or disclosure of this message is strictly prohibited.  
> Faurecia does not guarantee the integrity of this transmission and shall 
> therefore never be liable if the message is altered or falsified nor for any 
> virus, interception or damage to your system.
> 



RE: MergeRecord

2018-04-13 Thread DEHAY Aurelien
Hello.

We looked in first place in the InferSchema to see if there was an option for 
that.

Anyway, thank you very much, it works fine with the update attribute. 


Aurélien DEHAY
Big Data Architect
+33 616 815 441
aurelien.de...@faurecia.com 

2 rue Hennape - 92735 Nanterre Cedex – France



-Original Message-
From: Koji Kawamura [mailto:ijokaruma...@gmail.com] 
Sent: vendredi 13 avril 2018 09:20
To: users@nifi.apache.org
Subject: Re: MergeRecord

Hi,

Just FYI,
If I replaces the schema doc comment by UpdateAttribute, I was able to merge 
records.
${inferred.avro.schema:replaceAll('"Type inferred from [^"]+"', '""')}

I looked at InferAvroSchema and underlying Kite source code, but there's no 
option to suppress the doc comment when inferring schema unfortunately.

Thanks,
Koji

On Fri, Apr 13, 2018 at 4:11 PM, Koji Kawamura <ijokaruma...@gmail.com> wrote:
> Hi,
>
> I've tested InferAvroSchema and MergeRecord scenario.
> As you described, records are not merged as expected.
>
> The reason in my case is, InferAvroSchema generates schema text like this:
> inferred.avro.schema
> { "type" : "record", "name" : "example", "doc" : "Schema generated by 
> Kite", "fields" : [ { "name" : "Key", "type" : "long", "doc" : "Type 
> inferred from '4'" }, { "name" : "Value", "type" : "string", "doc" :
> "Type inferred from 'four'" } ] }
>
> And, MergedRecord uses that schema text as groupId even if 
> 'Correlation Attribute' is specified.
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-stand
> ard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/proc
> essors/standard/MergeRecord.java#L348
>
> So, even if schema is the same, if actual values vary, merging group 
> id will be different.
> If you can use SchemaRegistry, it should work as expected.
>
> Thanks,
> Koji
>
> On Fri, Apr 13, 2018 at 2:45 PM, DEHAY Aurelien 
> <aurelien.de...@faurecia.com> wrote:
>>
>> Hello.
>>
>> Thanks for the answer.
>>
>> The 20k is just the last test, I’ve tested with 100,1000, with an input 
>> queue of 10k, and it doesn’t change anything.
>>
>> I will try to simplify the test case and to not use the inferred schema.
>>
>> Regards
>>
>>> Le 13 avr. 2018 à 04:50, Koji Kawamura <ijokaruma...@gmail.com> a écrit :
>>>
>>> Hello,
>>>
>>> I checked your template. Haven't run the flow since I don't have 
>>> sample input XML files.
>>> However, when I looked at the MergeRecord processor configuration, I found 
>>> that:
>>> Minimum Number of Records = 2
>>> Max Bin Age = 10 sec
>>>
>>> By briefly looked at MergeRecord source code, it expires a bin that 
>>> is not complete after Max Bin Age.
>>> Do you have 20,000 records to merge always within 10 sec window?
>>> If not, I recommend to lower the minimum number of records.
>>>
>>> I haven't checked actual MergeRecord behavior so I may be wrong, but 
>>> worth to change the configuration.
>>>
>>> Hope this helps,
>>> Koji
>>>
>>>
>>> On Fri, Apr 13, 2018 at 12:26 AM, DEHAY Aurelien 
>>> <aurelien.de...@faurecia.com> wrote:
>>>> Hello.
>>>>
>>>> Please see the template attached. The problem we have is that, however any 
>>>> configuration we can set in the mergerecord, we can't manage it to 
>>>> actually merge record.
>>>>
>>>> All the record are the same format, we put an inferschema not to have to 
>>>> write it down ourselves. The only differences between schemas is then that 
>>>> the doc="" field are different. Is it possible for it to prevent the 
>>>> merging?
>>>>
>>>> Thanks for any pointer or info.
>>>>
>>>>
>>>> Aurélien DEHAY
>>>>
>>>>
>>>>
>>>> This electronic transmission (and any attachments thereto) is intended 
>>>> solely for the use of the addressee(s). It may contain confidential or 
>>>> legally privileged information. If you are not the intended recipient of 
>>>> this message, you must delete it immediately and notify the sender. Any 
>>>> unauthorized use or disclosure of this message is strictly prohibited.  
>>>> Faurecia does not guarantee the integrity of this transmission and shall 
>>>> therefore never be liabl

Re: MergeRecord

2018-04-13 Thread Koji Kawamura
Hi,

Just FYI,
If I replaces the schema doc comment by UpdateAttribute, I was able to
merge records.
${inferred.avro.schema:replaceAll('"Type inferred from [^"]+"', '""')}

I looked at InferAvroSchema and underlying Kite source code, but
there's no option to suppress the doc comment when inferring schema
unfortunately.

Thanks,
Koji

On Fri, Apr 13, 2018 at 4:11 PM, Koji Kawamura  wrote:
> Hi,
>
> I've tested InferAvroSchema and MergeRecord scenario.
> As you described, records are not merged as expected.
>
> The reason in my case is, InferAvroSchema generates schema text like this:
> inferred.avro.schema
> { "type" : "record", "name" : "example", "doc" : "Schema generated by
> Kite", "fields" : [ { "name" : "Key", "type" : "long", "doc" : "Type
> inferred from '4'" }, { "name" : "Value", "type" : "string", "doc" :
> "Type inferred from 'four'" } ] }
>
> And, MergedRecord uses that schema text as groupId even if
> 'Correlation Attribute' is specified.
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/MergeRecord.java#L348
>
> So, even if schema is the same, if actual values vary, merging group
> id will be different.
> If you can use SchemaRegistry, it should work as expected.
>
> Thanks,
> Koji
>
> On Fri, Apr 13, 2018 at 2:45 PM, DEHAY Aurelien
>  wrote:
>>
>> Hello.
>>
>> Thanks for the answer.
>>
>> The 20k is just the last test, I’ve tested with 100,1000, with an input 
>> queue of 10k, and it doesn’t change anything.
>>
>> I will try to simplify the test case and to not use the inferred schema.
>>
>> Regards
>>
>>> Le 13 avr. 2018 à 04:50, Koji Kawamura  a écrit :
>>>
>>> Hello,
>>>
>>> I checked your template. Haven't run the flow since I don't have
>>> sample input XML files.
>>> However, when I looked at the MergeRecord processor configuration, I found 
>>> that:
>>> Minimum Number of Records = 2
>>> Max Bin Age = 10 sec
>>>
>>> By briefly looked at MergeRecord source code, it expires a bin that is
>>> not complete after Max Bin Age.
>>> Do you have 20,000 records to merge always within 10 sec window?
>>> If not, I recommend to lower the minimum number of records.
>>>
>>> I haven't checked actual MergeRecord behavior so I may be wrong, but
>>> worth to change the configuration.
>>>
>>> Hope this helps,
>>> Koji
>>>
>>>
>>> On Fri, Apr 13, 2018 at 12:26 AM, DEHAY Aurelien
>>>  wrote:
 Hello.

 Please see the template attached. The problem we have is that, however any 
 configuration we can set in the mergerecord, we can't manage it to 
 actually merge record.

 All the record are the same format, we put an inferschema not to have to 
 write it down ourselves. The only differences between schemas is then that 
 the doc="" field are different. Is it possible for it to prevent the 
 merging?

 Thanks for any pointer or info.


 Aurélien DEHAY



 This electronic transmission (and any attachments thereto) is intended 
 solely for the use of the addressee(s). It may contain confidential or 
 legally privileged information. If you are not the intended recipient of 
 this message, you must delete it immediately and notify the sender. Any 
 unauthorized use or disclosure of this message is strictly prohibited.  
 Faurecia does not guarantee the integrity of this transmission and shall 
 therefore never be liable if the message is altered or falsified nor for 
 any virus, interception or damage to your system.
>>
>> This electronic transmission (and any attachments thereto) is intended 
>> solely for the use of the addressee(s). It may contain confidential or 
>> legally privileged information. If you are not the intended recipient of 
>> this message, you must delete it immediately and notify the sender. Any 
>> unauthorized use or disclosure of this message is strictly prohibited.  
>> Faurecia does not guarantee the integrity of this transmission and shall 
>> therefore never be liable if the message is altered or falsified nor for any 
>> virus, interception or damage to your system.
>>


Re: MergeRecord

2018-04-13 Thread Koji Kawamura
Hi,

I've tested InferAvroSchema and MergeRecord scenario.
As you described, records are not merged as expected.

The reason in my case is, InferAvroSchema generates schema text like this:
inferred.avro.schema
{ "type" : "record", "name" : "example", "doc" : "Schema generated by
Kite", "fields" : [ { "name" : "Key", "type" : "long", "doc" : "Type
inferred from '4'" }, { "name" : "Value", "type" : "string", "doc" :
"Type inferred from 'four'" } ] }

And, MergedRecord uses that schema text as groupId even if
'Correlation Attribute' is specified.
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/MergeRecord.java#L348

So, even if schema is the same, if actual values vary, merging group
id will be different.
If you can use SchemaRegistry, it should work as expected.

Thanks,
Koji

On Fri, Apr 13, 2018 at 2:45 PM, DEHAY Aurelien
 wrote:
>
> Hello.
>
> Thanks for the answer.
>
> The 20k is just the last test, I’ve tested with 100,1000, with an input queue 
> of 10k, and it doesn’t change anything.
>
> I will try to simplify the test case and to not use the inferred schema.
>
> Regards
>
>> Le 13 avr. 2018 à 04:50, Koji Kawamura  a écrit :
>>
>> Hello,
>>
>> I checked your template. Haven't run the flow since I don't have
>> sample input XML files.
>> However, when I looked at the MergeRecord processor configuration, I found 
>> that:
>> Minimum Number of Records = 2
>> Max Bin Age = 10 sec
>>
>> By briefly looked at MergeRecord source code, it expires a bin that is
>> not complete after Max Bin Age.
>> Do you have 20,000 records to merge always within 10 sec window?
>> If not, I recommend to lower the minimum number of records.
>>
>> I haven't checked actual MergeRecord behavior so I may be wrong, but
>> worth to change the configuration.
>>
>> Hope this helps,
>> Koji
>>
>>
>> On Fri, Apr 13, 2018 at 12:26 AM, DEHAY Aurelien
>>  wrote:
>>> Hello.
>>>
>>> Please see the template attached. The problem we have is that, however any 
>>> configuration we can set in the mergerecord, we can't manage it to actually 
>>> merge record.
>>>
>>> All the record are the same format, we put an inferschema not to have to 
>>> write it down ourselves. The only differences between schemas is then that 
>>> the doc="" field are different. Is it possible for it to prevent the 
>>> merging?
>>>
>>> Thanks for any pointer or info.
>>>
>>>
>>> Aurélien DEHAY
>>>
>>>
>>>
>>> This electronic transmission (and any attachments thereto) is intended 
>>> solely for the use of the addressee(s). It may contain confidential or 
>>> legally privileged information. If you are not the intended recipient of 
>>> this message, you must delete it immediately and notify the sender. Any 
>>> unauthorized use or disclosure of this message is strictly prohibited.  
>>> Faurecia does not guarantee the integrity of this transmission and shall 
>>> therefore never be liable if the message is altered or falsified nor for 
>>> any virus, interception or damage to your system.
>
> This electronic transmission (and any attachments thereto) is intended solely 
> for the use of the addressee(s). It may contain confidential or legally 
> privileged information. If you are not the intended recipient of this 
> message, you must delete it immediately and notify the sender. Any 
> unauthorized use or disclosure of this message is strictly prohibited.  
> Faurecia does not guarantee the integrity of this transmission and shall 
> therefore never be liable if the message is altered or falsified nor for any 
> virus, interception or damage to your system.
>


Re: MergeRecord

2018-04-12 Thread DEHAY Aurelien

Hello. 

Thanks for the answer. 

The 20k is just the last test, I’ve tested with 100,1000, with an input queue 
of 10k, and it doesn’t change anything. 

I will try to simplify the test case and to not use the inferred schema. 

Regards

> Le 13 avr. 2018 à 04:50, Koji Kawamura  a écrit :
> 
> Hello,
> 
> I checked your template. Haven't run the flow since I don't have
> sample input XML files.
> However, when I looked at the MergeRecord processor configuration, I found 
> that:
> Minimum Number of Records = 2
> Max Bin Age = 10 sec
> 
> By briefly looked at MergeRecord source code, it expires a bin that is
> not complete after Max Bin Age.
> Do you have 20,000 records to merge always within 10 sec window?
> If not, I recommend to lower the minimum number of records.
> 
> I haven't checked actual MergeRecord behavior so I may be wrong, but
> worth to change the configuration.
> 
> Hope this helps,
> Koji
> 
> 
> On Fri, Apr 13, 2018 at 12:26 AM, DEHAY Aurelien
>  wrote:
>> Hello.
>> 
>> Please see the template attached. The problem we have is that, however any 
>> configuration we can set in the mergerecord, we can't manage it to actually 
>> merge record.
>> 
>> All the record are the same format, we put an inferschema not to have to 
>> write it down ourselves. The only differences between schemas is then that 
>> the doc="" field are different. Is it possible for it to prevent the merging?
>> 
>> Thanks for any pointer or info.
>> 
>> 
>> Aurélien DEHAY
>> 
>> 
>> 
>> This electronic transmission (and any attachments thereto) is intended 
>> solely for the use of the addressee(s). It may contain confidential or 
>> legally privileged information. If you are not the intended recipient of 
>> this message, you must delete it immediately and notify the sender. Any 
>> unauthorized use or disclosure of this message is strictly prohibited.  
>> Faurecia does not guarantee the integrity of this transmission and shall 
>> therefore never be liable if the message is altered or falsified nor for any 
>> virus, interception or damage to your system.

This electronic transmission (and any attachments thereto) is intended solely 
for the use of the addressee(s). It may contain confidential or legally 
privileged information. If you are not the intended recipient of this message, 
you must delete it immediately and notify the sender. Any unauthorized use or 
disclosure of this message is strictly prohibited.  Faurecia does not guarantee 
the integrity of this transmission and shall therefore never be liable if the 
message is altered or falsified nor for any virus, interception or damage to 
your system.



Re: MergeRecord

2018-04-12 Thread Koji Kawamura
Hello,

I checked your template. Haven't run the flow since I don't have
sample input XML files.
However, when I looked at the MergeRecord processor configuration, I found that:
Minimum Number of Records = 2
Max Bin Age = 10 sec

By briefly looked at MergeRecord source code, it expires a bin that is
not complete after Max Bin Age.
Do you have 20,000 records to merge always within 10 sec window?
If not, I recommend to lower the minimum number of records.

I haven't checked actual MergeRecord behavior so I may be wrong, but
worth to change the configuration.

Hope this helps,
Koji


On Fri, Apr 13, 2018 at 12:26 AM, DEHAY Aurelien
 wrote:
> Hello.
>
> Please see the template attached. The problem we have is that, however any 
> configuration we can set in the mergerecord, we can't manage it to actually 
> merge record.
>
> All the record are the same format, we put an inferschema not to have to 
> write it down ourselves. The only differences between schemas is then that 
> the doc="" field are different. Is it possible for it to prevent the merging?
>
> Thanks for any pointer or info.
>
>
> Aurélien DEHAY
>
>
>
> This electronic transmission (and any attachments thereto) is intended solely 
> for the use of the addressee(s). It may contain confidential or legally 
> privileged information. If you are not the intended recipient of this 
> message, you must delete it immediately and notify the sender. Any 
> unauthorized use or disclosure of this message is strictly prohibited.  
> Faurecia does not guarantee the integrity of this transmission and shall 
> therefore never be liable if the message is altered or falsified nor for any 
> virus, interception or damage to your system.