Re: Expected mergerecord performance

2023-01-30 Thread Richard Beare
Hi Everyone, I'm tentatively reporting that I've identified my problem. The nifi instance I've been testing is running on one docker container among a set of other docker-based services. The logging on the nifi container was configured to write to a persistent volume, and the nifi-app.log file

Re: Expected mergerecord performance

2022-12-21 Thread Richard Beare
Yes, embedded avro schema on both. The part that mystifies me most is the different behaviour I see on different test hosts, making me suspect there's some sort of network timeout happening in the background (probably desperation on my part). The problem host is sitting in a hospital datacentre

Re: Expected mergerecord performance

2022-12-21 Thread Bryan Bende
That works. Can elaborate on the changes you made to AvroRecordSetWriter where it went from instant to slow? It sounds like the instant case had: - Schema Access Strategy: Inherit Record Schema and slow case has: - Schema Access Strategy: Use Schema Text What about Schema Write Strategy?

Re: Expected mergerecord performance

2022-12-21 Thread Bryan Bende
Can you show your configuration of JoltTransformRecord and AvroRecordSetWriter? On Wed, Dec 21, 2022 at 2:51 AM Lars Winderling wrote: > Hi Richard, > > it's not related, but for the logical types timestamp-millis you should > use a "long" instead of a "string" (cf >

Re: Expected mergerecord performance

2022-12-20 Thread Lars Winderling
Hi Richard, it's not related, but for the logical types timestamp-millis you should use a "long" instead of a "string" (cf https://avro.apache.org/docs/1.11.1/specification/#timestamp-millisecond-precision) afaik. Best, Lars On 21 December 2022 08:29:54 CET, Richard Beare wrote: >I have

Re: Expected mergerecord performance

2022-12-20 Thread Richard Beare
I have found a way to force the schema to be used, but I've missed something in my configuration. When I use a default generic avro writer in my jolttransformrecord processor the queue of 259 entries (about 1.8M) is processed instantly. If I configure my avrowriter to use the schema text property

Re: Expected mergerecord performance

2022-12-20 Thread Richard Beare
I've made progress with Jolt and I think I'm close to achieving what I'm after. I am missing one conceptual step, I think. I rearrange my json so that it conforms to the desired structure and I can then write the results as avro. However, that is generic avro. How do I ensure that I conform to

Re: Expected mergerecord performance

2022-12-20 Thread Richard Beare
Thanks - I'll have a look at that. It is helpfully to get guidance like this when the system is so large. On Wed, Dec 21, 2022 at 5:30 AM Matt Burgess wrote: > Thanks Vijay! I agree those processors should do the trick but there > were things in the transformation between input and desired

Re: Expected mergerecord performance

2022-12-20 Thread Matt Burgess
Thanks Vijay! I agree those processors should do the trick but there were things in the transformation between input and desired output that I wasn't sure of their origin. If you are setting constants you can use either a Shift or Default spec, if you are moving fields around you can use a Shift

Re: Expected mergerecord performance

2022-12-20 Thread Vijay Chhipa
Hi Richard Have you tried JoltTransformJSON or JoltTransformRecord I believe you should be able to do this Quick start here: https://community.cloudera.com/t5/Community-Articles/Jolt-quick-reference-for-Nifi-Jolt-Processors/ta-p/244350 > On Dec 20, 2022, at 4:13 AM, Richard Beare wrote: >

Re: Expected mergerecord performance

2022-12-20 Thread Richard Beare
Hi Everyone, Still struggling to fix this issue and may need to try some different things. What is the recommended way of transforming a record structure? At the moment I have a groovy script doing this but the downstream processing is very slow, as discussed in the preceding thread. The

Re: Expected mergerecord performance

2022-12-13 Thread Richard Beare
Any thoughts on this? Are there some extra steps required when creating an avro file from a user defined schema? On Thu, Dec 8, 2022 at 2:56 PM Richard Beare wrote: > Here's another result that I think suggests there's something wrong with > the avro files created by the groovy script, although

Re: Expected mergerecord performance

2022-12-07 Thread Richard Beare
Here's another result that I think suggests there's something wrong with the avro files created by the groovy script, although I can't see what the problem might be. The test is as follows. Output of the groovy script creating avro files is passed to convertrecord, configured with an avro reader

Re: Expected mergerecord performance

2022-12-07 Thread Richard Beare
I'm diving into the convertrecord tests a bit deeper on the production server. The first test case - 259 documents, total of 1M when in avro format in the input queue to the convert record processor. These avro files were not created by the groovy script - they start life as a database query and

Re: Expected mergerecord performance

2022-12-07 Thread Richard Beare
Hi All, Some progress on debugging options. I've found a flow that exhibits the problem using synthetic data. However the results are host dependent. On my laptop a "run-once" click of merge record gives me two flowfiles of 100 records, while the same flow on the production server produces several

Re: Expected mergerecord performance

2022-12-07 Thread Mark Payne
> Is there something about this structure that is likely to be causing the > problem? Could there be other issues with the avro generated by the script? I don’t think the structure should matter. And as long as the avro produced is proper Avro, I don’t think it should matter. Unless perhaps

Re: Expected mergerecord performance

2022-12-05 Thread Richard Beare
The script generating the avro files is: https://github.com/CogStack/CogStack-NiFi/blob/master/nifi/user-scripts/parse-tika-result-json-to-avro.groovy On Mon, Dec 5, 2022 at 9:58 PM Richard Beare wrote: > Further - I performed another test in which I replaced the custom json to > avro script

Re: Expected mergerecord performance

2022-12-05 Thread Richard Beare
Further - I performed another test in which I replaced the custom json to avro script with a ConvertRecord processor - merge record appears to work as expected in that case. Output of convertrecord looks like this: [ { "text" : " No Alert Found \n\n", "metadata" : { "X_TIKA_Parsed_By" :

Re: Expected mergerecord performance

2022-12-05 Thread Richard Beare
I've reset the backpressure to the default This remains something of a mystery. The merge with synthetic data happily creates flowfiles with 100 records, and the join says "Records merged due to: Bin is full" or "Records merged due to: Bin is full enough". No timeouts in that case, even with the

Re: Expected mergerecord performance

2022-12-04 Thread Richard Beare
Thanks very much - I'll work through these one at a time and figure out what is going on. The host is an on-prem 48 core with 512G RAM. Can't remember the volume size, but large. I hadn't realised I had modified the backpressure, so that is my top suspect for the difference I'm seeing. Also good

Re: Expected mergerecord performance

2022-12-04 Thread Mark Payne
Hey Richard, So a few things that I’ve done/looked at. I generated some Avro data (random JSON that I downloaded from a Random JSON Generator and then converted to Avro). I then ran this avro data into both the MergeRecord processors. Firstly, I noticed that both are very slow. Found that was

Re: Expected mergerecord performance

2022-12-03 Thread Mark Payne
Richard, I think just the flow structure shoudl be sufficient. Thanks -Mark On Dec 3, 2022, at 4:32 PM, Richard Beare wrote: Thanks for responding, I re-tested with max bins = 2, but the behaviour remained the same. I can easily share a version of the functioning workflow (and data), which

Re: Expected mergerecord performance

2022-12-03 Thread Richard Beare
Thanks for responding, I re-tested with max bins = 2, but the behaviour remained the same. I can easily share a version of the functioning workflow (and data), which is part of a public project. The problem workflow (which shares many of the same components) is part of a health research project,

Re: Expected mergerecord performance

2022-12-03 Thread Mark Payne
Hi Richard, Can you try increasing the Maximum Number of Bins? I think there was an issue that was recently addressed in which the merge processors had an issue when Max Number of Bins = 1. If you still see the same issue, please provide a copy of the flow that can be used to replicate the

Re: Expected mergerecord performance

2022-12-03 Thread Richard Beare
Hi, Pretty much the same - I seem to end up with flowfiles containing about 7 records, presumably always triggered by the timeout. I had thought the timeout needed to be less than the run schedule, but it looks like it can be the same. Here's a debug dump 10:13:43 UTC DEBUG

Re: Expected mergerecord performance

2022-12-02 Thread Joe Witt
Hello Run schedule should be 0. 50 should be the min number of records 5 seconds is the max bin age it sounds like you want. Start with these changes and let us know what youre seeing. Thanks On Fri, Dec 2, 2022 at 10:12 PM Richard Beare wrote: > Hi, > I'm having a great deal of trouble

Expected mergerecord performance

2022-12-02 Thread Richard Beare
Hi, I'm having a great deal of trouble configuring the mergerecord processor to deliver reasonable performance and I'm not sure where to look to correct it. One of my upstream processors requires a single record per flowfile, but I'd like to create larger flowfiles before passing to the next