Hi Everyone,
I'm tentatively reporting that I've identified my problem.
The nifi instance I've been testing is running on one docker container
among a set of other docker-based services. The logging on the nifi
container was configured to write to a persistent volume, and the
nifi-app.log file
Yes, embedded avro schema on both.
The part that mystifies me most is the different behaviour I see on
different test hosts, making me suspect there's some sort of network
timeout happening in the background (probably desperation on my part). The
problem host is sitting in a hospital datacentre
That works. Can elaborate on the changes you made to AvroRecordSetWriter
where it went from instant to slow?
It sounds like the instant case had:
- Schema Access Strategy: Inherit Record Schema
and slow case has:
- Schema Access Strategy: Use Schema Text
What about Schema Write Strategy?
Can you show your configuration of JoltTransformRecord and
AvroRecordSetWriter?
On Wed, Dec 21, 2022 at 2:51 AM Lars Winderling
wrote:
> Hi Richard,
>
> it's not related, but for the logical types timestamp-millis you should
> use a "long" instead of a "string" (cf
>
Hi Richard,
it's not related, but for the logical types timestamp-millis you should use a
"long" instead of a "string" (cf
https://avro.apache.org/docs/1.11.1/specification/#timestamp-millisecond-precision)
afaik.
Best, Lars
On 21 December 2022 08:29:54 CET, Richard Beare wrote:
>I have
I have found a way to force the schema to be used, but I've missed
something in my configuration. When I use a default generic avro writer in
my jolttransformrecord processor the queue of 259 entries (about 1.8M) is
processed instantly.
If I configure my avrowriter to use the schema text property
I've made progress with Jolt and I think I'm close to achieving what I'm
after. I am missing one conceptual step, I think.
I rearrange my json so that it conforms to the desired structure and I can
then write the results as avro. However, that is generic avro. How do I
ensure that I conform to
Thanks - I'll have a look at that. It is helpfully to get guidance like
this when the system is so large.
On Wed, Dec 21, 2022 at 5:30 AM Matt Burgess wrote:
> Thanks Vijay! I agree those processors should do the trick but there
> were things in the transformation between input and desired
Thanks Vijay! I agree those processors should do the trick but there
were things in the transformation between input and desired output
that I wasn't sure of their origin. If you are setting constants you
can use either a Shift or Default spec, if you are moving fields
around you can use a Shift
Hi Richard
Have you tried JoltTransformJSON or JoltTransformRecord
I believe you should be able to do this
Quick start here:
https://community.cloudera.com/t5/Community-Articles/Jolt-quick-reference-for-Nifi-Jolt-Processors/ta-p/244350
> On Dec 20, 2022, at 4:13 AM, Richard Beare wrote:
>
Hi Everyone,
Still struggling to fix this issue and may need to try some different
things.
What is the recommended way of transforming a record structure? At the
moment I have a groovy script doing this but the downstream processing is
very slow, as discussed in the preceding thread.
The
Any thoughts on this? Are there some extra steps required when creating an
avro file from a user defined schema?
On Thu, Dec 8, 2022 at 2:56 PM Richard Beare
wrote:
> Here's another result that I think suggests there's something wrong with
> the avro files created by the groovy script, although
Here's another result that I think suggests there's something wrong with
the avro files created by the groovy script, although I can't see what the
problem might be.
The test is as follows. Output of the groovy script creating avro files is
passed to convertrecord, configured with an avro reader
I'm diving into the convertrecord tests a bit deeper on the production
server.
The first test case - 259 documents, total of 1M when in avro format in the
input queue to the convert record processor. These avro files were not
created by the groovy script - they start life as a database query and
Hi All,
Some progress on debugging options. I've found a flow that exhibits the
problem using synthetic data. However the results are host dependent. On my
laptop a "run-once" click of merge record gives me two flowfiles of 100
records, while the same flow on the production server produces several
> Is there something about this structure that is likely to be causing the
> problem? Could there be other issues with the avro generated by the script?
I don’t think the structure should matter. And as long as the avro produced is
proper Avro, I don’t think it should matter. Unless perhaps
The script generating the avro files is:
https://github.com/CogStack/CogStack-NiFi/blob/master/nifi/user-scripts/parse-tika-result-json-to-avro.groovy
On Mon, Dec 5, 2022 at 9:58 PM Richard Beare
wrote:
> Further - I performed another test in which I replaced the custom json to
> avro script
Further - I performed another test in which I replaced the custom json to
avro script with a ConvertRecord processor - merge record appears to work
as expected in that case.
Output of convertrecord looks like this:
[ {
"text" : " No Alert Found \n\n",
"metadata" : {
"X_TIKA_Parsed_By" :
I've reset the backpressure to the default
This remains something of a mystery. The merge with synthetic data happily
creates flowfiles with 100 records, and the join says "Records merged due
to: Bin is full" or "Records merged due to: Bin is full enough". No
timeouts in that case, even with the
Thanks very much - I'll work through these one at a time and figure out
what is going on.
The host is an on-prem 48 core with 512G RAM. Can't remember the volume
size, but large. I hadn't realised I had modified the backpressure, so that
is my top suspect for the difference I'm seeing.
Also good
Hey Richard,
So a few things that I’ve done/looked at.
I generated some Avro data (random JSON that I downloaded from a Random JSON
Generator and then converted to Avro).
I then ran this avro data into both the MergeRecord processors.
Firstly, I noticed that both are very slow. Found that was
Richard,
I think just the flow structure shoudl be sufficient.
Thanks
-Mark
On Dec 3, 2022, at 4:32 PM, Richard Beare wrote:
Thanks for responding,
I re-tested with max bins = 2, but the behaviour remained the same. I can
easily share a version of the functioning workflow (and data), which
Thanks for responding,
I re-tested with max bins = 2, but the behaviour remained the same. I can
easily share a version of the functioning workflow (and data), which is
part of a public project. The problem workflow (which shares many of the
same components) is part of a health research project,
Hi Richard,
Can you try increasing the Maximum Number of Bins? I think there was an issue
that was recently addressed in which the merge processors had an issue when Max
Number of Bins = 1.
If you still see the same issue, please provide a copy of the flow that can be
used to replicate the
Hi,
Pretty much the same - I seem to end up with flowfiles containing about 7
records, presumably always triggered by the timeout.
I had thought the timeout needed to be less than the run schedule, but it
looks like it can be the same.
Here's a debug dump
10:13:43 UTC
DEBUG
Hello
Run schedule should be 0.
50 should be the min number of records
5 seconds is the max bin age it sounds like you want.
Start with these changes and let us know what youre seeing.
Thanks
On Fri, Dec 2, 2022 at 10:12 PM Richard Beare
wrote:
> Hi,
> I'm having a great deal of trouble
Hi,
I'm having a great deal of trouble configuring the mergerecord processor to
deliver reasonable performance and I'm not sure where to look to correct
it. One of my upstream processors requires a single record per flowfile,
but I'd like to create larger flowfiles before passing to the next
27 matches
Mail list logo