Hey Mark,
I should have mentioned the PutElasticsearchHttp is going to 2 different
clusters.  We did play with different thread counts for each of them.  At
one point were wondering if too large a Batch Size would make the threads
block each.

It looks like PutElasticsearchHttp serializes every FlowFile to verify it's
a well-formed JSON document [1].  That alone feels pretty CPU expensive..
In our case, we know already we have valid JSON.  Just as an
anecdotal benchmark.. A combination of [MergeContent + 2x InvokeHTTP] uses
a total of 9 threads to accomplish the same thing that [2x DistributeLoad +
2x PutElasticsearchHTTP] does with 50 threads.  DistributeLoad's need 5
threads each to keep up.  PutElasticsearchHTTP needs about 10 each.

PutElasticsearchHTTP is configured like this:
Index: ${esIndex}
Batch Size: 3000
Index Operation: Index

For the ./nifi.sh diagnostics --verbose diagnostics1.txt, I had to export
TOOLS_JAR on the command line to the path where tools.jar was located.

I'm not getting a file written out though.  I still have the "full" NiFi up
and running.  I assume that should be?  Do I need to change my logback.xml
levels at all?


[1]
https://github.com/apache/nifi/blob/aa741cc5967f62c3c38c2a47e712b7faa6fe19ff/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/PutElasticsearchHttp.java#L299

Thanks,
Ryan

On Thu, Sep 17, 2020 at 11:43 AM Mark Payne <marka...@hotmail.com> wrote:

> Ryan,
>
> Why are you using DistributeLoad to go to two different
> PutElasticsearchHttp processors? Does that perform better for you than a
> single PutElasticsearchHttp processors with multiple concurrent tasks? It
> shouldn’t really. I’ve never used that processor, but if two instances of
> the processor perform significantly better than 1 instance with 2
> concurrent tasks, that’s probably worth looking into.
>
> -Mark
>
>
> On Sep 17, 2020, at 11:38 AM, Ryan Hendrickson <
> ryan.andrew.hendrick...@gmail.com> wrote:
>
> @Joe I can't export the flow.xml.gz easily, although it's pretty simple.
> We put just the following on it's own server because DistributeLoad (bug
> [1]) and PutElasticsearchHttp have a hard time keeping up.
>
>    1. Input Port
>    2. ControlRate (data rate | 1.7GB | 5 min)
>    3. Update Attributes (Delete Attribute Regex)
>    4. JoltTransformJSON
>    5. FlattenJSONArray (Custom.. takes a 1 level JSON Array and turns it
>    into Objects)
>    6. DistributeLoad
>       1. PutElasticsearchHttp
>       2. PutElasticsearchHttp
>
>
> Unrelated..  We're experimenting with a MergeContent + InvokeHTTP combo to
> see if that's more performant than PutElasticsearchHttp.. The Elastic one
> uses an ObjectMapper, and string replacements, etc.  It seems to cap out
> around 2-3GB/5 minutes
>
> @Mark I'll check the diagnostics.
>
> @Jim definitely disk space 100% used.
>
> [1] https://issues.apache.org/jira/browse/NIFI-1121
>
> Ryan
>
> On Thu, Sep 17, 2020 at 11:33 AM Williams, Jim <jwilli...@alertlogic.com>
> wrote:
>
>> Ryan,
>>
>>
>>
>> Is this this maybe a case of exhausting inodes on the filesystem rather
>> than exhausting the space available?  If you do a ‘df -I’ on the system
>> what do you see for inode usage?
>>
>>
>>
>> Warm regards,
>>
>>
>>
>> <image001.jpg> <https://www.alertlogic.com/>
>>
>> *Jim Williams* | Manager, Site Reliability Engineering
>>
>> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilli...@alertlogic.com |
>> alertlogic.com <http://www.alertlogic.com/> <image002.png>
>> <https://twitter.com/alertlogic><image003.png>
>> <https://www.linkedin.com/company/alert-logic>
>>
>>
>>
>> <image004.png>
>>
>>
>>
>> *From:* Joe Witt <joe.w...@gmail.com>
>> *Sent:* Thursday, September 17, 2020 10:19 AM
>> *To:* users@nifi.apache.org
>> *Subject:* Re: Content Claims Filling Disk - Best practice for small
>> files?
>>
>>
>>
>> can you share your flow.xml.gz?
>>
>>
>>
>> On Thu, Sep 17, 2020 at 8:08 AM Ryan Hendrickson <
>> ryan.andrew.hendrick...@gmail.com> wrote:
>>
>> 1.12.0
>>
>>
>>
>> Thanks,
>>
>> Ryan
>>
>>
>>
>> On Thu, Sep 17, 2020 at 11:04 AM Joe Witt <joe.w...@gmail.com> wrote:
>>
>> Ryan
>>
>>
>>
>> What version are you using? I do think we had an issue that kept items
>> around longer than intended that has been addressed.
>>
>>
>>
>> Thanks
>>
>>
>>
>> On Thu, Sep 17, 2020 at 7:58 AM Ryan Hendrickson <
>> ryan.andrew.hendrick...@gmail.com> wrote:
>>
>> Hello,
>>
>> I've got ~15 million FlowFiles, each roughly 4KB, totally in about 55GB
>> of data on my canvas.
>>
>>
>>
>> However, the content repository (on it's own partition) is
>> completely full with 350GB of data.  I'm pretty certain the way Content
>> Claims store the data is responsible for this.  In previous experience,
>> we've had files that are larger, and haven't seen this as much.
>>
>>
>>
>> My guess is that as data was streaming through and being added to a
>> claim, it isn't always released as the small files leaves the canvas.
>>
>>
>>
>> We've run into this issue enough times that I figure there's probably a
>> "best practice for small files" for the content claims settings.
>>
>>
>>
>> These are our current settings:
>>
>>
>> nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
>>
>> nifi.content.claim.max.appendable.size=1 MB
>>
>> nifi.content.claim.max.flow.files=100
>>
>> nifi.content.repository.directory.default=/var/nifi/repositories/content
>>
>> nifi.content.repository.archive.max.retention.period=12 hours
>>
>> nifi.content.repository.archive.max.usage.percentage=50%
>>
>> nifi.content.repository.archive.enabled=true
>>
>> nifi.content.repository.always.sync=false
>>
>>
>>
>>
>> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>>
>>
>>
>>
>> There's 1024 folders on the disk (0-1023) for the Content Claims.
>>
>> Each file inside the folders are roughly  2MB to 8 MB (Which is odd
>> because I thought the max appendable size would make this no larger than
>> 1MB.)
>>
>>
>>
>> Is there a way to expand the number of folders and/or reduce the amount
>> of individual FlowFiles that are stored in the claims?
>>
>>
>>
>> I'm hoping there might be a best practice out there though.
>>
>>
>>
>> Thanks,
>>
>> Ryan
>>
>>
>>
>> Confidentiality Notice | This email and any included attachments may be
>> privileged, confidential and/or otherwise protected from disclosure. Access
>> to this email by anyone other than the intended recipient is unauthorized.
>> If you believe you have received this email in error, please contact the
>> sender immediately and delete all copies. If you are not the intended
>> recipient, you are notified that disclosing, copying, distributing or
>> taking any action in reliance on the contents of this information is
>> strictly prohibited.
>>
>> *Disclaimer*
>>
>> The information contained in this communication from the sender is
>> confidential. It is intended solely for use by the recipient and others
>> authorized to receive it. If you are not the recipient, you are hereby
>> notified that any disclosure, copying, distribution or taking action in
>> relation of the contents of this information is strictly prohibited and may
>> be unlawful.
>>
>> This email has been scanned for viruses and malware, and may have been
>> automatically archived by Mimecast, a leader in email security and cyber
>> resilience. Mimecast integrates email defenses with brand protection,
>> security awareness training, web security, compliance and other essential
>> capabilities. Mimecast helps protect large and small organizations from
>> malicious activity, human error and technology failure; and to lead the
>> movement toward building a more resilient world. To find out more, visit
>> our website.
>>
>
>

Reply via email to