Re: Nifi hardware recommendation

Joe Witt Fri, 14 Oct 2016 09:01:50 -0700

I'd also add to Mark's great reply that another good use of RAM beyond the
HEAP and disk caching and avoiding swapping is that you can do things like
off-heap native storage of things like reference datasets that can wired
into NiFi flows for high speed enrichment where you can even do hot
swapping of older and newer versions of those references sets.


On Fri, Oct 14, 2016 at 8:41 AM, Mark Payne <[email protected]> wrote:

> Hi Ali,
>
> Typically, we see people using a 4-8 GB heap with NiFi. 8 GB is pretty
> typical for a flow that is expected to have
> pretty high throughput in terms of the number of FlowFiles, or a large
> number of processors. However, one thing
> that you will want to consider in terms of RAM is disk caching. While NiFi
> may not use a huge amount of RAM
> directly, the operating system's disk cache is immensely valuable. Because
> the content of FlowFiles is written
> to disk, having a small amount of RAM can result in the next processor
> needing to read that content from disk.
> However, with a sufficient amount of RAM, you will see by looking at
> operating system metrics such as (iostat -xmh 5, for linux)
> that NiFi almost never reads FlowFile content from disk. Instead, it is
> able to get all it needs from the disk cache.
> Frequently querying provenance data also shows a huge difference in
> performance if you have enough RAM.
>
> So the ideal case, I would say, is to have enough RAM for NiFi's heap, as
> well as the content size of all FlowFiles
> that will be actively in your flow at once, plus all other things that
> need to go on, on that box. That said, NiFi should
> certainly work fine reading the content from disk if it needs to - just
> with lower performance.
>
> Does this answer your question?
>
> Thanks
> -Mark
>
>
> On Oct 13, 2016, at 7:47 PM, Ali Nazemian <[email protected]> wrote:
>
> Hi,
>
> I have another question regarding the hardware recommendation. As far as I
> found out, Nifi uses on-heap memory currently, and it will not try to load
> the whole object in memory. From the garbage collection perspective, it is
> not recommended to dedicate more than 8-10 GB to JVM heap space. In this
> case, may I say spending money on system memory is useless? Probably 16 GB
> per each system is enough according to this architecture. Unless some
> architecture changes appear in the future to use off-heap memory as well.
> However, I found some articles about best practices, and in terms of memory
> recommendation it does not make sense. Would you please clarify this part
> for me?
> Thank you very much.
>
> Best regards,
> Ali
>
>
> On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian <[email protected]>
> wrote:
>
>> Thank you very much.
>> I would be more than happy to provide some benchmark results after the
>> implementation.
>> Sincerely yours,
>> Ali
>>
>> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt <[email protected]> wrote:
>>
>>> Ali,
>>>
>>> I agree with your assumption.  It would be great to test that out and
>>> provide some numbers but intuitively I agree.
>>>
>>> I could envision certain scatter/gather data flows that could challenge
>>> that sequential access assumption but honestly with how awesome disk
>>> caching is in Linux these days in think practically speaking this is the
>>> right way to think about it.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian <[email protected]>
>>> wrote:
>>>
>>>> Dear Joe,
>>>>
>>>> Thank you very much. That was a really great explanation.
>>>> I investigated the Nifi architecture, and it seems that most of the
>>>> read/write operations for flow file repo and provenance repo are random.
>>>> However, for content repo most of the read/write operations are sequential.
>>>> Let's say cost does not matter. In this case, even choosing SSD for content
>>>> repo can not provide huge performance gain instead of HDD. Am I right?
>>>> Hence, it would be better to spend content repo SSD money on network
>>>> infrastructure.
>>>>
>>>> Best regards,
>>>> Ali
>>>>
>>>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt <[email protected]> wrote:
>>>>
>>>>> Ali,
>>>>>
>>>>> You have a lot of nice resources to work with there.  I'd recommend
>>>>> the series of RAID-1 configuration personally provided you keep in mind
>>>>> this means you can only lose a single disk for any one partition.  As long
>>>>> as they're being monitored and would be quickly replaced this in practice
>>>>> works well.  If there could be lapses in monitoring or time to replace 
>>>>> then
>>>>> it is perhaps safer to go with more redundancy or an alternative RAID 
>>>>> type.
>>>>>
>>>>> I'd say do the OS, app installs w/user and audit db stuff, application
>>>>> logs on one physical RAID volume.  Have a dedicated physical volume for 
>>>>> the
>>>>> flow file repository.  It will not be able to use all the space but it
>>>>> certainly could benefit from having no other contention.  This could be a
>>>>> great thing to have SSDs for actually.  And for the remaining volumes 
>>>>> split
>>>>> them up for content and provenance as you have.  You get to make the
>>>>> overall performance versus retention decision.  Frankly, you have a great
>>>>> system to work with and I suspect you're going to see excellent results
>>>>> anyway.
>>>>>
>>>>> Conservatively speaking expect say 50MB/s of throughput per volume in
>>>>> the content repository so if you end up with 8 of them could achieve
>>>>> upwards of 400MB/s sustained.  You'll also then want to make sure you have
>>>>> a good 10G based network setup as well.  Or, you could dial back on the
>>>>> speed tradeoff and simply increase retention or disk loss tolerance.  Lots
>>>>> of ways to play the game.
>>>>>
>>>>> There are no published SSD vs HDD performance benchmarks that I am
>>>>> aware of though this is a good idea.  Having a hybrid of SSDs and HDDs
>>>>> could offer a really solid performance/retention/cost tradeoff.  For
>>>>> example having SSDs for the OS/logs/provenance/flowfile with HDDs for the
>>>>> content - that would be quite nice.  At that rate to take full advantage 
>>>>> of
>>>>> the system you'd need to have very strong network infrastructure between
>>>>> NiFi and any systems it is interfacing with  and your flows would need to
>>>>> be well tuned for GC/memory efficiency.
>>>>>
>>>>> Thanks
>>>>> Joe
>>>>>
>>>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Dear Nifi Users/ developers,
>>>>>> Hi,
>>>>>>
>>>>>> I was wondering is there any benchmark about the question that is it
>>>>>> better to dedicate disk control to Nifi or using RAID for this purpose? 
>>>>>> For
>>>>>> example, which of these scenarios is recommended from the performance 
>>>>>> point
>>>>>> of view?
>>>>>> Scenario 1:
>>>>>> 24 disk in total
>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>> 2 disk- raid 1 for provenance repo1
>>>>>> 2 disk- raid 1 for provenance repo2
>>>>>> 2 disk- raid 1 for content repo1
>>>>>> 2 disk- raid 1 for content repo2
>>>>>> 2 disk- raid 1 for content repo3
>>>>>> 2 disk- raid 1 for content repo4
>>>>>> 2 disk- raid 1 for content repo5
>>>>>> 2 disk- raid 1 for content repo6
>>>>>> 2 disk- raid 1 for content repo7
>>>>>> 2 disk- raid 1 for content repo8
>>>>>> 2 disk- raid 1 for content repo9
>>>>>>
>>>>>>
>>>>>> Scenario 2:
>>>>>> 24 disk in total
>>>>>> 2 disk- raid 1 for OS and fileflow repo
>>>>>> 4 disk- raid 10 for provenance repo1
>>>>>> 18 disk- raid 10 for content repo1
>>>>>>
>>>>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>>>>> Thank you very much.
>>>>>>
>>>>>> Best regards,
>>>>>> Ali
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> A.Nazemian
>>>>
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>
>
>

Re: Nifi hardware recommendation

Reply via email to