I think the notes on multiple locations for a repository are based on independent disks not shared storage. That's why I don't think it will help in a shared storage environment.
Yes, I can see a potential performance loss if NiFi is given multiple locations for a repository if the underlying storage (shared or otherwise) does not provide a performance gain greater than the overhead of managing multiple storage locations, but those will vary based on the system and flow. On Wed, May 17, 2017 at 10:30 AM, Ali Nazemian <[email protected]> wrote: > Hi Joe, > > I understand the situation of using DAS and it is a recommended option for > a production environment, but in the case of having a shared storage like > SAN or NAS, I am not sure how we can see a slightly more throughput with > having multiple disk volumes for the content repo. > > At the storage layer, data is written and read from multiple disks anyway. > Nifi moves content to content repos in a round robin way. On the other > hand, shared storage distributes data through raid mechanism. Can we face a > situation that actually throughput decreases due to any conflict between > shared storage distribution mechanism and Nifi round robin approach? > > Cheers, > Ali > > On Thu, May 18, 2017 at 12:21 AM, Ali Nazemian <[email protected]> > wrote: > >> Hi Juan, >> >> Thank you very much, I have already seen those documents. So it is >> completely clear to me for a Direct Attached Storage scenario, but I am >> investigating the situation of a fully virtualized platform with a shared >> storage. >> >> Cheers, >> Ali >> >> On Thu, May 18, 2017 at 12:00 AM, Joe Skora <[email protected]> wrote: >> >>> What I meant is that in general, multiple disks have a higher potential >>> maximum throughput than a single disk. For example, if a single 1TB disk >>> capable of 160MB/s is split into 4x 250GB volumes the total combined >>> bandwidth of the volumes is still 160MB/s, but if data is distributed >>> across four 250GB disks capable of 160MB/s the potential throughput is up >>> to 640MB/s. The motherboard, operating system, volume of files, file >>> sizes, and physical distribution of data across the disks will all affect >>> the actual bandwidth seen. >>> >>> On virtualized disks, the disk configuration and physical distribution >>> of data cannot be controlled so splitting the volumes doesn't give the same >>> performance benefit. >>> >>> On Wed, May 17, 2017 at 9:27 AM, Ali Nazemian <[email protected]> >>> wrote: >>> >>>> Hi Joe, >>>> >>>> Can you please explain what will happen that still we will see a >>>> performance increase through using multiple volumes for each repository? So >>>> practically using different volumes for FlowFile, Provenance and Content >>>> would overcome space collision situation. Based on the mentioned example so >>>> 100GB FlowFile, 1TB prov and 4TB Content Repo should still have less >>>> throughput than 100GB FlowFile, 2x500GB prov and 8x500GB content repo in >>>> practice for a fully virtualized environment. >>>> >>>> Regards, >>>> Ali >>>> >>>> On Wed, May 17, 2017 at 10:06 PM, Joe Skora <[email protected]> wrote: >>>> >>>>> Ali, >>>>> >>>>> If you can separate the repositories onto separate physical spindles I >>>>> would expect a performance benefit, but if they are all on virtualized >>>>> storage I'd expect less performance benefit from separate volumes. But, >>>>> even on virtualized storage, separate volumes can help reduce space >>>>> collision problems, preventing runaway system logs or the provenance >>>>> repository, for instance, from filling the disk and running the content >>>>> repository out of space. >>>>> >>>>> Regards, >>>>> Joe S >>>>> >>>>> On Wed, May 17, 2017 at 5:00 AM, Ali Nazemian <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I was wondering whether there is any performance throughput of having >>>>>> multiple disk mount points for FlowFile, Provenance and Content or using >>>>>> single mount point for all of them if we are using a fully virtualized >>>>>> deployment with a shared storage. Suppose we have got 500TB disks in the >>>>>> Share Storage. Which one do you suggest: 100 GB for FlowFile 2x500GB for >>>>>> Provenance and 8x500GB for the Content repository or using a single mount >>>>>> point of 5.1TB for the entire instance? In another word, it would be >>>>>> better >>>>>> Nifi keeps track of load among the disk mount points or delegate it >>>>>> entirely to the shared storage? >>>>>> >>>>>> Regards, >>>>>> Ali >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> A.Nazemian >>>> >>> >>> >> >> >> -- >> A.Nazemian >> > > > > -- > A.Nazemian >
