Hey Kevin, Sure! We were using the default HDFS blockcache settings and -Xmx6g -XX:MaxDirectMemorySize=6g
Thanks! Kyle On Thu, 20 Dec 2018 at 13:15, Kevin Risden <kris...@apache.org> wrote: > Kyle - Thanks so much for the followup on this. Rarely do we get to > see results compared with detail. > > Can you share the Solr HDFS configuration settings that you tested > with? Blockcache and direct memory size? I'd be curious just as a > reference point. > > Kevin Risden > > On Thu, Dec 20, 2018 at 10:31 AM lstusr 5u93n4 <lstusr...@gmail.com> > wrote: > > > > Hi All, > > > > To close this off, I'm sad to report that we've come to a end with Solr > on > > HDFS. > > > > Here's what we finally did: > > - created two brand-new identical Solr cloud clusters, one on HDFS and > one > > on local disk. > > - 1 replica per node. Each node 16GB ram. > > - Added documents. > > - Compared start-up times for a single node after a graceful shutdown. > > > > What we observe: > > - on startup, the replica will transition from "Gone" to "Down" fairly > > quickly. (a few seconds) > > - The replica then spends some time in the "Down" state before > > transitioning to "Recovering" > > - The replica stays in "Recovering" for some time, before transitioning > to > > "Active" > > > > Results for 75M docs in the replica, replica size 28.5GB: > > > > - HDFS > > - Time in "Down": 4m 49s > > - Time in "Recovering": 2m 30s > > - Total time to restart: 7m 9s > > > > - Local Disk > > - Time in "Down": 0m 5s > > - Time in "Recovering": 0m 8s > > - Total time to restart: 0m 13s > > > > > > Results for 100M docs in the replica, replica size 37GB: > > > > - HDFS > > - Time in "Down": 8m 30s > > - Time in "Recovering": 5m 19s > > - Total time to restart: 13m 49s > > > > - Local Disk > > - Time in "Down": 0m 4s > > - Time in "Recovering": 0m 10s > > - Total time to restart: 0m 14s > > > > > > Conclusions: > > - As the index size grows, Solr on HDFS has a trend towards increasing > > restart times that's not seen on local disk. > > > > Notes: > > - HDFS in our environment is FINE. The network is FINE. We have hbase > > servers running on the same ESXi hosts as Solr, they access the same HDFS > > filesystem, and hbase bandwidth regularly exceeds 2GB/s. All latencies > are > > sub-millisecond. > > - The values reported above are averages. There's some variance to the > > results, but the averages are representative of the times we're seeing. > > > > Thanks for reading! > > > > Kyle > > > > > > > > On Mon, 10 Dec 2018 at 14:14, lstusr 5u93n4 <lstusr...@gmail.com> wrote: > > > > > Hi Guys, > > > > > > > What OS is it on? > > > CentOS 7 > > > > > > > With your indexes in HDFS, the HDFS software running > > > > inside Solr also needs heap memory to operate, and is probably going > to > > > > set aside part of the heap for caching purposes. > > > We still have the solr.hdfs.blockcache.slab.count parameter set to the > > > default of 1, but we're going to tune this a bit and see what happens. > > > > > > > but for this setup, I'd definitely want a LOT more than 16GB.GB > > > So where would you start? We can easily double the number of servers > to 6, > > > and put one replica on each (probably going to do this anyways.) > Would you > > > go bigger than 6 x 16GB ? Keeping in mind, even with our little 3 x > 16GB we > > > haven't had performance problems... This thread kind of diverged that > way, > > > but really the initial issue was just that the whole index seems to be > read > > > on startup. (Which I fully understand may be resource related, but I > have > > > yet to try reproduce on a smaller scale to confirm/deny.) > > > > > > > As Solr runs, it writes a GC log. Can you share all of the GC log > files > > > > that Solr has created? There should not be any proprietary > information > > > > in those files. > > > > > > This I can do. Actually, I've collected a lot of things, redacted any > > > private info, and collected here into a series of logs / screenshots. > > > > > > So what I did: > > > - 16:49 GMT -- stopped solr on one node (node 4) using bin/solr stop, > and > > > keeping the others alive.. Captured the solr log as it was stopping, > and > > > uploaded here: > > > - https://pastebin.com/raw/UhSTdb1h > > > > > > - 17:00 GMT - restarted solr on the same node (other two stayed up the > > > whole time) and let it run for an hour. Captured the solr logs since > the > > > startup here: > > > - https://pastebin.com/raw/S4Z9XVrG > > > > > > - Observed the outbound network traffic from HDFS to this particular > solr > > > instance during this time, screenshotted it, and put the image here: > (times > > > are in EST for that screenshot) > > > - https://imagebin.ca/v/4PY63LAMSVV1 > > > > > > - Screenshotted the resource usage on the node according to the solr > UI: > > > - https://imagebin.ca/v/4PY6dYddWGXn > > > > > > - Captured the GC logs for the 20 mins after restart, and pasted here: > > > - https://pastebin.com/raw/piswTy1M > > > > > > Some notes: > > > - the main collection (the big one) is called "main" > > > - there is an empty collection on the system called "history" but this > > > has 0 documents. > > > - I redacted any private info in the logs... if there are > inconsistencies > > > it might be due to this manual process (but I think it's okay) > > > > > > Thanks! > > > > > > Kyle > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 10 Dec 2018 at 12:43, Shawn Heisey <apa...@elyograg.org> > wrote: > > > > > >> On 12/7/2018 8:54 AM, Erick Erickson wrote: > > >> > Here's the trap:_Indexing_ doesn't take much memory. The memory > > >> > is bounded > > >> > by ramBufferSizeMB, which defaults to 100. > > >> > > >> This statement is completely true. But it hides one detail: A large > > >> amount of indexing will allocate this buffer repeatedly. So although > > >> indexing doesn't take a huge amount of memory space at any given > moment, > > >> the amount of total memory allocated by large indexing will be > enormous, > > >> keeping the garbage collector busy. This is particularly true when > > >> segment merging happens. > > >> > > >> Going over the whole thread: > > >> > > >> Graceful shutdown on Solr 7.5 (for non-Windows operating systems) > should > > >> allow up to three minutes for Solr to shut down normally before it > > >> hard-kills the instance. On Windows it only waits 5 seconds, which is > > >> not enough. What OS is it on? > > >> > > >> The problems you've described do sound like your Solr instances are > > >> experiencing massive GC pauses. This can make *ALL* Solr activity > take > > >> a long time, including index recovery operations. Increasing the heap > > >> size MIGHT alleviate these problems. > > >> > > >> If every machine is handling 700GB of index data and 1.4 billion docs > > >> (assuming one third of the 2.1 billion docs per shard replica, two > > >> replicas per machine), you're going to need a lot of heap memory for > > >> Solr to run well. With your indexes in HDFS, the HDFS software > running > > >> inside Solr also needs heap memory to operate, and is probably going > to > > >> set aside part of the heap for caching purposes. I thought I saw > > >> something in the thread about a 6GB heap size. This is probably way > too > > >> small. For everything you've described, I have to agree with Erick > ... > > >> 16GB total memory is VERY undersized. It's likely unrealistic to have > > >> enough memory for the whole index ... but for this setup, I'd > definitely > > >> want a LOT more than 16GB. > > >> > > >> As Solr runs, it writes a GC log. Can you share all of the GC log > files > > >> that Solr has created? There should not be any proprietary > information > > >> in those files. > > >> > > >> Thanks, > > >> Shawn > > >> > > >> >