Hey Kevin,

Sure! We were using the default HDFS blockcache settings and
-Xmx6g -XX:MaxDirectMemorySize=6g

Thanks!

Kyle

On Thu, 20 Dec 2018 at 13:15, Kevin Risden <kris...@apache.org> wrote:

> Kyle - Thanks so much for the followup on this. Rarely do we get to
> see results compared with detail.
>
> Can you share the Solr HDFS configuration settings that you tested
> with? Blockcache and direct memory size? I'd be curious just as a
> reference point.
>
> Kevin Risden
>
> On Thu, Dec 20, 2018 at 10:31 AM lstusr 5u93n4 <lstusr...@gmail.com>
> wrote:
> >
> > Hi All,
> >
> > To close this off, I'm sad to report that we've come to a end with Solr
> on
> > HDFS.
> >
> > Here's what we finally did:
> >  - created two brand-new identical Solr cloud clusters, one on HDFS and
> one
> > on local disk.
> > - 1 replica per node. Each node 16GB ram.
> >  - Added documents.
> >  - Compared start-up times for a single node after a graceful shutdown.
> >
> > What we observe:
> >  - on startup, the replica will transition from "Gone" to "Down" fairly
> > quickly. (a few seconds)
> >  - The replica then spends some time in the "Down" state before
> > transitioning to "Recovering"
> >  - The replica stays in "Recovering" for some time, before transitioning
> to
> > "Active"
> >
> > Results for 75M docs in the replica, replica size 28.5GB:
> >
> >   - HDFS
> >      - Time in "Down": 4m 49s
> >      - Time in "Recovering": 2m 30s
> >      - Total time to restart: 7m 9s
> >
> >   - Local Disk
> >      - Time in "Down": 0m 5s
> >      - Time in "Recovering": 0m 8s
> >      - Total time to restart: 0m 13s
> >
> >
> > Results for 100M docs in the replica, replica size 37GB:
> >
> >    - HDFS
> >     - Time in "Down": 8m 30s
> >      - Time in "Recovering": 5m 19s
> >      - Total time to restart: 13m 49s
> >
> >   - Local Disk
> >      - Time in "Down": 0m 4s
> >      - Time in "Recovering": 0m 10s
> >      - Total time to restart: 0m 14s
> >
> >
> > Conclusions:
> >  - As the index size grows, Solr on HDFS has a trend towards increasing
> > restart times that's not seen on local disk.
> >
> > Notes:
> >  - HDFS in our environment is FINE. The network is FINE. We have hbase
> > servers running on the same ESXi hosts as Solr, they access the same HDFS
> > filesystem, and hbase bandwidth regularly exceeds 2GB/s. All latencies
> are
> > sub-millisecond.
> >  - The values reported above are averages. There's some variance to the
> > results, but the averages are representative of the times we're seeing.
> >
> > Thanks for reading!
> >
> > Kyle
> >
> >
> >
> > On Mon, 10 Dec 2018 at 14:14, lstusr 5u93n4 <lstusr...@gmail.com> wrote:
> >
> > > Hi Guys,
> > >
> > > >  What OS is it on?
> > > CentOS 7
> > >
> > > >  With your indexes in HDFS, the HDFS software running
> > > > inside Solr also needs heap memory to operate, and is probably going
> to
> > > > set aside part of the heap for caching purposes.
> > > We still have the solr.hdfs.blockcache.slab.count parameter set to the
> > > default of 1, but we're going to tune this a bit and see what happens.
> > >
> > > > but for this setup, I'd definitely want a LOT more than 16GB.GB
> > > So where would you start? We can easily double the number of servers
> to 6,
> > > and put one replica on each (probably going to do this anyways.)
> Would you
> > > go bigger than 6 x 16GB ? Keeping in mind, even with our little 3 x
> 16GB we
> > > haven't had performance problems... This thread kind of diverged that
> way,
> > > but really the initial issue was just that the whole index seems to be
> read
> > > on startup. (Which I fully understand may be resource related, but I
> have
> > > yet to try reproduce on a smaller scale to confirm/deny.)
> > >
> > > > As Solr runs, it writes a GC log.  Can you share all of the GC log
> files
> > > > that Solr has created?  There should not be any proprietary
> information
> > > > in those files.
> > >
> > > This I can do. Actually, I've collected a lot of things, redacted any
> > > private info, and collected here into a series of logs / screenshots.
> > >
> > > So what I did:
> > >  - 16:49 GMT -- stopped solr on one node (node 4) using bin/solr stop,
> and
> > > keeping the others alive.. Captured the solr log as it was stopping,
> and
> > > uploaded here:
> > >      - https://pastebin.com/raw/UhSTdb1h
> > >
> > > - 17:00 GMT  - restarted solr on the same node (other two stayed up the
> > > whole time) and let it run for an hour. Captured the solr logs since
> the
> > > startup here:
> > >     - https://pastebin.com/raw/S4Z9XVrG
> > >
> > >  - Observed the outbound network traffic from HDFS to this particular
> solr
> > > instance during this time, screenshotted it, and put the image here:
> (times
> > > are in EST for that screenshot)
> > >     - https://imagebin.ca/v/4PY63LAMSVV1
> > >
> > >  - Screenshotted the resource usage on the node according to the solr
> UI:
> > >    - https://imagebin.ca/v/4PY6dYddWGXn
> > >
> > >  - Captured the GC logs for the 20 mins after restart, and pasted here:
> > >    - https://pastebin.com/raw/piswTy1M
> > >
> > > Some notes:
> > >  - the main collection (the big one) is called "main"
> > >  - there is an empty collection on the system called "history" but this
> > > has 0 documents.
> > >  - I redacted any private info in the logs... if there are
> inconsistencies
> > > it might be due to this manual process (but I think it's okay)
> > >
> > > Thanks!
> > >
> > > Kyle
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, 10 Dec 2018 at 12:43, Shawn Heisey <apa...@elyograg.org>
> wrote:
> > >
> > >> On 12/7/2018 8:54 AM, Erick Erickson wrote:
> > >> > Here's the trap:_Indexing_  doesn't take much memory. The memory
> > >> > is bounded
> > >> > by ramBufferSizeMB, which defaults to 100.
> > >>
> > >> This statement is completely true.  But it hides one detail:  A large
> > >> amount of indexing will allocate this buffer repeatedly.  So although
> > >> indexing doesn't take a huge amount of memory space at any given
> moment,
> > >> the amount of total memory allocated by large indexing will be
> enormous,
> > >> keeping the garbage collector busy.  This is particularly true when
> > >> segment merging happens.
> > >>
> > >> Going over the whole thread:
> > >>
> > >> Graceful shutdown on Solr 7.5 (for non-Windows operating systems)
> should
> > >> allow up to three minutes for Solr to shut down normally before it
> > >> hard-kills the instance.  On Windows it only waits 5 seconds, which is
> > >> not enough.  What OS is it on?
> > >>
> > >> The problems you've described do sound like your Solr instances are
> > >> experiencing massive GC pauses.  This can make *ALL* Solr activity
> take
> > >> a long time, including index recovery operations.  Increasing the heap
> > >> size MIGHT alleviate these problems.
> > >>
> > >> If every machine is handling 700GB of index data and 1.4 billion docs
> > >> (assuming one third of the 2.1 billion docs per shard replica, two
> > >> replicas per machine), you're going to need a lot of heap memory for
> > >> Solr to run well.  With your indexes in HDFS, the HDFS software
> running
> > >> inside Solr also needs heap memory to operate, and is probably going
> to
> > >> set aside part of the heap for caching purposes.  I thought I saw
> > >> something in the thread about a 6GB heap size.  This is probably way
> too
> > >> small.   For everything you've described, I have to agree with Erick
> ...
> > >> 16GB total memory is VERY undersized.  It's likely unrealistic to have
> > >> enough memory for the whole index ... but for this setup, I'd
> definitely
> > >> want a LOT more than 16GB.
> > >>
> > >> As Solr runs, it writes a GC log.  Can you share all of the GC log
> files
> > >> that Solr has created?  There should not be any proprietary
> information
> > >> in those files.
> > >>
> > >> Thanks,
> > >> Shawn
> > >>
> > >>
>

Reply via email to