Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

Shawn Heisey Thu, 25 Jul 2013 20:54:10 -0700

On 7/25/2013 6:53 PM, Tim Vaillancourt wrote:
> Thanks for the reply Shawn, I can always count on you :).
> 
> We are using 10GB heaps and have over 100GB of OS cache free to answer the
> JVM question, Young has about 50% of the heap, all CMS. Our max number of
> processes for the JVM user is 10k, which is where Solr dies when it blows
> up with 'cannot create native thread'.
> 
> I also want to say this is system related, but I am seeing this occur on
> all 3 servers, which are brand-new Dell R720s. I'm not saying this is
> impossible, but I don't see much to suggest that, and it would need to be
> one hell of a coincidence.


Nice hardware.  I have some R720xd servers for another project unrelated
to Solr, love them.

I know a little about Dell servers.  If you haven't done so already, I
would install the OpenManage repo and get the firmware fully updated -
BIOS, RAID, and LAN in particular.  Instructions that are pretty easy to
follow:

http://linux.dell.com/repo/hardware/latest/

For process/file limits, I have the following in
/etc/security/limits.conf on systems that aren't using Cloud:

ncindex         hard    nproc   6144
ncindex         soft    nproc   4096

ncindex         hard    nofile  65535
ncindex         soft    nofile  49151

> To add more confusion to the mix, we actually run a 2nd SolrCloud cluster
> on the same Solr, Jetty and JVM versions that do not exhibit this issue,
> although using a completely different schema, servers and access-patterns,
> although it is also at high-TPS. That is some evidence to say the current
> software stack is OK, or maybe this only occurs under an extreme load that
> 2nd cluster does not see, or lastly only with a certain schema.

This is a big reason why I think you should make sure you're fully up to
date on your firmware, as the hardware seems to be one strong
difference.  As much as I love Dell server hardware, firmware issues are
relatively common, especially on early versions of the latest
generation, which includes the R720.

> Lastly, to add a bit more detail to my original description, so far I have
> tried:
> 
> - Entirely rebuilding my cluster from scratch, reinstalling all deps,
> configs, reindexing the data (in case I screwed up somewhere). The EXACT
> same issue occurs under load about 20-45 minutes in.
> - Moving to Java 1.7.0_21 from _25 due to some known bugs. Same issue
> occurs after some load.
> - Restarting SolrCloud / forcing rebuilds or cores. Same issue occurs after
> some load.

The only other thing I can think of is increasing your zkClientTimeout
to 30 seconds or so and trying Solr 4.4 so you have SOLR-4899 and
SOLR-4805.  That's very definitely a shot in the dark.

Thanks,
Shawn

Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

Reply via email to