I tried a few variations of various things before we found and tried that 
linux/EC2 tuning page, including:
  - EC2 instance type: r4, c4, and i3
  - Ubuntu version: Xenial and Trusty
  - EBS vs local storage
  - Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of 
the issues with early java8 versions and I’m not using G1)

Most of those attempts were to help reduce differences between the data center 
and the EC2 cluster. In all cases I re-indexed from scratch. I got the same 
very high system-time symptom in all cases. With the linux changes in place, we 
settled on r4/Xenial/EBS/Stock.

Again, this was a slightly modified Solr 5.4, (I added backup requests, and two 
memory allocation rate tweaks that have long since been merged into mainline - 
released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) 
I’ve never used Solr 6.x in production though. 
The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is 
based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his 
ES setup, although I think he did try G1.

I definitely do want to binary-search those settings until I understand better 
what exactly did the trick. 
It’s a long cycle time per test is the problem, but hopefully in the next 
couple of weeks.



On 5/1/17, 7:26 AM, "John Bickerstaff" <j...@johnbickerstaff.com> wrote:

    It's also very important to consider the type of EC2 instance you are
    using...
    
    We settled on the R4.2XL...  The R series is labeled "High-Memory"
    
    Which instance type did you end up using?
    
    On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <apa...@elyograg.org> wrote:
    
    > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
    > > tldr: Recently, I tried moving an existing solrcloud configuration from
    > a local datacenter to EC2. Performance was roughly 1/10th what I’d
    > expected, until I applied a bunch of linux tweaks.
    >
    > How very strange.  I knew virtualization would have overheard, possibly
    > even measurable overhead, but that's insane.  Running on bare metal is
    > always better if you can do it.  I would be curious what would happen on
    > your original install if you applied similar tuning to that.  Would you
    > see a speedup there?
    >
    > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
    > much more recent release) alternate implementation of the same index was
    > not seeing this high-system-time behavior on EC2, and was getting
    > throughput consistent with our general expectations.
    >
    > That's even weirder.  ES 5.x will likely be using Points field types for
    > numeric fields, and although those are faster than what Solr currently
    > uses, I doubt it could explain that difference.  The implication here is
    > that the ES systems are running with stock EC2 settings, not the tuned
    > settings ... but I'd like you to confirm that.  Same Java version as
    > with Solr?  IMHO, Java itself is more likely to cause issues like you
    > saw than Solr.
    >
    > > I’m writing this for a few reasons:
    > >
    > > 1.       The performance difference was so crazy I really feel like this
    > should really be broader knowledge.
    >
    > Definitely agree!  I would be very interested in learning which of the
    > tunables you changed were major contributors to the improvement.  If it
    > turns out that Solr's code is sub-optimal in some way, maybe we can fix 
it.
    >
    > > 2.       If anyone is aware of anything that changed in Lucene between
    > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
    > this? If it’s the clocksource that’s the issue, there’s an implication 
that
    > Solr was using tons more system calls like gettimeofday that the EC2 (xen)
    > hypervisor doesn’t allow in userspace.
    >
    > I had not considered the performance regression in 6.4.0 and 6.4.1 that
    > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x 
version?
    >
    > =============
    >
    > Specific thoughts on the tuning:
    >
    > The noatime option is very good to use.  I also use nodiratime on my
    > systems.  Turning these off can have *massive* impacts on disk
    > performance.  If these are the source of the speedup, then the machine
    > doesn't have enough spare memory.
    >
    > I'd be wary of the "nobarrier" mount option.  If the underlying storage
    > has battery-backed write caches, or is SSD without write caching, it
    > wouldn't be a problem.  Here's info about the "discard" mount option, I
    > don't know whether it applies to your amazon storage:
    >
    >        discard/nodiscard
    >               Controls  whether ext4 should issue discard/TRIM commands
    > to the
    >               underlying block device when blocks are freed.  This  is
    > useful
    >               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
    > it is
    >               off by default until sufficient testing has been done.
    >
    > The network tunables would have more of an effect in a distributed
    > environment like EC2 than they would on a LAN.
    >
    > Thanks,
    > Shawn
    >
    >
    

Reply via email to