Re: push to the limit without going over

2018-07-05 Thread Erick Erickson
Arturas:

" it is becoming incredibly difficult to find working code"

Yeah, I sympathize totally. What I usually do is go into the test code
of whatever version of Solr I'm using and find examples there. _That_
code _must_ be kept up to date ;).

About batching docs. What you gain basically more efficient I/O, you
don't have to wait around for the client to connect/disconnect for
every doc. Here's some numbers:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/ with
all the caveats that YMMV.

Best,
Erick

On Thu, Jul 5, 2018 at 7:48 AM, Shawn Heisey  wrote:
> On 7/4/2018 3:32 AM, Arturas Mazeika wrote:
>>
>> Details:
>>
>> I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 "cpu
>> cores", an SSD as well as a HDD) using the German Wikipedia collection. I
>> created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
>> managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
>> ~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
>> Indexing the files from the SSD (I am able to scan the collection at the
>> actual rate 400-500MB/s) with 16 threads, I tried to send those to the
>> solr
>> cluster with all indexes on the HDD.
>
> 
>>
>> - 4 cores running 2gb ram
>
>
> If this is saying that the machine running Solr has 2GB of installed memory,
> that's going to be a serious problem.
>
> The default heap size that Solr starts with is 512MB.  With 4 Solr nodes
> running on the machine, each with a 512MB heap, all of your 2GB of memory is
> going to be required by the heaps.  Java requires memory beyond the heap to
> run.  Your operating system and its other processes will also require some
> memory.
>
> This means that not only are you going to have no memory left for the OS
> disk cache, you're actually going to allocating MORE than the 2GB of
> installed memory, which means the OS is going to start swapping to
> accommodate memory allocations.
>
> When you don't have enough memory for good disk caching, Solr performance is
> absolutely terrible.  When Solr has to wait for data to be read off of disk,
> even if the disk is SSD, its performance will not be good.
>
> When the OS starts swapping, the performance of ANY software on the system
> drops SIGNIFICANTLY.
>
> You need a lot more memory than 2GB on your server.
>
> Thanks,
> Shawn
>


Re: push to the limit without going over

2018-07-05 Thread Shawn Heisey

On 7/4/2018 3:32 AM, Arturas Mazeika wrote:

Details:

I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 "cpu
cores", an SSD as well as a HDD) using the German Wikipedia collection. I
created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
Indexing the files from the SSD (I am able to scan the collection at the
actual rate 400-500MB/s) with 16 threads, I tried to send those to the solr
cluster with all indexes on the HDD.



- 4 cores running 2gb ram


If this is saying that the machine running Solr has 2GB of installed 
memory, that's going to be a serious problem.


The default heap size that Solr starts with is 512MB.  With 4 Solr nodes 
running on the machine, each with a 512MB heap, all of your 2GB of 
memory is going to be required by the heaps.  Java requires memory 
beyond the heap to run.  Your operating system and its other processes 
will also require some memory.


This means that not only are you going to have no memory left for the OS 
disk cache, you're actually going to allocating MORE than the 2GB of 
installed memory, which means the OS is going to start swapping to 
accommodate memory allocations.


When you don't have enough memory for good disk caching, Solr 
performance is absolutely terrible.  When Solr has to wait for data to 
be read off of disk, even if the disk is SSD, its performance will not 
be good.


When the OS starts swapping, the performance of ANY software on the 
system drops SIGNIFICANTLY.


You need a lot more memory than 2GB on your server.

Thanks,
Shawn



Re: push to the limit without going over

2018-07-05 Thread Arturas Mazeika
s heavy,
> but that hasn't been implemented yet. This could also address
> your cluster state fetch error.
>
> You will get significantly better throughput if you batch your
> docs and use the client.add(list_of_documents) BTW.
>
> Another possibility is to use the new metrics (since Solr 6.4). They
> provide over 200 metrics you can query, and it's quite
> possible that they'd help your clients know when to self-throttle
> but AFAIK, there's nothing built in to help you there.
>
> Best,
> Erick
>
> On Wed, Jul 4, 2018 at 2:32 AM, Arturas Mazeika  wrote:
> > Hi Solr Folk,
> >
> > I am trying to push solr to the limit and sometimes I succeed. The
> > questions is how to not go over it, e.g., avoid:
> >
> > java.lang.RuntimeException: Tried fetching cluster state using the node
> > names we knew of, i.e. [192.168.56.1:9998_solr, 192.168.56.1:9997_solr,
> > 192.168.56.1:_solr, 192.168.56.1:9996_solr]. However, succeeded in
> > obtaining the cluster state from none of them.If you think your Solr
> > cluster is up and is accessible, you could try re-creating a new
> > CloudSolrClient using working solrUrl(s) or zkHost(s).
> > at org.apache.solr.client.solrj.impl.HttpClusterStateProvider.
> > getState(HttpClusterStateProvider.java:109)
> > at org.apache.solr.client.solrj.impl.CloudSolrClient.
> resolveAliases(
> > CloudSolrClient.java:1113)
> > at org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:845)
> > at org.apache.solr.client.solrj.impl.CloudSolrClient.request(
> > CloudSolrClient.java:818)
> > at org.apache.solr.client.solrj.SolrRequest.process(
> > SolrRequest.java:194)
> > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.
> java:173)
> > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.
> java:138)
> > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.
> java:152)
> > at com.asc.InsertDEWikiSimple$SimpleThread.run(
> > InsertDEWikiSimple.java:132)
> >
> >
> > Details:
> >
> > I am benchmarking solrcloud setup on a single machine (Intel 7 with 8
> "cpu
> > cores", an SSD as well as a HDD) using the German Wikipedia collection. I
> > created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
> > managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
> > ~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
> > Indexing the files from the SSD (I am able to scan the collection at the
> > actual rate 400-500MB/s) with 16 threads, I tried to send those to the
> solr
> > cluster with all indexes on the HDD.
> >
> > Clearly solr needs to deal with a very slow hard drive (10-20MB/s actual
> > rate). If the cluster is not touched, solrj may start loosing connections
> > after a few hours. If one checks the status of the cluster, it may happen
> > sooner. After the connection is lost, the cluster calms down with writing
> > after a half a dozen of minutes.
> >
> > What would be a reasonable way to push to the limit without going over?
> >
> > The exact parameters are:
> >
> > - 4 cores running 2gb ram
> > - Schema:
> >
> >> positionIncrementGap="100">
> >  
> >
> >
> >
> >
> >  
> >   
> >
> >positionIncrementGap="100">
> >  
> >
> >
> >  
> >   
> >
> >   
> >required="true"/>
> >> docValues="false" />
> >
> >   
> >   
> >   
> >   
> >
> >stored="false"/>
> >
> > I SolrJ-connect once:
> >
> > ArrayList urls = new ArrayList<>();
> > urls.add("http://localhost:/solr;);
> > urls.add("http://localhost:9998/solr;);
> > urls.add("http://localhost:9997/solr;);
> > urls.add("http://localhost:9996/solr;);
> >
> > solrClient = new CloudSolrClient.Builder(urls)
> > .withConnectionTimeout(1)
> > .withSocketTimeout(6)
> > .build();
> > solrClient.setDefaultCollection("de_wiki_man");
> >
> > and then execute in 16 threads till there's anything to execute:
> >
> > Path p = getJobPath();
> >String content = new String
> > (Files.readAllBytes(p));
> > UUID id = UUID.randomUUID();
> > SolrInputDocument doc = new SolrInputDocument();
> >
> > BasicFileAttributes attr = Files.readAttributes(p,
> > BasicFileAttributes.class);
> >
> > doc.addField("id",  id.toString());
> > doc.addField("content", content);
> > doc.addField("time",
> attr.creationTime().toString());
> > doc.addField("size",content.length());
> > doc.addField("url", p.getFileName().
> > toAbsolutePath().toString());
> > solrClient.add(doc);
> >
> >
> > to go through all the wiki html files.
> >
> > Cheers,
> > Arturas
>


Re: push to the limit without going over

2018-07-04 Thread Erick Erickson
First, I usually prefer to construct your CloudSolrClient by
using the Zookeeper ensemble string rather than URLs,
although that's probably not a cure for your problem.

Here's what I _think_ is happening. If you're slamming Solr
with a lot of updates, you're doing a lot of merging. At some point
when there are a lot of merges going on incoming
updates block until one or more merge threads is done.

At that point, I suspect your client is timing out. And (perhaps)
if you used the Zookeeper ensemble instead of HTTP, the
cluster state fetch would go away. I suspect that another
issue would come up, but

It's also possible this would all go away if you increase your
timeouts significantly. That's still a "set it and hope" approach
rather than a totally robust solution though.

Let's assume that the above works and you start getting timeouts.
You can back off the indexing rate at that point, or just go to
sleep for a while. This isn't what you'd like for a permanent solution,
but may let you get by.

There's work afoot to separate out update thread pools from query
thread pools so _querying_ doesn't suffer when indexing is heavy,
but that hasn't been implemented yet. This could also address
your cluster state fetch error.

You will get significantly better throughput if you batch your
docs and use the client.add(list_of_documents) BTW.

Another possibility is to use the new metrics (since Solr 6.4). They
provide over 200 metrics you can query, and it's quite
possible that they'd help your clients know when to self-throttle
but AFAIK, there's nothing built in to help you there.

Best,
Erick

On Wed, Jul 4, 2018 at 2:32 AM, Arturas Mazeika  wrote:
> Hi Solr Folk,
>
> I am trying to push solr to the limit and sometimes I succeed. The
> questions is how to not go over it, e.g., avoid:
>
> java.lang.RuntimeException: Tried fetching cluster state using the node
> names we knew of, i.e. [192.168.56.1:9998_solr, 192.168.56.1:9997_solr,
> 192.168.56.1:_solr, 192.168.56.1:9996_solr]. However, succeeded in
> obtaining the cluster state from none of them.If you think your Solr
> cluster is up and is accessible, you could try re-creating a new
> CloudSolrClient using working solrUrl(s) or zkHost(s).
> at org.apache.solr.client.solrj.impl.HttpClusterStateProvider.
> getState(HttpClusterStateProvider.java:109)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(
> CloudSolrClient.java:1113)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.
> requestWithRetryOnStaleState(CloudSolrClient.java:845)
> at org.apache.solr.client.solrj.impl.CloudSolrClient.request(
> CloudSolrClient.java:818)
> at org.apache.solr.client.solrj.SolrRequest.process(
> SolrRequest.java:194)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:173)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:152)
> at com.asc.InsertDEWikiSimple$SimpleThread.run(
> InsertDEWikiSimple.java:132)
>
>
> Details:
>
> I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 "cpu
> cores", an SSD as well as a HDD) using the German Wikipedia collection. I
> created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
> managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
> ~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
> Indexing the files from the SSD (I am able to scan the collection at the
> actual rate 400-500MB/s) with 16 threads, I tried to send those to the solr
> cluster with all indexes on the HDD.
>
> Clearly solr needs to deal with a very slow hard drive (10-20MB/s actual
> rate). If the cluster is not touched, solrj may start loosing connections
> after a few hours. If one checks the status of the cluster, it may happen
> sooner. After the connection is lost, the cluster calms down with writing
> after a half a dozen of minutes.
>
> What would be a reasonable way to push to the limit without going over?
>
> The exact parameters are:
>
> - 4 cores running 2gb ram
> - Schema:
>
>positionIncrementGap="100">
>  
>
>
>
>
>  
>   
>
>   
>  
>
>
>  
>   
>
>   
>   
>docValues="false" />
>
>   
>   
>   
>   
>
>   
>
> I SolrJ-connect once:
>
> ArrayList urls = new ArrayList<>();
> urls.add("http://localhost:/solr;);
> urls.add("http://localhost:9998/solr;);
> urls.add("http://localhost:9997/solr;);
> urls.add("http://localhost:9996/solr;);
>
&g

push to the limit without going over

2018-07-04 Thread Arturas Mazeika
Hi Solr Folk,

I am trying to push solr to the limit and sometimes I succeed. The
questions is how to not go over it, e.g., avoid:

java.lang.RuntimeException: Tried fetching cluster state using the node
names we knew of, i.e. [192.168.56.1:9998_solr, 192.168.56.1:9997_solr,
192.168.56.1:_solr, 192.168.56.1:9996_solr]. However, succeeded in
obtaining the cluster state from none of them.If you think your Solr
cluster is up and is accessible, you could try re-creating a new
CloudSolrClient using working solrUrl(s) or zkHost(s).
at org.apache.solr.client.solrj.impl.HttpClusterStateProvider.
getState(HttpClusterStateProvider.java:109)
at org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(
CloudSolrClient.java:1113)
at org.apache.solr.client.solrj.impl.CloudSolrClient.
requestWithRetryOnStaleState(CloudSolrClient.java:845)
at org.apache.solr.client.solrj.impl.CloudSolrClient.request(
CloudSolrClient.java:818)
at org.apache.solr.client.solrj.SolrRequest.process(
SolrRequest.java:194)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:173)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:152)
at com.asc.InsertDEWikiSimple$SimpleThread.run(
InsertDEWikiSimple.java:132)


Details:

I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 "cpu
cores", an SSD as well as a HDD) using the German Wikipedia collection. I
created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
Indexing the files from the SSD (I am able to scan the collection at the
actual rate 400-500MB/s) with 16 threads, I tried to send those to the solr
cluster with all indexes on the HDD.

Clearly solr needs to deal with a very slow hard drive (10-20MB/s actual
rate). If the cluster is not touched, solrj may start loosing connections
after a few hours. If one checks the status of the cluster, it may happen
sooner. After the connection is lost, the cluster calms down with writing
after a half a dozen of minutes.

What would be a reasonable way to push to the limit without going over?

The exact parameters are:

- 4 cores running 2gb ram
- Schema:

  
 
   
   
   
   
 
  

  
 
   
   
 
  

  
  
  

  
  
  
  

  

I SolrJ-connect once:

ArrayList urls = new ArrayList<>();
urls.add("http://localhost:/solr;);
urls.add("http://localhost:9998/solr;);
urls.add("http://localhost:9997/solr;);
urls.add("http://localhost:9996/solr;);

solrClient = new CloudSolrClient.Builder(urls)
.withConnectionTimeout(1)
.withSocketTimeout(6)
.build();
solrClient.setDefaultCollection("de_wiki_man");

and then execute in 16 threads till there's anything to execute:

Path p = getJobPath();
   String content = new String
(Files.readAllBytes(p));
UUID id = UUID.randomUUID();
SolrInputDocument doc = new SolrInputDocument();

BasicFileAttributes attr = Files.readAttributes(p,
BasicFileAttributes.class);

doc.addField("id",  id.toString());
doc.addField("content", content);
doc.addField("time",attr.creationTime().toString());
doc.addField("size",content.length());
doc.addField("url", p.getFileName().
toAbsolutePath().toString());
solrClient.add(doc);


to go through all the wiki html files.

Cheers,
Arturas