Re: solrcloud Auto-commit doesn't seem reliable

2018-03-21 Thread Elaine Cario
I'm just catching up on reading solr emails, so forgive me for being late
to this dance

I've just gone through a project to enable CDCR on our Solr, and I also
experienced a small period of time where the commits on the source server
just seemed to stop.  This was during a period of intense experimentation
where I was mucking around with configurations, turning CDCR on/off, etc.
At some point the commits stopped occurring, and it drove me nuts for a
couple of days - tried everything - restarting Solr, reloading, turned
buffering on, turned buffering off, etc.  I finally threw up my hands and
rebooted the server out of desperation (it was a physical Linux box).
Commits worked fine after that.  I don't know what caused the commits to
stop, and why re-booting (and not just restarting Solr) caused them to work
fine.

Wondering if you ever found a solution to your situation?



On Fri, Feb 16, 2018 at 2:44 PM, Webster Homer 
wrote:

> I meant to get back to this sooner.
>
> When I say I issued a commit I do issue it as collection/update?commit=true
>
> The soft commit interval is set to 3000, but I don't have a problem with
> soft commits ( I think). I was responding
>
> I am concerned that some hard commits don't seem to happen, but I think
> many commits do occur. I'd like suggestions on how to diagnose this, and
> perhaps an idea of where to look. Typically I believe that issues like this
> are from our configuration.
>
> Our indexing job is pretty simple, we send blocks of JSON to
> /update/json. We have either re-index the whole collection, or
> just apply updates. Typically we reindex the data once a week and delete
> any records that are older than the last full index. This does lead to a
> fair number of deleted records in the index especially if commits fail.
> Most of our collections are not large between 2 and 3 million records.
>
> The collections are hosted in google cloud
>
> On Mon, Feb 12, 2018 at 5:00 PM, Erick Erickson 
> wrote:
>
> > bq: But if 3 seconds is aggressive what would be a  good value for soft
> > commit?
> >
> > The usual answer is "as long as you can stand". All top-level caches are
> > invalidated, autowarming is done etc. on each soft commit. That can be a
> > lot of
> > work and if your users are comfortable with docs not showing up for,
> > say, 10 minutes
> > then use 10 minutes. As always "it depends" here, the point is not to
> > do unnecessary
> > work if possible.
> >
> > bq: If a commit doesn't happen how would there ever be an index merge
> > that would remove the deleted documents.
> >
> > Right, it wouldn't. It's a little more subtle than that though.
> > Segments on various
> > replicas will contain different docs, thus the term/doc statistics can be
> > a bit
> > different between multiple replicas. None of the stats will change
> > until the commit
> > though. You might try turning no distributed doc/term stats though.
> >
> > Your comments about PULL or TLOG replicas are well taken. However, even
> > those
> > won't be absolutely in sync since they'll replicate from the master at
> > slightly
> > different times and _could_ get slightly different segments _if_
> > there's indexing
> > going on. But let's say you stop indexing. After the next poll
> > interval all the replicas
> > will have identical characteristics and will score the docs the same.
> >
> > I don't have any signifiant wisdom to offer here, except this is really
> the
> > first time I've heard of this behavior. About all I can imagine is
> > that _somehow_
> > the soft commit interval is -1. When you say you "issue a commit" I'm
> > assuming
> > it's via collection/update?commit=true or some such which issues a
> > hard
> > commit with openSearcher=true. And it's on a _collection_ basis, right?
> >
> > Sorry I can't be more help
> > Erick
> >
> >
> >
> >
> > On Mon, Feb 12, 2018 at 10:44 AM, Webster Homer 
> > wrote:
> > > Erick, I am aware of the CDCR buffering problem causing tlog retention,
> > we
> > > always turn buffering off in our cdcr configurations.
> > >
> > > My post was precipitated by seeing that we had uncommitted data in
> > > collections > 24 hours after it was loaded. The collections I was
> looking
> > > at are in our development environment, where we do not use CDCR.
> However
> > > I'm pretty sure that I've seen situations in production where commits
> > were
> > > also long overdue.
> > >
> > > the "autoSoftcommit" was a typo. The soft commit logic seems to be
> fine,
> > I
> > > don't see an issue with data visibility. But if 3 seconds is aggressive
> > > what would be a  good value for soft commit? We have a couple of
> > > collections that are updated every minute although most of them are
> > updated
> > > much less frequently.
> > >
> > > My reason for raising this commit issue is that we see problems with
> the
> > > relevancy of solrcloud searches, and the NRT replica type. Sometimes
> 

Re: Some performance questions....

2018-03-21 Thread Deepak Goel
Deepak
"Please stop cruelty to Animals, help by becoming a Vegan"
+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

On Mon, Mar 19, 2018 at 2:40 AM, Walter Underwood 
wrote:

> > On Mar 17, 2018, at 3:23 AM, Deepak Goel  wrote:
> >
> > Sorry for being rude. But the ' results ' please, not the ' road to the
> > results '
>
> We have 15 different search collections, all different sizes and all with
> different kinds of queries. Here are the two major ones.
>
> 22 million docs
> 32 server Solr Cloud cluster, EC2 c4.8xlarge instances (36 CPU, 59 GB RAM)
> Solr 6.6.2
> 4 shards
> 24,000 requests/minute
> 95th percentile query response time 5 to 7 seconds
>
> 250,000 docs
> 4 server Solr master/slave cluster, EC2 c4.4xlarge (16 CPU, 30 GB RAM)
> Solr 4.10.4
> 60,000 requests/minute
> 95th percentile 100 ms
>
> This does not help at all. If you look at the author's question, i think
it is about a single server. You will have to post your results (25%CPU,
50%CPU, 75%CPU, 100%CPU) for a single server (how does the server scale
with increase in load)


> That should make everything crystal clear.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Virus-free.
www.avg.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: Some performance questions....

2018-03-21 Thread Deepak Goel
Deepak
"Please stop cruelty to Animals, help by becoming a Vegan"
+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

On Sat, Mar 17, 2018 at 2:56 AM, Shawn Heisey  wrote:

> On 3/16/2018 2:21 PM, Deepak Goel wrote:
> > I wanted to test how many max connections can Solr handle concurrently.
> > Also I would have to implement an 'connection pooling' of the
> client-object
> > connections rather than a single connection thread
> >
> > However a single client object with thousands of queries coming in would
> > surely become a bottleneck. I can test this scenario too.
>
> Handling thousands of simultaneous queries is NOT something you can
> expect a single Solr server to do.  It's not going to happen.  It
> wouldn't happen with ES, either.  Handling that much load requires load
> balancing to a LOT of servers.  The server would much more of a
> bottleneck than the client.
>
> > The problem is the max throughput which I can get on the machine is
> around
> > 28 tps, even though I increase the load further & only 65% CPU is
> utilised
> > (there is still 35% which is not being used). This clearly indicates the
> > software is a problem as there is enough hardware resources.
>
> If your code is creating a client object before every single query, that
> could be part of the issue.  The benchmark code should be using the same
> client for all requests.  I really don't know how long it takes to
> create HttpSolrClient objects, but I don't imagine that it's super-fast.
>
> What version of SolrJ were you using?
>
> Depending on the SolrJ version you may need to create the client with a
> custom HttpClient object in order to allow it to handle plenty of
> threads.  This is how I create client objects in my SolrJ code:
>
>   RequestConfig rc = RequestConfig.custom().setConnectTimeout(2000)
> .setSocketTimeout(6).build();
>   CloseableHttpClient httpClient =
> HttpClients.custom().setDefaultRequestConfig(rc).setMaxConnPerRoute(1024)
> .setMaxConnTotal(4096).disableAutomaticRetries().build();
>
>   SolrClient sc = new HttpSolrClient.Builder().withBaseSolrUrl(solrUrl)
> .withHttpClient(httpClient).build();
>
> I tried the above suggestion. The throughput and utilisation remain the
same (they dont increase even if I increase the load). The response time
comes down.







*SoftwareThroughput (/sec)Response Time (msec)Utilization (%CPU)UnTuned
(Windows)27.8142665UnTuned (Linux)Partially Tuned (Linux)Partially Tuned
(Windows)28.11.10560 *I am going to give your suggestion a spin on Linux
next (This might take a day or two)



> Thanks,
> Shawn
>
>


Virus-free.
www.avg.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: Get terms in solr not working

2018-03-21 Thread Joel Bernstein
Also what is the use case? What do you plan to do with terms? There may be
other approaches that will work better then the terms query.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Mar 21, 2018 at 9:28 AM, Erick Erickson 
wrote:

> We need a lot more information. What is the exact query you're using?
> Is 100M the number of docs? How many terms are in the field?
>
> On Tue, Mar 20, 2018 at 10:39 PM, adam rag  wrote:
> > To get top words in my Apache Solr instance, I am using "terms" query.
> When
> > I try it to get 10 terms in 100 million of data, the data are fetching
> > after a few minutes, But if the data is 300 million the Solr is not
> > responding. My server memory is 100 GB.
>


Re: Upgrading a Plugin from 6.6 to 7.x

2018-03-21 Thread Atita Arora
Hi Peter,


*(Sorry for the earlier incomplete email - I hit send by mistake)*

I haven't really been able to look into it completely , but my first glance
says , it should be because the method signature has changed.

Iam looking here : https://lucene.apache.org/core/7_0_0/core/org/apache/
lucene/search/Query.html

createWeight

(IndexSearcher

 searcher, boolean needsScores, float boost)
Expert: Constructs an appropriate Weight implementation for this query.

While at :

https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/Query.html


createWeight

(IndexSearcher

searcher,
boolean needsScores)
Expert: Constructs an appropriate Weight implementation for this query.

You would need a code change for this to make it work in Version 7.

Thanks,
Atita


On Wed, Mar 21, 2018 at 6:59 PM, Atita Arora  wrote:

> Hi Peter,
>
> I haven't really been able to look into it completely , but my first
> glance says , it should be because the method signature has changed.
>
> Iam looking here : https://lucene.apache.org/core/7_0_0/core/org/apache/
> lucene/search/Query.html
>
> createWeight
> 
> (IndexSearcher
> 
>  searcher, boolean needsScores, float boost)
> Expert: Constructs an appropriate Weight implementation for this query.
>
> While at :
>
>
> On Wed, Mar 21, 2018 at 4:16 PM, Peter Alexander Kopciak  > wrote:
>
>> Hi!
>>
>> I'm still pretty new to Solr and I want to use the vector Scoring plugin (
>> https://github.com/saaay71/solr-vector-scoring/network) but
>> unfortunately,
>> it does not seem to work for newer Solr versions.
>>
>> I tested it with 6.6 to verify its functionality, so it seems to be broken
>> because of the upgrade to 7.x.
>>
>> When following the installation procedure and executing the examples, I
>> ran
>> into the following error with Query 1:
>>
>> java.lang.UnsupportedOperationException: Query {! type=vp f=vector
>> vector=0.1,4.75,0.3,1.2,0.7,4.0 v=} does not implement createWeight
>>
>> Does anyone has a lead for me how to fix/upgrade the plugin? The
>> createWeight method seems to exist, so I'm not sure where to start and
>> waht
>> the problem seems to be.
>>
>
>


Re: Upgrading a Plugin from 6.6 to 7.x

2018-03-21 Thread Atita Arora
Hi Peter,

I haven't really been able to look into it completely , but my first glance
says , it should be because the method signature has changed.

Iam looking here :
https://lucene.apache.org/core/7_0_0/core/org/apache/lucene/search/Query.html

createWeight

(IndexSearcher

searcher,
boolean needsScores, float boost)
Expert: Constructs an appropriate Weight implementation for this query.

While at :


On Wed, Mar 21, 2018 at 4:16 PM, Peter Alexander Kopciak 
wrote:

> Hi!
>
> I'm still pretty new to Solr and I want to use the vector Scoring plugin (
> https://github.com/saaay71/solr-vector-scoring/network) but unfortunately,
> it does not seem to work for newer Solr versions.
>
> I tested it with 6.6 to verify its functionality, so it seems to be broken
> because of the upgrade to 7.x.
>
> When following the installation procedure and executing the examples, I ran
> into the following error with Query 1:
>
> java.lang.UnsupportedOperationException: Query {! type=vp f=vector
> vector=0.1,4.75,0.3,1.2,0.7,4.0 v=} does not implement createWeight
>
> Does anyone has a lead for me how to fix/upgrade the plugin? The
> createWeight method seems to exist, so I'm not sure where to start and waht
> the problem seems to be.
>


Re: Get terms in solr not working

2018-03-21 Thread Erick Erickson
We need a lot more information. What is the exact query you're using?
Is 100M the number of docs? How many terms are in the field?

On Tue, Mar 20, 2018 at 10:39 PM, adam rag  wrote:
> To get top words in my Apache Solr instance, I am using "terms" query. When
> I try it to get 10 terms in 100 million of data, the data are fetching
> after a few minutes, But if the data is 300 million the Solr is not
> responding. My server memory is 100 GB.


Re: Solr main replica down, another replica taking over

2018-03-21 Thread Shawn Heisey

On 3/21/2018 12:04 AM, Midas A wrote:

We want to send less traffic over virtual machines and more on physical
servers . How can we achieve this


At the moment, I do not know of any functionality in SolrCloud to 
accomplish this goal.  As I mentioned before, there is work underway to 
make it possible, but it's not available yet.


One thing you could do is include preferLocalShards=true as a URL 
parameter and only send requests to the physical servers (unless they 
are down), but to do that, you'll have to handle load balancing yourself.


Thanks,
Shawn



Upgrading a Plugin from 6.6 to 7.x

2018-03-21 Thread Peter Alexander Kopciak
Hi!

I'm still pretty new to Solr and I want to use the vector Scoring plugin (
https://github.com/saaay71/solr-vector-scoring/network) but unfortunately,
it does not seem to work for newer Solr versions.

I tested it with 6.6 to verify its functionality, so it seems to be broken
because of the upgrade to 7.x.

When following the installation procedure and executing the examples, I ran
into the following error with Query 1:

java.lang.UnsupportedOperationException: Query {! type=vp f=vector
vector=0.1,4.75,0.3,1.2,0.7,4.0 v=} does not implement createWeight

Does anyone has a lead for me how to fix/upgrade the plugin? The
createWeight method seems to exist, so I'm not sure where to start and waht
the problem seems to be.


Get terms in solr not working

2018-03-21 Thread adam rag
To get top words in my Apache Solr instance, I am using "terms" query. When
I try it to get 10 terms in 100 million of data, the data are fetching
after a few minutes, But if the data is 300 million the Solr is not
responding. My server memory is 100 GB.


Re: Solr main replica down, another replica taking over

2018-03-21 Thread Midas A
Thanks Shawn,

We want to send less traffic over virtual machines and more on physical
servers . How can we achieve this

On Wed, Mar 21, 2018 at 11:02 AM, Shawn Heisey  wrote:

> On 3/20/2018 11:18 PM, Midas A wrote:
>
>> I have one question here
>> a) solr cloud load balance requests internally (Round robin or anything
>> else ).
>>
>
> Yes, SolrCloud does load balance requests across active replicas in the
> entire cloud.  I do not know what algorithm it uses for load balancing --
> whether that's round-robin or something else.
>
> b) How can i change this behaviour (Note. I have solr cloud with mix of
>> machines physical and virtual  )
>>
>
> There is some effort underway to allow SolrCloud to prefer specific
> replica types.  Recent versions of Solr added TLOG and PULL types, to
> supplement the NRT type that all versions of SolrCloud have.  There is
> strong interest in being able to prefer one of the new types and let the
> NRT replicas handle indexing only when possible.
>
> There is already a "preferLocalShards" parameter ... but enabling this
> parameter can actually make performance *worse*, by concentrating requests
> onto a single machine and leaving the other machines in the cloud idle.
>
> Thanks,
> Shawn
>
>