Is Pivoted Grouping possible?

2015-12-21 Thread Lewin Joy (TMS)
Hi,

I am working with Solr 4.10.3 . And we are trying to retrieve some documents 
under for categories and sub-categories.
With grouping we are able to bring n number of records under each group.
Could we have a pivoted grouping where I could bring the results from 
sub-categories?

Example:


Apparel
Shirts
{id:1, Blue shirt}
{id:2, Green shirt}
Pants
{id:10, Blue Pants}
{id:20, Grey Pants}
Sports
Basketball
{id:45, Black Basketball}
{id:32, Basketball hoop}


I know we could bring the number of records under each sub-category using 
facet.pivot=category,sub-cat .
Also grouping could give me records under each groups.
Is there a way we could combine this to give us pivoting groups? Or is there an 
alternative to bring about these results?

Thanks,
Lewin


RE: Is Pivoted Grouping possible?

2015-12-21 Thread Lewin Joy (TMS)
If there is even a way to have a string concatenate function, we could bring 
out similar result sets. Is that possible?

-Lewin

-Original Message-
From: Lewin Joy (TMS) [mailto:lewin@toyota.com] 
Sent: Monday, December 21, 2015 12:16 PM
To: solr-user@lucene.apache.org
Subject: Is Pivoted Grouping possible?

Hi,

I am working with Solr 4.10.3 . And we are trying to retrieve some documents 
under for categories and sub-categories.
With grouping we are able to bring n number of records under each group.
Could we have a pivoted grouping where I could bring the results from 
sub-categories?

Example:


Apparel
Shirts
{id:1, Blue shirt}
{id:2, Green shirt}
Pants
{id:10, Blue Pants}
{id:20, Grey Pants} Sports
Basketball
{id:45, Black Basketball}
{id:32, Basketball hoop}


I know we could bring the number of records under each sub-category using 
facet.pivot=category,sub-cat .
Also grouping could give me records under each groups.
Is there a way we could combine this to give us pivoting groups? Or is there an 
alternative to bring about these results?

Thanks,
Lewin


Re: Re: Re: Some problems when upload data to index in cloud environment

2015-12-21 Thread 周建二
Erick:


Thank your so much for your advise. Now we do not index a large number of 
files, but in future we may. I will pay more attention to 
ExtractingRequestHandler. Thanks again.


Best regard,
Jianer


> -原始邮件-
> 发件人: "Erick Erickson" 
> 发送时间: 2015年12月22日 星期二
> 收件人: solr-user 
> 抄送: 
> 主题: Re: Re: Some problems when upload data to index in cloud environment
> 
> Jianer:
> 
> Getting your head around the configs is, indeed, "exciting" at times.
> 
> I just wanted to caution you that using ExtractingRequestHandler
> puts the Tika parsing load on the Solr server, which doesn't
> scale as the same machine that's serving queries and indexing
> is _also_ parsing potentially very large files. It may not matter
> if you don't do it often, but if you're going to index a large number
> of files and/or you're going to do this continuously, you probably
> want to move the parsing off Solr. Here's an example with DB
> as well, but the DB bits can be removed easily.
> 
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> 
> Best,
> Erick
> 
> On Sun, Dec 20, 2015 at 9:29 PM, 周建二  wrote:
> > Hi Shawn, thanks for your reply. :)
> >
> >
> > It is because the /update/extract handler is not defined in my collection's 
> > solrconfig.xml file as I upload the basic_configs/conf to ZooKeeper. When I 
> > upload sample_techproducts_configs to ZooKeeper, everything goes well.
> >
> >
> > I am a freshman for Solr. Now I am going to learn the schema.xml 
> > solrconfig.xml,  and try to make my own config for my dataset based on the 
> > basic_configs.
> >
> >
> > Thanks again.
> > Jianer
> >
> >
> >> -原始邮件-
> >> 发件人: "Shawn Heisey" 
> >> 发送时间: 2015年12月20日 星期日
> >> 收件人: solr-user@lucene.apache.org
> >> 抄送:
> >> 主题: Re: Some problems when upload data to index in cloud environment
> >>
> >> On 12/18/2015 6:16 PM, 周建二 wrote:
> >> > I am building a solr cloud production environment. My solr version is 
> >> > 5.3.1. The environment consists three nodes running CentOS 6.5. First I 
> >> > build the zookeeper environment by the three nodes, and then run solr on 
> >> > the three nodes, and at last build a collection consists of three shards 
> >> > and each shard has two replicas. After that we can see that cloud 
> >> > structure on the Solr Admin page.
> >>
> >> 
> >>
> >> > HTTP ERROR 404
> >> >
> >> > Problem accessing /solr/cloud-test/update/extract. Reason:
> >>
> >> One of two problems is likely:  Either there is no collection named
> >> "cloud-test" on your cloud, or the /update/extract handler is not
> >> defined in that collection's solrconfig.xml file.  The active version of
> >> this file lives in zookeeper when you're running SolrCloud.
> >>
> >> If you're sure a collection with this name exists, how exactly did you
> >> create it?  Was it built with one of the sample configs or with a config
> >> that you built yourself?
> >>
> >> Of the three configsets included with the Solr dowbload,
> >> data_driven_schema_configs and sample_techproducts_configs contain the
> >> /update/extract handler.  The configset named basic_configs does NOT
> >> contain the handler.
> >>
> >> Thanks,
> >> Shawn
> >>
> >
> >
> >





Json facet api method stream

2015-12-21 Thread Yago Riveiro
Hi,

The json facet API method "stream" uses the docvalues internally for do the
aggregation on the fly?

I wan't to know if using this method justifies have the docvalues configured
in schema.



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Json-facet-api-method-stream-tp4246520.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: TPS with Solr Cloud

2015-12-21 Thread Walter Underwood
How many documents do you have? How big is the index?

You can increase total throughput with replicas. Shards will make it slower, 
but allow more documents.

At 8000 queries/s, I assume you are using the same query over and over. If so, 
that is a terrible benchmark. Everything is served out of cache.

Test with production logs. Choose logs where the number of distinct queries is 
much larger than your cache sizes. If your caches are 1024, it would be good to 
have a 100K distinct queries. That might mean of total log size of a few 
million queries.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 21, 2015, at 9:47 AM, Upayavira  wrote:
> 
> 
> You add shards to reduce response times. If your responses are too slow
> for 1 shard, try it with three. Skip two for reasons stated above.
> 
> Upayavira
> 
> On Mon, Dec 21, 2015, at 04:27 PM, Erick Erickson wrote:
>> 8,000 TPS almost certainly means you're firing the same (or
>> same few) requests over and over and hitting the queryResultCache,
>> look in the adminUI>>core>>plugins/stats>>cache>>queryResultCache.
>> I bet you're seeing a hit ratio near 100%. This is what Toke means
>> when he says your tests are too lightweight.
>> 
>> 
>> As others have outlined, to increase TPS (after you straighten out
>> your test harness) you add _replicas_ rather than add _shards_.
>> Only add shards when your collections are too big to fit on a single
>> Solr instance.
>> 
>> Best,
>> Erick
>> 
>> On Mon, Dec 21, 2015 at 1:56 AM, Emir Arnautovic
>>  wrote:
>>> Hi Anshul,
>>> TPS depends on number of concurrent request you can run and request
>>> processing time. With sharding you reduce processing time with reducing
>>> amount of data single node process, but you have overhead of inter shard
>>> communication and merging results from different shards. If that overhead is
>>> smaller than time you get when processing half of index, you will see
>>> increase of TPS. If you are running same query in a loop, first request will
>>> be processed and others will likely be returned from cache, so response time
>>> will not vary with index size hence sharding overhead will cause TPS to go
>>> down.
>>> If you are happy with your response time, and want more TPS, you go with
>>> replications - that will increase number of concurrent requests you can run.
>>> 
>>> Also, make sure your tests are realistic in order to avoid having false
>>> estimates and have surprises when start running real load.
>>> 
>>> Regards,
>>> Emir
>>> 
>>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>> 
>>> 
>>> 
>>> 
>>> On 21.12.2015 08:18, Anshul Sharma wrote:
 
 Hi,
 I am trying to evaluate solr for one of my project for which i need to
 check the scalability in terms of tps(transaction per second) for my
 application.
 I have configured solr on 1 AWS server as standalone application which is
 giving me a tps of ~8000 for my query.
 In order to test the scalability, i have done sharding of the same data
 across two AWS servers with 2.5 milion records each .When i try to query
 the cluster with the same query as before it gives me a tps of ~2500 .
 My understanding is the tps should have been increased in a cluster as
 these are two different machines which will perform separate I/O
 operations.
 I have not configured any seperate load balancer as the document says that
 by default solr cloud will perform load balancing in a round robin
 fashion.
 Can you please help me in understanding the issue.
 
>>> 



Re: facet component and uninverted field

2015-12-21 Thread Jamie Johnson
Thanks, the issue I'm having is that there is no equivalent to method uif
for the standard facet component.  We'll see how SOLR-8096 shakes out.

On Sun, Dec 20, 2015 at 11:29 PM, Upayavira  wrote:

>
>
> On Sun, Dec 20, 2015, at 01:32 PM, Jamie Johnson wrote:
> > For those interested I've attached an initial patch to
> > https://issues.apache.org/jira/browse/SOLR-8096 to start supporting uif
> > in
> > FacetComponent via JSON facet api.
> > On Dec 18, 2015 9:22 PM, "Jamie Johnson"  wrote:
> >
> > > I recently saw that the new JSON Facet API supports controlling the
> facet
> > > method that is used and was wondering if there was any support for
> doing
> > > the same thing in the original facet component?
> > >
> > > Also is there a plan to deprecate one of these components over the
> other
> > > or is there an expectation that both will continue to live on?
> Curious if
> > > I should bite the bullet and transition to the new JSON Facet API or
> not.
>
> facet.method specifies the method for faceting! But I suspect you've
> found that already.
>
> As to deprecation, these sort of things in my experience don't get
> deprecated as such, we just find that one gets better than the other -
> the better it gets, the more adoption it sees.
>
> Upayavira
>


Re: Permutations of entries in a multivalued field

2015-12-21 Thread Johannes Riedl

Thanks a lot for these useful hints.

Best,

Johannes

On 18.12.2015 20:59, Allison, Timothy B. wrote:

Duh, didn't realize you could set inOrder in Solr.  Y, that's the better 
solution.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, December 18, 2015 2:27 PM
To: solr-user 
Subject: Re: Permutations of entries in a multivalued field

The other thing to check is the ComplexPhraseQueryParser, see:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser

It uses the Span queries to build up the query...

Best,
Erick

On Fri, Dec 18, 2015 at 11:23 AM, Allison, Timothy B.
 wrote:

Hi Johannes,
   I suspect that Scott's answer would be more efficient than the following, 
and I may be misunderstanding the problem!

  This type of search is supported at the Lucene level by a SpanNearQuery with 
inOrder set to false.

  So, how do you get a SpanQuery in Solr?  You might want to look at the 
SurroundQueryParser, and I have an alternate (LUCENE-5205/SOLR-5410) here: 
https://github.com/tballison/lucene-addons.

  If you do find an appropriate parser, make sure that your position increment gap 
is > 0 on your text field definition, and then you'd never incorrectly get a 
hit across field entries of:

[0] A B
[1] C

Best,
Tim

On Wed, Dec 16, 2015 at 8:38 AM, Johannes Riedl < 
johannes.ri...@uni-tuebingen.de> wrote:


Hello all,

we are facing the following problem: we use a multivalued string
field that contains entries of the kind A/B/C/, where A,B,C are terms.
We are now looking for a simple way to also find all permutations of
A/B/C, so e.g. B/A/C. As a workaround we added a new field that
contains all entries alphabetically sorted and guarantee sorting on the user 
side.
However - since this is limited in some ways - is there a simple way
to either index in a way such that solely A/B/C and all permutations
are found (using e.g. type=text is not an option since a term could
occur in a different entry of the multivalued field) or trigger an
alphabetical sorting of incoming queries.

Thanks a lot for your feedback, best regards

Johannes




--
Scott Stults | Founder & Solutions Architect | OpenSource Connections,
LLC
| 434.409.2780
http://www.opensourceconnections.com




new data structure for some fields

2015-12-21 Thread Abhishek Mishra
Hello all

i am facing some kind of requirement that where for an id p1 is  associated
with some category_ids c1,c2,c3,c4 with some integers b1,b2,b3,b4. We need
to sort the query of solr on the basis of b1/b2/b3/b4 depending on given
category_id . Right now we mapped the category_ids into multi-valued
attribute. [c1,c2,c3,c4] something like this. we are querying into it. But
from now we also need to find which integer b1,b2,b3.. associated with
given category and also sort the whole query on it.


sorry for any typos..

Regards
Abhishek


Solr 5.4, NGramFilterFactory highlighting

2015-12-21 Thread Bjørn Hjelle
Hi,

I have problems getting hit highlighting to work in NGram-fields, with
search terms longer than 8 characters.
Without the luceneMatchVersion="4.3" parameter in the field type
definition, the whole word is highlighted, not just the search term.


Here are the exact steps to reproduce the issue:

Download Solr 5.4.0:

$ wget http://archive.apache.org/dist/lucene/solr/5.4.0/solr-5.4.0.tgz
$ tar xvfx solr-5.4.0.tgz

Start solr:

$ cd solr-5.4.0
$ bin/solr start

In another command prompt, create a core:

$ bin/solr create_core -c test -d
server/solr/configsets/sample_techproducts_configs


Add to server/solr/test/conf/schema.xml:





 










Reload the core to pick up config changes:
$ curl "http://localhost:8983/solr/admin/cores?action=RELOAD=test;


Create file doc.xml with contents:


  
DOC2
thisisalongword in the document
  



Index the document:

$ bin/post -c test doc.xml


Perform a search that shows that we find the document and the search term
is highlighted:
http://localhost:8983/solr/test/select?q=name_ngram%3Athis=json=true=true=name_ngram=%3Cem%3E=%3C%2Fem%3E

  "highlighting":{
"DOC2":{
  "name_ngram":["thisisalongword in the document"]}}}


Add more characters to the search term, we still find the document, but the
search term is now NOT highlighted:

http://localhost:8983/solr/test/select?q=name_ngram%3Athisisalong=json=true=true=name_ngram=%3Cem%3E=%3C%2Fem%3E

  "highlighting":{
"DOC2":{
  "name_ngram":["thisisalongword in the document"]}}}


Thank you,
Bjørn Hjelle


Re: Json facet api method stream

2015-12-21 Thread Yonik Seeley
On Mon, Dec 21, 2015 at 6:56 PM, Yago Riveiro  wrote:
> The json facet API method "stream" uses the docvalues internally for do the
> aggregation on the fly?
>
> I wan't to know if using this method justifies have the docvalues configured
> in schema.

It won't use docValues for the actual field being faceted on (because
streaming in term order means that it's most efficient to use the term
index and not docValues to find all of the docs that match a given
term).

It will use docValues for sub-facets/stats.

-Yonik


Re: solrcloud used a lot of memory and memory keep increasing during long time run

2015-12-21 Thread Erick Erickson
bq: What can we benefit from set maxWarmingSearchers to a larger value

You really don't get _any_ value. That's in there as a safety valve to
prevent run-away resource consumption. Getting this warning in your logs
means you're mis-configuring your system. Increasing the value is almost
totally useless. It simply makes little sense to have your soft commit take
less time than your autowarming, that's a ton of wasted work for no
purpose. It's highly unlikely that your users _really_ need 1.5 second
latency, my bet is 10-15 seconds would be fine. You know best of course,
but this kind of requirement is often something that people _think_ they
need but really don't. It particularly amuses me when the time between when
a document changes and any attempt is made to send it to solr is minutes,
but the product manager insists that "Solr must show the doc within two
seconds of sending it to the index".

It's often actually acceptable for your users to know "it may take up to a
minute for the docs to be searchable". What's usually not acceptable is
unpredictability. But again that's up to your product managers.

bq: You mean if my customer SearchComponent open a searcher, it will exceed the
limit set by maxWarmingSearchers?

Not at all. but if you don't close it properly (it's reference counted),
then more and more searchers will stay open, chewing up memory. So you may
just be failing to close them and seeing memory increase because of that.

Best,
Erick

On Mon, Dec 21, 2015 at 6:47 PM, zhenglingyun  wrote:

> Yes, I do have some custom “Tokenizer"s and “SearchComponent"s.
>
> Here is the screenshot:
>
>
> The number of opened searchers keeps changing. This time it’s 10.
>
> You mean if my customer SearchComponent open a searcher, it will exceed
> the limit set by maxWarmingSearchers? I’ll check that, thanks!
>
> I have to do a short time commit. Our application needs a near real time
> searching
> service. But I’m not sure whether Solr can support NRT search in other
> ways. Can
> you give me some advices?
>
> The value of maxWarmingSearchers is copied from some example configs I
> think,
> I’ll try to set it back to 2.
>
> What can we benefit from set maxWarmingSearchers to a larger value? I
> don't find
> the answer on google and apache-solr-ref-guide.
>
>
>
>
> 在 2015年12月22日,00:34,Erick Erickson  写道:
>
> Do you have any custom components? Indeed, you shouldn't have
> that many searchers open. But could we see a screenshot? That's
> the best way to insure that we're talking about the same thing.
>
> Your autocommit settings are really hurting you. Your commit interval
> should be as long as you can tolerate. At that kind of commit frequency,
> your caches are of very limited usefulness anyway, so you can pretty
> much shut them off. Every 1.5 seconds, they're invalidated totally.
>
> Upping maxWarmingSearchers is almost always a mistake. That's
> a safety valve that's there in order to prevent runaway resource
> consumption and almost always means the system is mis-configured.
> I'd put it back to 2 and tune the rest of the system to avoid it rather
> than bumping it up.
>
> Best,
> Erick
>
> On Sun, Dec 20, 2015 at 11:43 PM, zhenglingyun 
> wrote:
>
> Just now, I see about 40 "Searchers@ main" displayed in Solr Web UI:
> collection -> Plugins/Stats -> CORE
>
> I think it’s abnormal!
>
> softcommit is set to 1.5s, but warmupTime needs about 3s
> Does it lead to so many Searchers?
>
> maxWarmingSearchers is set to 4 in my solrconfig.xml,
> doesn’t it will prevent Solr from creating more than 4 Searchers?
>
>
>
> 在 2015年12月21日,14:43,zhenglingyun  写道:
>
> Thanks Erick for pointing out the memory change in a sawtooth pattern.
> The problem troubles me is that the bottom point of the sawtooth keeps
> increasing.
> And when the used capacity of old generation exceeds the threshold set by
> CMS’s
> CMSInitiatingOccupancyFraction, gc keeps running and uses a lot of CPU
> cycle
> but the used old generation memory does not decrease.
>
> After I take Rahul’s advice, I decrease the Xms and Xmx from 16G to 8G, and
> adjust the parameters of JVM from
>   -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>   -XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70
>   -XX:+CMSParallelRemarkEnabled
> to
>   -XX:NewRatio=3
>   -XX:SurvivorRatio=4
>   -XX:TargetSurvivorRatio=90
>   -XX:MaxTenuringThreshold=8
>   -XX:+UseConcMarkSweepGC
>   -XX:+UseParNewGC
>   -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4
>   -XX:+CMSScavengeBeforeRemark
>   -XX:PretenureSizeThreshold=64m
>   -XX:+UseCMSInitiatingOccupancyOnly
>   -XX:CMSInitiatingOccupancyFraction=50
>   -XX:CMSMaxAbortablePrecleanTime=6000
>   -XX:+CMSParallelRemarkEnabled
>   -XX:+ParallelRefProcEnabled
>   -XX:-CMSConcurrentMTEnabled
> which is taken from bin/solr.in.sh
> I hope this can reduce gc pause time and full gc times.
> And maybe the memory increasing problem will 

Re: solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread elisabeth benoit
hello,

yes in the second case I get one document with a higher score. the relative
scoring between documents is not the same anymore.

best regards,
elisabeth

2015-12-22 4:39 GMT+01:00 Binoy Dalal :

> I have one query.
> In the second case do you get two records with the same lower scores or
> just one record with a lower score and the other with a higher one?
>
> On Mon, 21 Dec 2015, 18:45 elisabeth benoit 
> wrote:
>
> > Hello,
> >
> > I don't think the query is important in this case.
> >
> > After checking out solr's debug output, I dont think the query norm is
> > relevant either.
> >
> > I think the scoring changes because
> >
> > 1) in first case, I have same slop for catchall and name fields. Bot
> match
> > pf2 pf3. In this case, solr uses max of both for scoring pf2 pf3 results.
> >
> > 2) In second case, I have different slopes, then solr uses sum of values
> > instead of max.
> >
> >
> >
> > If anyone knows how to work around this, please let me know.
> >
> > Elisabeth
> >
> > 2015-12-21 11:22 GMT+01:00 Binoy Dalal :
> >
> > > What is your query?
> > >
> > > On Mon, 21 Dec 2015, 14:37 elisabeth benoit  >
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am using solr 4.10.1 and I have configured my pf2 pf3 like this
> > > >
> > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > >
> > > > my search field (qf) is my catchall field
> > > >
> > > > I'v been trying to change slop in pf2, pf3 for catchall and synonyms
> > > (going
> > > > from 0, or default value for synonyms, to 1)
> > > >
> > > > pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > > pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > >
> > > > but some results are not ordered the same way anymore even if I get
> the
> > > > same MATCH values in debugQuery output
> > > >
> > > > For instance, for a doc matching bastill in catchall field (and
> nothing
> > > to
> > > > do with pf2, pf3!)
> > > >
> > > > with first pf2, pf3
> > > >
> > > > 0.5163083 = (MATCH) weight(catchall:bastill in 105256)
> > > [NoTFIDFSimilarity],
> > > > result of:
> > > >* 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
> > > > ), product of:
> > > >  * 0.5163083 = queryWeight,* product of:
> > > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > > 0.5163083 = queryNorm
> > > >   1.0 = fieldWeight in 105256, product of:
> > > > 1.0 = tf(freq=2.0), with freq of:
> > > >   2.0 = termFreq=2.0
> > > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > > 1.0 = fieldNorm(doc=105256)
> > > >   0.5163083 = (MATCH) weight(catchall:paris in 105256)
> > > > [NoTFIDFSimilarity], result of:
> > > > 0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0
> > > >
> > > > and when I change pf2 pf3 (the only change, same query, same docs)
> > > >
> > > > 0.47504464 = (MATCH) weight(catchall:paris in 105256)
> > > [NoTFIDFSimilarity],
> > > > result of:
> > > >* 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
> > > > ), product of:
> > > >  * 0.47504464 = queryWeight*, product of:
> > > > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > > > 0.47504464 = queryNorm
> > > >   1.0 = fieldWeight in 105256, product of:
> > > > 1.0 = tf(freq=6.0), with freq of:
> > > >   6.0 = termFreq=6.0
> > > > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > > > 1.0 = fieldNorm(doc=105256)
> > > >
> > > > so in the end, with same MATCH results, in first case I get two
> > documents
> > > > with same score, and in second case, one document has a higher score.
> > > >
> > > > This seem very very strange. Does anyone have a clue what's going on?
> > > >
> > > > Thanks
> > > > Elisabeth
> > > >
> > > --
> > > Regards,
> > > Binoy Dalal
> > >
> >
> --
> Regards,
> Binoy Dalal
>


documentCache - max concurrent queries

2015-12-21 Thread Vincenzo D'Amore
Hi all,

looking at solr wiki

https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig

I found this:

"The size for the documentCache should always be greater than max_results
times the max_concurrent_queries, to ensure that Solr does not need to
refetch a document during a request."

Well, I'm not sure what max_concurrent_queries is.

I mean, is the number of concurrent queries usually solr receives or the
limit, the max number of concurrent queries solr is configured to?
And if it is the limit, where is it set?

Could anyone please help me?

Best regards,
Vincenzo

-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251


Re: solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread Binoy Dalal
I have one query.
In the second case do you get two records with the same lower scores or
just one record with a lower score and the other with a higher one?

On Mon, 21 Dec 2015, 18:45 elisabeth benoit 
wrote:

> Hello,
>
> I don't think the query is important in this case.
>
> After checking out solr's debug output, I dont think the query norm is
> relevant either.
>
> I think the scoring changes because
>
> 1) in first case, I have same slop for catchall and name fields. Bot match
> pf2 pf3. In this case, solr uses max of both for scoring pf2 pf3 results.
>
> 2) In second case, I have different slopes, then solr uses sum of values
> instead of max.
>
>
>
> If anyone knows how to work around this, please let me know.
>
> Elisabeth
>
> 2015-12-21 11:22 GMT+01:00 Binoy Dalal :
>
> > What is your query?
> >
> > On Mon, 21 Dec 2015, 14:37 elisabeth benoit 
> > wrote:
> >
> > > Hello all,
> > >
> > > I am using solr 4.10.1 and I have configured my pf2 pf3 like this
> > >
> > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > >
> > > my search field (qf) is my catchall field
> > >
> > > I'v been trying to change slop in pf2, pf3 for catchall and synonyms
> > (going
> > > from 0, or default value for synonyms, to 1)
> > >
> > > pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > >
> > > but some results are not ordered the same way anymore even if I get the
> > > same MATCH values in debugQuery output
> > >
> > > For instance, for a doc matching bastill in catchall field (and nothing
> > to
> > > do with pf2, pf3!)
> > >
> > > with first pf2, pf3
> > >
> > > 0.5163083 = (MATCH) weight(catchall:bastill in 105256)
> > [NoTFIDFSimilarity],
> > > result of:
> > >* 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
> > > ), product of:
> > >  * 0.5163083 = queryWeight,* product of:
> > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > 0.5163083 = queryNorm
> > >   1.0 = fieldWeight in 105256, product of:
> > > 1.0 = tf(freq=2.0), with freq of:
> > >   2.0 = termFreq=2.0
> > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > 1.0 = fieldNorm(doc=105256)
> > >   0.5163083 = (MATCH) weight(catchall:paris in 105256)
> > > [NoTFIDFSimilarity], result of:
> > > 0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0
> > >
> > > and when I change pf2 pf3 (the only change, same query, same docs)
> > >
> > > 0.47504464 = (MATCH) weight(catchall:paris in 105256)
> > [NoTFIDFSimilarity],
> > > result of:
> > >* 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
> > > ), product of:
> > >  * 0.47504464 = queryWeight*, product of:
> > > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > > 0.47504464 = queryNorm
> > >   1.0 = fieldWeight in 105256, product of:
> > > 1.0 = tf(freq=6.0), with freq of:
> > >   6.0 = termFreq=6.0
> > > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > > 1.0 = fieldNorm(doc=105256)
> > >
> > > so in the end, with same MATCH results, in first case I get two
> documents
> > > with same score, and in second case, one document has a higher score.
> > >
> > > This seem very very strange. Does anyone have a clue what's going on?
> > >
> > > Thanks
> > > Elisabeth
> > >
> > --
> > Regards,
> > Binoy Dalal
> >
>
-- 
Regards,
Binoy Dalal


Re: Slow query response.

2015-12-21 Thread Modassar Ather
Thanks Jack for your response.

The users of our application can enter a list of ids which the UI caps at
50k. All the ids are valid and match documents. We do faceting, grouping
etc. on the result set of up to 50k documents.
I checked and found that the query is not very resource intensive. It is
not eating up lot of CPU%, IO% or memory.

Regards,
Modassar

On Thu, Dec 17, 2015 at 8:44 PM, Jack Krupansky 
wrote:

> A single query with tens of thousands of terms is very clearly a misuse of
> Solr. If it happens to work at all, consider yourself lucky. Are you using
> a standard Solr query parser or the terms query parser that lets you write
> a raw list of terms to OR.
>
> Are your nodes CPU-bound or I/O-bound during those 50-second intervals? My
> bet is that your index does not fit fully in memory, causing lots of I/O to
> repeatedly page in portions of the index and probably additional CPU usage
> as well.
>
> How many rows are you returning on each query? Are you using all these
> terms just to filter a smaller query or to return a large bulk of
> documents?
>
>
> -- Jack Krupansky
>
> On Thu, Dec 17, 2015 at 7:01 AM, Modassar Ather 
> wrote:
>
> > Hi,
> >
> > I have a field f which is defined as follows.
> >  > omitNorms="true"/>
> >
> > Solr-5.2.1 is used. The index is spread across 12 shards (no replica) and
> > the index size on each node is around 100 GB.
> >
> > When I search for 50 thousand values (ORed) in the field f it takes
> almost
> > around 45 to 55 seconds.
> > Per my understanding it is too slow. Kindly share your thoughts on this
> > behavior and provide your suggestions.
> >
> > Thanks,
> > Modassar
> >
>


Re: TPS with Solr Cloud

2015-12-21 Thread Toke Eskildsen
Anshul Sharma  wrote:
> I have configured solr on 1 AWS server as standalone application which is
> giving me a tps of ~8000 for my query.

[...]

> In order to test the scalability, i have done sharding of the same data
> across two AWS servers with 2.5 milion records each .When i try to query
> the cluster with the same query as before it gives me a tps of ~2500 .

Sharding means two-phase processing and a merge of the shard-results. The 
overhead of sharding was larger than the gains, for your setup. I am afraid 
your test is too light-weight for performance-estimation at scale.

- Toke Eskildsen


Re: solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread Binoy Dalal
What is your query?

On Mon, 21 Dec 2015, 14:37 elisabeth benoit 
wrote:

> Hello all,
>
> I am using solr 4.10.1 and I have configured my pf2 pf3 like this
>
> catchall~0^0.2 name~0^0.21 synonyms^0.2
> catchall~0^0.2 name~0^0.21 synonyms^0.2
>
> my search field (qf) is my catchall field
>
> I'v been trying to change slop in pf2, pf3 for catchall and synonyms (going
> from 0, or default value for synonyms, to 1)
>
> pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
>
> but some results are not ordered the same way anymore even if I get the
> same MATCH values in debugQuery output
>
> For instance, for a doc matching bastill in catchall field (and nothing to
> do with pf2, pf3!)
>
> with first pf2, pf3
>
> 0.5163083 = (MATCH) weight(catchall:bastill in 105256) [NoTFIDFSimilarity],
> result of:
>* 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
> ), product of:
>  * 0.5163083 = queryWeight,* product of:
> 1.0 = idf(docFreq=134, maxDocs=12258543)
> 0.5163083 = queryNorm
>   1.0 = fieldWeight in 105256, product of:
> 1.0 = tf(freq=2.0), with freq of:
>   2.0 = termFreq=2.0
> 1.0 = idf(docFreq=134, maxDocs=12258543)
> 1.0 = fieldNorm(doc=105256)
>   0.5163083 = (MATCH) weight(catchall:paris in 105256)
> [NoTFIDFSimilarity], result of:
> 0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0
>
> and when I change pf2 pf3 (the only change, same query, same docs)
>
> 0.47504464 = (MATCH) weight(catchall:paris in 105256) [NoTFIDFSimilarity],
> result of:
>* 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
> ), product of:
>  * 0.47504464 = queryWeight*, product of:
> 1.0 = idf(docFreq=10958, maxDocs=12258543)
> 0.47504464 = queryNorm
>   1.0 = fieldWeight in 105256, product of:
> 1.0 = tf(freq=6.0), with freq of:
>   6.0 = termFreq=6.0
> 1.0 = idf(docFreq=10958, maxDocs=12258543)
> 1.0 = fieldNorm(doc=105256)
>
> so in the end, with same MATCH results, in first case I get two documents
> with same score, and in second case, one document has a higher score.
>
> This seem very very strange. Does anyone have a clue what's going on?
>
> Thanks
> Elisabeth
>
-- 
Regards,
Binoy Dalal


Re: new data structure for some fields

2015-12-21 Thread Abhishek Mishra
hi binoy
thanks for reply. I mean by sort is to sort the data-sets on the basis of
integers values given for that category.
For any document let say for an id P1,
category associated is c1,c2,c3,c4 (using multivalued field)
For new implementation
similarly a number is associated with each category. let say
c1---b1,c2---b2,c3---b3,c4---b4.
now when we querying into solr for the ids which have c1 in their
categories. (q=category_id:c1) now i want the result of this query sorted
on the basis of number(b) associated with it throughout the result..

number of association is usually less than 20 (means an id can't be mapped
more than 20 category_ids)


On Mon, Dec 21, 2015 at 3:59 PM, Binoy Dalal  wrote:

> When you say sort, do you mean search on the basis of category and
> integers? Or score the docs based on their category and integer values?
>
> Also, for any given document, how many categories or integers are
> associated with it?
>
> On Mon, 21 Dec 2015, 14:43 Abhishek Mishra  wrote:
>
> > Hello all
> >
> > i am facing some kind of requirement that where for an id p1 is
> associated
> > with some category_ids c1,c2,c3,c4 with some integers b1,b2,b3,b4. We
> need
> > to sort the query of solr on the basis of b1/b2/b3/b4 depending on given
> > category_id . Right now we mapped the category_ids into multi-valued
> > attribute. [c1,c2,c3,c4] something like this. we are querying into it.
> But
> > from now we also need to find which integer b1,b2,b3.. associated with
> > given category and also sort the whole query on it.
> >
> >
> > sorry for any typos..
> >
> > Regards
> > Abhishek
> >
> --
> Regards,
> Binoy Dalal
>


solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread elisabeth benoit
Hello all,

I am using solr 4.10.1 and I have configured my pf2 pf3 like this

catchall~0^0.2 name~0^0.21 synonyms^0.2
catchall~0^0.2 name~0^0.21 synonyms^0.2

my search field (qf) is my catchall field

I'v been trying to change slop in pf2, pf3 for catchall and synonyms (going
from 0, or default value for synonyms, to 1)

pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2

but some results are not ordered the same way anymore even if I get the
same MATCH values in debugQuery output

For instance, for a doc matching bastill in catchall field (and nothing to
do with pf2, pf3!)

with first pf2, pf3

0.5163083 = (MATCH) weight(catchall:bastill in 105256) [NoTFIDFSimilarity],
result of:
   * 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
), product of:
 * 0.5163083 = queryWeight,* product of:
1.0 = idf(docFreq=134, maxDocs=12258543)
0.5163083 = queryNorm
  1.0 = fieldWeight in 105256, product of:
1.0 = tf(freq=2.0), with freq of:
  2.0 = termFreq=2.0
1.0 = idf(docFreq=134, maxDocs=12258543)
1.0 = fieldNorm(doc=105256)
  0.5163083 = (MATCH) weight(catchall:paris in 105256)
[NoTFIDFSimilarity], result of:
0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0

and when I change pf2 pf3 (the only change, same query, same docs)

0.47504464 = (MATCH) weight(catchall:paris in 105256) [NoTFIDFSimilarity],
result of:
   * 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
), product of:
 * 0.47504464 = queryWeight*, product of:
1.0 = idf(docFreq=10958, maxDocs=12258543)
0.47504464 = queryNorm
  1.0 = fieldWeight in 105256, product of:
1.0 = tf(freq=6.0), with freq of:
  6.0 = termFreq=6.0
1.0 = idf(docFreq=10958, maxDocs=12258543)
1.0 = fieldNorm(doc=105256)

so in the end, with same MATCH results, in first case I get two documents
with same score, and in second case, one document has a higher score.

This seem very very strange. Does anyone have a clue what's going on?

Thanks
Elisabeth


Re: TPS with Solr Cloud

2015-12-21 Thread Emir Arnautovic

Hi Anshul,
TPS depends on number of concurrent request you can run and request 
processing time. With sharding you reduce processing time with reducing 
amount of data single node process, but you have overhead of inter shard 
communication and merging results from different shards. If that 
overhead is smaller than time you get when processing half of index, you 
will see increase of TPS. If you are running same query in a loop, first 
request will be processed and others will likely be returned from cache, 
so response time will not vary with index size hence sharding overhead 
will cause TPS to go down.
If you are happy with your response time, and want more TPS, you go with 
replications - that will increase number of concurrent requests you can run.


Also, make sure your tests are realistic in order to avoid having false 
estimates and have surprises when start running real load.


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 21.12.2015 08:18, Anshul Sharma wrote:

Hi,
I am trying to evaluate solr for one of my project for which i need to
check the scalability in terms of tps(transaction per second) for my
application.
I have configured solr on 1 AWS server as standalone application which is
giving me a tps of ~8000 for my query.
In order to test the scalability, i have done sharding of the same data
across two AWS servers with 2.5 milion records each .When i try to query
the cluster with the same query as before it gives me a tps of ~2500 .
My understanding is the tps should have been increased in a cluster as
these are two different machines which will perform separate I/O operations.
I have not configured any seperate load balancer as the document says that
by default solr cloud will perform load balancing in a round robin fashion.
Can you please help me in understanding the issue.



Re: new data structure for some fields

2015-12-21 Thread Binoy Dalal
When you say sort, do you mean search on the basis of category and
integers? Or score the docs based on their category and integer values?

Also, for any given document, how many categories or integers are
associated with it?

On Mon, 21 Dec 2015, 14:43 Abhishek Mishra  wrote:

> Hello all
>
> i am facing some kind of requirement that where for an id p1 is  associated
> with some category_ids c1,c2,c3,c4 with some integers b1,b2,b3,b4. We need
> to sort the query of solr on the basis of b1/b2/b3/b4 depending on given
> category_id . Right now we mapped the category_ids into multi-valued
> attribute. [c1,c2,c3,c4] something like this. we are querying into it. But
> from now we also need to find which integer b1,b2,b3.. associated with
> given category and also sort the whole query on it.
>
>
> sorry for any typos..
>
> Regards
> Abhishek
>
-- 
Regards,
Binoy Dalal


Re: Solr 6 Distributed Join

2015-12-21 Thread Akiel Ahmed
Thank you for the help. 

I am working through what I want to do with the join - will let you know 
if I hit any issues.



From:   Joel Bernstein 
To: solr-user@lucene.apache.org
Date:   17/12/2015 15:40
Subject:Re: Solr 6 Distributed Join



One thing to note about the hashJoin is that it requires the search 
results
from the hashed query to fit entirely in memory.

The innerJoin does not have this requirement as it performs a streaming
merge join.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Dec 17, 2015 at 10:33 AM, Joel Bernstein  
wrote:

> Below is an example of nested joins where the innerJoin is done in
> parallel using the parallel function. The partitionKeys parameter needs 
to
> be added to the searches when the parallel function is used to partition
> the results across worker nodes.
>
> hashJoin(
> parallel(workerCollection,
> innerJoin(
> search(users, q="*:*",
> fl="userId, full_name, hometown", sort="userId asc", zkHost="zk2:2345",
> qt="/export" partitionKeys="userId"),
> search(reviews, q="*:*",
> fl="userId, review, score", sort="userId asc", zkHost="zk1:2345",
> qt="/export" partitionKeys="userId"),
> on="userId"
> ),
>  workers="20",
>  zkHost="zk1:2345",
>  sort="userId asc"
>  ),
>hashed=search(restaurants, q="city:nyc", 
fl="restaurantId, restaurantName",
> sort="restaurantId asc", zkHost="zk1:2345", qt="/export"),
>on="restaurantId"
> )
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Dec 17, 2015 at 10:29 AM, Joel Bernstein 
> wrote:
>
>> The innerJoin joins two streams sorted by the same join keys (merge
>> join). If third stream has the same join keys you can nest innerJoins. 
But
>> all three tables need to be sorted by the same join keys to nest 
innerJoins
>> (merge joins).
>>
>> innerJoin(innerJoin(...),
>> search(...),
>> on...)
>>
>> If the third stream is joined on a different key you can nest inside a
>> hashJoin which doesn't require streams to be sorted on the join key. 
For
>> example:
>>
>> hashJoin(innerJoin(...),
>> hashed=search(...),
>> on..)
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Thu, Dec 17, 2015 at 9:28 AM, Akiel Ahmed  
wrote:
>>
>>> Hi again,
>>>
>>> I got the join to work. A team mate pointed out that one of the search
>>> functions in the innerJoin query was missing a field in the join - 
adding
>>> the e1 field to the fl parameter of the second search function gave 
the
>>> result I expected:
>>>
>>>
>>> 
http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted

>>>
>>> , fl="id", q=text:John, sort="id
>>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
>>> fl="id,e1", q=text:Friends, sort="id
>>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
>>>
>>> I am still interested in whether we can specify a join, using an
>>> arbitrary
>>> number of searches.
>>>
>>> Cheers
>>>
>>> Akiel
>>>
>>>
>>>
>>> From:   Akiel Ahmed/UK/IBM@IBMGB
>>> To: solr-user@lucene.apache.org
>>> Date:   16/12/2015 17:05
>>> Subject:Re: Solr 6 Distributed Join
>>>
>>>
>>>
>>> Hi Dennis,
>>>
>>> Thank you for your help. I used your explanation to construct an
>>> innerJoin
>>>
>>> query; I think I am getting further but didn't get the results I
>>> expected.
>>>
>>> The following describes what I did – is there any chance you can tell
>>> where I am going wrong:
>>>
>>> Solr 6 Developer Builds: #2738 and #2743
>>>
>>> 1. Modified server/solr/configsets/basic_configs/conf/managed-schema 
so
>>> it
>>>
>>> reads:
>>>
>>> 
>>> 
>>>   id
>>>   >> multiValued="false" docValues="true"/>
>>>   >> stored="true"
>>>
>>> required="false" multiValued="false" docValues="true"/>
>>>   >> required="false" multiValued="false" docValues="true"/>
>>>   >> required="false"
>>>
>>> multiValued="false" docValues="true"/>
>>>   >> required="false"
>>>
>>> multiValued="false" docValues="true"/>
>>>   >> required="false" multiValued="false"/>
>>>   
>>>   >> precisionStep="0" positionIncrementGap="0"/>
>>>   >> positionIncrementGap="100">
>>> 
>>>   
>>>   
>>>   >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
>>>   >> words="lang/stopwords_en.txt"/>
>>> 
>>>   
>>> 
>>>
>>> 2. Modified server/solr/configsets/basic_configs/conf/solrconfig.xml,
>>> adding the following near the bottom of the file so it is the last
>>> request
>>>
>>> handler
>>>
>>>   
>>> 
>>> 

Re: solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread elisabeth benoit
hello,

That's what I did, like I wrote in my mail yesterday. In first case, solr
computes max. In second case, he sums both results.

That's why I dont get the same relative scoring between docs with the same
query.

2015-12-22 8:30 GMT+01:00 Binoy Dalal :

> Unless the content for both the docs is exactly the same it is highly
> unlikely that you will get the same score for the docs under different
> querying conditions. What you saw in the first case may have been a happy
> coincidence.
> Other than that it is very difficult to say why the scoring is different
> without getting a look at the actual query and the doc content.
>
> If you still wish to dig deeper, try to understand how solr actually scores
> documents that match your query. It takes into account a variety of factors
> to compute the cosine similarity to find the best match.
> You can find this formula and a decent explanation for it in the book solr
> in action or online in the lucene docs:
>
> https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/search/Similarity.html
>
> On Tue, 22 Dec 2015, 11:10 elisabeth benoit 
> wrote:
>
> > hello,
> >
> > yes in the second case I get one document with a higher score. the
> relative
> > scoring between documents is not the same anymore.
> >
> > best regards,
> > elisabeth
> >
> > 2015-12-22 4:39 GMT+01:00 Binoy Dalal :
> >
> > > I have one query.
> > > In the second case do you get two records with the same lower scores or
> > > just one record with a lower score and the other with a higher one?
> > >
> > > On Mon, 21 Dec 2015, 18:45 elisabeth benoit  >
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I don't think the query is important in this case.
> > > >
> > > > After checking out solr's debug output, I dont think the query norm
> is
> > > > relevant either.
> > > >
> > > > I think the scoring changes because
> > > >
> > > > 1) in first case, I have same slop for catchall and name fields. Bot
> > > match
> > > > pf2 pf3. In this case, solr uses max of both for scoring pf2 pf3
> > results.
> > > >
> > > > 2) In second case, I have different slopes, then solr uses sum of
> > values
> > > > instead of max.
> > > >
> > > >
> > > >
> > > > If anyone knows how to work around this, please let me know.
> > > >
> > > > Elisabeth
> > > >
> > > > 2015-12-21 11:22 GMT+01:00 Binoy Dalal :
> > > >
> > > > > What is your query?
> > > > >
> > > > > On Mon, 21 Dec 2015, 14:37 elisabeth benoit <
> > elisaelisael...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > I am using solr 4.10.1 and I have configured my pf2 pf3 like this
> > > > > >
> > > > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > > > >
> > > > > > my search field (qf) is my catchall field
> > > > > >
> > > > > > I'v been trying to change slop in pf2, pf3 for catchall and
> > synonyms
> > > > > (going
> > > > > > from 0, or default value for synonyms, to 1)
> > > > > >
> > > > > > pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > > > > pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > > > >
> > > > > > but some results are not ordered the same way anymore even if I
> get
> > > the
> > > > > > same MATCH values in debugQuery output
> > > > > >
> > > > > > For instance, for a doc matching bastill in catchall field (and
> > > nothing
> > > > > to
> > > > > > do with pf2, pf3!)
> > > > > >
> > > > > > with first pf2, pf3
> > > > > >
> > > > > > 0.5163083 = (MATCH) weight(catchall:bastill in 105256)
> > > > > [NoTFIDFSimilarity],
> > > > > > result of:
> > > > > >* 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
> > > > > > ), product of:
> > > > > >  * 0.5163083 = queryWeight,* product of:
> > > > > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > > > > 0.5163083 = queryNorm
> > > > > >   1.0 = fieldWeight in 105256, product of:
> > > > > > 1.0 = tf(freq=2.0), with freq of:
> > > > > >   2.0 = termFreq=2.0
> > > > > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > > > > 1.0 = fieldNorm(doc=105256)
> > > > > >   0.5163083 = (MATCH) weight(catchall:paris in 105256)
> > > > > > [NoTFIDFSimilarity], result of:
> > > > > > 0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0
> > > > > >
> > > > > > and when I change pf2 pf3 (the only change, same query, same
> docs)
> > > > > >
> > > > > > 0.47504464 = (MATCH) weight(catchall:paris in 105256)
> > > > > [NoTFIDFSimilarity],
> > > > > > result of:
> > > > > >* 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
> > > > > > ), product of:
> > > > > >  * 0.47504464 = queryWeight*, product of:
> > > > > > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > > > > > 0.47504464 = queryNorm
> > > > > >   1.0 = fieldWeight 

Re: solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread Binoy Dalal
Unless the content for both the docs is exactly the same it is highly
unlikely that you will get the same score for the docs under different
querying conditions. What you saw in the first case may have been a happy
coincidence.
Other than that it is very difficult to say why the scoring is different
without getting a look at the actual query and the doc content.

If you still wish to dig deeper, try to understand how solr actually scores
documents that match your query. It takes into account a variety of factors
to compute the cosine similarity to find the best match.
You can find this formula and a decent explanation for it in the book solr
in action or online in the lucene docs:
https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/search/Similarity.html

On Tue, 22 Dec 2015, 11:10 elisabeth benoit 
wrote:

> hello,
>
> yes in the second case I get one document with a higher score. the relative
> scoring between documents is not the same anymore.
>
> best regards,
> elisabeth
>
> 2015-12-22 4:39 GMT+01:00 Binoy Dalal :
>
> > I have one query.
> > In the second case do you get two records with the same lower scores or
> > just one record with a lower score and the other with a higher one?
> >
> > On Mon, 21 Dec 2015, 18:45 elisabeth benoit 
> > wrote:
> >
> > > Hello,
> > >
> > > I don't think the query is important in this case.
> > >
> > > After checking out solr's debug output, I dont think the query norm is
> > > relevant either.
> > >
> > > I think the scoring changes because
> > >
> > > 1) in first case, I have same slop for catchall and name fields. Bot
> > match
> > > pf2 pf3. In this case, solr uses max of both for scoring pf2 pf3
> results.
> > >
> > > 2) In second case, I have different slopes, then solr uses sum of
> values
> > > instead of max.
> > >
> > >
> > >
> > > If anyone knows how to work around this, please let me know.
> > >
> > > Elisabeth
> > >
> > > 2015-12-21 11:22 GMT+01:00 Binoy Dalal :
> > >
> > > > What is your query?
> > > >
> > > > On Mon, 21 Dec 2015, 14:37 elisabeth benoit <
> elisaelisael...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I am using solr 4.10.1 and I have configured my pf2 pf3 like this
> > > > >
> > > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > > >
> > > > > my search field (qf) is my catchall field
> > > > >
> > > > > I'v been trying to change slop in pf2, pf3 for catchall and
> synonyms
> > > > (going
> > > > > from 0, or default value for synonyms, to 1)
> > > > >
> > > > > pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > > > pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > > >
> > > > > but some results are not ordered the same way anymore even if I get
> > the
> > > > > same MATCH values in debugQuery output
> > > > >
> > > > > For instance, for a doc matching bastill in catchall field (and
> > nothing
> > > > to
> > > > > do with pf2, pf3!)
> > > > >
> > > > > with first pf2, pf3
> > > > >
> > > > > 0.5163083 = (MATCH) weight(catchall:bastill in 105256)
> > > > [NoTFIDFSimilarity],
> > > > > result of:
> > > > >* 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
> > > > > ), product of:
> > > > >  * 0.5163083 = queryWeight,* product of:
> > > > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > > > 0.5163083 = queryNorm
> > > > >   1.0 = fieldWeight in 105256, product of:
> > > > > 1.0 = tf(freq=2.0), with freq of:
> > > > >   2.0 = termFreq=2.0
> > > > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > > > 1.0 = fieldNorm(doc=105256)
> > > > >   0.5163083 = (MATCH) weight(catchall:paris in 105256)
> > > > > [NoTFIDFSimilarity], result of:
> > > > > 0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0
> > > > >
> > > > > and when I change pf2 pf3 (the only change, same query, same docs)
> > > > >
> > > > > 0.47504464 = (MATCH) weight(catchall:paris in 105256)
> > > > [NoTFIDFSimilarity],
> > > > > result of:
> > > > >* 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
> > > > > ), product of:
> > > > >  * 0.47504464 = queryWeight*, product of:
> > > > > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > > > > 0.47504464 = queryNorm
> > > > >   1.0 = fieldWeight in 105256, product of:
> > > > > 1.0 = tf(freq=6.0), with freq of:
> > > > >   6.0 = termFreq=6.0
> > > > > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > > > > 1.0 = fieldNorm(doc=105256)
> > > > >
> > > > > so in the end, with same MATCH results, in first case I get two
> > > documents
> > > > > with same score, and in second case, one document has a higher
> score.
> > > > >
> > > > > This seem very very strange. Does anyone have a clue what's going
> on?
> > > > >
> > > > > Thanks
> 

Re: new data structure for some fields

2015-12-21 Thread Binoy Dalal
Small edit:
The sort parameter in the solrconfig goes in the request handler
declaration that you're using. So if it's select, put in the  list.

On Mon, 21 Dec 2015, 17:21 Binoy Dalal  wrote:

> OK. You will only be able to sort based on the integers if the integer
> field is single valued, I.e. only one integer is associated with one
> category I'd.
>
> To do this you've to use the sort parameter.
> You can either specify it in your solrconfig.XML like so:
> integer asc
> Field name followed by the order - asc/desc
>
> Or you can specify the it along with our query by appending it to your
> query like so:
> /select?q=query=integet%20asc
>
> If you want to apply these sorting rules for all docs, then specify the
> sorting in your solrconfig. If you only want It for a certain subset then
> apply the parameter from code at the app level
>
> On Mon, 21 Dec 2015, 16:49 Abhishek Mishra  wrote:
>
>> hi binoy
>> thanks for reply. I mean by sort is to sort the data-sets on the basis of
>> integers values given for that category.
>> For any document let say for an id P1,
>> category associated is c1,c2,c3,c4 (using multivalued field)
>> For new implementation
>> similarly a number is associated with each category. let say
>> c1---b1,c2---b2,c3---b3,c4---b4.
>> now when we querying into solr for the ids which have c1 in their
>> categories. (q=category_id:c1) now i want the result of this query sorted
>> on the basis of number(b) associated with it throughout the result..
>>
>> number of association is usually less than 20 (means an id can't be mapped
>> more than 20 category_ids)
>>
>>
>> On Mon, Dec 21, 2015 at 3:59 PM, Binoy Dalal 
>> wrote:
>>
>> > When you say sort, do you mean search on the basis of category and
>> > integers? Or score the docs based on their category and integer values?
>> >
>> > Also, for any given document, how many categories or integers are
>> > associated with it?
>> >
>> > On Mon, 21 Dec 2015, 14:43 Abhishek Mishra 
>> wrote:
>> >
>> > > Hello all
>> > >
>> > > i am facing some kind of requirement that where for an id p1 is
>> > associated
>> > > with some category_ids c1,c2,c3,c4 with some integers b1,b2,b3,b4. We
>> > need
>> > > to sort the query of solr on the basis of b1/b2/b3/b4 depending on
>> given
>> > > category_id . Right now we mapped the category_ids into multi-valued
>> > > attribute. [c1,c2,c3,c4] something like this. we are querying into it.
>> > But
>> > > from now we also need to find which integer b1,b2,b3.. associated with
>> > > given category and also sort the whole query on it.
>> > >
>> > >
>> > > sorry for any typos..
>> > >
>> > > Regards
>> > > Abhishek
>> > >
>> > --
>> > Regards,
>> > Binoy Dalal
>> >
>>
> --
> Regards,
> Binoy Dalal
>
-- 
Regards,
Binoy Dalal


Re: solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread elisabeth benoit
Hello,

I don't think the query is important in this case.

After checking out solr's debug output, I dont think the query norm is
relevant either.

I think the scoring changes because

1) in first case, I have same slop for catchall and name fields. Bot match
pf2 pf3. In this case, solr uses max of both for scoring pf2 pf3 results.

2) In second case, I have different slopes, then solr uses sum of values
instead of max.



If anyone knows how to work around this, please let me know.

Elisabeth

2015-12-21 11:22 GMT+01:00 Binoy Dalal :

> What is your query?
>
> On Mon, 21 Dec 2015, 14:37 elisabeth benoit 
> wrote:
>
> > Hello all,
> >
> > I am using solr 4.10.1 and I have configured my pf2 pf3 like this
> >
> > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > catchall~0^0.2 name~0^0.21 synonyms^0.2
> >
> > my search field (qf) is my catchall field
> >
> > I'v been trying to change slop in pf2, pf3 for catchall and synonyms
> (going
> > from 0, or default value for synonyms, to 1)
> >
> > pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> >
> > but some results are not ordered the same way anymore even if I get the
> > same MATCH values in debugQuery output
> >
> > For instance, for a doc matching bastill in catchall field (and nothing
> to
> > do with pf2, pf3!)
> >
> > with first pf2, pf3
> >
> > 0.5163083 = (MATCH) weight(catchall:bastill in 105256)
> [NoTFIDFSimilarity],
> > result of:
> >* 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
> > ), product of:
> >  * 0.5163083 = queryWeight,* product of:
> > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > 0.5163083 = queryNorm
> >   1.0 = fieldWeight in 105256, product of:
> > 1.0 = tf(freq=2.0), with freq of:
> >   2.0 = termFreq=2.0
> > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > 1.0 = fieldNorm(doc=105256)
> >   0.5163083 = (MATCH) weight(catchall:paris in 105256)
> > [NoTFIDFSimilarity], result of:
> > 0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0
> >
> > and when I change pf2 pf3 (the only change, same query, same docs)
> >
> > 0.47504464 = (MATCH) weight(catchall:paris in 105256)
> [NoTFIDFSimilarity],
> > result of:
> >* 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
> > ), product of:
> >  * 0.47504464 = queryWeight*, product of:
> > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > 0.47504464 = queryNorm
> >   1.0 = fieldWeight in 105256, product of:
> > 1.0 = tf(freq=6.0), with freq of:
> >   6.0 = termFreq=6.0
> > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > 1.0 = fieldNorm(doc=105256)
> >
> > so in the end, with same MATCH results, in first case I get two documents
> > with same score, and in second case, one document has a higher score.
> >
> > This seem very very strange. Does anyone have a clue what's going on?
> >
> > Thanks
> > Elisabeth
> >
> --
> Regards,
> Binoy Dalal
>


Re: new data structure for some fields

2015-12-21 Thread Binoy Dalal
OK. You will only be able to sort based on the integers if the integer
field is single valued, I.e. only one integer is associated with one
category I'd.

To do this you've to use the sort parameter.
You can either specify it in your solrconfig.XML like so:
integer asc
Field name followed by the order - asc/desc

Or you can specify the it along with our query by appending it to your
query like so:
/select?q=query=integet%20asc

If you want to apply these sorting rules for all docs, then specify the
sorting in your solrconfig. If you only want It for a certain subset then
apply the parameter from code at the app level

On Mon, 21 Dec 2015, 16:49 Abhishek Mishra  wrote:

> hi binoy
> thanks for reply. I mean by sort is to sort the data-sets on the basis of
> integers values given for that category.
> For any document let say for an id P1,
> category associated is c1,c2,c3,c4 (using multivalued field)
> For new implementation
> similarly a number is associated with each category. let say
> c1---b1,c2---b2,c3---b3,c4---b4.
> now when we querying into solr for the ids which have c1 in their
> categories. (q=category_id:c1) now i want the result of this query sorted
> on the basis of number(b) associated with it throughout the result..
>
> number of association is usually less than 20 (means an id can't be mapped
> more than 20 category_ids)
>
>
> On Mon, Dec 21, 2015 at 3:59 PM, Binoy Dalal 
> wrote:
>
> > When you say sort, do you mean search on the basis of category and
> > integers? Or score the docs based on their category and integer values?
> >
> > Also, for any given document, how many categories or integers are
> > associated with it?
> >
> > On Mon, 21 Dec 2015, 14:43 Abhishek Mishra  wrote:
> >
> > > Hello all
> > >
> > > i am facing some kind of requirement that where for an id p1 is
> > associated
> > > with some category_ids c1,c2,c3,c4 with some integers b1,b2,b3,b4. We
> > need
> > > to sort the query of solr on the basis of b1/b2/b3/b4 depending on
> given
> > > category_id . Right now we mapped the category_ids into multi-valued
> > > attribute. [c1,c2,c3,c4] something like this. we are querying into it.
> > But
> > > from now we also need to find which integer b1,b2,b3.. associated with
> > > given category and also sort the whole query on it.
> > >
> > >
> > > sorry for any typos..
> > >
> > > Regards
> > > Abhishek
> > >
> > --
> > Regards,
> > Binoy Dalal
> >
>
-- 
Regards,
Binoy Dalal


Re: new data structure for some fields

2015-12-21 Thread Abhishek Mishra
Hi binoy it will not work as category and integer is one to one mapping so
if category_id is multivalued same goes to integer also. and you need some
kind of mechanism which will identify which integer to pick given to
category_id for search thenafter you can implement sort according to it.

On Mon, Dec 21, 2015 at 5:27 PM, Binoy Dalal  wrote:

> Small edit:
> The sort parameter in the solrconfig goes in the request handler
> declaration that you're using. So if it's select, put in the  name="defaults"> list.
>
> On Mon, 21 Dec 2015, 17:21 Binoy Dalal  wrote:
>
> > OK. You will only be able to sort based on the integers if the integer
> > field is single valued, I.e. only one integer is associated with one
> > category I'd.
> >
> > To do this you've to use the sort parameter.
> > You can either specify it in your solrconfig.XML like so:
> > integer asc
> > Field name followed by the order - asc/desc
> >
> > Or you can specify the it along with our query by appending it to your
> > query like so:
> > /select?q=query=integet%20asc
> >
> > If you want to apply these sorting rules for all docs, then specify the
> > sorting in your solrconfig. If you only want It for a certain subset then
> > apply the parameter from code at the app level
> >
> > On Mon, 21 Dec 2015, 16:49 Abhishek Mishra  wrote:
> >
> >> hi binoy
> >> thanks for reply. I mean by sort is to sort the data-sets on the basis
> of
> >> integers values given for that category.
> >> For any document let say for an id P1,
> >> category associated is c1,c2,c3,c4 (using multivalued field)
> >> For new implementation
> >> similarly a number is associated with each category. let say
> >> c1---b1,c2---b2,c3---b3,c4---b4.
> >> now when we querying into solr for the ids which have c1 in their
> >> categories. (q=category_id:c1) now i want the result of this query
> sorted
> >> on the basis of number(b) associated with it throughout the result..
> >>
> >> number of association is usually less than 20 (means an id can't be
> mapped
> >> more than 20 category_ids)
> >>
> >>
> >> On Mon, Dec 21, 2015 at 3:59 PM, Binoy Dalal 
> >> wrote:
> >>
> >> > When you say sort, do you mean search on the basis of category and
> >> > integers? Or score the docs based on their category and integer
> values?
> >> >
> >> > Also, for any given document, how many categories or integers are
> >> > associated with it?
> >> >
> >> > On Mon, 21 Dec 2015, 14:43 Abhishek Mishra 
> >> wrote:
> >> >
> >> > > Hello all
> >> > >
> >> > > i am facing some kind of requirement that where for an id p1 is
> >> > associated
> >> > > with some category_ids c1,c2,c3,c4 with some integers b1,b2,b3,b4.
> We
> >> > need
> >> > > to sort the query of solr on the basis of b1/b2/b3/b4 depending on
> >> given
> >> > > category_id . Right now we mapped the category_ids into multi-valued
> >> > > attribute. [c1,c2,c3,c4] something like this. we are querying into
> it.
> >> > But
> >> > > from now we also need to find which integer b1,b2,b3.. associated
> with
> >> > > given category and also sort the whole query on it.
> >> > >
> >> > >
> >> > > sorry for any typos..
> >> > >
> >> > > Regards
> >> > > Abhishek
> >> > >
> >> > --
> >> > Regards,
> >> > Binoy Dalal
> >> >
> >>
> > --
> > Regards,
> > Binoy Dalal
> >
> --
> Regards,
> Binoy Dalal
>


Re: new data structure for some fields

2015-12-21 Thread Emir Arnautovic
Maybe missing something but if c and b are one-to-one and you are 
filtering by c, how can you sort on b since all values will be the same?


On 21.12.2015 13:10, Abhishek Mishra wrote:

Hi binoy it will not work as category and integer is one to one mapping so
if category_id is multivalued same goes to integer also. and you need some
kind of mechanism which will identify which integer to pick given to
category_id for search thenafter you can implement sort according to it.

On Mon, Dec 21, 2015 at 5:27 PM, Binoy Dalal  wrote:


Small edit:
The sort parameter in the solrconfig goes in the request handler
declaration that you're using. So if it's select, put in the  list.

On Mon, 21 Dec 2015, 17:21 Binoy Dalal  wrote:


OK. You will only be able to sort based on the integers if the integer
field is single valued, I.e. only one integer is associated with one
category I'd.

To do this you've to use the sort parameter.
You can either specify it in your solrconfig.XML like so:
integer asc
Field name followed by the order - asc/desc

Or you can specify the it along with our query by appending it to your
query like so:
/select?q=query=integet%20asc

If you want to apply these sorting rules for all docs, then specify the
sorting in your solrconfig. If you only want It for a certain subset then
apply the parameter from code at the app level

On Mon, 21 Dec 2015, 16:49 Abhishek Mishra  wrote:


hi binoy
thanks for reply. I mean by sort is to sort the data-sets on the basis

of

integers values given for that category.
For any document let say for an id P1,
category associated is c1,c2,c3,c4 (using multivalued field)
For new implementation
similarly a number is associated with each category. let say
c1---b1,c2---b2,c3---b3,c4---b4.
now when we querying into solr for the ids which have c1 in their
categories. (q=category_id:c1) now i want the result of this query

sorted

on the basis of number(b) associated with it throughout the result..

number of association is usually less than 20 (means an id can't be

mapped

more than 20 category_ids)


On Mon, Dec 21, 2015 at 3:59 PM, Binoy Dalal 
wrote:


When you say sort, do you mean search on the basis of category and
integers? Or score the docs based on their category and integer

values?

Also, for any given document, how many categories or integers are
associated with it?

On Mon, 21 Dec 2015, 14:43 Abhishek Mishra 

wrote:

Hello all

i am facing some kind of requirement that where for an id p1 is

associated

with some category_ids c1,c2,c3,c4 with some integers b1,b2,b3,b4.

We

need

to sort the query of solr on the basis of b1/b2/b3/b4 depending on

given

category_id . Right now we mapped the category_ids into multi-valued
attribute. [c1,c2,c3,c4] something like this. we are querying into

it.

But

from now we also need to find which integer b1,b2,b3.. associated

with

given category and also sort the whole query on it.


sorry for any typos..

Regards
Abhishek


--
Regards,
Binoy Dalal


--
Regards,
Binoy Dalal


--
Regards,
Binoy Dalal



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: new data structure for some fields

2015-12-21 Thread Binoy Dalal
I wasn't clear enough. What I meant was that basically your integer field
should not be multivalued. That's it.

If on the other hand your integer field is multivalued, sort will not work.
You will have to figure out some sort of a conditional boosting approach
wherein you check the integer value and then apply a boost based on some
mathematical formula to either send the doc to the top or send it to the
end of the list

On Mon, 21 Dec 2015, 17:40 Abhishek Mishra  wrote:

> Hi binoy it will not work as category and integer is one to one mapping so
> if category_id is multivalued same goes to integer also. and you need some
> kind of mechanism which will identify which integer to pick given to
> category_id for search thenafter you can implement sort according to it.
>
> On Mon, Dec 21, 2015 at 5:27 PM, Binoy Dalal 
> wrote:
>
> > Small edit:
> > The sort parameter in the solrconfig goes in the request handler
> > declaration that you're using. So if it's select, put in the  > name="defaults"> list.
> >
> > On Mon, 21 Dec 2015, 17:21 Binoy Dalal  wrote:
> >
> > > OK. You will only be able to sort based on the integers if the integer
> > > field is single valued, I.e. only one integer is associated with one
> > > category I'd.
> > >
> > > To do this you've to use the sort parameter.
> > > You can either specify it in your solrconfig.XML like so:
> > > integer asc
> > > Field name followed by the order - asc/desc
> > >
> > > Or you can specify the it along with our query by appending it to your
> > > query like so:
> > > /select?q=query=integet%20asc
> > >
> > > If you want to apply these sorting rules for all docs, then specify the
> > > sorting in your solrconfig. If you only want It for a certain subset
> then
> > > apply the parameter from code at the app level
> > >
> > > On Mon, 21 Dec 2015, 16:49 Abhishek Mishra 
> wrote:
> > >
> > >> hi binoy
> > >> thanks for reply. I mean by sort is to sort the data-sets on the basis
> > of
> > >> integers values given for that category.
> > >> For any document let say for an id P1,
> > >> category associated is c1,c2,c3,c4 (using multivalued field)
> > >> For new implementation
> > >> similarly a number is associated with each category. let say
> > >> c1---b1,c2---b2,c3---b3,c4---b4.
> > >> now when we querying into solr for the ids which have c1 in their
> > >> categories. (q=category_id:c1) now i want the result of this query
> > sorted
> > >> on the basis of number(b) associated with it throughout the result..
> > >>
> > >> number of association is usually less than 20 (means an id can't be
> > mapped
> > >> more than 20 category_ids)
> > >>
> > >>
> > >> On Mon, Dec 21, 2015 at 3:59 PM, Binoy Dalal 
> > >> wrote:
> > >>
> > >> > When you say sort, do you mean search on the basis of category and
> > >> > integers? Or score the docs based on their category and integer
> > values?
> > >> >
> > >> > Also, for any given document, how many categories or integers are
> > >> > associated with it?
> > >> >
> > >> > On Mon, 21 Dec 2015, 14:43 Abhishek Mishra 
> > >> wrote:
> > >> >
> > >> > > Hello all
> > >> > >
> > >> > > i am facing some kind of requirement that where for an id p1 is
> > >> > associated
> > >> > > with some category_ids c1,c2,c3,c4 with some integers b1,b2,b3,b4.
> > We
> > >> > need
> > >> > > to sort the query of solr on the basis of b1/b2/b3/b4 depending on
> > >> given
> > >> > > category_id . Right now we mapped the category_ids into
> multi-valued
> > >> > > attribute. [c1,c2,c3,c4] something like this. we are querying into
> > it.
> > >> > But
> > >> > > from now we also need to find which integer b1,b2,b3.. associated
> > with
> > >> > > given category and also sort the whole query on it.
> > >> > >
> > >> > >
> > >> > > sorry for any typos..
> > >> > >
> > >> > > Regards
> > >> > > Abhishek
> > >> > >
> > >> > --
> > >> > Regards,
> > >> > Binoy Dalal
> > >> >
> > >>
> > > --
> > > Regards,
> > > Binoy Dalal
> > >
> > --
> > Regards,
> > Binoy Dalal
> >
>
-- 
Regards,
Binoy Dalal


Re: Numerous problems with SolrCloud

2015-12-21 Thread Erick Erickson
ZK isn't pushed all that heavily, although all things are possible. Still,
for maintenance putting Zk on separate machines is a good idea. They
don't have to be very beefy machines.

Look in your logs for LeaderInitiatedRecovery messages. If you find them
then _probably_ you have some issues with timeouts, often due to
excessive GC pauses, turning on GC logging can help you get
a handle on that.

Another "popular" reason for nodes going into recovery is Out Of Memory
errors, which is easy to do in a system that gets set up and
then more and more docs get added to it. You either have to move
some collections to other Solr instances, get more memory to the JVM
(but watch out for GC pauses and starving the OS's memory) etc.

But the Solr logs are the place I'd look first for any help in understanding
the root cause of nodes going into recovery.

Best,
Erick

On Mon, Dec 21, 2015 at 8:04 AM, John Smith  wrote:
> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
> response time in the current situation, which would cause the desync? Is
> this the reason for the change?
>
> John.
>
>
> On 21/12/15 16:45, Erik Hatcher wrote:
>> John - the first recommendation that pops out is to run (only) 3 zookeepers, 
>> entirely separate from Solr servers, and then as many Solr servers from 
>> there that you need to scale indexing and querying to your needs.  Sounds 
>> like 3 ZKs + 2 Solr’s is a good start, given you have 5 servers at your 
>> disposal.
>>
>>
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com 
>>
>>
>>
>>> On Dec 21, 2015, at 10:37 AM, John Smith  wrote:
>>>
>>> This is my first experience with SolrCloud, so please bear with me.
>>>
>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>>> (20Gb) and some very small. All of them have only one shard. The bigger
>>> ones are almost constantly being updated (and of course queried at the
>>> same time).
>>>
>>> I've had a huge number of errors, many different ones. At some point the
>>> system seemed rather stable, but I've tried to add a few new collections
>>> and things went wrong again. The usual symptom is that some cores stop
>>> synchronizing; sometimes an entire server is shown as "gone" (although
>>> it's still alive and well). When I add a core on a server, another (or
>>> several others) often goes down on that server. Even when the system is
>>> rather stable some cores are shown as recovering. When restarting a
>>> server it takes a very long time (30 min at least) to fully recover.
>>>
>>> Some of the many errors I've got (I've skipped the warnings):
>>> - org.apache.solr.common.SolrException: Error trying to proxy request
>>> for url
>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>>> up to try to start recovery on replica
>>> - org.apache.solr.common.SolrException; Error while trying to recover.
>>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>>> was found after waiting
>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>>> tlog=null}
>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>>> after succesful recovery
>>> - org.apache.solr.common.SolrException; Could not find core to call recovery
>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>>> Unable to create core
>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>>> not closed!
>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>>> for connection from pool
>>> - and so on...
>>>
>>> Any advice on where I should start? I've checked disk space, memory
>>> usage, max number of open files, everything seems fine there. My guess
>>> is that the configuration is rather unaltered from the defaults. I've
>>> extended timeouts in Zookeeper already.
>>>
>>> Thanks,
>>> John
>>>
>>
>


Re: TPS with Solr Cloud

2015-12-21 Thread Erick Erickson
8,000 TPS almost certainly means you're firing the same (or
same few) requests over and over and hitting the queryResultCache,
look in the adminUI>>core>>plugins/stats>>cache>>queryResultCache.
I bet you're seeing a hit ratio near 100%. This is what Toke means
when he says your tests are too lightweight.


As others have outlined, to increase TPS (after you straighten out
your test harness) you add _replicas_ rather than add _shards_.
Only add shards when your collections are too big to fit on a single
Solr instance.

Best,
Erick

On Mon, Dec 21, 2015 at 1:56 AM, Emir Arnautovic
 wrote:
> Hi Anshul,
> TPS depends on number of concurrent request you can run and request
> processing time. With sharding you reduce processing time with reducing
> amount of data single node process, but you have overhead of inter shard
> communication and merging results from different shards. If that overhead is
> smaller than time you get when processing half of index, you will see
> increase of TPS. If you are running same query in a loop, first request will
> be processed and others will likely be returned from cache, so response time
> will not vary with index size hence sharding overhead will cause TPS to go
> down.
> If you are happy with your response time, and want more TPS, you go with
> replications - that will increase number of concurrent requests you can run.
>
> Also, make sure your tests are realistic in order to avoid having false
> estimates and have surprises when start running real load.
>
> Regards,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> On 21.12.2015 08:18, Anshul Sharma wrote:
>>
>> Hi,
>> I am trying to evaluate solr for one of my project for which i need to
>> check the scalability in terms of tps(transaction per second) for my
>> application.
>> I have configured solr on 1 AWS server as standalone application which is
>> giving me a tps of ~8000 for my query.
>> In order to test the scalability, i have done sharding of the same data
>> across two AWS servers with 2.5 milion records each .When i try to query
>> the cluster with the same query as before it gives me a tps of ~2500 .
>> My understanding is the tps should have been increased in a cluster as
>> these are two different machines which will perform separate I/O
>> operations.
>> I have not configured any seperate load balancer as the document says that
>> by default solr cloud will perform load balancing in a round robin
>> fashion.
>> Can you please help me in understanding the issue.
>>
>


Re: solrcloud used a lot of memory and memory keep increasing during long time run

2015-12-21 Thread Erick Erickson
Do you have any custom components? Indeed, you shouldn't have
that many searchers open. But could we see a screenshot? That's
the best way to insure that we're talking about the same thing.

Your autocommit settings are really hurting you. Your commit interval
should be as long as you can tolerate. At that kind of commit frequency,
your caches are of very limited usefulness anyway, so you can pretty
much shut them off. Every 1.5 seconds, they're invalidated totally.

Upping maxWarmingSearchers is almost always a mistake. That's
a safety valve that's there in order to prevent runaway resource
consumption and almost always means the system is mis-configured.
I'd put it back to 2 and tune the rest of the system to avoid it rather
than bumping it up.

Best,
Erick

On Sun, Dec 20, 2015 at 11:43 PM, zhenglingyun  wrote:
> Just now, I see about 40 "Searchers@ main" displayed in Solr Web UI: 
> collection -> Plugins/Stats -> CORE
>
> I think it’s abnormal!
>
> softcommit is set to 1.5s, but warmupTime needs about 3s
> Does it lead to so many Searchers?
>
> maxWarmingSearchers is set to 4 in my solrconfig.xml,
> doesn’t it will prevent Solr from creating more than 4 Searchers?
>
>
>
>> 在 2015年12月21日,14:43,zhenglingyun  写道:
>>
>> Thanks Erick for pointing out the memory change in a sawtooth pattern.
>> The problem troubles me is that the bottom point of the sawtooth keeps 
>> increasing.
>> And when the used capacity of old generation exceeds the threshold set by 
>> CMS’s
>> CMSInitiatingOccupancyFraction, gc keeps running and uses a lot of CPU cycle
>> but the used old generation memory does not decrease.
>>
>> After I take Rahul’s advice, I decrease the Xms and Xmx from 16G to 8G, and
>> adjust the parameters of JVM from
>>-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>>-XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70
>>-XX:+CMSParallelRemarkEnabled
>> to
>>-XX:NewRatio=3
>>-XX:SurvivorRatio=4
>>-XX:TargetSurvivorRatio=90
>>-XX:MaxTenuringThreshold=8
>>-XX:+UseConcMarkSweepGC
>>-XX:+UseParNewGC
>>-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4
>>-XX:+CMSScavengeBeforeRemark
>>-XX:PretenureSizeThreshold=64m
>>-XX:+UseCMSInitiatingOccupancyOnly
>>-XX:CMSInitiatingOccupancyFraction=50
>>-XX:CMSMaxAbortablePrecleanTime=6000
>>-XX:+CMSParallelRemarkEnabled
>>-XX:+ParallelRefProcEnabled
>>-XX:-CMSConcurrentMTEnabled
>> which is taken from bin/solr.in.sh
>> I hope this can reduce gc pause time and full gc times.
>> And maybe the memory increasing problem will disappear if I’m lucky.
>>
>> After several day's running, the memory on one of my two servers increased 
>> to 90% again…
>> (When solr is started, the memory used by solr is less than 1G.)
>>
>> Following is the output of stat -gccause -h5  1000:
>>
>>  S0 S1 E  O  P YGC YGCTFGCFGCT GCT
>> LGCC GCC
>>  9.56   0.00   8.65  91.31  65.89  69379 3076.096 16563 1579.639 4655.735 
>> Allocation Failure   No GC
>>  9.56   0.00  51.10  91.31  65.89  69379 3076.096 16563 1579.639 4655.735 
>> Allocation Failure   No GC
>>  0.00   9.23  10.23  91.35  65.89  69380 3076.135 16563 1579.639 4655.774 
>> Allocation Failure   No GC
>>  7.90   0.00   9.74  91.39  65.89  69381 3076.165 16564 1579.683 4655.848 
>> CMS Final Remark No GC
>>  7.90   0.00  67.45  91.39  65.89  69381 3076.165 16564 1579.683 4655.848 
>> CMS Final Remark No GC
>>  S0 S1 E  O  P YGC YGCTFGCFGCT GCT
>> LGCC GCC
>>  0.00   7.48  16.18  91.41  65.89  69382 3076.200 16565 1579.707 4655.908 
>> CMS Initial Mark No GC
>>  0.00   7.48  73.77  91.41  65.89  69382 3076.200 16565 1579.707 4655.908 
>> CMS Initial Mark No GC
>>  8.61   0.00  29.86  91.45  65.89  69383 3076.228 16565 1579.707 4655.936 
>> Allocation Failure   No GC
>>  8.61   0.00  90.16  91.45  65.89  69383 3076.228 16565 1579.707 4655.936 
>> Allocation Failure   No GC
>>  0.00   7.46  47.89  91.46  65.89  69384 3076.258 16565 1579.707 4655.966 
>> Allocation Failure   No GC
>>  S0 S1 E  O  P YGC YGCTFGCFGCT GCT
>> LGCC GCC
>>  8.67   0.00  11.98  91.49  65.89  69385 3076.287 16565 1579.707 4655.995 
>> Allocation Failure   No GC
>>  0.00  11.76   9.24  91.54  65.89  69386 3076.321 16566 1579.759 4656.081 
>> CMS Final Remark No GC
>>  0.00  11.76  64.53  91.54  65.89  69386 3076.321 16566 1579.759 4656.081 
>> CMS Final Remark No GC
>>  7.25   0.00  20.39  91.57  65.89  69387 3076.358 16567 1579.786 4656.144 
>> CMS Initial Mark No GC
>>  7.25   0.00  81.56  91.57  65.89  69387 3076.358 16567 1579.786 4656.144 
>> CMS Initial Mark No GC
>>  S0 S1 E  O  P YGC YGCTFGCFGCT GCT
>> LGCC GCC
>>  0.00   8.05  34.42  91.60  65.89  69388 3076.391 16567 1579.786 4656.177 
>> 

Re: Numerous problems with SolrCloud

2015-12-21 Thread John Smith
OK, great. I've eliminated OOM errors after increasing the memory
allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal
setting but this is all I can have right now on the Solr machines. I'll
look into GC logging too.

Turning to the Solr logs, a quick sweep revealed a lot of "Caused by:
java.net.SocketException: Connection reset" lines, but this isn't very
explicit. I suppose I'll have to cross-check on the concerned server(s).

Anyway, I'll have a try at the updated setting and I'll get back to the
list.

Thanks,
John.


On 21/12/15 17:21, Erick Erickson wrote:
> ZK isn't pushed all that heavily, although all things are possible. Still,
> for maintenance putting Zk on separate machines is a good idea. They
> don't have to be very beefy machines.
>
> Look in your logs for LeaderInitiatedRecovery messages. If you find them
> then _probably_ you have some issues with timeouts, often due to
> excessive GC pauses, turning on GC logging can help you get
> a handle on that.
>
> Another "popular" reason for nodes going into recovery is Out Of Memory
> errors, which is easy to do in a system that gets set up and
> then more and more docs get added to it. You either have to move
> some collections to other Solr instances, get more memory to the JVM
> (but watch out for GC pauses and starving the OS's memory) etc.
>
> But the Solr logs are the place I'd look first for any help in understanding
> the root cause of nodes going into recovery.
>
> Best,
> Erick
>
> On Mon, Dec 21, 2015 at 8:04 AM, John Smith  wrote:
>> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
>> response time in the current situation, which would cause the desync? Is
>> this the reason for the change?
>>
>> John.
>>
>>
>> On 21/12/15 16:45, Erik Hatcher wrote:
>>> John - the first recommendation that pops out is to run (only) 3 
>>> zookeepers, entirely separate from Solr servers, and then as many Solr 
>>> servers from there that you need to scale indexing and querying to your 
>>> needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 
>>> servers at your disposal.
>>>
>>>
>>> —
>>> Erik Hatcher, Senior Solutions Architect
>>> http://www.lucidworks.com 
>>>
>>>
>>>
 On Dec 21, 2015, at 10:37 AM, John Smith  wrote:

 This is my first experience with SolrCloud, so please bear with me.

 I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
 the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
 3.4.7. There's around 80 Gb of index, some collections are rather big
 (20Gb) and some very small. All of them have only one shard. The bigger
 ones are almost constantly being updated (and of course queried at the
 same time).

 I've had a huge number of errors, many different ones. At some point the
 system seemed rather stable, but I've tried to add a few new collections
 and things went wrong again. The usual symptom is that some cores stop
 synchronizing; sometimes an entire server is shown as "gone" (although
 it's still alive and well). When I add a core on a server, another (or
 several others) often goes down on that server. Even when the system is
 rather stable some cores are shown as recovering. When restarting a
 server it takes a very long time (30 min at least) to fully recover.

 Some of the many errors I've got (I've skipped the warnings):
 - org.apache.solr.common.SolrException: Error trying to proxy request
 for url
 - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
 up to try to start recovery on replica
 - org.apache.solr.common.SolrException; Error while trying to recover.
 core=[...]:org.apache.solr.common.SolrException: No registered leader
 was found after waiting
 - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
 tlog=null}
 - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
 after succesful recovery
 - org.apache.solr.common.SolrException; Could not find core to call 
 recovery
 - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
 Unable to create core
 - org.apache.solr.request.SolrRequestInfo; prev == info : false
 - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
 not closed!
 - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
 - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
 prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
 - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
 - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
 for connection from pool
 - and so on...

 Any advice on where I should start? I've checked disk space, memory
 usage, max number of open 

Re: Numerous problems with SolrCloud

2015-12-21 Thread Erick Erickson
right, do note that when you _do_ hit an OOM, you really
should restart the JVM as nothing is _really_ certain after
that.

You're right, just bumping the memory is a band-aid, but
whatever gets you by. Lucene makes heavy use of
MMapDirectory which uses OS memory rather than JVM
memory, so you're robbing Peter to pay Paul when you
allocate high percentages of the physical memory to the JVM.
See Uwe's excellent blog here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

And yeah, your "connection reset" errors may well be GC-related
if you're getting a lot of stop-the-world GC pauses.

Sounds like you inherited a system that's getting more and more
docs added to it over time and outgrew its host, but that's a guess.

And you get to deal with it over the holidays too ;)

Best,
Erick

On Mon, Dec 21, 2015 at 8:33 AM, John Smith  wrote:
> OK, great. I've eliminated OOM errors after increasing the memory
> allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal
> setting but this is all I can have right now on the Solr machines. I'll
> look into GC logging too.
>
> Turning to the Solr logs, a quick sweep revealed a lot of "Caused by:
> java.net.SocketException: Connection reset" lines, but this isn't very
> explicit. I suppose I'll have to cross-check on the concerned server(s).
>
> Anyway, I'll have a try at the updated setting and I'll get back to the
> list.
>
> Thanks,
> John.
>
>
> On 21/12/15 17:21, Erick Erickson wrote:
>> ZK isn't pushed all that heavily, although all things are possible. Still,
>> for maintenance putting Zk on separate machines is a good idea. They
>> don't have to be very beefy machines.
>>
>> Look in your logs for LeaderInitiatedRecovery messages. If you find them
>> then _probably_ you have some issues with timeouts, often due to
>> excessive GC pauses, turning on GC logging can help you get
>> a handle on that.
>>
>> Another "popular" reason for nodes going into recovery is Out Of Memory
>> errors, which is easy to do in a system that gets set up and
>> then more and more docs get added to it. You either have to move
>> some collections to other Solr instances, get more memory to the JVM
>> (but watch out for GC pauses and starving the OS's memory) etc.
>>
>> But the Solr logs are the place I'd look first for any help in understanding
>> the root cause of nodes going into recovery.
>>
>> Best,
>> Erick
>>
>> On Mon, Dec 21, 2015 at 8:04 AM, John Smith  wrote:
>>> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
>>> response time in the current situation, which would cause the desync? Is
>>> this the reason for the change?
>>>
>>> John.
>>>
>>>
>>> On 21/12/15 16:45, Erik Hatcher wrote:
 John - the first recommendation that pops out is to run (only) 3 
 zookeepers, entirely separate from Solr servers, and then as many Solr 
 servers from there that you need to scale indexing and querying to your 
 needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 
 servers at your disposal.


 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com 



> On Dec 21, 2015, at 10:37 AM, John Smith  wrote:
>
> This is my first experience with SolrCloud, so please bear with me.
>
> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
> 3.4.7. There's around 80 Gb of index, some collections are rather big
> (20Gb) and some very small. All of them have only one shard. The bigger
> ones are almost constantly being updated (and of course queried at the
> same time).
>
> I've had a huge number of errors, many different ones. At some point the
> system seemed rather stable, but I've tried to add a few new collections
> and things went wrong again. The usual symptom is that some cores stop
> synchronizing; sometimes an entire server is shown as "gone" (although
> it's still alive and well). When I add a core on a server, another (or
> several others) often goes down on that server. Even when the system is
> rather stable some cores are shown as recovering. When restarting a
> server it takes a very long time (30 min at least) to fully recover.
>
> Some of the many errors I've got (I've skipped the warnings):
> - org.apache.solr.common.SolrException: Error trying to proxy request
> for url
> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
> up to try to start recovery on replica
> - org.apache.solr.common.SolrException; Error while trying to recover.
> core=[...]:org.apache.solr.common.SolrException: No registered leader
> was found after waiting
> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
> tlog=null}
> - 

Re: Re: Some problems when upload data to index in cloud environment

2015-12-21 Thread Erick Erickson
Jianer:

Getting your head around the configs is, indeed, "exciting" at times.

I just wanted to caution you that using ExtractingRequestHandler
puts the Tika parsing load on the Solr server, which doesn't
scale as the same machine that's serving queries and indexing
is _also_ parsing potentially very large files. It may not matter
if you don't do it often, but if you're going to index a large number
of files and/or you're going to do this continuously, you probably
want to move the parsing off Solr. Here's an example with DB
as well, but the DB bits can be removed easily.

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Sun, Dec 20, 2015 at 9:29 PM, 周建二  wrote:
> Hi Shawn, thanks for your reply. :)
>
>
> It is because the /update/extract handler is not defined in my collection's 
> solrconfig.xml file as I upload the basic_configs/conf to ZooKeeper. When I 
> upload sample_techproducts_configs to ZooKeeper, everything goes well.
>
>
> I am a freshman for Solr. Now I am going to learn the schema.xml 
> solrconfig.xml,  and try to make my own config for my dataset based on the 
> basic_configs.
>
>
> Thanks again.
> Jianer
>
>
>> -原始邮件-
>> 发件人: "Shawn Heisey" 
>> 发送时间: 2015年12月20日 星期日
>> 收件人: solr-user@lucene.apache.org
>> 抄送:
>> 主题: Re: Some problems when upload data to index in cloud environment
>>
>> On 12/18/2015 6:16 PM, 周建二 wrote:
>> > I am building a solr cloud production environment. My solr version is 
>> > 5.3.1. The environment consists three nodes running CentOS 6.5. First I 
>> > build the zookeeper environment by the three nodes, and then run solr on 
>> > the three nodes, and at last build a collection consists of three shards 
>> > and each shard has two replicas. After that we can see that cloud 
>> > structure on the Solr Admin page.
>>
>> 
>>
>> > HTTP ERROR 404
>> >
>> > Problem accessing /solr/cloud-test/update/extract. Reason:
>>
>> One of two problems is likely:  Either there is no collection named
>> "cloud-test" on your cloud, or the /update/extract handler is not
>> defined in that collection's solrconfig.xml file.  The active version of
>> this file lives in zookeeper when you're running SolrCloud.
>>
>> If you're sure a collection with this name exists, how exactly did you
>> create it?  Was it built with one of the sample configs or with a config
>> that you built yourself?
>>
>> Of the three configsets included with the Solr dowbload,
>> data_driven_schema_configs and sample_techproducts_configs contain the
>> /update/extract handler.  The configset named basic_configs does NOT
>> contain the handler.
>>
>> Thanks,
>> Shawn
>>
>
>
>


Re: Numerous problems with SolrCloud

2015-12-21 Thread Erik Hatcher
John - the first recommendation that pops out is to run (only) 3 zookeepers, 
entirely separate from Solr servers, and then as many Solr servers from there 
that you need to scale indexing and querying to your needs.  Sounds like 3 ZKs 
+ 2 Solr’s is a good start, given you have 5 servers at your disposal.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 



> On Dec 21, 2015, at 10:37 AM, John Smith  wrote:
> 
> This is my first experience with SolrCloud, so please bear with me.
> 
> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
> 3.4.7. There's around 80 Gb of index, some collections are rather big
> (20Gb) and some very small. All of them have only one shard. The bigger
> ones are almost constantly being updated (and of course queried at the
> same time).
> 
> I've had a huge number of errors, many different ones. At some point the
> system seemed rather stable, but I've tried to add a few new collections
> and things went wrong again. The usual symptom is that some cores stop
> synchronizing; sometimes an entire server is shown as "gone" (although
> it's still alive and well). When I add a core on a server, another (or
> several others) often goes down on that server. Even when the system is
> rather stable some cores are shown as recovering. When restarting a
> server it takes a very long time (30 min at least) to fully recover.
> 
> Some of the many errors I've got (I've skipped the warnings):
> - org.apache.solr.common.SolrException: Error trying to proxy request
> for url
> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
> up to try to start recovery on replica
> - org.apache.solr.common.SolrException; Error while trying to recover.
> core=[...]:org.apache.solr.common.SolrException: No registered leader
> was found after waiting
> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
> tlog=null}
> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
> after succesful recovery
> - org.apache.solr.common.SolrException; Could not find core to call recovery
> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
> Unable to create core
> - org.apache.solr.request.SolrRequestInfo; prev == info : false
> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
> not closed!
> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
> for connection from pool
> - and so on...
> 
> Any advice on where I should start? I've checked disk space, memory
> usage, max number of open files, everything seems fine there. My guess
> is that the configuration is rather unaltered from the defaults. I've
> extended timeouts in Zookeeper already.
> 
> Thanks,
> John
> 



Numerous problems with SolrCloud

2015-12-21 Thread John Smith
This is my first experience with SolrCloud, so please bear with me.

I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
3.4.7. There's around 80 Gb of index, some collections are rather big
(20Gb) and some very small. All of them have only one shard. The bigger
ones are almost constantly being updated (and of course queried at the
same time).

I've had a huge number of errors, many different ones. At some point the
system seemed rather stable, but I've tried to add a few new collections
and things went wrong again. The usual symptom is that some cores stop
synchronizing; sometimes an entire server is shown as "gone" (although
it's still alive and well). When I add a core on a server, another (or
several others) often goes down on that server. Even when the system is
rather stable some cores are shown as recovering. When restarting a
server it takes a very long time (30 min at least) to fully recover.

Some of the many errors I've got (I've skipped the warnings):
- org.apache.solr.common.SolrException: Error trying to proxy request
for url
- org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
up to try to start recovery on replica
- org.apache.solr.common.SolrException; Error while trying to recover.
core=[...]:org.apache.solr.common.SolrException: No registered leader
was found after waiting
- update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
tlog=null}
- org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
after succesful recovery
- org.apache.solr.common.SolrException; Could not find core to call recovery
- org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
Unable to create core
- org.apache.solr.request.SolrRequestInfo; prev == info : false
- org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
not closed!
- org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
- org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
- org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
- org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
for connection from pool
- and so on...

Any advice on where I should start? I've checked disk space, memory
usage, max number of open files, everything seems fine there. My guess
is that the configuration is rather unaltered from the defaults. I've
extended timeouts in Zookeeper already.

Thanks,
John



Re: Numerous problems with SolrCloud

2015-12-21 Thread John Smith
Thanks, I'll have a try. Can the load on the Solr servers impair the zk
response time in the current situation, which would cause the desync? Is
this the reason for the change?

John.


On 21/12/15 16:45, Erik Hatcher wrote:
> John - the first recommendation that pops out is to run (only) 3 zookeepers, 
> entirely separate from Solr servers, and then as many Solr servers from there 
> that you need to scale indexing and querying to your needs.  Sounds like 3 
> ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal.
>
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com 
>
>
>
>> On Dec 21, 2015, at 10:37 AM, John Smith  wrote:
>>
>> This is my first experience with SolrCloud, so please bear with me.
>>
>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>> (20Gb) and some very small. All of them have only one shard. The bigger
>> ones are almost constantly being updated (and of course queried at the
>> same time).
>>
>> I've had a huge number of errors, many different ones. At some point the
>> system seemed rather stable, but I've tried to add a few new collections
>> and things went wrong again. The usual symptom is that some cores stop
>> synchronizing; sometimes an entire server is shown as "gone" (although
>> it's still alive and well). When I add a core on a server, another (or
>> several others) often goes down on that server. Even when the system is
>> rather stable some cores are shown as recovering. When restarting a
>> server it takes a very long time (30 min at least) to fully recover.
>>
>> Some of the many errors I've got (I've skipped the warnings):
>> - org.apache.solr.common.SolrException: Error trying to proxy request
>> for url
>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>> up to try to start recovery on replica
>> - org.apache.solr.common.SolrException; Error while trying to recover.
>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>> was found after waiting
>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>> tlog=null}
>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>> after succesful recovery
>> - org.apache.solr.common.SolrException; Could not find core to call recovery
>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>> Unable to create core
>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>> not closed!
>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>> for connection from pool
>> - and so on...
>>
>> Any advice on where I should start? I've checked disk space, memory
>> usage, max number of open files, everything seems fine there. My guess
>> is that the configuration is rather unaltered from the defaults. I've
>> extended timeouts in Zookeeper already.
>>
>> Thanks,
>> John
>>
>



Re: TPS with Solr Cloud

2015-12-21 Thread Upayavira

You add shards to reduce response times. If your responses are too slow
for 1 shard, try it with three. Skip two for reasons stated above.

Upayavira

On Mon, Dec 21, 2015, at 04:27 PM, Erick Erickson wrote:
> 8,000 TPS almost certainly means you're firing the same (or
> same few) requests over and over and hitting the queryResultCache,
> look in the adminUI>>core>>plugins/stats>>cache>>queryResultCache.
> I bet you're seeing a hit ratio near 100%. This is what Toke means
> when he says your tests are too lightweight.
> 
> 
> As others have outlined, to increase TPS (after you straighten out
> your test harness) you add _replicas_ rather than add _shards_.
> Only add shards when your collections are too big to fit on a single
> Solr instance.
> 
> Best,
> Erick
> 
> On Mon, Dec 21, 2015 at 1:56 AM, Emir Arnautovic
>  wrote:
> > Hi Anshul,
> > TPS depends on number of concurrent request you can run and request
> > processing time. With sharding you reduce processing time with reducing
> > amount of data single node process, but you have overhead of inter shard
> > communication and merging results from different shards. If that overhead is
> > smaller than time you get when processing half of index, you will see
> > increase of TPS. If you are running same query in a loop, first request will
> > be processed and others will likely be returned from cache, so response time
> > will not vary with index size hence sharding overhead will cause TPS to go
> > down.
> > If you are happy with your response time, and want more TPS, you go with
> > replications - that will increase number of concurrent requests you can run.
> >
> > Also, make sure your tests are realistic in order to avoid having false
> > estimates and have surprises when start running real load.
> >
> > Regards,
> > Emir
> >
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> >
> > On 21.12.2015 08:18, Anshul Sharma wrote:
> >>
> >> Hi,
> >> I am trying to evaluate solr for one of my project for which i need to
> >> check the scalability in terms of tps(transaction per second) for my
> >> application.
> >> I have configured solr on 1 AWS server as standalone application which is
> >> giving me a tps of ~8000 for my query.
> >> In order to test the scalability, i have done sharding of the same data
> >> across two AWS servers with 2.5 milion records each .When i try to query
> >> the cluster with the same query as before it gives me a tps of ~2500 .
> >> My understanding is the tps should have been increased in a cluster as
> >> these are two different machines which will perform separate I/O
> >> operations.
> >> I have not configured any seperate load balancer as the document says that
> >> by default solr cloud will perform load balancing in a round robin
> >> fashion.
> >> Can you please help me in understanding the issue.
> >>
> >