Re: Trending functionality in Solr

2015-02-09 Thread S.L
Folks,

Thanks for this wealth of information , the consensus generally seems to be
that one should be able to save the queries in Solr core (another one) and
then times stamp it to do  further analysis . I will try and implement the
same .

Siegfried, I looked at your JIRA issue which is impressive but would be an
overkill in my situation so , I will implementing something simpler for use
in my case.

Thanks again everyone for this help.

On Mon, Feb 9, 2015 at 3:14 AM, Siegfried Goeschl sgoes...@gmx.at wrote:

 Hi folks,

 I implemented something similar but never got around to contribute it -
 see https://issues.apache.org/jira/browse/SOLR-4056

 The code was initially for SOLR3 but was recently ported to SOLR4

 * capturing the most frequent search terms per core
 * supports ad-hoc queries
 * CSV export

 If you are interested we could team up and make a proper SOLR contribution
 :-)

 Cheers,

 Siegfried Goeschl


 On 08.02.15 05:26, S.L wrote:

 Folks,

 Is there a way to implement the trending functionality using Solr , to
 give
 the results using a query for say the most searched terms in the past
 hours
 or so , if the most searched terms is not possible is it possible to at
 least the get results for the last 100 terms?

 Thanks





Re: [MASSMAIL]Re: Trending functionality in Solr

2015-02-09 Thread S.L
Thanks for stating this in a simple fashion.

On Sun, Feb 8, 2015 at 6:07 PM, Jorge Luis Betancourt González 
jlbetanco...@uci.cu wrote:

 For a project I'm working on, what we do is store the user's query in a
 separated core that we also use to provide an autocomplete query
 functionality, so far, the frontend app is responsible of sending the query
 to Solr, meaning: 1. execute the query against our search core and 2. send
 an update request to store the query in the separated core. We use some
 deduplication (provided by Solr) to avoid storing the same query several
 times. We don't do what you're after but it would't be to hard to tag each
 query with a timestamp field and provide analytics. Thinking from the top
 of my head we could wrap this logic that is currently done in the frontend
 app in a custom SearchComponent that automatically send the search query
 into the other core for storing, abstracting all this logic from the client
 app. Keep in mind that the considerations regarding volume of data that
 Shawn has talked keeps being valid.

 Hope it helps,

 - Original Message -
 From: Shawn Heisey apa...@elyograg.org
 To: solr-user@lucene.apache.org
 Sent: Sunday, February 8, 2015 11:03:33 AM
 Subject: [MASSMAIL]Re: Trending functionality in Solr

 On 2/7/2015 9:26 PM, S.L wrote:
  Is there a way to implement the trending functionality using Solr , to
 give
  the results using a query for say the most searched terms in the past
 hours
  or so , if the most searched terms is not possible is it possible to at
  least the get results for the last 100 terms?

 I'm reasonably sure that the only thing Solr has out of the box that can
 record queries is the logging feature that defaults to INFO.  That data
 is not directly available to Solr, and it's not in a good format for
 easy parsing.

 Queries are not stored anywhere else by Solr.  From what I understand,
 analysis is a relatively easy part of the equation, but the data must be
 available first, which is the hard part.  Storing it in RAM is pretty
 much a non-starter -- there are installations that see thousands of
 queries every second.

 This is an area for improvement, but the infrastructure must be written
 from scratch.  All work on this project is volunteer.  We are highly
 motivated volunteers, but extensive work like this is difficult to fit
 into donated time.

 Many people who use Solr are already recording all queries in some other
 system (like a database), so it is far easier to implement analysis on
 that data.

 Thanks,
 Shawn




Trending functionality in Solr

2015-02-08 Thread S.L
Folks,

Is there a way to implement the trending functionality using Solr , to give
the results using a query for say the most searched terms in the past hours
or so , if the most searched terms is not possible is it possible to at
least the get results for the last 100 terms?

Thanks


DocExpirationUpdateProcessorFactory not deleting records

2015-02-04 Thread S.L
I am trying to use the DocExpirationUpdateProcessorFactoryfactory in Solr
4.10.1 version.

I have included the following in my solrconfig.xml

updateRequestProcessorChain default=true
processor class=solr.UUIDUpdateProcessorFactory
str name=fieldNameid/str
/processor

processor class=solr.TimestampUpdateProcessorFactory
str name=fieldNametimestamp/str
/processor
processor
class=solr.processor.DocExpirationUpdateProcessorFactory
int name=autoDeletePeriodSeconds30/int
str name=ttlFieldNamettl/str
str name=expirationFieldNameexpire_at/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

And I have included the following in my schema.xml

field name=timestamp type=date indexed=true stored=true
default=NOW multiValued=false/
field name=ttl type=date indexed=true stored=true
default=NOW+60SECONDS multiValued=false/
field name=expire_at type=date indexed=true stored=true
multiValued=false/

As you can see I am setting the time to live to be 60 seconds and checking
to delete every 30 seconds, when I insert a document , and check after a
minute or couple or an hour it never gets deleted.

This is  what I see in the indexed document , can you please let me know
what might be the issue here ? Please note that the expire_at field is
never getting generated in the Solr document as can be seen below.



id: 3888a8ac-fbc4-437a-8248-132384753c00,
timestamp: 2015-02-04T04:09:21.29Z,
_version_: 1492147724740460500,
ttl: 2015-02-04T04:10:21.29Z


Re: DocExpirationUpdateProcessorFactory not deleting records

2015-02-04 Thread S.L
Thanks for giving multiple options , I ll try them out both ,but last time
I checked, having +60SECONDS as the default value for ttl was giving me
an invalid date format exception, I am assuming that would only be the
case if I use it with the default mechanism in schema.xml, but not when we
use the solr.DefaultValueUpdateProcessorFactory ?



On Wed, Feb 4, 2015 at 1:56 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : processor
 : class=solr.processor.DocExpirationUpdateProcessorFactory
 : int name=autoDeletePeriodSeconds30/int
 : str name=ttlFieldNamettl/str
 : str name=expirationFieldNameexpire_at/str
 : /processor

 ...

 : And I have included the following in my schema.xml
 :
 : field name=ttl type=date indexed=true stored=true
 : default=NOW+60SECONDS multiValued=false/

 there are a couple of problems here...

 : As you can see I am setting the time to live to be 60 seconds and
 checking
 : to delete every 30 seconds, when I insert a document , and check after a
 : minute or couple or an hour it never gets deleted.

 first off: you aren't actaully setting the ttl to 60 seconds you are
 setting the ttl to be a fixed moment in time which is 60 seconds from when
 the doc is written to the index -- basically you are eliminating hte need
 for having a ttl field/param at all and saying this is *exactly* when the
 document should expire.

 if that's what you want to do, just elimintae the ttleFieldName everywhere
 in your schema.xml and solrconfig.xml and setup expire_at in your
 schema.xml with a default=NOW+60SECONDS and you'll probably be good to
 go.

 second...

 : what might be the issue here ? Please note that the expire_at field is
 : never getting generated in the Solr document as can be seen below.

 ...even if you redefined your ttl field to look like this...

   field name=ttl type=string default=+60SECONDS /

 ...the expire_at still wouldn't be populated by the processor because
 schema field default values are populated *after* the processors run --
 so when the DocExpirationUpdateProcessorFactory sees the documents being
 added, it has no idea that they all have a default ttl, so it doesn't know
 that you want it to compute an expire_at for you.

 instead of using default= in the schema, you can use the
 DefaultValueUpdateProcessorFactory to assign it *before* the
 DocExpirationUpdateProcessorFactory sees the doc...

  processor class=solr.DefaultValueUpdateProcessorFactory
str name=fieldNamettl/str
str name=value+60SECONDS/str
  /processor




 -Hoss
 http://www.lucidworks.com/



Re: DocExpirationUpdateProcessorFactory not deleting records

2015-02-04 Thread S.L
Great, this is the first example I have seen so far, I wish we could
include this in the Wiki. Thanks again!

On Wed, Feb 4, 2015 at 2:04 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:

 :
 : Thanks for giving multiple options , I ll try them out both ,but last
 time
 : I checked, having +60SECONDS as the default value for ttl was giving me
 : an invalid date format exception, I am assuming that would only be
 the

 that's because ttl should not be a date field -- it should be a *string*
 (as noted in my examples)

 time to live is a date math expression that the processor will evaluate
 for you -- not a date.  if you want to specify an explicit date, just set
 expire_at directly.

 ie: do you wnat to do the match yourself (set expire_at as a date field)
 or do you want the processor to do the math itself (set ttl as a string
 field)

 :  ...even if you redefined your ttl field to look like this...
 : 
 :field name=ttl type=string default=+60SECONDS /
 : 
 :  ...the expire_at still wouldn't be populated by the processor because
 :  schema field default values are populated *after* the processors run
 --
 :  so when the DocExpirationUpdateProcessorFactory sees the documents
 being
 :  added, it has no idea that they all have a default ttl, so it doesn't
 know
 :  that you want it to compute an expire_at for you.
 : 
 :  instead of using default= in the schema, you can use the
 :  DefaultValueUpdateProcessorFactory to assign it *before* the
 :  DocExpirationUpdateProcessorFactory sees the doc...
 : 
 :   processor class=solr.DefaultValueUpdateProcessorFactory
 : str name=fieldNamettl/str
 : str name=value+60SECONDS/str
 :   /processor


 -Hoss
 http://www.lucidworks.com/



Re: distrib=false

2014-12-28 Thread S.L
Erik

I have attached the screen shot of the toplogy , as you can see I have
three nodes and no two replicas of the same shard reside on the same node,
this was made sure so as not affect the availability.

The query that I use is a general get all query of type *:* to test .

The behavior I notice is that even though when a particular replica of a
shard is queried using distrib=false , the request goes to the other
replica of the same shard.

Thanks.

On Sat, Dec 27, 2014 at 2:10 PM, Erick Erickson erickerick...@gmail.com
wrote:

 How are you sending the request? AFAIK, setting distrib=false
 should should keep the query from being sent to any other node,
 although I'm not quite sure what happens when you host multiple
 replicas of the _same_ shard on the same node.

 So we need:
 1 your topology, How many nodes and what replicas on each?
 2 the actual query you send.

 Best,
 Erick

 On Sat, Dec 27, 2014 at 8:14 AM, S.L simpleliving...@gmail.com wrote:
  Hi All,
 
  I have a question regarding distrib=false on the Solr query , it seems
 that
  the distribution is restricted across only the shards  when the parameter
  is set to false, meaning if I query a particular node with in a shard
 with
  replication factor of more than one  , the request could go to another
 node
  with in the same shard which is a replica of the node that I made the
  initial request to, is my understanding correct ?
 
  If the answer to my question is yes, then how do we make sure that the
  request goes to only the node I intend to make the request to  ?
 
  Thanks.



How to implement multi-set in a Solr schema.

2014-12-28 Thread S.L
Hi All,

I have a use case where I need to group documents that have a same field
called bookName , meaning if there are a multiple documents with the same
bookName value and if the user input is searched by a query on  bookName ,
I need to be able to group all the documents by the same bookName together,
so that I could display them as a group in the UI.

What kind of support does Solr provide for such a scenario , and how should
I look at changing my schema.xml which as bookName as single valued text
field ?

Thanks.


distrib=false

2014-12-27 Thread S.L
Hi All,

I have a question regarding distrib=false on the Solr query , it seems that
the distribution is restricted across only the shards  when the parameter
is set to false, meaning if I query a particular node with in a shard with
replication factor of more than one  , the request could go to another node
with in the same shard which is a replica of the node that I made the
initial request to, is my understanding correct ?

If the answer to my question is yes, then how do we make sure that the
request goes to only the node I intend to make the request to  ?

Thanks.


Re: 'Illegal character in query' on Solr cloud 4.10.1

2014-12-25 Thread S.L
Jack,

I am using this query to test from the browser and this occurs consistently
for the 5 out of the 6 servers in the cluster, but the actual API that I
use is pysolr, so from the front end its sent using pysolr.

I face the same issue in both Firefox and Google Chrome, the fact that
there is an existing Jira for a similar issue , made me think this is a
Solr issue , but I am still not clear how I can circumvent this issue.





On Wed, Dec 24, 2014 at 4:57 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Is the problem here that the error occurs sometimes or that it doesn't
 occur all of the time? I mean, it is clearly a bug in the client if it is
 sending a raw circumflex rather than a URL-encoded circumflex.

 Also, some browsers automatically URL-encode character as needed, but I
 have heard that some browsers don't always encode all of the characters.

 Question: You mention the URL, but how are you sending that URL to Solr -
 via a browser address box, curl, or... what?

 If using curl, you also have to cope with some characters having a shell
 meaning and needing to be escaped.

 Whether it is Tomcat or Solr that gives the error, the main point is that
 the raw circumflex shouldn't be sent to either.


 -- Jack Krupansky

 On Wed, Dec 24, 2014 at 4:32 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  OK, then I don't think it's a Solr problem. I think 5 of your Tomcats are
  configured in such a way that they consider ^ to be an illegal character.
 
  There have been recurring problems with Servlet containers being
  configured to allow/disallow various characters, and I think that's
  what's happening here. But this is totally outside Solr.
 
  Solr, when it successfully distributes a query, sends the query on to one
  replica of each shard, and I was wondering if that process wasn't
  working correctly somehow, although boosting is so common that it
  would be a huge shock since it would have broken almost every
  Tomcat installation out there. By sending the query directly to each
  node, you've bypassed any forwarding by Solr so it looks like the
  problem is before Solr even sees it.
 
  So my guess is that somehow 5 of your servers are configured to
  expect a different character than the server that works. I'm afraid
  I don't know Tomcat well enough to direct you there, but take a
  look here:
  https://wiki.apache.org/solr/SolrTomcat
 
  Sorry I can't be more help
  Erick
 
  On Wed, Dec 24, 2014 at 1:33 AM, S.L simpleliving...@gmail.com wrote:
   Erik,
  
   The scenario 1, that you have listed is what seems to be the case.
  
   When I add distrib=false to query each one of the 6 servers only 1 of
  them
   returns results (partial) and the rest of them give the illegal
 character
   error .
  
   I have not set up any special logging I do not see any info in the
   catalina.out but in a file called localhost_access_log.2014-12-24.txt
 in
   tomcat/logs directory, I see the following logging message when the
  invalid
   character error occurs.
  
   [24/Dec/2014:09:25:54 +] GET
  
 
 /solr/dyCollection1_shard2_replica1/?fl=*,scoreq=canon+pixma+printersort=score+desc,productNameLength%20ascwt=jsonindent=truerows=100defType=edismaxqf=productNamemm=2pf=productNameps=1pf2=productNamepf3=productNamestopwords=truelowercaseOperators=truebq=hasThumbnailImage:true^2.0distrib=false
   HTTP/1.1 500 7781
  
   I am using Tomcat 7.0.42 and SolrCloud 4.10.1 and the Oracle JDK .
  
   java version 1.7.0_71
   Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
   Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  
   Thanks.
  
   On Tue, Dec 23, 2014 at 11:46 AM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
   Hmmm, so you are you pinging the servers directly, right?
   Here's a couple of things to try:
   1 add distrib=false to the query and try each of the 6 servers.
   What I'm wondering is if this is happening on the sub-query sent
   out or on the primary server. Adding distrib=false will just execute
   on the node you're sending it to, and will NOT send sub-queries out
   to any other node so you'll get partial results back.
  
   If one server continues to work but the other 5 fail, then your
 servlet
   container is probably not set up with the right character sets.
 Although
   why that would manifest itself on the ^ character mystifies me.
  
   2 Let's assume that all 6 servers handle the raw query. Next thing
 that
   would be really helpful is to see the sub-queries. Take distrib=false
   off and tail the logs on all the servers. What we're looking for here
 is
   whether the sub-queries even make it to Solr or whether the problem
   is in your container.
  
   3 If the sub-queries do NOT make it to the Solr logs, what is the
 query
   that the container sees? Is it recognizable or has Solr somehow munged
   the sub-query?
  
   What is your environment like? Tomcat? Jetty? Other? What JVM
   etc?
  
   Best,
   Erick
  
   On Tue, Dec 23

Re: 'Illegal character in query' on Solr cloud 4.10.1

2014-12-24 Thread S.L
Erik,

The scenario 1, that you have listed is what seems to be the case.

When I add distrib=false to query each one of the 6 servers only 1 of them
returns results (partial) and the rest of them give the illegal character
error .

I have not set up any special logging I do not see any info in the
catalina.out but in a file called localhost_access_log.2014-12-24.txt in
tomcat/logs directory, I see the following logging message when the invalid
character error occurs.

[24/Dec/2014:09:25:54 +] GET
/solr/dyCollection1_shard2_replica1/?fl=*,scoreq=canon+pixma+printersort=score+desc,productNameLength%20ascwt=jsonindent=truerows=100defType=edismaxqf=productNamemm=2pf=productNameps=1pf2=productNamepf3=productNamestopwords=truelowercaseOperators=truebq=hasThumbnailImage:true^2.0distrib=false
HTTP/1.1 500 7781

I am using Tomcat 7.0.42 and SolrCloud 4.10.1 and the Oracle JDK .

java version 1.7.0_71
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)

Thanks.

On Tue, Dec 23, 2014 at 11:46 AM, Erick Erickson erickerick...@gmail.com
wrote:

 Hmmm, so you are you pinging the servers directly, right?
 Here's a couple of things to try:
 1 add distrib=false to the query and try each of the 6 servers.
 What I'm wondering is if this is happening on the sub-query sent
 out or on the primary server. Adding distrib=false will just execute
 on the node you're sending it to, and will NOT send sub-queries out
 to any other node so you'll get partial results back.

 If one server continues to work but the other 5 fail, then your servlet
 container is probably not set up with the right character sets. Although
 why that would manifest itself on the ^ character mystifies me.

 2 Let's assume that all 6 servers handle the raw query. Next thing that
 would be really helpful is to see the sub-queries. Take distrib=false
 off and tail the logs on all the servers. What we're looking for here is
 whether the sub-queries even make it to Solr or whether the problem
 is in your container.

 3 If the sub-queries do NOT make it to the Solr logs, what is the query
 that the container sees? Is it recognizable or has Solr somehow munged
 the sub-query?

 What is your environment like? Tomcat? Jetty? Other? What JVM
 etc?

 Best,
 Erick

 On Tue, Dec 23, 2014 at 3:23 AM, S.L simpleliving...@gmail.com wrote:
  Hi All,
 
  I am using SolrCloud 4.10.1 and I have 3 shards with replication factor
 of
  2 , i.e is 6 nodes altogether.
 
  When I query the server1 out of 6 nodes in the cluster with the below
 query
  , it works fine , but any other node in the cluster when queried with the
  same query results in a *HTTP Status 500 - {msg=Illegal character in
 query
  at index 181:*
  error.
 
  The character at index 181 is the boost character ^. I have see a Jira
  SOLR-5971 https://issues.apache.org/jira/browse/SOLR-5971 for a
 similar
  issue , how can I overcome this issue.
 
  The query I use is below. Thanks in Advance!
 
 
 http://xx2..com:8081/solr/dyCollection1_shard2_replica1/?q=x+x+xxsort=score+descwt=jsonindent=truedebugQuery=truedefType=edismaxqf=productName
 
 ^1.5+productDescriptionmm=1pf=productName+productDescriptionps=1pf2=productName+productDescriptionpf3=productName+productDescriptionstopwords=truelowercaseOperators=true



'Illegal character in query' on Solr cloud 4.10.1

2014-12-23 Thread S.L
Hi All,

I am using SolrCloud 4.10.1 and I have 3 shards with replication factor of
2 , i.e is 6 nodes altogether.

When I query the server1 out of 6 nodes in the cluster with the below query
, it works fine , but any other node in the cluster when queried with the
same query results in a *HTTP Status 500 - {msg=Illegal character in query
at index 181:*
error.

The character at index 181 is the boost character ^. I have see a Jira
SOLR-5971 https://issues.apache.org/jira/browse/SOLR-5971 for a similar
issue , how can I overcome this issue.

The query I use is below. Thanks in Advance!

http://xx2..com:8081/solr/dyCollection1_shard2_replica1/?q=x+x+xxsort=score+descwt=jsonindent=truedebugQuery=truedefType=edismaxqf=productName
^1.5+productDescriptionmm=1pf=productName+productDescriptionps=1pf2=productName+productDescriptionpf3=productName+productDescriptionstopwords=truelowercaseOperators=true


Re: Length norm not functioning in solr queries.

2014-12-11 Thread S.L
Mikhail,

Thank you for confirming this , however Ahmet's proposal seems more simpler
to implement to me .

On Wed, Dec 10, 2014 at 5:07 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 S.L,

 I briefly skimmed Lucene50NormsConsumer.writeNormsField(), my conclusion
 is: if you supply own similarity, which just avoids putting float to byte
 in Similarity.computeNorm(FieldInvertState), you get right this value in .
 Similarity.decodeNormValue(long).
 You may wonder but this is what's exactly done in PreciseDefaultSimilarity
 in TestLongNormValueSource. I think you can just use it.

 On Wed, Dec 10, 2014 at 12:11 PM, S.L simpleliving...@gmail.com wrote:

  Hi Ahmet,
 
  Is there already an implementation of the suggested work around ? Thanks.
 
  On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
  wrote:
 
   Hi,
  
   Default length norm is not best option for differentiating very short
   documents, like product names.
   Please see :
   http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
  
   I suggest you to create an additional integer field, that holds number
 of
   tokens. You can populate it via update processor. And then penalise
  (using
   fuction queries) according to that field. This way you have more fine
   grained and flexible control over it.
  
   Ahmet
  
  
  
   On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
   wrote:
   Hi ,
  
   Mikhail Thanks , I looked at the explain and this is what I see for the
  two
   different documents in questions, they have identical scores   even
  though
   the document 2 has a shorter productName field, I do not see any
  lenghtNorm
   related information in the explain.
  
   Also I am not exactly clear on what needs to be looked in the API ?
  
   *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
   productNameps=1pf2= productNamepf3=
   productNamestopwords=truelowercaseOperators=true
  
   *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
   Unlocked *
  
  
  - *100%* 10.649221 sum of the following:
 - *10.58%* 1.1270299 sum of the following:
- *2.1%* 0.22383358 productName:iphon
- *3.47%* 0.36922288 productName:4 s
- *5.01%* 0.53397346 productName:16 gb
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 - *27.79%* 2.959255 sum of the following:
- *10.97%* 1.1680154 productName:iphon 4 s~1
- *16.82%* 1.7912396 productName:4 s 16 gb~1
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  
  
   *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
  
  
  - *100%* 10.649221 sum of the following:
 - *10.58%* 1.1270299 sum of the following:
- *2.1%* 0.22383358 productName:iphon
- *3.47%* 0.36922288 productName:4 s
- *5.01%* 0.53397346 productName:16 gb
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 - *27.79%* 2.959255 sum of the following:
- *10.97%* 1.1680154 productName:iphon 4 s~1
- *16.82%* 1.7912396 productName:4 s 16 gb~1
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  
  
  
  
  
   On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
It's worth to look into explain to check particular scoring values.
  But
for most suspect is the reducing precision when float norms are
 stored
  in
byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
   
   
On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com
 wrote:
   
 I have two documents doc1 and doc2 and each one of those has a
 field
called
 phoneName.

 doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White
  (Verizon)
 Smartphone Factory Unlocked

 doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White

 Here if I search for


   
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true

 Doc1 and Doc2 both have the same identical score , but since the
  field
 phoneName in the doc2 has shorter length I would expect it to have
 a
higher
 score , but both have an identical score of 9.961212.

 The phoneName filed is defined as follows.As we can see no where
 am I
 specifying omitNorms=True, still the behavior seems to be that the
   length
 norm is not functioning at all. Can some one let me know whats the
   issue
 here ?

 field name=phoneName type=text_en_splitting
  indexed=true
 stored=true required=true /
 fieldType name=text_en_splitting class=solr.TextField
 positionIncrementGap=100
   autoGeneratePhraseQueries=true
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory
 /
 !-- in this example, we will only use

Re: Length norm not functioning in solr queries.

2014-12-11 Thread S.L
Ahmet,

Thank you , as the configurations in SolrCloud are uploaded to zookeeper ,
are there any special steps that need to be taken to make this work in
SolrCloud ?

On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 Or even better, you can use your new field for tie break purposes. Where
 scores are identical.
 e.g. sort=score desc, wordCount asc

 Ahmet


 On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan iori...@yahoo.com
 wrote:
 Hi,

 You mean update processor factory?

 Here is augmented (wordCount field added) version of your example :

 doc1:

 phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
 Smartphone Factory Unlocked
 wordCount: 11

 doc2:

 phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
 wordCount: 9


 First task is simply calculate wordCount values. You can do it in your
 indexing code, or other places.
 I quickly skimmed existing update processors but I couldn't find stock
 implementation.
 CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is
 all about multivalued fields.

 I guess, A simple javascript that splits on whitespace and returns the
 produced array size would do the trick :
 StatelessScriptUpdateProcessorFactory



 At this point you have a int field named word count.
 boost=div(1,wordCount) should work. Or you can came up with more
 sophisticated math formula.

 Ahmet


 On Wednesday, December 10, 2014 11:12 AM, S.L simpleliving...@gmail.com
 wrote:
 Hi Ahmet,

 Is there already an implementation of the suggested work around ? Thanks.


 On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:

  Hi,
 
  Default length norm is not best option for differentiating very short
  documents, like product names.
  Please see :
  http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
 
  I suggest you to create an additional integer field, that holds number of
  tokens. You can populate it via update processor. And then penalise
 (using
  fuction queries) according to that field. This way you have more fine
  grained and flexible control over it.
 
  Ahmet
 
 
 
  On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
  wrote:
  Hi ,
 
  Mikhail Thanks , I looked at the explain and this is what I see for the
 two
  different documents in questions, they have identical scores   even
 though
  the document 2 has a shorter productName field, I do not see any
 lenghtNorm
  related information in the explain.
 
  Also I am not exactly clear on what needs to be looked in the API ?
 
  *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
  productNameps=1pf2= productNamepf3=
  productNamestopwords=truelowercaseOperators=true
 
  *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
  Unlocked *
 
 
 - *100%* 10.649221 sum of the following:
- *10.58%* 1.1270299 sum of the following:
   - *2.1%* 0.22383358 productName:iphon
   - *3.47%* 0.36922288 productName:4 s
   - *5.01%* 0.53397346 productName:16 gb
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
- *27.79%* 2.959255 sum of the following:
   - *10.97%* 1.1680154 productName:iphon 4 s~1
   - *16.82%* 1.7912396 productName:4 s 16 gb~1
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 
 
  *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
 
 
 - *100%* 10.649221 sum of the following:
- *10.58%* 1.1270299 sum of the following:
   - *2.1%* 0.22383358 productName:iphon
   - *3.47%* 0.36922288 productName:4 s
   - *5.01%* 0.53397346 productName:16 gb
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
- *27.79%* 2.959255 sum of the following:
   - *10.97%* 1.1680154 productName:iphon 4 s~1
   - *16.82%* 1.7912396 productName:4 s 16 gb~1
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 
 
 
 
 
  On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   It's worth to look into explain to check particular scoring values.
 But
   for most suspect is the reducing precision when float norms are stored
 in
   byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
  
  
   On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:
  
I have two documents doc1 and doc2 and each one of those has a field
   called
phoneName.
   
doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White
 (Verizon)
Smartphone Factory Unlocked
   
doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
   
Here if I search for
   
   
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
   
Doc1 and Doc2 both have the same identical score , but since the
 field
phoneName in the doc2 has shorter length I would expect it to have a
   higher
score , but both have an identical

Re: Length norm not functioning in solr queries.

2014-12-11 Thread S.L
Yes, I understand that reindexing is neccesary , however for some reason I
was not able to invoke the js script from the updateprocessor, so I ended
up using Java only solution at index time.

Thanks.

On Thu, Dec 11, 2014 at 7:18 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 No special steps to be taken for cloud setup. Please note that for both
 solutions, re-index is mandatory.

 Ahmet



 On Thursday, December 11, 2014 12:15 PM, S.L simpleliving...@gmail.com
 wrote:
 Ahmet,

 Thank you , as the configurations in SolrCloud are uploaded to zookeeper ,
 are there any special steps that need to be taken to make this work in
 SolrCloud ?


 On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:
 
  Hi,
 
  Or even better, you can use your new field for tie break purposes. Where
  scores are identical.
  e.g. sort=score desc, wordCount asc
 
  Ahmet
 
 
  On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan 
 iori...@yahoo.com
  wrote:
  Hi,
 
  You mean update processor factory?
 
  Here is augmented (wordCount field added) version of your example :
 
  doc1:
 
  phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
  Smartphone Factory Unlocked
  wordCount: 11
 
  doc2:
 
  phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
  wordCount: 9
 
 
  First task is simply calculate wordCount values. You can do it in your
  indexing code, or other places.
  I quickly skimmed existing update processors but I couldn't find stock
  implementation.
  CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is
  all about multivalued fields.
 
  I guess, A simple javascript that splits on whitespace and returns the
  produced array size would do the trick :
  StatelessScriptUpdateProcessorFactory
 
 
 
  At this point you have a int field named word count.
  boost=div(1,wordCount) should work. Or you can came up with more
  sophisticated math formula.
 
  Ahmet
 
 
  On Wednesday, December 10, 2014 11:12 AM, S.L simpleliving...@gmail.com
 
  wrote:
  Hi Ahmet,
 
  Is there already an implementation of the suggested work around ? Thanks.
 
 
  On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
  wrote:
 
   Hi,
  
   Default length norm is not best option for differentiating very short
   documents, like product names.
   Please see :
   http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
  
   I suggest you to create an additional integer field, that holds number
 of
   tokens. You can populate it via update processor. And then penalise
  (using
   fuction queries) according to that field. This way you have more fine
   grained and flexible control over it.
  
   Ahmet
  
  
  
   On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
   wrote:
   Hi ,
  
   Mikhail Thanks , I looked at the explain and this is what I see for the
  two
   different documents in questions, they have identical scores   even
  though
   the document 2 has a shorter productName field, I do not see any
  lenghtNorm
   related information in the explain.
  
   Also I am not exactly clear on what needs to be looked in the API ?
  
   *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
   productNameps=1pf2= productNamepf3=
   productNamestopwords=truelowercaseOperators=true
  
   *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
   Unlocked *
  
  
  - *100%* 10.649221 sum of the following:
 - *10.58%* 1.1270299 sum of the following:
- *2.1%* 0.22383358 productName:iphon
- *3.47%* 0.36922288 productName:4 s
- *5.01%* 0.53397346 productName:16 gb
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 - *27.79%* 2.959255 sum of the following:
- *10.97%* 1.1680154 productName:iphon 4 s~1
- *16.82%* 1.7912396 productName:4 s 16 gb~1
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  
  
   *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
  
  
  - *100%* 10.649221 sum of the following:
 - *10.58%* 1.1270299 sum of the following:
- *2.1%* 0.22383358 productName:iphon
- *3.47%* 0.36922288 productName:4 s
- *5.01%* 0.53397346 productName:16 gb
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 - *27.79%* 2.959255 sum of the following:
- *10.97%* 1.1680154 productName:iphon 4 s~1
- *16.82%* 1.7912396 productName:4 s 16 gb~1
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  
  
  
  
  
   On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
It's worth to look into explain to check particular scoring values.
  But
for most suspect is the reducing precision when float norms are
 stored
  in
byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
   
   
On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com
 wrote

Re: Length norm not functioning in solr queries.

2014-12-10 Thread S.L
Hi Ahmet,

Is there already an implementation of the suggested work around ? Thanks.

On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 Default length norm is not best option for differentiating very short
 documents, like product names.
 Please see :
 http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec

 I suggest you to create an additional integer field, that holds number of
 tokens. You can populate it via update processor. And then penalise (using
 fuction queries) according to that field. This way you have more fine
 grained and flexible control over it.

 Ahmet



 On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
 wrote:
 Hi ,

 Mikhail Thanks , I looked at the explain and this is what I see for the two
 different documents in questions, they have identical scores   even though
 the document 2 has a shorter productName field, I do not see any lenghtNorm
 related information in the explain.

 Also I am not exactly clear on what needs to be looked in the API ?

 *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
 productNameps=1pf2= productNamepf3=
 productNamestopwords=truelowercaseOperators=true

 *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
 Unlocked *


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1


 *productName Apple iPhone 4S 16GB for Net10, No Contract, White*


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1





 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  It's worth to look into explain to check particular scoring values. But
  for most suspect is the reducing precision when float norms are stored in
  byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
 
 
  On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:
 
   I have two documents doc1 and doc2 and each one of those has a field
  called
   phoneName.
  
   doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
   Smartphone Factory Unlocked
  
   doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
  
   Here if I search for
  
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
  
   Doc1 and Doc2 both have the same identical score , but since the field
   phoneName in the doc2 has shorter length I would expect it to have a
  higher
   score , but both have an identical score of 9.961212.
  
   The phoneName filed is defined as follows.As we can see no where am I
   specifying omitNorms=True, still the behavior seems to be that the
 length
   norm is not functioning at all. Can some one let me know whats the
 issue
   here ?
  
   field name=phoneName type=text_en_splitting indexed=true
   stored=true required=true /
   fieldType name=text_en_splitting class=solr.TextField
   positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory /
   !-- in this example, we will only use synonyms at
 query
   time filter
   class=solr.SynonymFilterFactory
   synonyms=index_synonyms.txt ignoreCase=true
   expand=false/ --
   !-- Case insensitive stop word removal. add
   enablePositionIncrements=true
   in both the index and query analyzers to leave a
  'gap'
   for more accurate
   phrase queries. --
   filter class=solr.StopFilterFactory
 ignoreCase=true
   words=lang/stopwords_en.txt
   enablePositionIncrements=true /
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1
   catenateWords=1
   catenateNumbers=1 catenateAll=0
   splitOnCaseChange=1 /
   filter class=solr.LowerCaseFilterFactory /
   filter class=solr.KeywordMarkerFilterFactory

Re: Length norm not functioning in solr queries.

2014-12-09 Thread S.L
Hi ,

Mikhail Thanks , I looked at the explain and this is what I see for the two
different documents in questions, they have identical scores   even though
the document 2 has a shorter productName field, I do not see any lenghtNorm
related information in the explain.

Also I am not exactly clear on what needs to be looked in the API ?

*Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
productNameps=1pf2= productNamepf3=
productNamestopwords=truelowercaseOperators=true

*productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
Unlocked *


   - *100%* 10.649221 sum of the following:
  - *10.58%* 1.1270299 sum of the following:
 - *2.1%* 0.22383358 productName:iphon
 - *3.47%* 0.36922288 productName:4 s
 - *5.01%* 0.53397346 productName:16 gb
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  - *27.79%* 2.959255 sum of the following:
 - *10.97%* 1.1680154 productName:iphon 4 s~1
 - *16.82%* 1.7912396 productName:4 s 16 gb~1
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1


*productName Apple iPhone 4S 16GB for Net10, No Contract, White*


   - *100%* 10.649221 sum of the following:
  - *10.58%* 1.1270299 sum of the following:
 - *2.1%* 0.22383358 productName:iphon
 - *3.47%* 0.36922288 productName:4 s
 - *5.01%* 0.53397346 productName:16 gb
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  - *27.79%* 2.959255 sum of the following:
 - *10.97%* 1.1680154 productName:iphon 4 s~1
 - *16.82%* 1.7912396 productName:4 s 16 gb~1
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1




On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 It's worth to look into explain to check particular scoring values. But
 for most suspect is the reducing precision when float norms are stored in
 byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)


 On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:

  I have two documents doc1 and doc2 and each one of those has a field
 called
  phoneName.
 
  doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
  Smartphone Factory Unlocked
 
  doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
 
  Here if I search for
 
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
 
  Doc1 and Doc2 both have the same identical score , but since the field
  phoneName in the doc2 has shorter length I would expect it to have a
 higher
  score , but both have an identical score of 9.961212.
 
  The phoneName filed is defined as follows.As we can see no where am I
  specifying omitNorms=True, still the behavior seems to be that the length
  norm is not functioning at all. Can some one let me know whats the issue
  here ?
 
  field name=phoneName type=text_en_splitting indexed=true
  stored=true required=true /
  fieldType name=text_en_splitting class=solr.TextField
  positionIncrementGap=100 autoGeneratePhraseQueries=true
  analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory /
  !-- in this example, we will only use synonyms at query
  time filter
  class=solr.SynonymFilterFactory
  synonyms=index_synonyms.txt ignoreCase=true
  expand=false/ --
  !-- Case insensitive stop word removal. add
  enablePositionIncrements=true
  in both the index and query analyzers to leave a
 'gap'
  for more accurate
  phrase queries. --
  filter class=solr.StopFilterFactory ignoreCase=true
  words=lang/stopwords_en.txt
  enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1
  catenateWords=1
  catenateNumbers=1 catenateAll=0
  splitOnCaseChange=1 /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.KeywordMarkerFilterFactory
  protected=protwords.txt /
  filter class=solr.PorterStemFilterFactory /
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt
  ignoreCase=true expand=true /
  filter class=solr.StopFilterFactory ignoreCase=true
  words=lang/stopwords_en.txt
  enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1
  catenateWords=0
  catenateNumbers=0 catenateAll=0
  splitOnCaseChange=1 /
  filter class=solr.LowerCaseFilterFactory /
  filter

Length norm not functioning in solr queries.

2014-12-08 Thread S.L
I have two documents doc1 and doc2 and each one of those has a field called
phoneName.

doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
Smartphone Factory Unlocked

doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White

Here if I search for
q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true

Doc1 and Doc2 both have the same identical score , but since the field
phoneName in the doc2 has shorter length I would expect it to have a higher
score , but both have an identical score of 9.961212.

The phoneName filed is defined as follows.As we can see no where am I
specifying omitNorms=True, still the behavior seems to be that the length
norm is not functioning at all. Can some one let me know whats the issue
here ?

field name=phoneName type=text_en_splitting indexed=true
stored=true required=true /
fieldType name=text_en_splitting class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
!-- in this example, we will only use synonyms at query
time filter
class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true
expand=false/ --
!-- Case insensitive stop word removal. add
enablePositionIncrements=true
in both the index and query analyzers to leave a 'gap'
for more accurate
phrase queries. --
filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_en.txt
enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1
catenateNumbers=1 catenateAll=0
splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt /
filter class=solr.PorterStemFilterFactory /
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt
ignoreCase=true expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=lang/stopwords_en.txt
enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=0
catenateNumbers=0 catenateAll=0
splitOnCaseChange=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt /
filter class=solr.PorterStemFilterFactory /
/analyzer
/fieldType


Re: Boosting the score using edismax for a non empty and non indexed field.

2014-12-08 Thread S.L
Anyone ?

On Mon, Dec 8, 2014 at 2:45 AM, S.L simpleliving...@gmail.com wrote:

 Hi All,

 I have a situation where I need to boost the score of a query if a field
 (imageURL) in the given document is non empty , I am using edismax so I
 know that using bq parameter would solve the problem. However the field
 imageURL that  I am trying to boost on is not indexed , meaning (stored =
 true and indexed = false), can I use the bq parameter for a non indexed
 field ? or should I be looking at re-indexing after changing the schema to
 make this an indexed field ?

 Also , my use case is such that I want the documents that have an imageURL
 to be boosted so that they appear before those documents that do not have
 the imageURL when sorted by score in a descending order, and this field in
 question i.e. imageURL is sometimes present  and sometimes not, that is why
 I am looking at boosting the score of those documents that have the
 imageURL present.

 Thanks and any help and suggestionis much appreciated!





Boosting the score using edismax for a non empty and non indexed field.

2014-12-07 Thread S.L
Hi All,

I have a situation where I need to boost the score of a query if a field
(imageURL) in the given document is non empty , I am using edismax so I
know that using bq parameter would solve the problem. However the field
imageURL that  I am trying to boost on is not indexed , meaning (stored =
true and indexed = false), can I use the bq parameter for a non indexed
field ? or should I be looking at re-indexing after changing the schema to
make this an indexed field ?

Also , my use case is such that I want the documents that have an imageURL
to be boosted so that they appear before those documents that do not have
the imageURL when sorted by score in a descending order, and this field in
question i.e. imageURL is sometimes present  and sometimes not, that is why
I am looking at boosting the score of those documents that have the
imageURL present.

Thanks and any help and suggestionis much appreciated!


Re: Can we query on _version_field ?

2014-11-13 Thread S.L
Here is why I want to do this .

1. My unique key is a http URL, doctorURL.
2. If I do a look up based on URL , I am bound to face issues with
character escaping and all.
3. To avoid that I was using a UUID for look up , but in SolrCloud it
generates unique per replica , which is not acceptable.
4. Now I see that the mandatory _version_ field has a unique value per
document and and not unique per replica , so I am exploring the use of
_version_ to do a look up only and not neccesarily use it as a unique key,
is it do able in that case ?

On Thu, Nov 13, 2014 at 8:58 AM, Erick Erickson erickerick...@gmail.com
wrote:

 Really, I have to ask why you would want to. This is really purely an
 internal
 thing. I don't know what practical value there would be to search on this?

 Interestingly, I can search _version_:[100 TO *], but specific searches
 seem to fail.

 I wonder if there's something wonky going on with searching on large longs
 here.

 Feels like an XY problem to me though.

 Best,
 Erick

 On Thu, Nov 13, 2014 at 12:45 AM, S.L simpleliving...@gmail.com wrote:
  Hi All,
 
  We know that _version_field is a mandatory field in solrcloud schema.xml,
  it is expected to be of type long , it also seems to have unique value
 in a
  collection.
 
  However the query of the form
 
 http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json
  does not seems to return any record , can we query on the _version_field
 in
  the schema.xml ?
 
  Thank you.



Re: Can we query on _version_field ?

2014-11-13 Thread S.L
Erick,

1._version_ will change on updates , shouldnt that be OK  ?My
understanding of update here means that the a new document will be inserted
with the same unique key docUrl in my case ,which will replace the
document effectively. This will not be an issue in my case because the
initial search results based on doctorName, would have basic doctor data
, and when that tile is  clicked upon detail data would be displayed based
on the lookup of the _version_ id. So if the _version_ does not change
besides the update  , I should be good , of course there is a possibility
of the document being updated between the search results being displayed
and detailed information being requested, but the possibility of that less
in my case , because usually people request details as soon as the initial
search results are displayed.


2. Yes,I have used UUIDUPdateProcessorFactory  in the following ways , but
none of them solve the issue , especially in SolrCloud.

*Case 1:*

*schema.xml*

field name=id type=string indexed=true stored=true
required=true multiValued=false /

This does not generate the unique id at all.

*Case 2:*

field name=id type=uuid indexed=true stored=true
required=true multiValued=false /

In this case a unique id is generated , but that is unique for every
replica and we end up with different ids for the same document in different
replicas.


In both the cases above the solrconfig.xml had the following entry.

  updateRequestProcessorChain name=uuid

processor class=solr.UUIDUpdateProcessorFactory
str name=fieldNameid/str
/processor
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain



On Thu, Nov 13, 2014 at 11:01 AM, Erick Erickson erickerick...@gmail.com
wrote:

 _version_ will change on updates I'm pretty sure, so I doubt
 it's suitable.

 I _think_ you can use a UUIDUPdateProcessorFactory here.
 I haven't checked this personally, but the idea here is
 that the UUID cannot be assigned on the shard. But if you're
 checking this out, if the UUID is assigned _before_ the doc
 is sent to the destination shard, it should be fine.

 Have you checked that out? I'm at a conference, so I can't
 check it out too thoroughly right now...

 Best,
 Erick

 On Thu, Nov 13, 2014 at 10:18 AM, S.L simpleliving...@gmail.com wrote:
  Here is why I want to do this .
 
  1. My unique key is a http URL, doctorURL.
  2. If I do a look up based on URL , I am bound to face issues with
  character escaping and all.
  3. To avoid that I was using a UUID for look up , but in SolrCloud it
  generates unique per replica , which is not acceptable.
  4. Now I see that the mandatory _version_ field has a unique value per
  document and and not unique per replica , so I am exploring the use of
  _version_ to do a look up only and not neccesarily use it as a unique
 key,
  is it do able in that case ?
 
  On Thu, Nov 13, 2014 at 8:58 AM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  Really, I have to ask why you would want to. This is really purely an
  internal
  thing. I don't know what practical value there would be to search on
 this?
 
  Interestingly, I can search _version_:[100 TO *], but specific
 searches
  seem to fail.
 
  I wonder if there's something wonky going on with searching on large
 longs
  here.
 
  Feels like an XY problem to me though.
 
  Best,
  Erick
 
  On Thu, Nov 13, 2014 at 12:45 AM, S.L simpleliving...@gmail.com
 wrote:
   Hi All,
  
   We know that _version_field is a mandatory field in solrcloud
 schema.xml,
   it is expected to be of type long , it also seems to have unique value
  in a
   collection.
  
   However the query of the form
  
 
 http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json
   does not seems to return any record , can we query on the
 _version_field
  in
   the schema.xml ?
  
   Thank you.
 



Re: Can we query on _version_field ?

2014-11-13 Thread S.L
I am not sure if this a case of XY problem.

I have no control over the URLs to deduce an id from them , those are from
www, I made the URL the uniqueKey , that way the document gets replaced
when a new document with that URL comes in .

To do the detail look up I can either use the same docURL as it is , or
try and generate a unique id filed for each document.

For the later option UUID is not behaving as expected in SolrCloud and
_version_ field seems to be serving the need .

On Thu, Nov 13, 2014 at 11:35 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 11/12/2014 10:45 PM, S.L wrote:
  We know that _version_field is a mandatory field in solrcloud schema.xml,
  it is expected to be of type long , it also seems to have unique value
 in a
  collection.
 
  However the query of the form
 
 http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json
  does not seems to return any record , can we query on the _version_field
 in
  the schema.xml ?

 I've been watching your journey unfold on the mailing list.  The whole
 thing seems like an XY problem.

 If I'm reading everything correctly, you want to have a unique ID value
 that can serve as the uniqueKey, as well as a way to quickly look up a
 single document in Solr.

 Is there one part of the URL that serves as a unique identifier that
 doesn't contain special characters?  It seems insane that you would not
 have a unique ID value for every entity in your system that is composed
 of only regular characters.

 Assuming that such an ID exists (and is likely used as one piece of that
 doctorURL that you mentioned) ... if you can extract that ID value into
 its own field (either in your indexing code or a custom update
 processor), you could use that for both uniqueKey and single-document
 lookups.  Having that kind of information in your index seems like a
 generally good idea.

 Thanks,
 Shawn




Re: Can we query on _version_field ?

2014-11-13 Thread S.L
Garth and Erick,

I am now successfully able to auto generate ids using UUID
updateRequestProcessorChain , by giving the id type of string .

Thanks for your help folks.

On Thu, Nov 13, 2014 at 1:31 PM, Garth Grimm 
garthgr...@averyranchconsulting.com wrote:

 So it sounds like you’re OK with using the docURL as the unique key for
 routing in SolrCloud, but you don’t want to use it as a lookup mechanism.

 If you don’t want to do a hash of it and use that unique value in a second
 unique field and feed time,
 and you can’t seem to find any other field that might be unique,
 and you don’t want to make your own UpdateRequestProcessorChain that would
 generate a unique field from your unique key (such as by doing an MD5 hash),
 you might look at the UpdateRequestProcessorChain named “deduce” in the
 OOB solrconfig.xml.  It’s primarily designed to help dedupe results, but
 it’s technique is to concatenate multiple fields together to create a
 signature that will be unique in some way.  So instead of having to find
 one field in your data that’s unique, you could look for a couple of fields
 that, if combined, would create a unique field, and configure the “dedupe”
 Processor to handle that.


  On Nov 13, 2014, at 12:02 PM, S.L simpleliving...@gmail.com wrote:
 
  I am not sure if this a case of XY problem.
 
  I have no control over the URLs to deduce an id from them , those are
 from
  www, I made the URL the uniqueKey , that way the document gets replaced
  when a new document with that URL comes in .
 
  To do the detail look up I can either use the same docURL as it is , or
  try and generate a unique id filed for each document.
 
  For the later option UUID is not behaving as expected in SolrCloud and
  _version_ field seems to be serving the need .
 
  On Thu, Nov 13, 2014 at 11:35 AM, Shawn Heisey apa...@elyograg.org
 wrote:
 
  On 11/12/2014 10:45 PM, S.L wrote:
  We know that _version_field is a mandatory field in solrcloud
 schema.xml,
  it is expected to be of type long , it also seems to have unique value
  in a
  collection.
 
  However the query of the form
 
 
 http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json
  does not seems to return any record , can we query on the
 _version_field
  in
  the schema.xml ?
 
  I've been watching your journey unfold on the mailing list.  The whole
  thing seems like an XY problem.
 
  If I'm reading everything correctly, you want to have a unique ID value
  that can serve as the uniqueKey, as well as a way to quickly look up a
  single document in Solr.
 
  Is there one part of the URL that serves as a unique identifier that
  doesn't contain special characters?  It seems insane that you would not
  have a unique ID value for every entity in your system that is composed
  of only regular characters.
 
  Assuming that such an ID exists (and is likely used as one piece of that
  doctorURL that you mentioned) ... if you can extract that ID value into
  its own field (either in your indexing code or a custom update
  processor), you could use that for both uniqueKey and single-document
  lookups.  Having that kind of information in your index seems like a
  generally good idea.
 
  Thanks,
  Shawn
 
 




Re: Different ids for the same document in different replicas.

2014-11-12 Thread S.L
Thanks.

So the issue here is I already have a uniqueKeydoctorIduniquekey
defined in my schema.xml.

If along with that I also want the id/id field to be automatically
generated for each document do I have to declare it as a uniquekey as
well , because I just tried the following setting without the uniqueKey for
id and its only generating blank ids for me.

*schema.xml*

field name=id type=string indexed=true stored=true
required=true multiValued=false /

*solrconfig.xml*

  updateRequestProcessorChain name=uuid

processor class=solr.UUIDUpdateProcessorFactory
str name=fieldNameid/str
/processor
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain


On Tue, Nov 11, 2014 at 7:47 PM, Garth Grimm 
garthgr...@averyranchconsulting.com wrote:

 Looking a little deeper, I did find this about UUIDField


 http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/schema/UUIDField.html

 NOTE: Configuring a UUIDField instance with a default value of NEW is
 not advisable for most users when using SolrCloud (and not possible if the
 UUID value is configured as the unique key field) since the result will be
 that each replica of each document will get a unique UUID value. Using
 UUIDUpdateProcessorFactory
 http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html
 to generate UUID values when documents are added is recomended instead.”

 That might describe the behavior you saw.  And the use of
 UUIDUpdateProcessorFactory to auto generate ID’s seems to be covered well
 here:


 http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/

 Though I’ve not actually tried that process before.

 On Nov 11, 2014, at 7:39 PM, Garth Grimm 
 garthgr...@averyranchconsulting.commailto:
 garthgr...@averyranchconsulting.com wrote:

 “uuid” isn’t an out of the box field type that I’m familiar with.

 Generally, I’d stick with the out of the box advice of the schema.xml
 file, which includes things like….

   !-- Only remove the id field if you have a very good reason to. While
 not strictly
 required, it is highly recommended. A uniqueKey is present in almost
 all Solr
 installations. See the uniqueKey declaration below where uniqueKey
 is set to id.
   --
   field name=id type=string indexed=true stored=true
 required=true multiValued=false /

 and…

 !-- Field to use to determine and enforce document uniqueness.
  Unless this field is marked with required=false, it will be a
 required field
   --
 uniqueKeyid/uniqueKey

 If you’re creating some key/value pair with uuid as the key as you feed
 documents in, and you know that the uuid values you’re creating are unique,
 just change the field name and unique key name from ‘id’ to ‘uuid’.  Or
 change the key name you send in from ‘uuid’ to ‘id’.

 On Nov 11, 2014, at 7:18 PM, S.L simpleliving...@gmail.commailto:
 simpleliving...@gmail.com wrote:

 Hi All,

 I am seeing interesting behavior on the replicas , I have a single
 shard and 6 replicas and on SolrCloud 4.10.1 . I  only have a small
 number of documents ~375 that are replicated across the six replicas .

 The interesting thing is that the same  document has a different id in
 each one of those replicas .

 This is causing the fq(id:xyz) type queries to fail, depending on
 which replica the query goes to.

 I have  specified the id field in the following manner in schema.xml,
 is it the right way to specifiy an auto generated id in  SolrCloud ?

   field name=id type=uuid indexed=true stored=true
   required=true multiValued=false /


 Thanks.





Re: Different ids for the same document in different replicas.

2014-11-12 Thread S.L
Just tried  adding  uniqueKeyid/uniqueKey while keeping id type=
string only blank ids are being generated ,looks like the id is being
auto generated only if the the id is set to  type uuid , but in case of
SolrCloud this id will be unique per replica.

Is there a  way to generate a unique id both in case of SolrCloud with out
using the uuid type or not having a per replica unique id?

The uuid in question is of type .

fieldType name=uuid class=solr.UUIDField indexed=true /


On Wed, Nov 12, 2014 at 6:20 PM, S.L simpleliving...@gmail.com wrote:

 Thanks.

 So the issue here is I already have a uniqueKeydoctorIduniquekey
 defined in my schema.xml.

 If along with that I also want the id/id field to be automatically
 generated for each document do I have to declare it as a uniquekey as
 well , because I just tried the following setting without the uniqueKey for
 id and its only generating blank ids for me.

 *schema.xml*

 field name=id type=string indexed=true stored=true
 required=true multiValued=false /

 *solrconfig.xml*

   updateRequestProcessorChain name=uuid

 processor class=solr.UUIDUpdateProcessorFactory
 str name=fieldNameid/str
 /processor
 processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain


 On Tue, Nov 11, 2014 at 7:47 PM, Garth Grimm 
 garthgr...@averyranchconsulting.com wrote:

 Looking a little deeper, I did find this about UUIDField


 http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/schema/UUIDField.html

 NOTE: Configuring a UUIDField instance with a default value of NEW is
 not advisable for most users when using SolrCloud (and not possible if the
 UUID value is configured as the unique key field) since the result will be
 that each replica of each document will get a unique UUID value. Using
 UUIDUpdateProcessorFactory
 http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html
 to generate UUID values when documents are added is recomended instead.”

 That might describe the behavior you saw.  And the use of
 UUIDUpdateProcessorFactory to auto generate ID’s seems to be covered well
 here:


 http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/

 Though I’ve not actually tried that process before.

 On Nov 11, 2014, at 7:39 PM, Garth Grimm 
 garthgr...@averyranchconsulting.commailto:
 garthgr...@averyranchconsulting.com wrote:

 “uuid” isn’t an out of the box field type that I’m familiar with.

 Generally, I’d stick with the out of the box advice of the schema.xml
 file, which includes things like….

   !-- Only remove the id field if you have a very good reason to.
 While not strictly
 required, it is highly recommended. A uniqueKey is present in
 almost all Solr
 installations. See the uniqueKey declaration below where
 uniqueKey is set to id.
   --
   field name=id type=string indexed=true stored=true
 required=true multiValued=false /

 and…

 !-- Field to use to determine and enforce document uniqueness.
  Unless this field is marked with required=false, it will be a
 required field
   --
 uniqueKeyid/uniqueKey

 If you’re creating some key/value pair with uuid as the key as you feed
 documents in, and you know that the uuid values you’re creating are unique,
 just change the field name and unique key name from ‘id’ to ‘uuid’.  Or
 change the key name you send in from ‘uuid’ to ‘id’.

 On Nov 11, 2014, at 7:18 PM, S.L simpleliving...@gmail.commailto:
 simpleliving...@gmail.com wrote:

 Hi All,

 I am seeing interesting behavior on the replicas , I have a single
 shard and 6 replicas and on SolrCloud 4.10.1 . I  only have a small
 number of documents ~375 that are replicated across the six replicas .

 The interesting thing is that the same  document has a different id in
 each one of those replicas .

 This is causing the fq(id:xyz) type queries to fail, depending on
 which replica the query goes to.

 I have  specified the id field in the following manner in schema.xml,
 is it the right way to specifiy an auto generated id in  SolrCloud ?

   field name=id type=uuid indexed=true stored=true
   required=true multiValued=false /


 Thanks.






Can we query on _version_field ?

2014-11-12 Thread S.L
Hi All,

We know that _version_field is a mandatory field in solrcloud schema.xml,
it is expected to be of type long , it also seems to have unique value in a
collection.

However the query of the form
http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json
does not seems to return any record , can we query on the _version_field in
the schema.xml ?

Thank you.


Different ids for the same document in different replicas.

2014-11-11 Thread S.L
Hi All,

I am seeing interesting behavior on the replicas , I have a single
shard and 6 replicas and on SolrCloud 4.10.1 . I  only have a small
number of documents ~375 that are replicated across the six replicas .

The interesting thing is that the same  document has a different id in
each one of those replicas .

This is causing the fq(id:xyz) type queries to fail, depending on
which replica the query goes to.

I have  specified the id field in the following manner in schema.xml,
is it the right way to specifiy an auto generated id in  SolrCloud ?

field name=id type=uuid indexed=true stored=true
required=true multiValued=false /


Thanks.


Re: Master Slave set up in Solr Cloud

2014-11-02 Thread S.L
Resending this  as I might have not been clear in my earlier query.

I want to use SolrCloud for everything except the replication , is it
possible to set up the master-slave configuration using different Solr
instances and still be able to use the sharding feature provided by
SolrCloud ?

On Thu, Oct 30, 2014 at 6:18 PM, S.L simpleliving...@gmail.com wrote:
 Hi All,

 As I previously reported due to no overlap in terms of the documets in the
 SolrCloud replicas of the index shards , I have turned off the replication
 and basically have there shards with a replication factor of 1.

 It obviously seems will not be scalable due to the fact that the same core
 will be indexed and queried at the same time as this is a long running
 indexing task.

 My questions is what options do I have to set up the replicas of the single
 per shard core outside of the SolrCloud replication factor mechanism because
 that does not seem to work for me ?


 Thanks.



Re: Missing Records

2014-10-30 Thread S.L
I am curious , how many shards do you have and whats the replication factor
you are using ?

On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote:

 Hi All,

 We have a SOLR cloud instance that has been humming along nicely for
 months.
 Last week we started experiencing missing records.

 Admin DIH Example:
 Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s)
 A *:* search claims that there are only 903,902 this is the first full
 index.
 Subsequent full indexes give the following counts for the *:* search
 903,805
 903,665
 826,357

 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
 Processed: 903,993 (x/s) every time. ---records per second is variable


 I found an item that should be in the index but is not found in a search.

 Here are the referenced lines of the log file.

 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE
 add{,id=750041421}
 {{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}}
 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.SolrCmdDistributor; sending update to
 http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0
 add{,id=750041421}
 params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F

 --- there are 746 lines of log between entries ---

 DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;  
 [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17]
 http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
 Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine
 [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux
 City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162
 Long Track
 [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis
 Bad boy will pull you through the deepest snow!With the 162 track and
 1000cc of power you can fly up any
 hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission
 [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified
 Auto,
 Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified
 Auto,
 Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000
 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified
 Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit
 Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit
 highmark[\n]
 What could be the issue and how does one fix this issue?

 Thanks so much and if more information is needed I have preserved the log
 files.

 AJ



Master Slave set up in Solr Cloud

2014-10-30 Thread S.L
Hi All,

As I previously reported due to no overlap in terms of the documets in the
SolrCloud replicas of the index shards , I have turned off the replication
and basically have there shards with a replication factor of 1.

It obviously seems will not be scalable due to the fact that the same core
will be indexed and queried at the same time as this is a long running
indexing task.

My questions is what options do I have to set up the replicas of the single
per shard core outside of the SolrCloud replication factor mechanism
because that does not seem to work for me ?


Thanks.


Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread S.L
Will,

I think in one of your other emails(which I am not able to find) you has
asked if I was indexing directly from MapReduce jobs, yes I am indexing
directly from the map task and that is done using SolrJ with a
SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
something like MapReducerIndexerTool , which I suupose writes to HDFS and
that is in a subsequent step moved to Solr index ? If so why ?

I dont use any softCommits and do autocommit every 15 seconds , the snippet
in the configuration can be seen below.

 autoSoftCommit
   maxTime${solr.
autoSoftCommit.maxTime:-1}/maxTime
 /autoSoftCommit

 autoCommit
   maxTime${solr.autoCommit.maxTime:15000}/maxTime

   openSearchertrue/openSearcher
 /autoCommit

I looked at the localhost_access.log file ,  all the GET and POST requests
have a sub-second response time.




On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote:

 The easiest, and coarsest measure of response time [not service time in a
 distributed system] can be picked up in your localhost_access.log file.
 You're using tomcat write?  Lookup AccessLogValve in the docs and
 server.xml. You can add configuration to report the payload and time to
 service the request without touching any code.

 Queueing theory is what Otis was talking about when he said you've
 saturated your environment. In AWS people just auto-scale up and don't
 worry about where the load comes from; its dumb if it happens more than 2
 times. Capacity planning is tough, let's hope it doesn't disappear
 altogether.

 G'luck


 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 9:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.

 Good point about ZK logs , I do see the following exceptions
 intermittently in the ZK log.

 2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
 client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
 2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
 connection from /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
 establish new session at /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,746 [myid:1] - INFO
 [CommitProcessor:1:ZooKeeperServer@617] - Established session
 0x14949db9da40037 with negotiated timeout 1 for client
 /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
 EndOfStreamException: Unable to read additional data from client sessionid
 0x14949db9da40037, likely client has closed socket
 at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
 at

 org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
 at java.lang.Thread.run(Thread.java:744)

 For queuing theory , I dont know of any way to see how fasts the requests
 are being served by SolrCloud , and if a queue is being maintained if the
 service rate is slower than the rate of requests from the incoming multiple
 threads.

 On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote:

  2 naïve comments, of course.
 
 
 
  -  Queuing theory
 
  -  Zookeeper logs.
 
 
 
  From: S.L [mailto:simpleliving...@gmail.com]
  Sent: Monday, October 27, 2014 1:42 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
  replicas out of synch.
 
 
 
  Please find the clusterstate.json attached.
 
  Also in this case atleast the Shard1 replicas are out of sync , as can
  be seen below.
 
  Shard 1 replica 1 *does not* return a result with distrib=false.
 
  Query
  :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
  http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
  28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu
  g=trackshards.info=true
  fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false
  debug=track
  shards.info=true
 
 
 
  Result :
 
  responselst name=responseHeaderint name=status0/intint
  name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
  shards.infotrue/strstr name=distribfalse/strstr
  name=debugtrack/strstr name=wtxml/strstr
  name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst
  result name=response numFound=0 start=0/lst
  name=debug//response
 
 
 
  Shard1 replica 2 *does* return the result with distrib=false.
 
  Query:
  http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
  http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
  28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread S.L
I m using Apache Hadoop and Solr , do I nee dto switch to Cloudera

On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 We index directly from mappers using SolrJ. It does work, but you pay the
 price of having to instantiate all those sockets vs. the way
 MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer
 directly in the Reduce task.

 You don't *need* to use MapReduceIndexerTool, but it's more efficient, and
 if you don't, you then have to make sure to appropriately tune your Hadoop
 implementation to match what your Solr installation is capable of.

 On 10/28/14 12:39, S.L wrote:

 Will,

 I think in one of your other emails(which I am not able to find) you has
 asked if I was indexing directly from MapReduce jobs, yes I am indexing
 directly from the map task and that is done using SolrJ with a
 SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
 something like MapReducerIndexerTool , which I suupose writes to HDFS and
 that is in a subsequent step moved to Solr index ? If so why ?

 I dont use any softCommits and do autocommit every 15 seconds , the
 snippet
 in the configuration can be seen below.

   autoSoftCommit
 maxTime${solr.
 autoSoftCommit.maxTime:-1}/maxTime
   /autoSoftCommit

   autoCommit
 maxTime${solr.autoCommit.maxTime:15000}/maxTime

 openSearchertrue/openSearcher
   /autoCommit

 I looked at the localhost_access.log file ,  all the GET and POST requests
 have a sub-second response time.




 On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com
 wrote:

  The easiest, and coarsest measure of response time [not service time in a
 distributed system] can be picked up in your localhost_access.log file.
 You're using tomcat write?  Lookup AccessLogValve in the docs and
 server.xml. You can add configuration to report the payload and time to
 service the request without touching any code.

 Queueing theory is what Otis was talking about when he said you've
 saturated your environment. In AWS people just auto-scale up and don't
 worry about where the load comes from; its dumb if it happens more than 2
 times. Capacity planning is tough, let's hope it doesn't disappear
 altogether.

 G'luck


 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 9:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.

 Good point about ZK logs , I do see the following exceptions
 intermittently in the ZK log.

 2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
 client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
 2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
 connection from /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
 establish new session at /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,746 [myid:1] - INFO
 [CommitProcessor:1:ZooKeeperServer@617] - Established session
 0x14949db9da40037 with negotiated timeout 1 for client
 /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
 EndOfStreamException: Unable to read additional data from client
 sessionid
 0x14949db9da40037, likely client has closed socket
  at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
  at

 org.apache.zookeeper.server.NIOServerCnxnFactory.run(
 NIOServerCnxnFactory.java:208)
  at java.lang.Thread.run(Thread.java:744)

 For queuing theory , I dont know of any way to see how fasts the requests
 are being served by SolrCloud , and if a queue is being maintained if the
 service rate is slower than the rate of requests from the incoming
 multiple
 threads.

 On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com
 wrote:

  2 naïve comments, of course.



 -  Queuing theory

 -  Zookeeper logs.



 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
 replicas out of synch.



 Please find the clusterstate.json attached.

 Also in this case atleast the Shard1 replicas are out of sync , as can
 be seen below.

 Shard 1 replica 1 *does not* return a result with distrib=false.

 Query
 :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
 http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu
 g=trackshards.info=true
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread S.L
Yeah , I get that not using a MarReduceIndexerTool could be more resource
intensive , but the way this issue  is manifesting which is resulting in
disjoint SolrCloud replicas perplexes me .

While you were tuning your SolrCloud environment to cater to the Hadoop
indexing requirements , did you ever face the issue of disjoint replicas?

Is MapReduceIndexer tool Cloudera distro specific? I am using Apache Solr
and Hadoop.

Thanks



On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 We index directly from mappers using SolrJ. It does work, but you pay the
 price of having to instantiate all those sockets vs. the way
 MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer
 directly in the Reduce task.

 You don't *need* to use MapReduceIndexerTool, but it's more efficient, and
 if you don't, you then have to make sure to appropriately tune your Hadoop
 implementation to match what your Solr installation is capable of.

 On 10/28/14 12:39, S.L wrote:

 Will,

 I think in one of your other emails(which I am not able to find) you has
 asked if I was indexing directly from MapReduce jobs, yes I am indexing
 directly from the map task and that is done using SolrJ with a
 SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
 something like MapReducerIndexerTool , which I suupose writes to HDFS and
 that is in a subsequent step moved to Solr index ? If so why ?

 I dont use any softCommits and do autocommit every 15 seconds , the
 snippet
 in the configuration can be seen below.

   autoSoftCommit
 maxTime${solr.
 autoSoftCommit.maxTime:-1}/maxTime
   /autoSoftCommit

   autoCommit
 maxTime${solr.autoCommit.maxTime:15000}/maxTime

 openSearchertrue/openSearcher
   /autoCommit

 I looked at the localhost_access.log file ,  all the GET and POST requests
 have a sub-second response time.




 On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com
 wrote:

  The easiest, and coarsest measure of response time [not service time in a
 distributed system] can be picked up in your localhost_access.log file.
 You're using tomcat write?  Lookup AccessLogValve in the docs and
 server.xml. You can add configuration to report the payload and time to
 service the request without touching any code.

 Queueing theory is what Otis was talking about when he said you've
 saturated your environment. In AWS people just auto-scale up and don't
 worry about where the load comes from; its dumb if it happens more than 2
 times. Capacity planning is tough, let's hope it doesn't disappear
 altogether.

 G'luck


 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 9:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.

 Good point about ZK logs , I do see the following exceptions
 intermittently in the ZK log.

 2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
 client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
 2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
 connection from /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
 establish new session at /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,746 [myid:1] - INFO
 [CommitProcessor:1:ZooKeeperServer@617] - Established session
 0x14949db9da40037 with negotiated timeout 1 for client
 /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
 EndOfStreamException: Unable to read additional data from client
 sessionid
 0x14949db9da40037, likely client has closed socket
  at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
  at

 org.apache.zookeeper.server.NIOServerCnxnFactory.run(
 NIOServerCnxnFactory.java:208)
  at java.lang.Thread.run(Thread.java:744)

 For queuing theory , I dont know of any way to see how fasts the requests
 are being served by SolrCloud , and if a queue is being maintained if the
 service rate is slower than the rate of requests from the incoming
 multiple
 threads.

 On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com
 wrote:

  2 naïve comments, of course.



 -  Queuing theory

 -  Zookeeper logs.



 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
 replicas out of synch.



 Please find the clusterstate.json attached.

 Also in this case atleast the Shard1 replicas are out of sync , as can
 be seen below.

 Shard

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
Thank Otis,

I have checked the logs , in my case the default catalina.out and I dont
see any OOMs or , any other exceptions.

What others metrics do you suggest ?

On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 You may simply be overwhelming your cluster-nodes. Have you checked
 various metrics to see if that is the case?

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/



  On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:
 
  Folks,
 
  I have posted previously about this , I am using SolrCloud 4.10.1 and
 have
  a sharded collection with  6 nodes , 3 shards and a replication factor
 of 2.
 
  I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
  can each have upto 5 threds each , so the load on the indexing side can
 get
  to as high as 75 concurrent threads.
 
  I am facing an issue where the replicas of a particular shard(s) are
  consistently getting out of synch , initially I thought this was
 beccause I
  was using a custom component , but I did a fresh install and removed the
  custom component and reindexed using the Hadoop job , I still see the
 same
  behavior.
 
  I do not see any exceptions in my catalina.out , like OOM , or any other
  excepitions, I suspecting thi scould be because of the multi-threaded
  indexing nature of the Hadoop job . I use CloudSolrServer from my java
 code
  to index and initialize the CloudSolrServer using a 3 node ZK ensemble.
 
  Does any one know of any known issues with a highly multi-threaded
 indexing
  and SolrCloud ?
 
  Can someone help ? This issue has been slowing things down on my end for
 a
  while now.
 
  Thanks and much appreciated!



Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
Markus,

I would like to ignore it too, but whats happening is that the there is a
lot of discrepancy between the replicas , queries like
q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on which
replica the request goes to, because of huge amount of discrepancy between
the replicas.

Thank you for confirming that it is a know issue , I was thinking I was the
only one facing this due to my set up.



On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 It is an ancient issue. One of the major contributors to the issue was
 resolved some versions ago but we are still seeing it sometimes too, there
 is nothing to see in the logs. We ignore it and just reindex.

 -Original message-
  From:S.L simpleliving...@gmail.com
  Sent: Monday 27th October 2014 16:25
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.
 
  Thank Otis,
 
  I have checked the logs , in my case the default catalina.out and I dont
  see any OOMs or , any other exceptions.
 
  What others metrics do you suggest ?
 
  On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
   Hi,
  
   You may simply be overwhelming your cluster-nodes. Have you checked
   various metrics to see if that is the case?
  
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log Management
   Solr  Elasticsearch Support * http://sematext.com/
  
  
  
On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:
   
Folks,
   
I have posted previously about this , I am using SolrCloud 4.10.1 and
   have
a sharded collection with  6 nodes , 3 shards and a replication
 factor
   of 2.
   
I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks ,
 that
can each have upto 5 threds each , so the load on the indexing side
 can
   get
to as high as 75 concurrent threads.
   
I am facing an issue where the replicas of a particular shard(s) are
consistently getting out of synch , initially I thought this was
   beccause I
was using a custom component , but I did a fresh install and removed
 the
custom component and reindexed using the Hadoop job , I still see the
   same
behavior.
   
I do not see any exceptions in my catalina.out , like OOM , or any
 other
excepitions, I suspecting thi scould be because of the multi-threaded
indexing nature of the Hadoop job . I use CloudSolrServer from my
 java
   code
to index and initialize the CloudSolrServer using a 3 node ZK
 ensemble.
   
Does any one know of any known issues with a highly multi-threaded
   indexing
and SolrCloud ?
   
Can someone help ? This issue has been slowing things down on my end
 for
   a
while now.
   
Thanks and much appreciated!
  
 



Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
One is not smaller than the other, because the numDocs is same for both
replicas and essentially they seem to be disjoint sets.

Also manually purging the replicas is not option , because this is
frequently indexed index and we need everything to be automated.

What other options do I have now.

1. Turn of the replication completely in SolrCloud
2. Use traditional Master Slave replication model.
3. Introduce a replica aware field in the index , to figure out which
replica the request should go to from the client.
4. Try a distribution like Helios to see if it has any different behavior.

Just think out loud here ..

On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - if there is a very large discrepancy, you could consider to purge the
 smallest replica, it will then resync from the leader.


 -Original message-
  From:S.L simpleliving...@gmail.com
  Sent: Monday 27th October 2014 16:41
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.
 
  Markus,
 
  I would like to ignore it too, but whats happening is that the there is a
  lot of discrepancy between the replicas , queries like
  q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on
 which
  replica the request goes to, because of huge amount of discrepancy
 between
  the replicas.
 
  Thank you for confirming that it is a know issue , I was thinking I was
 the
  only one facing this due to my set up.
 
  On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
   It is an ancient issue. One of the major contributors to the issue was
   resolved some versions ago but we are still seeing it sometimes too,
 there
   is nothing to see in the logs. We ignore it and just reindex.
  
   -Original message-
From:S.L simpleliving...@gmail.com
Sent: Monday 27th October 2014 16:25
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
 replicas
   out of synch.
   
Thank Otis,
   
I have checked the logs , in my case the default catalina.out and I
 dont
see any OOMs or , any other exceptions.
   
What others metrics do you suggest ?
   
On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:
   
 Hi,

 You may simply be overwhelming your cluster-nodes. Have you checked
 various metrics to see if that is the case?

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log
 Management
 Solr  Elasticsearch Support * http://sematext.com/



  On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com
 wrote:
 
  Folks,
 
  I have posted previously about this , I am using SolrCloud
 4.10.1 and
 have
  a sharded collection with  6 nodes , 3 shards and a replication
   factor
 of 2.
 
  I am indexing Solr using a Hadoop job , I have 15 Map fetch
 tasks ,
   that
  can each have upto 5 threds each , so the load on the indexing
 side
   can
 get
  to as high as 75 concurrent threads.
 
  I am facing an issue where the replicas of a particular shard(s)
 are
  consistently getting out of synch , initially I thought this was
 beccause I
  was using a custom component , but I did a fresh install and
 removed
   the
  custom component and reindexed using the Hadoop job , I still
 see the
 same
  behavior.
 
  I do not see any exceptions in my catalina.out , like OOM , or
 any
   other
  excepitions, I suspecting thi scould be because of the
 multi-threaded
  indexing nature of the Hadoop job . I use CloudSolrServer from my
   java
 code
  to index and initialize the CloudSolrServer using a 3 node ZK
   ensemble.
 
  Does any one know of any known issues with a highly
 multi-threaded
 indexing
  and SolrCloud ?
 
  Can someone help ? This issue has been slowing things down on my
 end
   for
 a
  while now.
 
  Thanks and much appreciated!

   
  



Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
Please find the clusterstate.json attached.

Also in this case *atleast *the Shard1 replicas are out of sync , as can be
seen below.


*Shard 1 replica 1 *does not* return a result with distrib=false.*
*Query :*
http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true

*Result :*

responselst name=responseHeaderint name=status0/intint
name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
shards.infotrue/strstr name=distribfalse/strstr
name=debugtrack/strstr name=wtxml/strstr
name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult
name=response numFound=0 start=0/lst name=debug//response


*Shard1 replica 2 *does* return the result with distrib=false.*
*Query:*
http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true

*Result:*

responselst name=responseHeaderint name=status0/intint
name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
shards.infotrue/strstr name=distribfalse/strstr
name=debugtrack/strstr name=wtxml/strstr
name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult
name=response numFound=1 start=0docstr name=thingURL
http://www.xyz.com/strstr
name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong
name=_version_1483135330558148608/long/doc/resultlst
name=debug//response

On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote:

  One is not smaller than the other, because the numDocs is same for both
  replicas and essentially they seem to be disjoint sets.
 

 That is strange. Can we see your clusterstate.json? With that, please also
 specify the two replicas which are out of sync.

 
  Also manually purging the replicas is not option , because this is
  frequently indexed index and we need everything to be automated.
 
  What other options do I have now.
 
  1. Turn of the replication completely in SolrCloud
  2. Use traditional Master Slave replication model.
  3. Introduce a replica aware field in the index , to figure out which
  replica the request should go to from the client.
  4. Try a distribution like Helios to see if it has any different
 behavior.
 
  Just think out loud here ..
 
  On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma 
  markus.jel...@openindex.io
  wrote:
 
   Hi - if there is a very large discrepancy, you could consider to purge
  the
   smallest replica, it will then resync from the leader.
  
  
   -Original message-
From:S.L simpleliving...@gmail.com
Sent: Monday 27th October 2014 16:41
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
  replicas
   out of synch.
   
Markus,
   
I would like to ignore it too, but whats happening is that the there
  is a
lot of discrepancy between the replicas , queries like
q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on
   which
replica the request goes to, because of huge amount of discrepancy
   between
the replicas.
   
Thank you for confirming that it is a know issue , I was thinking I
 was
   the
only one facing this due to my set up.
   
On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma 
   markus.jel...@openindex.io
wrote:
   
 It is an ancient issue. One of the major contributors to the issue
  was
 resolved some versions ago but we are still seeing it sometimes
 too,
   there
 is nothing to see in the logs. We ignore it and just reindex.

 -Original message-
  From:S.L simpleliving...@gmail.com
  Sent: Monday 27th October 2014 16:25
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
   replicas
 out of synch.
 
  Thank Otis,
 
  I have checked the logs , in my case the default catalina.out
 and I
   dont
  see any OOMs or , any other exceptions.
 
  What others metrics do you suggest ?
 
  On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
   Hi,
  
   You may simply be overwhelming your cluster-nodes. Have you
  checked
   various metrics to see if that is the case?
  
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log
   Management
   Solr  Elasticsearch Support * http://sematext.com/
  
  
  
On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com
   wrote:
   
Folks,
   
I have posted previously about this , I am using SolrCloud
   4.10.1 and
   have
a sharded collection with  6 nodes , 3 shards and a
 replication
 factor
   of 2.
   
I am indexing Solr using a Hadoop job , I have 15 Map fetch
   tasks ,
 that
can each have upto 5

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
Good point about ZK logs , I do see the following exceptions intermittently
in the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection
from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish
new session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x14949db9da40037, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests
are being served by SolrCloud , and if a queue is being maintained if the
service rate is slower than the rate of requests from the incoming multiple
threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote:

 2 naïve comments, of course.



 -  Queuing theory

 -  Zookeeper logs.



 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.



 Please find the clusterstate.json attached.

 Also in this case atleast the Shard1 replicas are out of sync , as can be
 seen below.

 Shard 1 replica 1 *does not* return a result with distrib=false.

 Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
 http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=track
 shards.info=true



 Result :

 responselst name=responseHeaderint name=status0/intint
 name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
 shards.infotrue/strstr name=distribfalse/strstr
 name=debugtrack/strstr name=wtxml/strstr
 name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult
 name=response numFound=0 start=0/lst name=debug//response



 Shard1 replica 2 *does* return the result with distrib=false.

 Query: http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
 http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=track
 shards.info=true

 Result:

 responselst name=responseHeaderint name=status0/intint
 name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
 shards.infotrue/strstr name=distribfalse/strstr
 name=debugtrack/strstr name=wtxml/strstr
 name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult
 name=response numFound=1 start=0docstr name=thingURL
 http://www.xyz.com/strstr
 name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong
 name=_version_1483135330558148608/long/doc/resultlst
 name=debug//response



 On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote:

  One is not smaller than the other, because the numDocs is same for both
  replicas and essentially they seem to be disjoint sets.
 

 That is strange. Can we see your clusterstate.json? With that, please also
 specify the two replicas which are out of sync.

 
  Also manually purging the replicas is not option , because this is
  frequently indexed index and we need everything to be automated.
 
  What other options do I have now.
 
  1. Turn of the replication completely in SolrCloud
  2. Use traditional Master Slave replication model.
  3. Introduce a replica aware field in the index , to figure out which
  replica the request should go to from the client.
  4. Try a distribution like Helios to see if it has any different
 behavior.
 
  Just think out loud here ..
 
  On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma 
  markus.jel...@openindex.io
  wrote:
 
   Hi - if there is a very large discrepancy, you could consider to purge
  the
   smallest replica, it will then resync from the leader.
  
  
   -Original message-
From:S.L simpleliving...@gmail.com
Sent: Monday 27th October 2014 16:41

Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-26 Thread S.L
Folks,

I have posted previously about this , I am using SolrCloud 4.10.1 and have
a sharded collection with  6 nodes , 3 shards and a replication factor of 2.

I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
can each have upto 5 threds each , so the load on the indexing side can get
to as high as 75 concurrent threads.

I am facing an issue where the replicas of a particular shard(s) are
consistently getting out of synch , initially I thought this was beccause I
was using a custom component , but I did a fresh install and removed the
custom component and reindexed using the Hadoop job , I still see the same
behavior.

I do not see any exceptions in my catalina.out , like OOM , or any other
excepitions, I suspecting thi scould be because of the multi-threaded
indexing nature of the Hadoop job . I use CloudSolrServer from my java code
to index and initialize the CloudSolrServer using a 3 node ZK ensemble.

Does any one know of any known issues with a highly multi-threaded indexing
and SolrCloud ?

Can someone help ? This issue has been slowing things down on my end for a
while now.

Thanks and much appreciated!


Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-23 Thread S.L
 new directory for
/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463
471654573 [RecoveryThread] INFO  org.apache.solr.handler.SnapPuller  –
Starting download to
NRTCachingDirectory(MMapDirectory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463
lockFactory=NativeFSLockFactory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463;
maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
471834454 [zkCallback-2-thread-12] INFO
org.apache.solr.common.cloud.ZkStateReader  – A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 6)
471897454 [RecoveryThread] INFO  org.apache.solr.handler.SnapPuller  –
Total time taken for download : 243 secs
471898551 [RecoveryThread] INFO  org.apache.solr.handler.SnapPuller  – New
index installed. Updating index properties... index=index.2014101839463
471898932 [RecoveryThread] INFO  org.apache.solr.handler.SnapPuller  –
removing old index directory
NRTCachingDirectory(MMapDirectory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index
lockFactory=NativeFSLockFactory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index;
maxCacheMB=48.0 maxMergeSizeMB=4.0)
471898932 [RecoveryThread] INFO
org.apache.solr.update.DefaultSolrCoreState  – Creating new IndexWriter...
471898934 [RecoveryThread] INFO
org.apache.solr.update.DefaultSolrCoreState  – Waiting until IndexWriter is
unused... core=dyCollection1_shard2_replica1
471898934 [RecoveryThread] INFO
org.apache.solr.update.DefaultSolrCoreState  – Rollback old IndexWriter...
core=dyCollection1_shard2_replica1
471904192 [RecoveryThread] INFO  org.apache.solr.core.SolrCore  – New index
directory detected:
old=/opt/solr/home1/dyCollection1_shard2_replica1/data/index/
new=/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463
471904907 [RecoveryThread] INFO  org.apache.solr.core.SolrCore  –
SolrDeletionPolicy.onInit: commits: num=1

commit{dir=NRTCachingDirectory(MMapDirectory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463
lockFactory=NativeFSLockFactory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.2014101839463;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_88t,generation=10685}
471904907 [RecoveryThread] INFO  org.apache.solr.core.SolrCore  – newest
commit generation = 10685

On Fri, Oct 17, 2014 at 1:12 PM, S.L simpleliving...@gmail.com wrote:

 Shawn,

 Just wondering if you have any other suggestions on what the next steps
 whould be ? Thanks.

 On Thu, Oct 16, 2014 at 11:12 PM, S.L simpleliving...@gmail.com wrote:

 Shawn ,


1. I will upgrade to 67 JVM  shortly .
2. This is  a new collection as , I was facing a similar issue in 4.7
and based on Erick's recommendation I updated to 4.10.1 and created a new
collection.
3. Yes, I am hitting the replicas of the same shard and I see the
lists are completely non overlapping.I am using CloudSolrServer to add the
documents.
4. I have a 3 physical node cluster , with each having 16GB in memory.
5. I also have a custom request handler defined in my solrconfig.xml
as below , however I am not using that and I am only using the default
select handler, but my MyCustomHandler class has been been added to the
source and included in the build , but not being used for any requests 
 yet.

   requestHandler name=/mycustomselect class=solr.MyCustomHandler
 startup=lazy
 lst name=defaults
   str name=dfsuggestAggregate/str

   str name=spellcheck.dictionarydirect/str
   !--str name=spellcheck.dictionarywordbreak/str--
   str name=spellcheckon/str
   str name=spellcheck.extendedResultstrue/str
   str name=spellcheck.count10/str
   str name=spellcheck.alternativeTermCount5/str
   str name=spellcheck.maxResultsForSuggest5/str
   str name=spellcheck.collatetrue/str
   str name=spellcheck.collateExtendedResultstrue/str
   str name=spellcheck.maxCollationTries10/str
   str name=spellcheck.maxCollations5/str
 /lst
 arr name=last-components
   strspellcheck/str
 /arr
   /requestHandler


 5. The clusterstate.json is copied below

 {dyCollection1:{
 shards:{
   shard1:{
 range:8000-d554,
 state:active,
 replicas:{
   core_node3:{
 state:active,
 core:dyCollection1_shard1_replica1,
 node_name:server3.mydomain.com:8082_solr,
 base_url:http://server3.mydomain.com:8082/solr},
   core_node4:{
 state:active,
 core:dyCollection1_shard1_replica2,
 node_name:server2.mydomain.com:8081_solr,
 base_url:http://server2.mydomain.com:8081/solr;,
 leader:true}}},
   shard2:{
 range:d555-2aa9,
 state:active,
 replicas:{
   core_node1

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-17 Thread S.L
Shawn,

Just wondering if you have any other suggestions on what the next steps
whould be ? Thanks.

On Thu, Oct 16, 2014 at 11:12 PM, S.L simpleliving...@gmail.com wrote:

 Shawn ,


1. I will upgrade to 67 JVM  shortly .
2. This is  a new collection as , I was facing a similar issue in 4.7
and based on Erick's recommendation I updated to 4.10.1 and created a new
collection.
3. Yes, I am hitting the replicas of the same shard and I see the
lists are completely non overlapping.I am using CloudSolrServer to add the
documents.
4. I have a 3 physical node cluster , with each having 16GB in memory.
5. I also have a custom request handler defined in my solrconfig.xml
as below , however I am not using that and I am only using the default
select handler, but my MyCustomHandler class has been been added to the
source and included in the build , but not being used for any requests yet.

   requestHandler name=/mycustomselect class=solr.MyCustomHandler
 startup=lazy
 lst name=defaults
   str name=dfsuggestAggregate/str

   str name=spellcheck.dictionarydirect/str
   !--str name=spellcheck.dictionarywordbreak/str--
   str name=spellcheckon/str
   str name=spellcheck.extendedResultstrue/str
   str name=spellcheck.count10/str
   str name=spellcheck.alternativeTermCount5/str
   str name=spellcheck.maxResultsForSuggest5/str
   str name=spellcheck.collatetrue/str
   str name=spellcheck.collateExtendedResultstrue/str
   str name=spellcheck.maxCollationTries10/str
   str name=spellcheck.maxCollations5/str
 /lst
 arr name=last-components
   strspellcheck/str
 /arr
   /requestHandler


 5. The clusterstate.json is copied below

 {dyCollection1:{
 shards:{
   shard1:{
 range:8000-d554,
 state:active,
 replicas:{
   core_node3:{
 state:active,
 core:dyCollection1_shard1_replica1,
 node_name:server3.mydomain.com:8082_solr,
 base_url:http://server3.mydomain.com:8082/solr},
   core_node4:{
 state:active,
 core:dyCollection1_shard1_replica2,
 node_name:server2.mydomain.com:8081_solr,
 base_url:http://server2.mydomain.com:8081/solr;,
 leader:true}}},
   shard2:{
 range:d555-2aa9,
 state:active,
 replicas:{
   core_node1:{
 state:active,
 core:dyCollection1_shard2_replica1,
 node_name:server1.mydomain.com:8081_solr,
 base_url:http://server1.mydomain.com:8081/solr;,
 leader:true},
   core_node6:{
 state:active,
 core:dyCollection1_shard2_replica2,
 node_name:server3.mydomain.com:8081_solr,
 base_url:http://server3.mydomain.com:8081/solr}}},
   shard3:{
 range:2aaa-7fff,
 state:active,
 replicas:{
   core_node2:{
 state:active,
 core:dyCollection1_shard3_replica2,
 node_name:server1.mydomain.com:8082_solr,
 base_url:http://server1.mydomain.com:8082/solr;,
 leader:true},
   core_node5:{
 state:active,
 core:dyCollection1_shard3_replica1,
 node_name:server2.mydomain.com:8082_solr,
 base_url:http://server2.mydomain.com:8082/solr,
 maxShardsPerNode:1,
 router:{name:compositeId},
 replicationFactor:2,
 autoAddReplicas:false}}

   Thanks!

 On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/16/2014 6:27 PM, S.L wrote:

 1. Java Version :java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


 I believe that build 51 is one of those that is known to have bugs
 related to Lucene.  If you can upgrade this to 67, that would be good, but
 I don't know that it's a pressing matter.  It looks like the Oracle JVM,
 which is good.

  2.OS
 CentOS Linux release 7.0.1406 (Core)

 3. Everything is 64 bit , OS , Java , and CPU.

 4. Java Args.
  -Djava.io.tmpdir=/opt/tomcat1/temp
  -Dcatalina.home=/opt/tomcat1
  -Dcatalina.base=/opt/tomcat1
  -Djava.endorsed.dirs=/opt/tomcat1/endorsed
  -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181,
 server3.mydomain.com:2181
  -DzkClientTimeout=2
  -DhostContext=solr
  -Dport=8081
  -Dhost=server1.mydomain.com
  -Dsolr.solr.home=/opt/solr/home1
  -Dfile.encoding=UTF8
  -Duser.timezone=UTC
  -XX:+UseG1GC
  -XX:MaxPermSize=128m
  -XX:PermSize=64m
  -Xmx2048m
  -Xms128m
  -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
  -Djava.util.logging.config.file=/opt/tomcat1/conf/
 logging.properties


 I would not use the G1 collector myself, but with the heap

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-16 Thread S.L
Shawn,

Please find the answers to your questions.

1. Java Version :java version 1.7.0_51
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

2.OS
CentOS Linux release 7.0.1406 (Core)

3. Everything is 64 bit , OS , Java , and CPU.

4. Java Args.
-Djava.io.tmpdir=/opt/tomcat1/temp
-Dcatalina.home=/opt/tomcat1
-Dcatalina.base=/opt/tomcat1
-Djava.endorsed.dirs=/opt/tomcat1/endorsed
-DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181,
server3.mydomain.com:2181
-DzkClientTimeout=2
-DhostContext=solr
-Dport=8081
-Dhost=server1.mydomain.com
-Dsolr.solr.home=/opt/solr/home1
-Dfile.encoding=UTF8
-Duser.timezone=UTC
-XX:+UseG1GC
-XX:MaxPermSize=128m
-XX:PermSize=64m
-Xmx2048m
-Xms128m
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties

5. Zookeeper ensemble has 3 zookeeper instances , which are external and
are not embedded.


6. Container : I am using Tomcat Apache Tomcat Version 7.0.42

*Additional Observations:*

I queries all docs on both replicas with distrib=falsefl=idsort=id+asc,
then compared the two lists, I could see by eyeballing the first few lines
of ids in both the lists ,I could say that even though each list has equal
number of documents i.e 96309 each , but the document ids in them seem to
be *mutually exclusive* ,  , I did not find even a single  common id in
those lists , I tried at least 15 manually ,it looks like to me that the
replicas are disjoint sets.

Thanks.



On Thu, Oct 16, 2014 at 1:41 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/15/2014 10:24 PM, S.L wrote:

 Yes , I tried those two queries with distrib=false , I get 0 results for
 first and 1 result  for the second query( (i.e. server 3 shard 2 replica
 2)  consistently.

 However if I run the same second query (i.e. server 3 shard 2 replica 2)
 with distrib=true, I sometimes get a result and sometimes not , should'nt
 this query always return a result when its pointing to a core that seems
 to
 have that document regardless of distrib=true or false ?

 Unfortunately I dont see anything particular in the logs to point to any
 information.

 BTW you asked me to replace the request handler , I use the select request
 handler ,so I cannot replace it with anything else , is that  a problem ?


 If you send the query with distrib=true (which is the default value in
 SolrCloud), then it treats it just as if you had sent it to
 /solr/collection instead of /solr/collection_shardN_replicaN, so it's a
 full distributed query. The distrib=false is required to turn that behavior
 off and ONLY query the index on the actual core where you sent it.

 I only said to replace those things as appropriate.  Since you are using
 /select, it's no problem that you left it that way. If I were to assume
 that you used /select, but you didn't, the URLs as I wrote them might not
 have worked.

 As discussed, this means that your replicas are truly out of sync.  It's
 difficult to know what caused it, especially if you can't see anything in
 the log when you indexed the missing documents.

 We know you're on Solr 4.10.1.  This means that your Java is a 1.7
 version, since Java7 is required.

 Here's where I ask a whole lot of questions about your setup. What is the
 precise Java version, and which vendor's Java are you using?  What
 operating system is it on?  Is everything 64-bit, or is any piece (CPU, OS,
 Java) 32-bit?  On the Solr admin UI dashboard, it lists all parameters used
 when starting Java, labelled as Args.  Can you include those?  Is
 zookeeper external, or embedded in Solr?  Is it a 3-server (or more)
 ensemble?  Are you using the example jetty, or did you provide your own
 servlet container?

 We recommend 64-bit Oracle Java, the latest 1.7 version.  OpenJDK (since
 version 1.7.x) should be pretty safe as well, but IBM's Java should be
 avoided.  IBM does very aggressive runtime optimizations.  These can make
 programs run faster, but they are known to negatively affect Lucene/Solr.

 Thanks,
 Shawn




Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-16 Thread S.L
Shawn ,


   1. I will upgrade to 67 JVM  shortly .
   2. This is  a new collection as , I was facing a similar issue in 4.7
   and based on Erick's recommendation I updated to 4.10.1 and created a new
   collection.
   3. Yes, I am hitting the replicas of the same shard and I see the lists
   are completely non overlapping.I am using CloudSolrServer to add the
   documents.
   4. I have a 3 physical node cluster , with each having 16GB in memory.
   5. I also have a custom request handler defined in my solrconfig.xml as
   below , however I am not using that and I am only using the default select
   handler, but my MyCustomHandler class has been been added to the source and
   included in the build , but not being used for any requests yet.

  requestHandler name=/mycustomselect class=solr.MyCustomHandler
startup=lazy
lst name=defaults
  str name=dfsuggestAggregate/str

  str name=spellcheck.dictionarydirect/str
  !--str name=spellcheck.dictionarywordbreak/str--
  str name=spellcheckon/str
  str name=spellcheck.extendedResultstrue/str
  str name=spellcheck.count10/str
  str name=spellcheck.alternativeTermCount5/str
  str name=spellcheck.maxResultsForSuggest5/str
  str name=spellcheck.collatetrue/str
  str name=spellcheck.collateExtendedResultstrue/str
  str name=spellcheck.maxCollationTries10/str
  str name=spellcheck.maxCollations5/str
/lst
arr name=last-components
  strspellcheck/str
/arr
  /requestHandler


5. The clusterstate.json is copied below

{dyCollection1:{
shards:{
  shard1:{
range:8000-d554,
state:active,
replicas:{
  core_node3:{
state:active,
core:dyCollection1_shard1_replica1,
node_name:server3.mydomain.com:8082_solr,
base_url:http://server3.mydomain.com:8082/solr},
  core_node4:{
state:active,
core:dyCollection1_shard1_replica2,
node_name:server2.mydomain.com:8081_solr,
base_url:http://server2.mydomain.com:8081/solr;,
leader:true}}},
  shard2:{
range:d555-2aa9,
state:active,
replicas:{
  core_node1:{
state:active,
core:dyCollection1_shard2_replica1,
node_name:server1.mydomain.com:8081_solr,
base_url:http://server1.mydomain.com:8081/solr;,
leader:true},
  core_node6:{
state:active,
core:dyCollection1_shard2_replica2,
node_name:server3.mydomain.com:8081_solr,
base_url:http://server3.mydomain.com:8081/solr}}},
  shard3:{
range:2aaa-7fff,
state:active,
replicas:{
  core_node2:{
state:active,
core:dyCollection1_shard3_replica2,
node_name:server1.mydomain.com:8082_solr,
base_url:http://server1.mydomain.com:8082/solr;,
leader:true},
  core_node5:{
state:active,
core:dyCollection1_shard3_replica1,
node_name:server2.mydomain.com:8082_solr,
base_url:http://server2.mydomain.com:8082/solr,
maxShardsPerNode:1,
router:{name:compositeId},
replicationFactor:2,
autoAddReplicas:false}}

  Thanks!

On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/16/2014 6:27 PM, S.L wrote:

 1. Java Version :java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


 I believe that build 51 is one of those that is known to have bugs related
 to Lucene.  If you can upgrade this to 67, that would be good, but I don't
 know that it's a pressing matter.  It looks like the Oracle JVM, which is
 good.

  2.OS
 CentOS Linux release 7.0.1406 (Core)

 3. Everything is 64 bit , OS , Java , and CPU.

 4. Java Args.
  -Djava.io.tmpdir=/opt/tomcat1/temp
  -Dcatalina.home=/opt/tomcat1
  -Dcatalina.base=/opt/tomcat1
  -Djava.endorsed.dirs=/opt/tomcat1/endorsed
  -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181,
 server3.mydomain.com:2181
  -DzkClientTimeout=2
  -DhostContext=solr
  -Dport=8081
  -Dhost=server1.mydomain.com
  -Dsolr.solr.home=/opt/solr/home1
  -Dfile.encoding=UTF8
  -Duser.timezone=UTC
  -XX:+UseG1GC
  -XX:MaxPermSize=128m
  -XX:PermSize=64m
  -Xmx2048m
  -Xms128m
  -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
  -Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties


 I would not use the G1 collector myself, but with the heap at only 2GB, I
 don't know that it matters all that much.  Even a worst-case collection
 probably is not going to take more than a few seconds, and you've already
 increased the zookeeper client timeout.

 http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

  5

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-15 Thread S.L
.mydomain.com:8081/solr/dyCollection1_shard2_replica2/

   str name=QTime14/str
   str name=ElapsedTime17/str
   str name=RequestPurposeGET_TOP_IDS/str
   str name=NumFound1/str
   str
name=Response{responseHeader={status=0,QTime=14,params={spellcheck=true,spellcheck.maxCollationTries=10,distrib=false,debug=[false,
track],version=2,NOW=1413398738457,shard.url=
http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/|http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/,df=suggestAggregate,fl=thingURL,score,debugQuery=false,spellcheck.count=10,fq=(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb),fsv=true,spellcheck.alternativeTermCount=5,spellcheck.maxResultsForSuggest=5,spellcheck.collateExtendedResults=true,spellcheck.extendedResults=true,spellcheck.maxCollations=5,wt=javabin,spellcheck.collate=true,requestPurpose=GET_TOP_IDS,rows=10,rid=server3.mydomain.com-dyCollection1_shard2_replica2-1413398738457-16,start=0,q=*:*,shards.info=true,spellcheck.dictionary=[direct,
wordbreak],isShard=true}},response={numFound=1,start=0,maxScore=1.0,docs=[SolrDocument{thingURL=
http://www.redacted.com/ip/Cutter-Bite-MD-Insect-Bite-Relief-.5-fl-oz/12166875,
score=1.0}]},sort_values={},debug={}}/str
/lst
lst name=
http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/

   str name=QTime26/str
   str name=ElapsedTime29/str
   str name=RequestPurposeGET_TOP_IDS/str
   str name=NumFound0/str
   str
name=Response{responseHeader={status=0,QTime=26,params={spellcheck=true,spellcheck.maxCollationTries=10,distrib=false,debug=[false,
track],version=2,NOW=1413398738457,shard.url=
http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/,df=suggestAggregate,fl=thingURL,score,debugQuery=false,spellcheck.count=10,fq=(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb),fsv=true,spellcheck.alternativeTermCount=5,spellcheck.maxResultsForSuggest=5,spellcheck.collateExtendedResults=true,spellcheck.extendedResults=true,spellcheck.maxCollations=5,wt=javabin,spellcheck.collate=true,requestPurpose=GET_TOP_IDS,rows=10,rid=server3.mydomain.com-dyCollection1_shard2_replica2-1413398738457-16,start=0,q=*:*,shards.info=true,spellcheck.dictionary=[direct,
wordbreak],isShard=true}},response={numFound=0,start=0,maxScore=0.0,docs=[]},sort_values={},debug={}}/str
/lst
 /lst
 lst name=GET_FIELDS
lst name=
http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/|http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/

   str name=QTime1/str
   str name=ElapsedTime3/str
   str name=RequestPurposeGET_FIELDS,GET_DEBUG/str
   str name=NumFound1/str
   str
name=Response{responseHeader={status=0,QTime=1,params={spellcheck=false,spellcheck.maxCollationTries=10,distrib=false,debug=[track,
track],version=2,df=suggestAggregate,shard.url=
http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/|http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/,NOW=1413398738457,spellcheck.count=10,fq=(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb),spellcheck.alternativeTermCount=5,spellcheck.maxResultsForSuggest=5,spellcheck.collateExtendedResults=true,spellcheck.extendedResults=true,spellcheck.maxCollations=5,ids=http://www.redacted.com/ip/Cutter-Bite,spellcheck.collate=true,wt=javabin,requestPurpose=GET_FIELDS,GET_DEBUG,rows=10,rid=server3.mydomain.com-dyCollection1_shard2_replica2-1413398738457-16,q=*:*,shards.info=true,spellcheck.dictionary=[direct,
wordbreak],isShard=true}},response={numFound=1,start=0,docs=[SolrDocument{thingURL=
http://www.redacted.com/ip/Cutter-Bite,
id=e8995da8-7d98-4010-93b4-8ff7dffb8bfb,
_version_=1481991045188157440}]},debug={}}/str
/lst
 /lst
  /lst
   /lst
/response

On Tue, Oct 14, 2014 at 10:32 AM, Tim Potter tim.pot...@lucidworks.com
wrote:

 Try adding shards.info=true and debug=track to your queries ... these will
 give more detailed information about what's going behind the scenes.

 On Mon, Oct 13, 2014 at 11:11 PM, S.L simpleliving...@gmail.com wrote:

  Erick,
 
  I have upgraded to SolrCloud 4.10.1 with the same toplogy , 3 shards and
 2
  replication factor with six cores altogether.
 
  Unfortunately , I still see the issue of intermittently no results being
  returned.I am not able to figure out whats going on here, I have included
  the logging information below.
 
  *Here's the query that I run.*
 
 
 
 http://server1.mydomain.com:8081/solr/dyCollection1/select/?q=*:*fq=%28id:220a8dce-3b31-4d46-8386-da8405595c47%29wt=jsondistrib=true
 
 
 
  *Scenario 1: No result returned.*
 
  *Log Information for Scenario #1 .*
  92860314 [http-bio-8081-exec-103] INFO

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-15 Thread S.L
Look at the logging information I provided below , looks like the results
are only being returned back for this solrCloud cluster  if the request
goes to one of the two replicas of a shard.

I have verified that numDocs in the replicas for a given shard is same but
there is difference in the maxDoc and deletedDocs, does this signal the
replicas being out of sync ?

Even if the numDocs are same , how do we guarantee that those docs are
identical and have the same uniquekeys , is there a way to verify this ? I
am suspecting that  as the numDocs is same across the replicas , and still
only when the request goes to one of  the  replicas of the shard that I get
a result back , the documents with in those replicas with in a shard are
not an exact replica set of each other.

I suspect the issue I am facing in 4.10.1 cloud is related to
https://issues.apache.org/jira/browse/SOLR-4924 .

Can anyone please let me know , how to solve this issue of intermittent no
results for a query ?



On Wed, Oct 15, 2014 at 3:15 PM, S.L simpleliving...@gmail.com wrote:

 Tim,

 Thanks for the suggestion.

 I have rerun the query by adding shards.info=true and debug= track. I
 have included the xml data for both teh scenarios below , thin happens
 intermittently on SolrCloud 4.10.1 , with a replication factor of 2 and 3
 shards (6 cores) , I get result in one execution of query and then no
 results for the subsequent one , I am hoping someone would be able to help
 me find the root cause with this additional information ,I have included
 the query output with the additional parameters for the both the scenarios
 below .

 Thanks for your help!

 *Scenario #1 : In this try I get no results back. Here is what the query
 returns.*

 ?xml version=1.0 encoding=UTF-8?
 response
lst name=responseHeader
   int name=status0/int
   int name=QTime29/int
   lst name=params
  str name=q*:*/str
  str name=shards.infotrue/str
  str name=distribtrue/str
  str name=debugtrack/str
  str name=wtxml/str
  str name=fq(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb)/str
   /lst
/lst
lst name=shards.info
   lst name=
 http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/
 
  long name=numFound0/long
  float name=maxScore0.0/float
  str name=shardAddress
 http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/str
  long name=time4/long
   /lst
   lst name=
 http://server3.mydomain.com:8082/solr/dyCollection1_shard1_replica1/|http://server2.mydomain.com:8081/solr/dyCollection1_shard1_replica2/
 
  long name=numFound0/long
  float name=maxScore0.0/float
  str name=shardAddress
 http://server3.mydomain.com:8082/solr/dyCollection1_shard1_replica1/str
  long name=time13/long
   /lst
   lst name=
 http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/|http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/
 
  long name=numFound0/long
  float name=maxScore0.0/float
  str name=shardAddress
 http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/str
  long name=time26/long
   /lst
/lst
result name=response numFound=0 start=0 maxScore=0.0 /
lst name=spellcheck
   lst name=suggestions
  bool name=correctlySpelledfalse/bool
   /lst
/lst
lst name=debug
   lst name=track
  str
 name=ridserver3.mydomain.com-dyCollection1_shard2_replica2-1413398784226-17/str
  lst name=EXECUTE_QUERY
 lst name=
 http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/
 
str name=QTime1/str
str name=ElapsedTime4/str
str name=RequestPurposeGET_TOP_IDS/str
str name=NumFound0/str
str
 name=Response{responseHeader={status=0,QTime=1,params={spellcheck=true,spellcheck.maxCollationTries=10,distrib=false,debug=[false,
 track],version=2,NOW=1413398784225,shard.url=
 http://server1.mydomain.com:8082/solr/dyCollection1_shard3_replica2/|http://server2.mydomain.com:8082/solr/dyCollection1_shard3_replica1/,df=suggestAggregate,fl=thingURL,score,debugQuery=false,spellcheck.count=10,fq=(id:e8995da8-7d98-4010-93b4-8ff7dffb8bfb),fsv=true,spellcheck.alternativeTermCount=5,spellcheck.maxResultsForSuggest=5,spellcheck.collateExtendedResults=true,spellcheck.extendedResults=true,spellcheck.maxCollations=5,wt=javabin,spellcheck.collate=true,requestPurpose=GET_TOP_IDS,rows=10,rid=server3.mydomain.com-dyCollection1_shard2_replica2-1413398784226-17,start=0,q=*:*,shards.info=true,spellcheck.dictionary=[direct,
 wordbreak],isShard=true}},response={numFound=0,start=0,maxScore=0.0,docs=[]},sort_values={},debug={}}/str
 /lst
 lst name=
 http://server3

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-15 Thread S.L
Shawn,

Yes , I tried those two queries with distrib=false , I get 0 results for
first and 1 result  for the second query( (i.e. server 3 shard 2 replica
2)  consistently.

However if I run the same second query (i.e. server 3 shard 2 replica 2)
with distrib=true, I sometimes get a result and sometimes not , should'nt
this query always return a result when its pointing to a core that seems to
have that document regardless of distrib=true or false ?

Unfortunately I dont see anything particular in the logs to point to any
information.

BTW you asked me to replace the request handler , I use the select request
handler ,so I cannot replace it with anything else , is that  a problem ?

Thanks.

On Thu, Oct 16, 2014 at 12:05 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/15/2014 9:26 PM, S.L wrote:

 Look at the logging information I provided below , looks like the results
 are only being returned back for this solrCloud cluster  if the request
 goes to one of the two replicas of a shard.

 I have verified that numDocs in the replicas for a given shard is same but
 there is difference in the maxDoc and deletedDocs, does this signal the
 replicas being out of sync ?

 Even if the numDocs are same , how do we guarantee that those docs are
 identical and have the same uniquekeys , is there a way to verify this ? I
 am suspecting that  as the numDocs is same across the replicas , and still
 only when the request goes to one of  the  replicas of the shard that I
 get
 a result back , the documents with in those replicas with in a shard are
 not an exact replica set of each other.

 I suspect the issue I am facing in 4.10.1 cloud is related to
 https://issues.apache.org/jira/browse/SOLR-4924  .

 Can anyone please let me know , how to solve this issue of intermittent no
 results for a query ?


 query with no results hits these cores:
 server 2 shard 3 replica1
 server 3 shard 1 replica 1
 server 1 shard 2 replica 1

 query with 1 result hits these cores:
 server 2 shard 1 replica 2
 server 3 shard 2 replica 2 (found 1)
 server 1 shard 3 replica 2

 Here's some URLs for some testing.  They are directed at specific shard
 replicas and are specifically NOT distributed queries:

 http://server1.mydomain.com:8081/solr/dyCollection1_
 shard2_replica1/select?q=*:*fq=id:e8995da8-7d98-4010-93b4-
 8ff7dffb8bfbdistrib=false

 http://server3.mydomain.com:8081/solr/dyCollection1_
 shard2_replica2/select?q=*:*fq=id:e8995da8-7d98-4010-93b4-
 8ff7dffb8bfbdistrib=false

 If you run these queries (replacing server names and the /select request
 handler as appropriate), do you get 0 results on the first one and 1 result
 on the second one?  If you do, then you've definitely got replicas out of
 sync.  If you get 1 result on both queries, then something else is
 breaking.  If by chance you have taken steps to fix this particular ID,
 pick another one that you know has a problem.

 There is no automated way to detect replicas out of sync.  You could
 request all docs on both replicas with distrib=falsefl=idsort=id+asc,
 then compare the two lists.  Depending on how many docs you have, those
 queries could take a while to run.

 If the replicas are out of sync, are there any ERROR entries in the Solr
 log, especially at the time that the problem docs were indexed?

 Thanks,
 Shawn




Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-13 Thread S.L
 to track down, you
 just are lucky perhaps ;)...

 Erick

 On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com wrote:
  Erick,
 
  Thanks for the suggestion , I am not sure if I would be able to capture
  what went wrong , so upgrading to 4.10 seems easier even though it means
 ,
  a days work of effort :) . I will go ahead and upgrade and let me know ,
  although I am surprised that this issue never got reported for 4.7 up
 until
  now.
 
  Thanks again for your help!
 
 
 
  On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  I think there were some holes that would allow replicas and leaders to
  be out of synch that have been patched up in the last 3 releases.
 
  There shouldn't be anything you need to do to keep these in synch, so
  if you can capture what happened when things got out of synch we'll
  fix it. But a lot has changed in the last several months, so the first
  thing I'd do if possible is to upgrade to 4.10.1.
 
 
  Best,
  Erick
 
  On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote:
   Hi Erick,
  
   Before I tried your suggestion of  issung a commit=true update, I
  realized that for eaach shard there was atleast a node that had its
 index
  directory named like index.timestamp.
  
   I went ahead and deleted index directory that restarted that core and
  now the index directory got syched with the other node and is properly
  named as 'index' without any timestamp attached to it.This is now
 giving me
  consistent results for distrib=true using a load balancer.Also
  distrib=false returns expexted results for a given shard.
  
   The underlying issue appears to be that in every shard the leader and
  the replica(follower) were out of sych.
  
   How can I avoid this from happening again?
  
   Thanks for your help!
  
   Sent from my HTC
  
   - Reply message -
   From: Erick Erickson erickerick...@gmail.com
   To: solr-user@lucene.apache.org
   Subject: SolrCloud 4.7 not doing distributed search when querying
 from a
  load balancer.
   Date: Fri, Oct 3, 2014 12:56 AM
  
   H. Assuming that you aren't re-indexing the doc you're searching
  for...
  
   Try issuing http://blah blah:8983/solr/collection/update?commit=true.
   That'll force all the docs to be searchable. Does 1 still hold for
   the document in question? Because this is exactly backwards of what
   I'd expect. I'd expect, if anything, the replica (I'm trying to call
   it the follower when a distinction needs to be made since the leader
   is a replica too) would be out of sync. This is still a Bad
   Thing, but the leader gets first crack at indexing thing.
  
   bq: only the replica of the shard that has this key returns the result
   , and the leader does not ,
  
   Just to be sure we're talking about the same thing. When you say
   leader, you mean the shard leader, right? The filled-in circle on
   the graph view from the admin/cloud page.
  
   And let's see your soft and hard commit settings please.
  
   Best,
   Erick
  
   On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com
 wrote:
   Eirck,
  
   0 Load balancer is out of the picture
   .
   1When I query with *distrib=false* , I get consistent results as
  expected
   for those shards that dont have the key i.e I dont get the results
 back
  for
   those shards, however I just realized that while *distrib=false* is
  present
   in the query for the shard that is supposed to contain the key,only
 the
   replica of the shard that has this key returns the result , and the
  leader
   does not , looks like replica and the leader do not have the same
 data
  and
   replica seems to contain the key in the query for that shard.
  
   2 By indexing I mean this collection is being populated by a web
  crawler.
  
   So looks like 1 above  is pointing to leader and replica being out
 of
   synch for atleast one shard.
  
  
  
   On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
   bq: Also ,the collection is being actively indexed as I query this,
  could
   that
   be an issue too ?
  
   Not if the documents you're searching aren't being added as you
 search
   (and all your autocommit intervals have expired).
  
   I would turn off indexing for testing, it's just one more variable
   that can get in the way of understanding this.
  
   Do note that if the problem were endemic to Solr, there would
 probably
   be a _lot_ more noise out there.
  
   So to recap:
   0 we can take the load balancer out of the picture all together.
  
   1 when you query each shard individually with distrib=true, every
   replica in a particular shard returns the same count.
  
   2 when you query without distrib=true you get varying counts.
  
   This is very strange and not at all expected. Let's try it again
   without indexing going on
  
   And what do you mean by indexing anyway? How are documents being
 fed
   to your system?
  
   Best,
   Erick

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-06 Thread S.L
Hi Erick,

Before I tried your suggestion of  issung a commit=true update, I realized that 
for eaach shard there was atleast a node that had its index directory named 
like index.timestamp.

I went ahead and deleted index directory that restarted that core and now the 
index directory got syched with the other node and is properly named as 'index' 
without any timestamp attached to it.This is now giving me consistent results 
for distrib=true using a load balancer.Also distrib=false returns expexted 
results for a given shard.

The underlying issue appears to be that in every shard the leader and the 
replica(follower) were out of sych.

How can I avoid this from happening again?

Thanks for your help!

Sent from my HTC

- Reply message -
From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org
Subject: SolrCloud 4.7 not doing distributed search when querying from a load 
balancer.
Date: Fri, Oct 3, 2014 12:56 AM

H. Assuming that you aren't re-indexing the doc you're searching for...

Try issuing http://blah blah:8983/solr/collection/update?commit=true.
That'll force all the docs to be searchable. Does 1 still hold for
the document in question? Because this is exactly backwards of what
I'd expect. I'd expect, if anything, the replica (I'm trying to call
it the follower when a distinction needs to be made since the leader
is a replica too) would be out of sync. This is still a Bad
Thing, but the leader gets first crack at indexing thing.

bq: only the replica of the shard that has this key returns the result
, and the leader does not ,

Just to be sure we're talking about the same thing. When you say
leader, you mean the shard leader, right? The filled-in circle on
the graph view from the admin/cloud page.

And let's see your soft and hard commit settings please.

Best,
Erick

On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote:
 Eirck,

 0 Load balancer is out of the picture
 .
 1When I query with *distrib=false* , I get consistent results as expected
 for those shards that dont have the key i.e I dont get the results back for
 those shards, however I just realized that while *distrib=false* is present
 in the query for the shard that is supposed to contain the key,only the
 replica of the shard that has this key returns the result , and the leader
 does not , looks like replica and the leader do not have the same data and
 replica seems to contain the key in the query for that shard.

 2 By indexing I mean this collection is being populated by a web crawler.

 So looks like 1 above  is pointing to leader and replica being out of
 synch for atleast one shard.



 On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 bq: Also ,the collection is being actively indexed as I query this, could
 that
 be an issue too ?

 Not if the documents you're searching aren't being added as you search
 (and all your autocommit intervals have expired).

 I would turn off indexing for testing, it's just one more variable
 that can get in the way of understanding this.

 Do note that if the problem were endemic to Solr, there would probably
 be a _lot_ more noise out there.

 So to recap:
 0 we can take the load balancer out of the picture all together.

 1 when you query each shard individually with distrib=true, every
 replica in a particular shard returns the same count.

 2 when you query without distrib=true you get varying counts.

 This is very strange and not at all expected. Let's try it again
 without indexing going on

 And what do you mean by indexing anyway? How are documents being fed
 to your system?

 Best,
 Erick@PuzzledAsWell

 On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote:
  Erick,
 
  I would like to add that the interesting behavior i.e point #2 that I
  mentioned in my earlier reply  happens in all the shards , if this were
 to
  be a distributed search issue this should have not manifested itself in
 the
  shard that contains the key that I am searching for , looks like the
 search
  is just failing as whole intermittently .
 
  Also ,the collection is being actively indexed as I query this, could
 that
  be an issue too ?
 
  Thanks.
 
  On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote:
 
  Erick,
 
  Thanks for your reply, I tried your suggestions.
 
  1 . When not using loadbalancer if  *I have distrib=false* I get
  consistent results across the replicas.
 
  2. However here's the insteresting part , while not using load balancer
 if
  I *dont have distrib=false* , then when I query a particular node ,I get
  the same behaviour as if I were using a loadbalancer , meaning the
  distributed search from a node works intermittently .Does this give any
  clue ?
 
 
 
  On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  Hmmm, nothing quite makes sense here
 
  Here are some experiments:
  1 avoid the load balancer and issue queries like

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-06 Thread S.L
Erick,

Thanks for the suggestion , I am not sure if I would be able to capture
what went wrong , so upgrading to 4.10 seems easier even though it means ,
a days work of effort :) . I will go ahead and upgrade and let me know ,
although I am surprised that this issue never got reported for 4.7 up until
now.

Thanks again for your help!



On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com
wrote:

 I think there were some holes that would allow replicas and leaders to
 be out of synch that have been patched up in the last 3 releases.

 There shouldn't be anything you need to do to keep these in synch, so
 if you can capture what happened when things got out of synch we'll
 fix it. But a lot has changed in the last several months, so the first
 thing I'd do if possible is to upgrade to 4.10.1.


 Best,
 Erick

 On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote:
  Hi Erick,
 
  Before I tried your suggestion of  issung a commit=true update, I
 realized that for eaach shard there was atleast a node that had its index
 directory named like index.timestamp.
 
  I went ahead and deleted index directory that restarted that core and
 now the index directory got syched with the other node and is properly
 named as 'index' without any timestamp attached to it.This is now giving me
 consistent results for distrib=true using a load balancer.Also
 distrib=false returns expexted results for a given shard.
 
  The underlying issue appears to be that in every shard the leader and
 the replica(follower) were out of sych.
 
  How can I avoid this from happening again?
 
  Thanks for your help!
 
  Sent from my HTC
 
  - Reply message -
  From: Erick Erickson erickerick...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: SolrCloud 4.7 not doing distributed search when querying from a
 load balancer.
  Date: Fri, Oct 3, 2014 12:56 AM
 
  H. Assuming that you aren't re-indexing the doc you're searching
 for...
 
  Try issuing http://blah blah:8983/solr/collection/update?commit=true.
  That'll force all the docs to be searchable. Does 1 still hold for
  the document in question? Because this is exactly backwards of what
  I'd expect. I'd expect, if anything, the replica (I'm trying to call
  it the follower when a distinction needs to be made since the leader
  is a replica too) would be out of sync. This is still a Bad
  Thing, but the leader gets first crack at indexing thing.
 
  bq: only the replica of the shard that has this key returns the result
  , and the leader does not ,
 
  Just to be sure we're talking about the same thing. When you say
  leader, you mean the shard leader, right? The filled-in circle on
  the graph view from the admin/cloud page.
 
  And let's see your soft and hard commit settings please.
 
  Best,
  Erick
 
  On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote:
  Eirck,
 
  0 Load balancer is out of the picture
  .
  1When I query with *distrib=false* , I get consistent results as
 expected
  for those shards that dont have the key i.e I dont get the results back
 for
  those shards, however I just realized that while *distrib=false* is
 present
  in the query for the shard that is supposed to contain the key,only the
  replica of the shard that has this key returns the result , and the
 leader
  does not , looks like replica and the leader do not have the same data
 and
  replica seems to contain the key in the query for that shard.
 
  2 By indexing I mean this collection is being populated by a web
 crawler.
 
  So looks like 1 above  is pointing to leader and replica being out of
  synch for atleast one shard.
 
 
 
  On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  bq: Also ,the collection is being actively indexed as I query this,
 could
  that
  be an issue too ?
 
  Not if the documents you're searching aren't being added as you search
  (and all your autocommit intervals have expired).
 
  I would turn off indexing for testing, it's just one more variable
  that can get in the way of understanding this.
 
  Do note that if the problem were endemic to Solr, there would probably
  be a _lot_ more noise out there.
 
  So to recap:
  0 we can take the load balancer out of the picture all together.
 
  1 when you query each shard individually with distrib=true, every
  replica in a particular shard returns the same count.
 
  2 when you query without distrib=true you get varying counts.
 
  This is very strange and not at all expected. Let's try it again
  without indexing going on
 
  And what do you mean by indexing anyway? How are documents being fed
  to your system?
 
  Best,
  Erick@PuzzledAsWell
 
  On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote:
   Erick,
  
   I would like to add that the interesting behavior i.e point #2 that I
   mentioned in my earlier reply  happens in all the shards , if this
 were
  to
   be a distributed search issue

SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread S.L
Hi All,

I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
replication factor of 2 .

I have fronted these 6 Solr nodes using a load balancer , what I notice is
that every time I do a search of the form
q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a result
only once in every 3 tries , telling me that the load balancer is
distributing the requests between the 3 shards and SolrCloud only returns a
result if the request goes to the core that as that id .

However if I do a simple search like q=*:* , I consistently get the right
aggregated results back of all the documents across all the shards for
every request from the load balancer. Can someone please let me know what
this is symptomatic of ?

Somehow Solr Cloud seems to be doing search query distribution and
aggregation for queries of type *:* only.

Thanks.


Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread S.L
Erick,

Thanks for your reply, I tried your suggestions.

1 . When not using loadbalancer if  *I have distrib=false* I get consistent
results across the replicas.

2. However here's the insteresting part , while not using load balancer if
I *dont have distrib=false* , then when I query a particular node ,I get
the same behaviour as if I were using a loadbalancer , meaning the
distributed search from a node works intermittently .Does this give any
clue ?



On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Hmmm, nothing quite makes sense here

 Here are some experiments:
 1 avoid the load balancer and issue queries like
 http://solr_server:8983/solr/collection/q=whateverdistrib=false

 the distrib=false bit will cause keep SolrCloud from trying to send
 the queries anywhere, they'll be served only from the node you address
 them to.
 that'll help check whether the nodes are consistent. You should be
 getting back the same results from each replica in a shard (i.e. 2 of
 your 6 machines).

 Next, try your failing query the same way.

 Next, try your failing query from a browser, pointing it at successive
 nodes.

 Where is the first place problems show up?

 My _guess_ is that your load balancer isn't quite doing what you think, or
 your cluster isn't set up the way you think it is, but those are guesses.

 Best,
 Erick

 On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote:
  Hi All,
 
  I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
  replication factor of 2 .
 
  I have fronted these 6 Solr nodes using a load balancer , what I notice
 is
  that every time I do a search of the form
  q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a result
  only once in every 3 tries , telling me that the load balancer is
  distributing the requests between the 3 shards and SolrCloud only
 returns a
  result if the request goes to the core that as that id .
 
  However if I do a simple search like q=*:* , I consistently get the right
  aggregated results back of all the documents across all the shards for
  every request from the load balancer. Can someone please let me know what
  this is symptomatic of ?
 
  Somehow Solr Cloud seems to be doing search query distribution and
  aggregation for queries of type *:* only.
 
  Thanks.



Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread S.L
Erick,

I would like to add that the interesting behavior i.e point #2 that I
mentioned in my earlier reply  happens in all the shards , if this were to
be a distributed search issue this should have not manifested itself in the
shard that contains the key that I am searching for , looks like the search
is just failing as whole intermittently .

Also ,the collection is being actively indexed as I query this, could that
be an issue too ?

Thanks.

On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote:

 Erick,

 Thanks for your reply, I tried your suggestions.

 1 . When not using loadbalancer if  *I have distrib=false* I get
 consistent results across the replicas.

 2. However here's the insteresting part , while not using load balancer if
 I *dont have distrib=false* , then when I query a particular node ,I get
 the same behaviour as if I were using a loadbalancer , meaning the
 distributed search from a node works intermittently .Does this give any
 clue ?



 On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Hmmm, nothing quite makes sense here

 Here are some experiments:
 1 avoid the load balancer and issue queries like
 http://solr_server:8983/solr/collection/q=whateverdistrib=false

 the distrib=false bit will cause keep SolrCloud from trying to send
 the queries anywhere, they'll be served only from the node you address
 them to.
 that'll help check whether the nodes are consistent. You should be
 getting back the same results from each replica in a shard (i.e. 2 of
 your 6 machines).

 Next, try your failing query the same way.

 Next, try your failing query from a browser, pointing it at successive
 nodes.

 Where is the first place problems show up?

 My _guess_ is that your load balancer isn't quite doing what you think, or
 your cluster isn't set up the way you think it is, but those are guesses.

 Best,
 Erick

 On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote:
  Hi All,
 
  I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
  replication factor of 2 .
 
  I have fronted these 6 Solr nodes using a load balancer , what I notice
 is
  that every time I do a search of the form
  q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a result
  only once in every 3 tries , telling me that the load balancer is
  distributing the requests between the 3 shards and SolrCloud only
 returns a
  result if the request goes to the core that as that id .
 
  However if I do a simple search like q=*:* , I consistently get the
 right
  aggregated results back of all the documents across all the shards for
  every request from the load balancer. Can someone please let me know
 what
  this is symptomatic of ?
 
  Somehow Solr Cloud seems to be doing search query distribution and
  aggregation for queries of type *:* only.
 
  Thanks.





Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread S.L
Eirck,

0 Load balancer is out of the picture
.
1When I query with *distrib=false* , I get consistent results as expected
for those shards that dont have the key i.e I dont get the results back for
those shards, however I just realized that while *distrib=false* is present
in the query for the shard that is supposed to contain the key,only the
replica of the shard that has this key returns the result , and the leader
does not , looks like replica and the leader do not have the same data and
replica seems to contain the key in the query for that shard.

2 By indexing I mean this collection is being populated by a web crawler.

So looks like 1 above  is pointing to leader and replica being out of
synch for atleast one shard.



On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com
wrote:

 bq: Also ,the collection is being actively indexed as I query this, could
 that
 be an issue too ?

 Not if the documents you're searching aren't being added as you search
 (and all your autocommit intervals have expired).

 I would turn off indexing for testing, it's just one more variable
 that can get in the way of understanding this.

 Do note that if the problem were endemic to Solr, there would probably
 be a _lot_ more noise out there.

 So to recap:
 0 we can take the load balancer out of the picture all together.

 1 when you query each shard individually with distrib=true, every
 replica in a particular shard returns the same count.

 2 when you query without distrib=true you get varying counts.

 This is very strange and not at all expected. Let's try it again
 without indexing going on

 And what do you mean by indexing anyway? How are documents being fed
 to your system?

 Best,
 Erick@PuzzledAsWell

 On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote:
  Erick,
 
  I would like to add that the interesting behavior i.e point #2 that I
  mentioned in my earlier reply  happens in all the shards , if this were
 to
  be a distributed search issue this should have not manifested itself in
 the
  shard that contains the key that I am searching for , looks like the
 search
  is just failing as whole intermittently .
 
  Also ,the collection is being actively indexed as I query this, could
 that
  be an issue too ?
 
  Thanks.
 
  On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote:
 
  Erick,
 
  Thanks for your reply, I tried your suggestions.
 
  1 . When not using loadbalancer if  *I have distrib=false* I get
  consistent results across the replicas.
 
  2. However here's the insteresting part , while not using load balancer
 if
  I *dont have distrib=false* , then when I query a particular node ,I get
  the same behaviour as if I were using a loadbalancer , meaning the
  distributed search from a node works intermittently .Does this give any
  clue ?
 
 
 
  On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  Hmmm, nothing quite makes sense here
 
  Here are some experiments:
  1 avoid the load balancer and issue queries like
  http://solr_server:8983/solr/collection/q=whateverdistrib=false
 
  the distrib=false bit will cause keep SolrCloud from trying to send
  the queries anywhere, they'll be served only from the node you address
  them to.
  that'll help check whether the nodes are consistent. You should be
  getting back the same results from each replica in a shard (i.e. 2 of
  your 6 machines).
 
  Next, try your failing query the same way.
 
  Next, try your failing query from a browser, pointing it at successive
  nodes.
 
  Where is the first place problems show up?
 
  My _guess_ is that your load balancer isn't quite doing what you
 think, or
  your cluster isn't set up the way you think it is, but those are
 guesses.
 
  Best,
  Erick
 
  On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote:
   Hi All,
  
   I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
   replication factor of 2 .
  
   I have fronted these 6 Solr nodes using a load balancer , what I
 notice
  is
   that every time I do a search of the form
   q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a
 result
   only once in every 3 tries , telling me that the load balancer is
   distributing the requests between the 3 shards and SolrCloud only
  returns a
   result if the request goes to the core that as that id .
  
   However if I do a simple search like q=*:* , I consistently get the
  right
   aggregated results back of all the documents across all the shards
 for
   every request from the load balancer. Can someone please let me know
  what
   this is symptomatic of ?
  
   Somehow Solr Cloud seems to be doing search query distribution and
   aggregation for queries of type *:* only.
  
   Thanks.
 
 
 



pySolr and other Python client options for SolrCloud.

2014-10-01 Thread S.L
Hi All,

We recently moved from a single Solr instance to SolrCloud and we are using
pysolr , I am wondering what options (clients)  we have from Python  to
take advantage of Zookeeper and load balancing capabilities that SolrCloud
provides if I were to use a smart client like Solrj?

Thanks.


Re: pySolr and other Python client options for SolrCloud.

2014-10-01 Thread S.L
Right , but my query was to know if there are any Python clients which
achieve the same thing as SolrJ  , or the approach one should take when
using Python based clients.

On Wed, Oct 1, 2014 at 3:57 PM, Upayavira u...@odoko.co.uk wrote:



 On Wed, Oct 1, 2014, at 08:47 PM, S.L wrote:
  Hi All,
 
  We recently moved from a single Solr instance to SolrCloud and we are
  using
  pysolr , I am wondering what options (clients)  we have from Python  to
  take advantage of Zookeeper and load balancing capabilities that
  SolrCloud
  provides if I were to use a smart client like Solrj?

 Obviously SolrJ is Java, not Python. SolrJ has integration with
 Zookeeper, so when you instantiate a CloudSolrServer instance, you tell
 it where Zookeeper is, not Solr. Your app then consults Zookeeper to
 find out which Solr instance to talk to.

 This means you can move stuff around within your infrastructure without
 needing to tell your app, and without needing to mess with load
 balancers as that is all handled for you by the SolrJ client deciding
 which node to forward your request.

 Upayavira



Re: pySolr and other Python client options for SolrCloud.

2014-10-01 Thread S.L
Shawn,

Thanks ,load balancer seems to be the preferred solution here , I have a
topology where I have 6 Solr nodes that support 3 shards with a replication
factor of 2.

Looks like it woul dbe better to use the load balancers for querying
only.The question, that I have is if I go the load balancer route should I
be listing all the six nodes in the load balancer or only the leaders as
identified by SolrCloud admin console?Would the load balancing solution
also incur any additional routing of requests between the individual nodes
of SolrCloud that would have not happened had the python Solr client been
zookeeper aware?

Also for indexing ,which is not done from a Python client but is done using
Solrj, I will avoid the load balancers and do the indexing  it via the
Zookeeper route.

Thanks.

On Wed, Oct 1, 2014 at 8:42 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/1/2014 2:29 PM, S.L wrote:
  Right , but my query was to know if there are any Python clients which
  achieve the same thing as SolrJ  , or the approach one should take when
  using Python based clients.

 If the python client can support multiple hosts and failing over between
 them, then you would simply list multiple URLs.  If not, then you'll
 need a load balancer.  I use haproxy with Solr (not in Cloud mode) for
 automatic failover, and it should work equally well for SolrCloud and a
 non-java client.

 It looks like Alexandre knows a lot more about it than I do ... I know
 very little about python.

 Thanks,
 Shawn




Re: pySolr and other Python client options for SolrCloud.

2014-10-01 Thread S.L
That makes perfect sense , thanks again!

On Wed, Oct 1, 2014 at 10:09 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/1/2014 7:08 PM, S.L wrote:
  Thanks ,load balancer seems to be the preferred solution here , I have a
  topology where I have 6 Solr nodes that support 3 shards with a
 replication
  factor of 2.
 
  Looks like it woul dbe better to use the load balancers for querying
  only.The question, that I have is if I go the load balancer route should
 I
  be listing all the six nodes in the load balancer or only the leaders as
  identified by SolrCloud admin console?Would the load balancing solution
  also incur any additional routing of requests between the individual
 nodes
  of SolrCloud that would have not happened had the python Solr client been
  zookeeper aware?
 
  Also for indexing ,which is not done from a Python client but is done
 using
  Solrj, I will avoid the load balancers and do the indexing  it via the
  Zookeeper route.

 If you were to send all your queries to just one server, it's my
 understanding that SolrCloud will load balance the actual work across
 the cloud.  I have not verified this.

 For a load balancer, the minimum requirement would be to list two of the
 servers, but it's probably better to list them all.  Leader designations
 can change, and I'm pretty sure you don't want to change your load
 balancer config just because the leader changed.

 If your 3 shards are using automatic document routing, then you can send
 updates to any machine in the cluster and they'll end up in the right
 place.  Since you're using SolrJ for updates, this is probably not
 something you need to worry about.

 Thanks,
 Shawn




Intermittent error indexing SolrCloud 4.7.0

2014-08-19 Thread S.L
Hi All,

I get No Live SolrServers available to handle this request error
intermittently while indexing in a SolrCloud cluster with 3 shards and
replication factor of 2.

I am using Solr 4.7.0.

Please see the stack trace below.

org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request
at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:352)
~[DynaOCrawlerUtils.jar:?]
at 
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:640)
~[DynaOCrawlerUtils.jar:?]
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
~[DynaOCrawlerUtils.jar:?]
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
~[DynaOCrawlerUtils.jar:?]
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:146)
~[DynaOCrawlerUtils.jar:?]


Crawl-Delay in robots.txt and fetcher.threads.per.queue property in Nutch

2014-06-25 Thread S.L
Hello All

If I set fetcher.threads.per.queue property to more than 1 , I believe the 
behavior would be to have those many number of threads per host from Nutch, in 
that case would Nutch still respect the Crawl-Delay directive in robots.txt and 
not crawl at a faster pace that what is specified in robots.txt.

In short what I am trying to ask is if setting fetcher.threads.per.queue to 1 
is required for being as polite as Crawl-Delay in robots.txt expects?

Thx



Spell checker - limit on number of misspelt words in a search term.

2014-06-17 Thread S.L
Hi All,

I am using the Direct Spell checker component and I have collate =true in
my solrconfig.xml.

The issue that I noticed is that , when I have a search term with upto two
words in it and if both of them are misspelled  I get a collation query  as
a suggestion in the spellchecker output, if I increase the search term
length to 3 words and spell all of them incorrectly then I do not get a
collation query as an output in the spell checker suggestions.

Is there a setting in solrconfig.xml file that's  controlling this behavior
by restricting the length of the search term to be up to two misspelt words
to suggest a collation query, if so I would need to change the property.

Can anyone please let me know how to do so ?

Thanks.

Sent from my mobile.


Re: Is it possible for solr to calculate and give back the price of a product based on its sub-products

2014-06-08 Thread S.L
I am not sure if that is doable , I think it needs to be taken care of at
the indexing time.


On Sun, Jun 8, 2014 at 4:55 PM, Gharbi Mohamed 
gharbi.mohamed.e...@gmail.com wrote:

 Hi,

 I am using Solr for searching magento products in my project,
 I want to know, is it possible for solr to calculate and give back the
 price
 of a product based on its sub-products(items);

 For instance, i have a product P1 and it is the parent of items m1, m2.
 i need to get the minimal price of items and return it as a price of
 product
 P1.

 I'm wondering if that is possible ?
 I need to know if solr can do that or if there is a feature or a way to do
 it ?
 And finally i thank you!

 regards,
 Mohamed.




Re: Strange Behavior with Solr in Tomcat.

2014-06-07 Thread S.L
Thanks, Meraj, that was exactly the issue , setting
useColdSearchertrue/useColdSearcher worked like a charm and the server
starts up as usual.

Thanks again!


On Fri, Jun 6, 2014 at 2:42 PM, Meraj A. Khan mera...@gmail.com wrote:

 This looks distinctly related to
 https://issues.apache.org/jira/browse/SOLR-4408 , try coldSearcher = true
 as being suggested in JIRA and let us know .


 On Fri, Jun 6, 2014 at 2:39 PM, Jean-Sebastien Vachon 
 jean-sebastien.vac...@wantedanalytics.com wrote:

  I would try a thread dump and check the output to see what`s going on.
  You could also strace the process if you`re running on Unix or changed
 the
  log level in Solr to get more information logged
 
   -Original Message-
   From: S.L [mailto:simpleliving...@gmail.com]
   Sent: June-06-14 2:33 PM
   To: solr-user@lucene.apache.org
   Subject: Re: Strange Behavior with Solr in Tomcat.
  
   Anyone folks?
  
  
   On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com
 wrote:
  
 Hi Folks,
   
I recently started using the spellchecker in my solrconfig.xml. I am
able to build up an index in Solr.
   
But,if I ever shutdown tomcat I am not able to restart it.The server
never spits out the server startup time in seconds in the logs,nor
does it print any error messages in the catalina.out file.
   
The only way for me to get around this is by delete the data
 directory
of the index and then start the server,obviously this makes me loose
 my
   index.
   
Just wondering if anyone faced a similar issue and if they were able
to solve this.
   
Thanks.
   
   
  
   -
   Aucun virus trouvé dans ce message.
   Analyse effectuée par AVG - www.avg.fr
   Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date:
   27/05/2014 La Base de données des virus a expiré.
 



Re: Strange Behavior with Solr in Tomcat.

2014-06-06 Thread S.L
Anyone folks?


On Wed, Jun 4, 2014 at 10:25 AM, S.L simpleliving...@gmail.com wrote:

  Hi Folks,

 I recently started using the spellchecker in my solrconfig.xml. I am able
 to build up an index in Solr.

 But,if I ever shutdown tomcat I am not able to restart it.The server never
 spits out the server startup time in seconds in the logs,nor does it print
 any error messages in the catalina.out file.

 The only way for me to get around this is by delete the data directory of
 the index and then start the server,obviously this makes me loose my index.

 Just wondering if anyone faced a similar issue and if they were able to
 solve this.

 Thanks.




Strange Behavior with Solr in Tomcat.

2014-06-04 Thread S.L
Hi Folks,

I recently started using the spellchecker in my solrconfig.xml. I am able to 
build up an index in Solr.

But,if I ever shutdown tomcat I am not able to restart it.The server never 
spits out the server startup time in seconds in the logs,nor does it print any 
error messages in the catalina.out file.

The only way for me to get around this is by delete the data directory of the 
index and then start the server,obviously this makes me loose my index.

Just wondering if anyone faced a similar issue and if they were able to solve 
this.

Thanks.



Re: Strange Behavior with Solr in Tomcat.

2014-06-04 Thread S.L
Hi,

This is not a case of accidental deletion , the only way I can restart the
tomcat is by deleting the data directory for the index that was created
earlier, this started happening after I started using spellcheckers in my
solrconfig.xml. As long as the Tomcat is running its fine.

Any help from anyone who faced a similar issues would be appreciated.

Thanks.



On Wed, Jun 4, 2014 at 11:08 AM, Aman Tandon antn.s...@gmail.com wrote:

 I guess if you try to copy the index and then kill the process of tomcat
 then it might help. If still the index need to be delete you would have the
 back up. Next time always make back up.
 On Jun 4, 2014 7:55 PM, S.L simpleliving...@gmail.com wrote:

  Hi Folks,
 
  I recently started using the spellchecker in my solrconfig.xml. I am able
  to build up an index in Solr.
 
  But,if I ever shutdown tomcat I am not able to restart it.The server
 never
  spits out the server startup time in seconds in the logs,nor does it
 print
  any error messages in the catalina.out file.
 
  The only way for me to get around this is by delete the data directory of
  the index and then start the server,obviously this makes me loose my
 index.
 
  Just wondering if anyone faced a similar issue and if they were able to
  solve this.
 
  Thanks.
 
 



Re: DirectSpellChecker not returning expected suggestions.

2014-06-02 Thread S.L
Anyone ?


On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com wrote:

 Hi All,

 I have a small test index of 400 documents , it happens to have an entry
 for  wrangler, When I search for wranglr, I correctly get the collation
 suggestion as wrangler, however when I search for wrangle , I do not
 get a suggestion for wrangler.

 The Levenstien distance between wrangle -- wrangler is same as the
 Levestien distance between wranglr--wrangler , I am just wondering why I
 do not get a suggestion for wrangle.

 Below is my Direct spell checker configuration.

 lst name=spellchecker
   str name=namedirect/str
   str name=fieldsuggestAggregate/str
   str name=classnamesolr.DirectSolrSpellChecker/str
   !-- the spellcheck distance measure used, the default is the
 internal levenshtein --
   str name=distanceMeasureinternal/str
   str name=comparatorClassscore/str

   !-- minimum accuracy needed to be considered a valid spellcheck
 suggestion --
   float name=accuracy0.7/float
   !-- the maximum #edits we consider when enumerating terms: can be 1
 or 2 --
   int name=maxEdits1/int
   !-- the minimum shared prefix when enumerating terms --
   int name=minPrefix3/int
   !-- maximum number of inspections per result. --
   int name=maxInspections5/int
   !-- minimum length of a query term to be considered for correction
 --
   int name=minQueryLength4/int
   !-- maximum threshold of documents a query term can appear to be
 considered for correction --
   float name=maxQueryFrequency0.01/float
   !-- uncomment this to require suggestions to occur in 1% of the
 documents --
   !--
   float name=thresholdTokenFrequency.01/float
   --
 /lst




Re: DirectSpellChecker not returning expected suggestions.

2014-06-02 Thread S.L
I do not get any suggestion (when I search for wrangle) , however I
correctly get the suggestion wrangler when I search for wranglr , I am
using the Direct and WordBreak spellcheckers in combination, I have not
tried using anything else.

Is the distance calculation of Solr different than what Levestien distance
calculation ? I have set maxEdits to 1 , assuming that this corresponds to
the maxDistance.

Thanks for your help!


On Mon, Jun 2, 2014 at 1:54 PM, david.w.smi...@gmail.com 
david.w.smi...@gmail.com wrote:

 What do you get then?  Suggestions, but not the one you’re looking for, or
 is it deemed correctly spelled?

 Have you tried another spellChecker impl, for troubleshooting purposes?

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley


 On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com wrote:

  Hi All,
 
  I have a small test index of 400 documents , it happens to have an entry
  for  wrangler, When I search for wranglr, I correctly get the
 collation
  suggestion as wrangler, however when I search for wrangle , I do not
  get a suggestion for wrangler.
 
  The Levenstien distance between wrangle -- wrangler is same as the
  Levestien distance between wranglr--wrangler , I am just wondering why I
  do not get a suggestion for wrangle.
 
  Below is my Direct spell checker configuration.
 
  lst name=spellchecker
str name=namedirect/str
str name=fieldsuggestAggregate/str
str name=classnamesolr.DirectSolrSpellChecker/str
!-- the spellcheck distance measure used, the default is the
  internal levenshtein --
str name=distanceMeasureinternal/str
str name=comparatorClassscore/str
 
!-- minimum accuracy needed to be considered a valid spellcheck
  suggestion --
float name=accuracy0.7/float
!-- the maximum #edits we consider when enumerating terms: can be
 1
  or 2 --
int name=maxEdits1/int
!-- the minimum shared prefix when enumerating terms --
int name=minPrefix3/int
!-- maximum number of inspections per result. --
int name=maxInspections5/int
!-- minimum length of a query term to be considered for correction
  --
int name=minQueryLength4/int
!-- maximum threshold of documents a query term can appear to be
  considered for correction --
float name=maxQueryFrequency0.01/float
!-- uncomment this to require suggestions to occur in 1% of the
  documents --
!--
float name=thresholdTokenFrequency.01/float
--
  /lst
 



Re: DirectSpellChecker not returning expected suggestions.

2014-06-02 Thread S.L
OK, I just realized that wrangle is a proper english word, probably thats
why I dont get a suggestion for wrangler in this case. How ever in my
test index there is no wrangle present , so even though this is a proper
english word , since there is no occurence of it in the index should'nt
Solr suggest me wrangler ?


On Mon, Jun 2, 2014 at 2:00 PM, S.L simpleliving...@gmail.com wrote:

 I do not get any suggestion (when I search for wrangle) , however I
 correctly get the suggestion wrangler when I search for wranglr , I am
 using the Direct and WordBreak spellcheckers in combination, I have not
 tried using anything else.

 Is the distance calculation of Solr different than what Levestien distance
 calculation ? I have set maxEdits to 1 , assuming that this corresponds to
 the maxDistance.

 Thanks for your help!


 On Mon, Jun 2, 2014 at 1:54 PM, david.w.smi...@gmail.com 
 david.w.smi...@gmail.com wrote:

 What do you get then?  Suggestions, but not the one you’re looking for, or
 is it deemed correctly spelled?

 Have you tried another spellChecker impl, for troubleshooting purposes?

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley


 On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com wrote:

  Hi All,
 
  I have a small test index of 400 documents , it happens to have an entry
  for  wrangler, When I search for wranglr, I correctly get the
 collation
  suggestion as wrangler, however when I search for wrangle , I do not
  get a suggestion for wrangler.
 
  The Levenstien distance between wrangle -- wrangler is same as the
  Levestien distance between wranglr--wrangler , I am just wondering why
 I
  do not get a suggestion for wrangle.
 
  Below is my Direct spell checker configuration.
 
  lst name=spellchecker
str name=namedirect/str
str name=fieldsuggestAggregate/str
str name=classnamesolr.DirectSolrSpellChecker/str
!-- the spellcheck distance measure used, the default is the
  internal levenshtein --
str name=distanceMeasureinternal/str
str name=comparatorClassscore/str
 
!-- minimum accuracy needed to be considered a valid spellcheck
  suggestion --
float name=accuracy0.7/float
!-- the maximum #edits we consider when enumerating terms: can
 be 1
  or 2 --
int name=maxEdits1/int
!-- the minimum shared prefix when enumerating terms --
int name=minPrefix3/int
!-- maximum number of inspections per result. --
int name=maxInspections5/int
!-- minimum length of a query term to be considered for
 correction
  --
int name=minQueryLength4/int
!-- maximum threshold of documents a query term can appear to be
  considered for correction --
float name=maxQueryFrequency0.01/float
!-- uncomment this to require suggestions to occur in 1% of the
  documents --
!--
float name=thresholdTokenFrequency.01/float
--
  /lst
 





Re: DirectSpellChecker not returning expected suggestions.

2014-06-02 Thread S.L
Thanks, you mean wrangler , has been stemmed to wrangle , if thats the
case then why does it not return any results for wrangle ?


On Mon, Jun 2, 2014 at 2:07 PM, david.w.smi...@gmail.com 
david.w.smi...@gmail.com wrote:

 It appears to be stemmed.

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley


 On Mon, Jun 2, 2014 at 2:06 PM, S.L simpleliving...@gmail.com wrote:

  OK, I just realized that wrangle is a proper english word, probably
 thats
  why I dont get a suggestion for wrangler in this case. How ever in my
  test index there is no wrangle present , so even though this is a
 proper
  english word , since there is no occurence of it in the index should'nt
  Solr suggest me wrangler ?
 
 
  On Mon, Jun 2, 2014 at 2:00 PM, S.L simpleliving...@gmail.com wrote:
 
   I do not get any suggestion (when I search for wrangle) , however I
   correctly get the suggestion wrangler when I search for wranglr , I am
   using the Direct and WordBreak spellcheckers in combination, I have not
   tried using anything else.
  
   Is the distance calculation of Solr different than what Levestien
  distance
   calculation ? I have set maxEdits to 1 , assuming that this corresponds
  to
   the maxDistance.
  
   Thanks for your help!
  
  
   On Mon, Jun 2, 2014 at 1:54 PM, david.w.smi...@gmail.com 
   david.w.smi...@gmail.com wrote:
  
   What do you get then?  Suggestions, but not the one you’re looking
 for,
  or
   is it deemed correctly spelled?
  
   Have you tried another spellChecker impl, for troubleshooting
 purposes?
  
   ~ David Smiley
   Freelance Apache Lucene/Solr Search Consultant/Developer
   http://www.linkedin.com/in/davidwsmiley
  
  
   On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com
  wrote:
  
Hi All,
   
I have a small test index of 400 documents , it happens to have an
  entry
for  wrangler, When I search for wranglr, I correctly get the
   collation
suggestion as wrangler, however when I search for wrangle , I do
  not
get a suggestion for wrangler.
   
The Levenstien distance between wrangle -- wrangler is same as the
Levestien distance between wranglr--wrangler , I am just wondering
  why
   I
do not get a suggestion for wrangle.
   
Below is my Direct spell checker configuration.
   
lst name=spellchecker
  str name=namedirect/str
  str name=fieldsuggestAggregate/str
  str name=classnamesolr.DirectSolrSpellChecker/str
  !-- the spellcheck distance measure used, the default is the
internal levenshtein --
  str name=distanceMeasureinternal/str
  str name=comparatorClassscore/str
   
  !-- minimum accuracy needed to be considered a valid
 spellcheck
suggestion --
  float name=accuracy0.7/float
  !-- the maximum #edits we consider when enumerating terms:
 can
   be 1
or 2 --
  int name=maxEdits1/int
  !-- the minimum shared prefix when enumerating terms --
  int name=minPrefix3/int
  !-- maximum number of inspections per result. --
  int name=maxInspections5/int
  !-- minimum length of a query term to be considered for
   correction
--
  int name=minQueryLength4/int
  !-- maximum threshold of documents a query term can appear to
  be
considered for correction --
  float name=maxQueryFrequency0.01/float
  !-- uncomment this to require suggestions to occur in 1% of
 the
documents --
  !--
  float name=thresholdTokenFrequency.01/float
  --
/lst
   
  
  
  
 



Re: DirectSpellChecker not returning expected suggestions.

2014-06-02 Thread S.L
James,

I get no results back and no suggestions  for wrangle , however I get
suggestions for wranglr , and wrangle is not present in my index.

I am just searching for wrangle in a field that is created by copying
other fields, as to how it is analyzed I dont have access to it now.

Thanks.


On Mon, Jun 2, 2014 at 2:48 PM, Dyer, James james.d...@ingramcontent.com
wrote:

 If wrangle is not in your index, and if it is within the max # of edits,
 then it should suggest it.

 Are you getting anything back from spellcheck at all?  What is the exact
 query you are using?  How is the spellcheck field analyzed?  If you're
 using stemming, then wrangle and wrangler might be stemmed to the same
 word. (by the way, you shouldn't spellcheck against a stemmed or otherwise
 heavily-analyzed field).

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, June 02, 2014 1:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: DirectSpellChecker not returning expected suggestions.

 OK, I just realized that wrangle is a proper english word, probably thats
 why I dont get a suggestion for wrangler in this case. How ever in my
 test index there is no wrangle present , so even though this is a proper
 english word , since there is no occurence of it in the index should'nt
 Solr suggest me wrangler ?


 On Mon, Jun 2, 2014 at 2:00 PM, S.L simpleliving...@gmail.com wrote:

  I do not get any suggestion (when I search for wrangle) , however I
  correctly get the suggestion wrangler when I search for wranglr , I am
  using the Direct and WordBreak spellcheckers in combination, I have not
  tried using anything else.
 
  Is the distance calculation of Solr different than what Levestien
 distance
  calculation ? I have set maxEdits to 1 , assuming that this corresponds
 to
  the maxDistance.
 
  Thanks for your help!
 
 
  On Mon, Jun 2, 2014 at 1:54 PM, david.w.smi...@gmail.com 
  david.w.smi...@gmail.com wrote:
 
  What do you get then?  Suggestions, but not the one you’re looking for,
 or
  is it deemed correctly spelled?
 
  Have you tried another spellChecker impl, for troubleshooting purposes?
 
  ~ David Smiley
  Freelance Apache Lucene/Solr Search Consultant/Developer
  http://www.linkedin.com/in/davidwsmiley
 
 
  On Sat, May 31, 2014 at 12:33 AM, S.L simpleliving...@gmail.com
 wrote:
 
   Hi All,
  
   I have a small test index of 400 documents , it happens to have an
 entry
   for  wrangler, When I search for wranglr, I correctly get the
  collation
   suggestion as wrangler, however when I search for wrangle , I do
 not
   get a suggestion for wrangler.
  
   The Levenstien distance between wrangle -- wrangler is same as the
   Levestien distance between wranglr--wrangler , I am just wondering
 why
  I
   do not get a suggestion for wrangle.
  
   Below is my Direct spell checker configuration.
  
   lst name=spellchecker
 str name=namedirect/str
 str name=fieldsuggestAggregate/str
 str name=classnamesolr.DirectSolrSpellChecker/str
 !-- the spellcheck distance measure used, the default is the
   internal levenshtein --
 str name=distanceMeasureinternal/str
 str name=comparatorClassscore/str
  
 !-- minimum accuracy needed to be considered a valid spellcheck
   suggestion --
 float name=accuracy0.7/float
 !-- the maximum #edits we consider when enumerating terms: can
  be 1
   or 2 --
 int name=maxEdits1/int
 !-- the minimum shared prefix when enumerating terms --
 int name=minPrefix3/int
 !-- maximum number of inspections per result. --
 int name=maxInspections5/int
 !-- minimum length of a query term to be considered for
  correction
   --
 int name=minQueryLength4/int
 !-- maximum threshold of documents a query term can appear to
 be
   considered for correction --
 float name=maxQueryFrequency0.01/float
 !-- uncomment this to require suggestions to occur in 1% of the
   documents --
 !--
 float name=thresholdTokenFrequency.01/float
 --
   /lst
  
 
 
 



Re: Wordbreak spellchecker excessive breaking.

2014-05-30 Thread S.L
 --

   !-- Result Window Size

An optimization for use with the queryResultCache.  When a search
is requested, a superset of the requested number of document ids
are collected.  For example, if a search for a particular query
requests matching documents 10 through 19, and queryWindowSize is
50,
then documents 0 through 49 will be collected and cached.  Any
further
requests in that range can be satisfied via the cache.
 --
   queryResultWindowSize20/queryResultWindowSize

   !-- Maximum number of documents to cache for any entry in the
queryResultCache.
 --
   queryResultMaxDocsCached200/queryResultMaxDocsCached

   !-- Query Related Event Listeners

Various IndexSearcher related events can trigger Listeners to
take actions.

newSearcher - fired whenever a new searcher is being prepared
and there is a current searcher handling requests (aka
registered).  It can be used to prime certain caches to
prevent long request times for certain requests.

firstSearcher - fired whenever a new searcher is being
prepared but there is no current registered searcher to handle
requests or to gain autowarming data from.


 --
!-- QuerySenderListener takes an array of NamedList and executes a
 local query request for each NamedList in sequence.
  --
listener event=newSearcher class=solr.QuerySenderListener
  arr name=queries
!--
   lststr name=qsolr/strstr name=sortprice
asc/str/lst
   lststr name=qrocks/strstr name=sortweight
asc/str/lst
  --
  /arr
/listener
listener event=firstSearcher class=solr.QuerySenderListener
  arr name=queries
lst
  str name=qstatic firstSearcher warming in solrconfig.xml/str
/lst
  /arr
/listener

!-- Use Cold Searcher

 If a search request comes in and there is no current
 registered searcher, then immediately register the still
 warming searcher and use it.  If false then all requests
 will block until the first searcher is done warming.
  --
useColdSearcherfalse/useColdSearcher

!-- Max Warming Searchers

 Maximum number of searchers that may be warming in the
 background concurrently.  An error is returned if this limit
 is exceeded.

 Recommend values of 1-2 for read-only slaves, higher for
 masters w/o cache warming.
  --
maxWarmingSearchers2/maxWarmingSearchers

  /query



On Fri, May 30, 2014 at 10:20 AM, Dyer, James james.d...@ingramcontent.com
wrote:

 I am not sure why changing spellcheck parameters would prevent your server
 from restarting.  One thing to check is to see if you have warming queries
 running that involve spellcheck.  I think I remember from long ago there
 was (maybe still is) an obscure bug where sometimes it will lock up in rare
 cases when spellcheck is used in warming queries.  I do not remember
 exactly what caused this or if it was ever fixed.

 Besides that, you might want to post a stack trace or describe what
 happens when it doesn't restart.  Perhaps someone here will know what the
 problem is.

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Friday, May 30, 2014 12:36 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Wordbreak spellchecker excessive breaking.

 James,

 Thanks for clearly stating this , I was not able to find this documented
 anywhere, yes I am using it with another spell checker (Direct) with the
 collation on. I will try the maxChangtes and let you know.

 On a side note , whenever I change the spellchecker parameter , I need to
 rebuild the index  and delete the solr data directory before that  as my
 Tomcat instance would not even start, can you let me know why ?

 Thanks.




 On Tue, May 27, 2014 at 12:21 PM, Dyer, James 
 james.d...@ingramcontent.com
 wrote:

  You can do this if you set it up like in the mail Solr example:
 
  lst name=spellchecker
  str name=namewordbreak/str
  str name=classnamesolr.WordBreakSolrSpellChecker/str
  str name=fieldname/str
  str name=combineWordstrue/str
  str name=breakWordstrue/str
  int name=maxChanges10/int
  /lst
 
  The combineWords and breakWords flags let you tell it which kind of
  workbreak correction you want.  maxChanges controls the maximum number
 of
  words it can break 1 word into, or the maximum number of words it can
  combine.  It is reasonable to set this to 1 or 2.
 
  The best way to use this is in conjunction with a regular spellchecker
  like DirectSolrSpellChecker.  When used together with the collation
  functionality, it should take a query like mob ile and depending on
 what
  actually returns results from your data, suggest either mobile or
 perhaps
  mob lie or both.  The one thing is cannot do is fix

DirectSpellChecker not returning expected suggestions.

2014-05-30 Thread S.L
Hi All,

I have a small test index of 400 documents , it happens to have an entry
for  wrangler, When I search for wranglr, I correctly get the collation
suggestion as wrangler, however when I search for wrangle , I do not
get a suggestion for wrangler.

The Levenstien distance between wrangle -- wrangler is same as the
Levestien distance between wranglr--wrangler , I am just wondering why I
do not get a suggestion for wrangle.

Below is my Direct spell checker configuration.

lst name=spellchecker
  str name=namedirect/str
  str name=fieldsuggestAggregate/str
  str name=classnamesolr.DirectSolrSpellChecker/str
  !-- the spellcheck distance measure used, the default is the
internal levenshtein --
  str name=distanceMeasureinternal/str
  str name=comparatorClassscore/str

  !-- minimum accuracy needed to be considered a valid spellcheck
suggestion --
  float name=accuracy0.7/float
  !-- the maximum #edits we consider when enumerating terms: can be 1
or 2 --
  int name=maxEdits1/int
  !-- the minimum shared prefix when enumerating terms --
  int name=minPrefix3/int
  !-- maximum number of inspections per result. --
  int name=maxInspections5/int
  !-- minimum length of a query term to be considered for correction
--
  int name=minQueryLength4/int
  !-- maximum threshold of documents a query term can appear to be
considered for correction --
  float name=maxQueryFrequency0.01/float
  !-- uncomment this to require suggestions to occur in 1% of the
documents --
  !--
  float name=thresholdTokenFrequency.01/float
  --
/lst


Re: Wordbreak spellchecker excessive breaking.

2014-05-29 Thread S.L
James,

Thanks for clearly stating this , I was not able to find this documented
anywhere, yes I am using it with another spell checker (Direct) with the
collation on. I will try the maxChangtes and let you know.

On a side note , whenever I change the spellchecker parameter , I need to
rebuild the index  and delete the solr data directory before that  as my
Tomcat instance would not even start, can you let me know why ?

Thanks.




On Tue, May 27, 2014 at 12:21 PM, Dyer, James james.d...@ingramcontent.com
wrote:

 You can do this if you set it up like in the mail Solr example:

 lst name=spellchecker
 str name=namewordbreak/str
 str name=classnamesolr.WordBreakSolrSpellChecker/str
 str name=fieldname/str
 str name=combineWordstrue/str
 str name=breakWordstrue/str
 int name=maxChanges10/int
 /lst

 The combineWords and breakWords flags let you tell it which kind of
 workbreak correction you want.  maxChanges controls the maximum number of
 words it can break 1 word into, or the maximum number of words it can
 combine.  It is reasonable to set this to 1 or 2.

 The best way to use this is in conjunction with a regular spellchecker
 like DirectSolrSpellChecker.  When used together with the collation
 functionality, it should take a query like mob ile and depending on what
 actually returns results from your data, suggest either mobile or perhaps
 mob lie or both.  The one thing is cannot do is fix a transposition or
 misspelling and combine or break words in one shot.  That is, it cannot
 detect that mob lie should become mobile.

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Saturday, May 24, 2014 4:21 PM
 To: solr-user@lucene.apache.org
 Subject: Wordbreak spellchecker excessive breaking.

 I am using Solr wordbreak spellchecker and the issue is that when I search
 for a term like mob ile expecting that the wordbreak spellchecker would
 actually resutn a suggestion for mobile it breaks the search term into
 letters like m o b  I have two issues with this behavior.

  1. How can I make Solr combine mob ile to mobile?
  2. Not withstanding the fact that my search term mob ile is being broken
 incorrectly into individual letters , I realize that the wordbreak is
 needed in certain cases, how do I control the wordbreak so that it does not
 break it into letters like m o b which seems like excessive breaking to
 me ?

 Thanks.



Re: Wordbreak spellchecker excessive breaking.

2014-05-26 Thread S.L
Anyone ?


On Sat, May 24, 2014 at 5:21 PM, S.L simpleliving...@gmail.com wrote:


 I am using Solr wordbreak spellchecker and the issue is that when I search
 for a term like mob ile expecting that the wordbreak spellchecker would
 actually resutn a suggestion for mobile it breaks the search term into
 letters like m o b  I have two issues with this behavior.

  1. How can I make Solr combine mob ile to mobile?
  2. Not withstanding the fact that my search term mob ile is being
 broken incorrectly into individual letters , I realize that the wordbreak
 is needed in certain cases, how do I control the wordbreak so that it does
 not break it into letters like m o b which seems like excessive breaking
 to me ?

 Thanks.




Wordbreak spellchecker excessive breaking.

2014-05-24 Thread S.L
I am using Solr wordbreak spellchecker and the issue is that when I search
for a term like mob ile expecting that the wordbreak spellchecker would
actually resutn a suggestion for mobile it breaks the search term into
letters like m o b  I have two issues with this behavior.

 1. How can I make Solr combine mob ile to mobile?
 2. Not withstanding the fact that my search term mob ile is being broken
incorrectly into individual letters , I realize that the wordbreak is
needed in certain cases, how do I control the wordbreak so that it does not
break it into letters like m o b which seems like excessive breaking to
me ?

Thanks.


Apache Solr SpellChecker Integration with the default select request handler

2014-04-12 Thread S.L
Hello fellow Solr users,

I am using the default select request handler to search a Solr core , I
also use the eDismaxquery parser.

   1.

   I want to integrate this with the spellchecker search component so that
   if a search request comes in the spellchecker component also gets called
   and I get a suggestion back with search results.
   2.

   If the suggestion is above a certain threshold then I want the search to
   be made on that suggestion , otherwise the suggestion should comeback along
   with the search results for the original search term.

In order to accomplish this it seems I need to integrate the
SearchHandler.java class to call the spellchecker internally and then make
a search call if the suggestion from the spellchecker has a suggestion that
is above a certain threshold.

I would really appreciate if there any examples of calling the SpellChecker
component via the API in Solr that someone can share with me and also if
you could validate my approach. Thank You.


Re: Apache Solr SpellChecker Integration with the default select request handler

2014-04-12 Thread S.L
Yes, I use solrJ , but only to index the data , the querying of the data
happens usinf the default select query handler from a non java client.


On Sat, Apr 12, 2014 at 12:12 PM, Furkan KAMACI furkankam...@gmail.comwrote:

 Hi;

 Do you use Solrj at your application? Why you did not consider to use to
 solve this with Solrj?

 Thanks;
 Furkan KAMACI


 2014-04-12 18:34 GMT+03:00 S.L simpleliving...@gmail.com:

  Hello fellow Solr users,
 
  I am using the default select request handler to search a Solr core , I
  also use the eDismaxquery parser.
 
 1.
 
 I want to integrate this with the spellchecker search component so
 that
 if a search request comes in the spellchecker component also gets
 called
 and I get a suggestion back with search results.
 2.
 
 If the suggestion is above a certain threshold then I want the search
 to
 be made on that suggestion , otherwise the suggestion should comeback
  along
 with the search results for the original search term.
 
  In order to accomplish this it seems I need to integrate the
  SearchHandler.java class to call the spellchecker internally and then
 make
  a search call if the suggestion from the spellchecker has a suggestion
 that
  is above a certain threshold.
 
  I would really appreciate if there any examples of calling the
 SpellChecker
  component via the API in Solr that someone can share with me and also if
  you could validate my approach. Thank You.
 



Re: Apache Solr SpellChecker Integration with the default select request handler

2014-04-12 Thread S.L
Furkan,

I am not sure how this could be a security concern, what I am actually
asking is an approach to integrate the spellchecker search component within
the default request handler.

Thanks.


On Sat, Apr 12, 2014 at 5:38 PM, Furkan KAMACI furkankam...@gmail.comwrote:

 Hi;

 I do not want to change the direction of your question but it is really
 good, secure and flexible to do such kind of things at your client (a java
 client or not). On the other *if *you let people to access your Solr
 instance directly it causes some security  issues.

 Thanks;
 Furkan KAMACI


 2014-04-12 19:26 GMT+03:00 S.L simpleliving...@gmail.com:

  Yes, I use solrJ , but only to index the data , the querying of the data
  happens usinf the default select query handler from a non java client.
 
 
  On Sat, Apr 12, 2014 at 12:12 PM, Furkan KAMACI furkankam...@gmail.com
  wrote:
 
   Hi;
  
   Do you use Solrj at your application? Why you did not consider to use
 to
   solve this with Solrj?
  
   Thanks;
   Furkan KAMACI
  
  
   2014-04-12 18:34 GMT+03:00 S.L simpleliving...@gmail.com:
  
Hello fellow Solr users,
   
I am using the default select request handler to search a Solr core
 , I
also use the eDismaxquery parser.
   
   1.
   
   I want to integrate this with the spellchecker search component so
   that
   if a search request comes in the spellchecker component also gets
   called
   and I get a suggestion back with search results.
   2.
   
   If the suggestion is above a certain threshold then I want the
  search
   to
   be made on that suggestion , otherwise the suggestion should
  comeback
along
   with the search results for the original search term.
   
In order to accomplish this it seems I need to integrate the
SearchHandler.java class to call the spellchecker internally and then
   make
a search call if the suggestion from the spellchecker has a
 suggestion
   that
is above a certain threshold.
   
I would really appreciate if there any examples of calling the
   SpellChecker
component via the API in Solr that someone can share with me and also
  if
you could validate my approach. Thank You.
   
  
 



Combining eDismax and SpellChecker

2014-04-05 Thread S.L
Hi All,

I want to suggest the correct phrase if a typo is made while searching and
then search it using eDismax parser(pf,pf2,pf3), if no typo is made then
search it using eDismax parser alone.

Is there a way I can combine these two components , I have seen examples
for eDismax and also for SpellChecker , but nothing that combines these two
together.

Can you please let me know ?

Thanks.


Re: eDismax parser and the mm parameter

2014-04-03 Thread S.L
Ahmet,

SpellChecker seems to be the  the exact thing that I need for fuzzy type
search , how can I combine SpellChecker with something like edismax parser
to make use of paramerters like pf,pf2 and pf3 . Is there any resource that
you can point me to do that ?

Thanks.


On Wed, Apr 2, 2014 at 9:12 PM, S.L simpleliving...@gmail.com wrote:

 Thanks Ahmet, I would definitely look into this . I appreciate that.


 On Wed, Apr 2, 2014 at 7:47 PM, Ahmet Arslan iori...@yahoo.com wrote:

 Yes, it has spellcheck.collate parameter. I mean it has lots of
 parameters and with correct combination of parameters
 it can suggest White Siberian Ginseng from Whte Sberia Ginsng

 https://cwiki.apache.org/confluence/display/solr/Spell+Checking




 On Thursday, April 3, 2014 1:57 AM, simpleliving...@gmail.com 
 simpleliving...@gmail.com wrote:
 Ahmet.

 Thanks I will look into this option . Does spellchecker support multiple
 word search terms?

 Sent from my HTC

 - Reply message -
 From: Ahmet Arslan iori...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Subject: eDismax parser and the mm parameter
 Date: Wed, Apr 2, 2014 10:53 AM

 Hi SL,

 Instead of fuzzy queries, can't you use spell checker? Generally Spell
 Checker (a.k.a did you mean) is a preferred tool for typos.

 Ahmet

 On Wednesday, April 2, 2014 4:13 PM, simpleliving...@gmail.com 
 simpleliving...@gmail.com wrote:

 It only works for a single word search term and not multiple word search
 term.

 Sent from my HTC

 - Reply message -
 From: William Bell billnb...@gmail.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Subject: eDismax parser and the mm parameter
 Date: Wed, Apr 2, 2014 12:03 AM

 Fuzzy is provided use ~


 On Mon, Mar 31, 2014 at 11:04 PM, S.L simpleliving...@gmail.com wrote:

  Jack ,
 
  Thanks a lot , I am now using the pf ,pf2 an pf3  and have gotten rid of
  the mm parameter from my queries, however for the fuzzy phrase queries
 , I
  am not sure how I would be able to leverage the Complex Query Parser
 there
  is absolutely nothing out there that gives me any idea as to how to do
 that
  .
 
  Why is fuzzy phrase search not provided by Solr OOB ? I am surprised
 
  Thanks.
 
 
  On Mon, Mar 31, 2014 at 5:39 AM, Jack Krupansky 
 j...@basetechnology.com
  wrote:
 
   The pf, pf2, and pf3 parameters should cover cases 1 and 2. Use
 q.op=OR
   (the default) and ignore the mm parameter. Give pf the highest boost,
 and
   boost pf3 higher than pf2.
  
   You could try using the complex phrase query parser for the third
 case.
  
   -- Jack Krupansky
  
   -Original Message- From: S.L
   Sent: Monday, March 31, 2014 12:08 AM
   To: solr-user@lucene.apache.org
   Subject: Re: eDismax parser and the mm parameter
  
   Thanks Jack , my use cases are as follows.
  
  
 1. Search for Ginseng everything related to ginseng should show
 up.
 2. Search For White Siberian Ginseng results with the whole phrase
 show up first followed by 2 words from the phrase followed by a
 single
   word
 in the phrase
 3. Fuzzy Search Whte Sberia Ginsng (please note the typos here)
 documents with White Siberian Ginseng Should show up , this looks
 like
   the
 most complicated of all as Solr does not support fuzzy phrase
 searches
  .
   (I
 have no solution for this yet).
  
   Thanks again!
  
  
   On Sun, Mar 30, 2014 at 11:21 PM, Jack Krupansky 
  j...@basetechnology.com
   wrote:
  
The mm parameter is really only relevant when the default operator
 is OR
   or explicit OR operators are used.
  
   Again: Please provide your use case examples and your expectations
 for
   each use case. It really doesn't make a lot of sense to prematurely
  focus
   on a solution when you haven't clearly defined your use cases.
  
   -- Jack Krupansky
  
   -Original Message- From: S.L
   Sent: Sunday, March 30, 2014 9:13 PM
   To: solr-user@lucene.apache.org
   Subject: Re: eDismax parser and the mm parameter
  
   Jack,
  
   I mis-stated the problem , I am not using the OR operator as default
   now(now that I think about it it does not make sense to use the
 default
   operator OR along with the mm parameter) , the reason I want to use
 pf
  and
   mm in conjunction is because of my understanding of the edismax
 parser
  and
   I have not looked into pf2 and pf3 parameters yet.
  
   I will state my understanding here below.
  
   Pf -  Is used to boost the result score if the complete phrase
 matches.
   mm (less than) search term length would help limit the query results
   to
   a
   certain number of better matches.
  
   With that being said would it make sense to have dynamic mm (set to
 the
   length of search term - 1)?
  
   I also have a question around using a fuzzy search along with eDismax
   parser , but I will ask that in a seperate post once I go thru that
  aspect
   of eDismax parser.
  
   Thanks again !
  
  
  
  
  
   On Sun, Mar 30, 2014 at 6

Re: eDismax parser and the mm parameter

2014-04-02 Thread S.L
Thanks Ahmet, I would definitely look into this . I appreciate that.


On Wed, Apr 2, 2014 at 7:47 PM, Ahmet Arslan iori...@yahoo.com wrote:

 Yes, it has spellcheck.collate parameter. I mean it has lots of parameters
 and with correct combination of parameters
 it can suggest White Siberian Ginseng from Whte Sberia Ginsng

 https://cwiki.apache.org/confluence/display/solr/Spell+Checking




 On Thursday, April 3, 2014 1:57 AM, simpleliving...@gmail.com 
 simpleliving...@gmail.com wrote:
 Ahmet.

 Thanks I will look into this option . Does spellchecker support multiple
 word search terms?

 Sent from my HTC

 - Reply message -
 From: Ahmet Arslan iori...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Subject: eDismax parser and the mm parameter
 Date: Wed, Apr 2, 2014 10:53 AM

 Hi SL,

 Instead of fuzzy queries, can't you use spell checker? Generally Spell
 Checker (a.k.a did you mean) is a preferred tool for typos.

 Ahmet

 On Wednesday, April 2, 2014 4:13 PM, simpleliving...@gmail.com 
 simpleliving...@gmail.com wrote:

 It only works for a single word search term and not multiple word search
 term.

 Sent from my HTC

 - Reply message -
 From: William Bell billnb...@gmail.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Subject: eDismax parser and the mm parameter
 Date: Wed, Apr 2, 2014 12:03 AM

 Fuzzy is provided use ~


 On Mon, Mar 31, 2014 at 11:04 PM, S.L simpleliving...@gmail.com wrote:

  Jack ,
 
  Thanks a lot , I am now using the pf ,pf2 an pf3  and have gotten rid of
  the mm parameter from my queries, however for the fuzzy phrase queries ,
 I
  am not sure how I would be able to leverage the Complex Query Parser
 there
  is absolutely nothing out there that gives me any idea as to how to do
 that
  .
 
  Why is fuzzy phrase search not provided by Solr OOB ? I am surprised
 
  Thanks.
 
 
  On Mon, Mar 31, 2014 at 5:39 AM, Jack Krupansky j...@basetechnology.com
  wrote:
 
   The pf, pf2, and pf3 parameters should cover cases 1 and 2. Use q.op=OR
   (the default) and ignore the mm parameter. Give pf the highest boost,
 and
   boost pf3 higher than pf2.
  
   You could try using the complex phrase query parser for the third case.
  
   -- Jack Krupansky
  
   -Original Message- From: S.L
   Sent: Monday, March 31, 2014 12:08 AM
   To: solr-user@lucene.apache.org
   Subject: Re: eDismax parser and the mm parameter
  
   Thanks Jack , my use cases are as follows.
  
  
 1. Search for Ginseng everything related to ginseng should show up.
 2. Search For White Siberian Ginseng results with the whole phrase
 show up first followed by 2 words from the phrase followed by a
 single
   word
 in the phrase
 3. Fuzzy Search Whte Sberia Ginsng (please note the typos here)
 documents with White Siberian Ginseng Should show up , this looks
 like
   the
 most complicated of all as Solr does not support fuzzy phrase
 searches
  .
   (I
 have no solution for this yet).
  
   Thanks again!
  
  
   On Sun, Mar 30, 2014 at 11:21 PM, Jack Krupansky 
  j...@basetechnology.com
   wrote:
  
The mm parameter is really only relevant when the default operator is
 OR
   or explicit OR operators are used.
  
   Again: Please provide your use case examples and your expectations for
   each use case. It really doesn't make a lot of sense to prematurely
  focus
   on a solution when you haven't clearly defined your use cases.
  
   -- Jack Krupansky
  
   -Original Message- From: S.L
   Sent: Sunday, March 30, 2014 9:13 PM
   To: solr-user@lucene.apache.org
   Subject: Re: eDismax parser and the mm parameter
  
   Jack,
  
   I mis-stated the problem , I am not using the OR operator as default
   now(now that I think about it it does not make sense to use the
 default
   operator OR along with the mm parameter) , the reason I want to use pf
  and
   mm in conjunction is because of my understanding of the edismax parser
  and
   I have not looked into pf2 and pf3 parameters yet.
  
   I will state my understanding here below.
  
   Pf -  Is used to boost the result score if the complete phrase
 matches.
   mm (less than) search term length would help limit the query results
   to
   a
   certain number of better matches.
  
   With that being said would it make sense to have dynamic mm (set to
 the
   length of search term - 1)?
  
   I also have a question around using a fuzzy search along with eDismax
   parser , but I will ask that in a seperate post once I go thru that
  aspect
   of eDismax parser.
  
   Thanks again !
  
  
  
  
  
   On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky 
  j...@basetechnology.com
   wrote:
  
If you use pf, pf2, and pf3 and boost appropriately, the effects of
 mm
  
   will be dwarfed.
  
   The general goal is to assure that the top documents really are the
  best,
   not to necessarily limit the total document count. Focusing on the
  latter
   could be a real waste

Re: eDismax parser and the mm parameter

2014-03-31 Thread S.L
Jack ,

Thanks a lot , I am now using the pf ,pf2 an pf3  and have gotten rid of
the mm parameter from my queries, however for the fuzzy phrase queries , I
am not sure how I would be able to leverage the Complex Query Parser there
is absolutely nothing out there that gives me any idea as to how to do that
.

Why is fuzzy phrase search not provided by Solr OOB ? I am surprised

Thanks.


On Mon, Mar 31, 2014 at 5:39 AM, Jack Krupansky j...@basetechnology.comwrote:

 The pf, pf2, and pf3 parameters should cover cases 1 and 2. Use q.op=OR
 (the default) and ignore the mm parameter. Give pf the highest boost, and
 boost pf3 higher than pf2.

 You could try using the complex phrase query parser for the third case.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Monday, March 31, 2014 12:08 AM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm parameter

 Thanks Jack , my use cases are as follows.


   1. Search for Ginseng everything related to ginseng should show up.
   2. Search For White Siberian Ginseng results with the whole phrase
   show up first followed by 2 words from the phrase followed by a single
 word
   in the phrase
   3. Fuzzy Search Whte Sberia Ginsng (please note the typos here)
   documents with White Siberian Ginseng Should show up , this looks like
 the
   most complicated of all as Solr does not support fuzzy phrase searches .
 (I
   have no solution for this yet).

 Thanks again!


 On Sun, Mar 30, 2014 at 11:21 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  The mm parameter is really only relevant when the default operator is OR
 or explicit OR operators are used.

 Again: Please provide your use case examples and your expectations for
 each use case. It really doesn't make a lot of sense to prematurely focus
 on a solution when you haven't clearly defined your use cases.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 9:13 PM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm parameter

 Jack,

 I mis-stated the problem , I am not using the OR operator as default
 now(now that I think about it it does not make sense to use the default
 operator OR along with the mm parameter) , the reason I want to use pf and
 mm in conjunction is because of my understanding of the edismax parser and
 I have not looked into pf2 and pf3 parameters yet.

 I will state my understanding here below.

 Pf -  Is used to boost the result score if the complete phrase matches.
 mm (less than) search term length would help limit the query results  to
 a
 certain number of better matches.

 With that being said would it make sense to have dynamic mm (set to the
 length of search term - 1)?

 I also have a question around using a fuzzy search along with eDismax
 parser , but I will ask that in a seperate post once I go thru that aspect
 of eDismax parser.

 Thanks again !





 On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  If you use pf, pf2, and pf3 and boost appropriately, the effects of mm

 will be dwarfed.

 The general goal is to assure that the top documents really are the best,
 not to necessarily limit the total document count. Focusing on the latter
 could be a real waste of time.

 It's still not clear why or how you need or want to use OR as the default
 operator - you still haven't given us a use case for that.

 To repeat: Give us a full set of use cases before taking this XY Problem
 approach of pursuing a solution before the problem is understood.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 6:14 PM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm parameter

 Jacks Thanks Again,

 I am searching  Chinese medicine  documents , as the example I gave
 earlier
 a user can search for Ginseng or Siberian Ginseng or Red Siberian
 Ginseng
 , I certainly want to use pf parameter (which is not driven by mm
 parameter) , however for giving higher score to documents that have more
 of
 the terms I want to use edismax now if I give a mm of 3 and the search
 term
 is of only length 1 (like Ginseng) what does edisMax do ?


 On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky j...@basetechnology.com
 
 wrote:

  It still depends on your objective - which you haven't told us yet. Show

  us some use cases and detail what your expectations are for each use
 case.

 The edismax phrase boosting is probably a lot more useful than messing
 around with mm. Take a look at pf, pf2, and pf3.

 See:
 http://wiki.apache.org/solr/ExtendedDisMax
 https://cwiki.apache.org/confluence/display/solr/The+
 Extended+DisMax+Query+Parser

 The focus on mm may indeed be a classic XY Problem - a premature focus
 on a solution without detailing the problem.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 11:18 AM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm

eDismax parser and the mm parameter

2014-03-30 Thread S.L
Hi All,

I am planning to use the eDismax query parser in SOLR to give boost to
documents that have a phrase in their fields present. Now there is a mm
parameter in the edismax parser query , since the query typed by the user
could be of any length (i.e. =1) I would like to set the mm value to 1 . I
have the following questions regarding this parameter.

   1. Is it set to 1 by default ?
   2. In my schema.xml the defaultOperator is set to AND should I set it
   to OR inorder for the edismax parser to be effective with a mm of 1?


Thanks in advance!


Re: eDismax parser and the mm parameter

2014-03-30 Thread S.L
Thanks Jack! I understand the intent of mm parameter, my question is that
since the query terms being provided are not of fixed length I do not know
what the mm should like for example Ginseng,Siberian Ginseng are my
search terms. The first one can have an mm upto 1 and the second one can
have an mm of upto 2 .

Should I dynamically set the mm based on the number of search terms in my
query ?

Thanks again.


On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky j...@basetechnology.comwrote:

 1. Yes, the default for mm is 1.

 2. It depends on what you are really trying to do - you haven't told us.

 Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
 q.op=AND.

 Generally, use q.op unless you really know what you are doing.

 Generally, the intent of mm is to set the minimum number of OR/SHOULD
 clauses that must match on the top level of a query.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 2:25 AM
 To: solr-user@lucene.apache.org
 Subject: eDismax parser and the mm parameter

 Hi All,

 I am planning to use the eDismax query parser in SOLR to give boost to
 documents that have a phrase in their fields present. Now there is a mm
 parameter in the edismax parser query , since the query typed by the user
 could be of any length (i.e. =1) I would like to set the mm value to 1 . I
 have the following questions regarding this parameter.

   1. Is it set to 1 by default ?
   2. In my schema.xml the defaultOperator is set to AND should I set it
   to OR inorder for the edismax parser to be effective with a mm of 1?


 Thanks in advance!



Re: eDismax parser and the mm parameter

2014-03-30 Thread S.L
Jacks Thanks Again,

I am searching  Chinese medicine  documents , as the example I gave earlier
a user can search for Ginseng or Siberian Ginseng or Red Siberian Ginseng
, I certainly want to use pf parameter (which is not driven by mm
parameter) , however for giving higher score to documents that have more of
the terms I want to use edismax now if I give a mm of 3 and the search term
is of only length 1 (like Ginseng) what does edisMax do ?


On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 It still depends on your objective - which you haven't told us yet. Show
 us some use cases and detail what your expectations are for each use case.

 The edismax phrase boosting is probably a lot more useful than messing
 around with mm. Take a look at pf, pf2, and pf3.

 See:
 http://wiki.apache.org/solr/ExtendedDisMax
 https://cwiki.apache.org/confluence/display/solr/The+
 Extended+DisMax+Query+Parser

 The focus on mm may indeed be a classic XY Problem - a premature focus
 on a solution without detailing the problem.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 11:18 AM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm parameter

 Thanks Jack! I understand the intent of mm parameter, my question is that
 since the query terms being provided are not of fixed length I do not know
 what the mm should like for example Ginseng,Siberian Ginseng are my
 search terms. The first one can have an mm upto 1 and the second one can
 have an mm of upto 2 .

 Should I dynamically set the mm based on the number of search terms in my
 query ?

 Thanks again.


 On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky j...@basetechnology.com
 wrote:

  1. Yes, the default for mm is 1.

 2. It depends on what you are really trying to do - you haven't told us.

 Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
 q.op=AND.

 Generally, use q.op unless you really know what you are doing.

 Generally, the intent of mm is to set the minimum number of OR/SHOULD
 clauses that must match on the top level of a query.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 2:25 AM
 To: solr-user@lucene.apache.org
 Subject: eDismax parser and the mm parameter

 Hi All,

 I am planning to use the eDismax query parser in SOLR to give boost to
 documents that have a phrase in their fields present. Now there is a mm
 parameter in the edismax parser query , since the query typed by the user
 could be of any length (i.e. =1) I would like to set the mm value to 1 .
 I
 have the following questions regarding this parameter.

   1. Is it set to 1 by default ?
   2. In my schema.xml the defaultOperator is set to AND should I set it
   to OR inorder for the edismax parser to be effective with a mm of 1?


 Thanks in advance!





Re: eDismax parser and the mm parameter

2014-03-30 Thread S.L
Jack,

 I mis-stated the problem , I am not using the OR operator as default
now(now that I think about it it does not make sense to use the default
operator OR along with the mm parameter) , the reason I want to use pf and
mm in conjunction is because of my understanding of the edismax parser and
I have not looked into pf2 and pf3 parameters yet.

I will state my understanding here below.

Pf -  Is used to boost the result score if the complete phrase matches.
mm (less than) search term length would help limit the query results  to a
certain number of better matches.

With that being said would it make sense to have dynamic mm (set to the
length of search term - 1)?

I also have a question around using a fuzzy search along with eDismax
parser , but I will ask that in a seperate post once I go thru that aspect
of eDismax parser.

Thanks again !





On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky j...@basetechnology.comwrote:

 If you use pf, pf2, and pf3 and boost appropriately, the effects of mm
 will be dwarfed.

 The general goal is to assure that the top documents really are the best,
 not to necessarily limit the total document count. Focusing on the latter
 could be a real waste of time.

 It's still not clear why or how you need or want to use OR as the default
 operator - you still haven't given us a use case for that.

 To repeat: Give us a full set of use cases before taking this XY Problem
 approach of pursuing a solution before the problem is understood.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 6:14 PM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm parameter

 Jacks Thanks Again,

 I am searching  Chinese medicine  documents , as the example I gave earlier
 a user can search for Ginseng or Siberian Ginseng or Red Siberian Ginseng
 , I certainly want to use pf parameter (which is not driven by mm
 parameter) , however for giving higher score to documents that have more of
 the terms I want to use edismax now if I give a mm of 3 and the search term
 is of only length 1 (like Ginseng) what does edisMax do ?


 On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  It still depends on your objective - which you haven't told us yet. Show
 us some use cases and detail what your expectations are for each use case.

 The edismax phrase boosting is probably a lot more useful than messing
 around with mm. Take a look at pf, pf2, and pf3.

 See:
 http://wiki.apache.org/solr/ExtendedDisMax
 https://cwiki.apache.org/confluence/display/solr/The+
 Extended+DisMax+Query+Parser

 The focus on mm may indeed be a classic XY Problem - a premature focus
 on a solution without detailing the problem.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 11:18 AM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm parameter

 Thanks Jack! I understand the intent of mm parameter, my question is that
 since the query terms being provided are not of fixed length I do not know
 what the mm should like for example Ginseng,Siberian Ginseng are my
 search terms. The first one can have an mm upto 1 and the second one can
 have an mm of upto 2 .

 Should I dynamically set the mm based on the number of search terms in my
 query ?

 Thanks again.


 On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky j...@basetechnology.com
 wrote:

  1. Yes, the default for mm is 1.


 2. It depends on what you are really trying to do - you haven't told us.

 Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
 q.op=AND.

 Generally, use q.op unless you really know what you are doing.

 Generally, the intent of mm is to set the minimum number of OR/SHOULD
 clauses that must match on the top level of a query.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 2:25 AM
 To: solr-user@lucene.apache.org
 Subject: eDismax parser and the mm parameter

 Hi All,

 I am planning to use the eDismax query parser in SOLR to give boost to
 documents that have a phrase in their fields present. Now there is a mm
 parameter in the edismax parser query , since the query typed by the user
 could be of any length (i.e. =1) I would like to set the mm value to 1 .
 I
 have the following questions regarding this parameter.

   1. Is it set to 1 by default ?
   2. In my schema.xml the defaultOperator is set to AND should I set it
   to OR inorder for the edismax parser to be effective with a mm of 1?


 Thanks in advance!







Re: eDismax parser and the mm parameter

2014-03-30 Thread S.L
Thanks Jack , my use cases are as follows.


   1. Search for Ginseng everything related to ginseng should show up.
   2. Search For White Siberian Ginseng results with the whole phrase
   show up first followed by 2 words from the phrase followed by a single word
   in the phrase
   3. Fuzzy Search Whte Sberia Ginsng (please note the typos here)
   documents with White Siberian Ginseng Should show up , this looks like the
   most complicated of all as Solr does not support fuzzy phrase searches . (I
   have no solution for this yet).

Thanks again!


On Sun, Mar 30, 2014 at 11:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 The mm parameter is really only relevant when the default operator is OR
 or explicit OR operators are used.

 Again: Please provide your use case examples and your expectations for
 each use case. It really doesn't make a lot of sense to prematurely focus
 on a solution when you haven't clearly defined your use cases.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 9:13 PM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm parameter

 Jack,

 I mis-stated the problem , I am not using the OR operator as default
 now(now that I think about it it does not make sense to use the default
 operator OR along with the mm parameter) , the reason I want to use pf and
 mm in conjunction is because of my understanding of the edismax parser and
 I have not looked into pf2 and pf3 parameters yet.

 I will state my understanding here below.

 Pf -  Is used to boost the result score if the complete phrase matches.
 mm (less than) search term length would help limit the query results  to a
 certain number of better matches.

 With that being said would it make sense to have dynamic mm (set to the
 length of search term - 1)?

 I also have a question around using a fuzzy search along with eDismax
 parser , but I will ask that in a seperate post once I go thru that aspect
 of eDismax parser.

 Thanks again !





 On Sun, Mar 30, 2014 at 6:44 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  If you use pf, pf2, and pf3 and boost appropriately, the effects of mm
 will be dwarfed.

 The general goal is to assure that the top documents really are the best,
 not to necessarily limit the total document count. Focusing on the latter
 could be a real waste of time.

 It's still not clear why or how you need or want to use OR as the default
 operator - you still haven't given us a use case for that.

 To repeat: Give us a full set of use cases before taking this XY Problem
 approach of pursuing a solution before the problem is understood.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 6:14 PM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm parameter

 Jacks Thanks Again,

 I am searching  Chinese medicine  documents , as the example I gave
 earlier
 a user can search for Ginseng or Siberian Ginseng or Red Siberian
 Ginseng
 , I certainly want to use pf parameter (which is not driven by mm
 parameter) , however for giving higher score to documents that have more
 of
 the terms I want to use edismax now if I give a mm of 3 and the search
 term
 is of only length 1 (like Ginseng) what does edisMax do ?


 On Sun, Mar 30, 2014 at 1:21 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  It still depends on your objective - which you haven't told us yet. Show

 us some use cases and detail what your expectations are for each use
 case.

 The edismax phrase boosting is probably a lot more useful than messing
 around with mm. Take a look at pf, pf2, and pf3.

 See:
 http://wiki.apache.org/solr/ExtendedDisMax
 https://cwiki.apache.org/confluence/display/solr/The+
 Extended+DisMax+Query+Parser

 The focus on mm may indeed be a classic XY Problem - a premature focus
 on a solution without detailing the problem.

 -- Jack Krupansky

 -Original Message- From: S.L
 Sent: Sunday, March 30, 2014 11:18 AM
 To: solr-user@lucene.apache.org
 Subject: Re: eDismax parser and the mm parameter

 Thanks Jack! I understand the intent of mm parameter, my question is that
 since the query terms being provided are not of fixed length I do not
 know
 what the mm should like for example Ginseng,Siberian Ginseng are my
 search terms. The first one can have an mm upto 1 and the second one can
 have an mm of upto 2 .

 Should I dynamically set the mm based on the number of search terms in my
 query ?

 Thanks again.


 On Sun, Mar 30, 2014 at 5:20 AM, Jack Krupansky j...@basetechnology.com
 
 wrote:

  1. Yes, the default for mm is 1.


 2. It depends on what you are really trying to do - you haven't told us.

 Generally, mm=1 is equivalent to q.op=OR, and mm=100% is equivalent to
 q.op=AND.

 Generally, use q.op unless you really know what you are doing.

 Generally, the intent of mm is to set the minimum number of OR/SHOULD
 clauses that must match on the top level of a query.

 -- Jack

SolrJ 503 Error

2013-12-21 Thread S.L
Hi All,

I am running a single Solr instance with version 4.4 with Apache
Tomcat 7.0.42 ,I am aslo running a Nutch instance with about 20
threads and each thread is committing a document in the Solr index
using the Solrj API , the version of Solrj API I

use is 4.3.1 , can anyone please let me know if this error is occuring
because I am committing documents too fast for a single instance of a
server or is it because of any other underlying issue , please let me
know.

Thanks.



org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at http://localhost:8081/solr returned non ok status:503,
message:Service Unavailable
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
~[DynaOCrawlerUtils.jar:?]
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
~[DynaOCrawlerUtils.jar:?]
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50]
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50]
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50]
at 
com.xyz.DynaOCrawlerUtils.SolrDynaOUtils.createSolrInputDocumentAndPopulateSolrIndex(SolrDynaOUtils.java:101)
~[DynaOCrawlerUtils.jar:?]
at 
com.xyz.DynaOCrawlerUtils.SolrCallbackForNXParser.populateModelToSolrIndex(SolrCallbackForNXParser.java:216)
[DynaOCrawlerUtils.jar:?]
at 
com.xyz.DynaOCrawlerUtils.SolrCallbackForNXParser.endDocument(SolrCallbackForNXParser.java:87)
[DynaOCrawlerUtils.jar:?]
at 
com.xyz.DynaOCrawlerUtils.SolrDynaOUtils.populateSolrIndexFromCurrentURL(SolrDynaOUtils.java:250)
[DynaOCrawlerUtils.jar:?]
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:716)
[job.jar:?]


Re: SolrJ 503 Error

2013-12-21 Thread S.L
I have a 8GB machine , and I commit for each and every document that is
added to Solr, not sure if I am missing anything here , but it seems I
could use auto commit from your response , in that case do I not need to
call the commit call , can you please point me to a resource that explains
this ?
Thanks.


On Sat, Dec 21, 2013 at 2:48 PM, Andrea Gazzarini agazzar...@apache.orgwrote:

 Not sure if we have the same scenario but I got the same error code when I
 was tryjng to do a lot of requests (updates and queries) with 10 secs of
 (hard) autocommit to a SOLR instance running in servlet engine (tomcat)
 with few resources (if I remember no more than 1GB of ram)

 Andrea
 Hi All,

 I am running a single Solr instance with version 4.4 with Apache
 Tomcat 7.0.42 ,I am aslo running a Nutch instance with about 20
 threads and each thread is committing a document in the Solr index
 using the Solrj API , the version of Solrj API I

 use is 4.3.1 , can anyone please let me know if this error is occuring
 because I am committing documents too fast for a single instance of a
 server or is it because of any other underlying issue , please let me
 know.

 Thanks.



 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 Server at http://localhost:8081/solr returned non ok status:503,
 message:Service Unavailable
 at

 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
 ~[DynaOCrawlerUtils.jar:?]
 at

 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
 ~[DynaOCrawlerUtils.jar:?]
 at

 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 ~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50]
 at
 org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
 ~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50]
 at
 org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
 ~[solr-solrj-3.4.0.jar:3.4.0 1167142 - mike - 2011-09-09 09:06:50]
 at

 com.xyz.DynaOCrawlerUtils.SolrDynaOUtils.createSolrInputDocumentAndPopulateSolrIndex(SolrDynaOUtils.java:101)
 ~[DynaOCrawlerUtils.jar:?]
 at

 com.xyz.DynaOCrawlerUtils.SolrCallbackForNXParser.populateModelToSolrIndex(SolrCallbackForNXParser.java:216)
 [DynaOCrawlerUtils.jar:?]
 at

 com.xyz.DynaOCrawlerUtils.SolrCallbackForNXParser.endDocument(SolrCallbackForNXParser.java:87)
 [DynaOCrawlerUtils.jar:?]
 at

 com.xyz.DynaOCrawlerUtils.SolrDynaOUtils.populateSolrIndexFromCurrentURL(SolrDynaOUtils.java:250)
 [DynaOCrawlerUtils.jar:?]
 at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:716)
 [job.jar:?]