Re: Special character and wildcard matching

2015-02-24 Thread Arun Rangarajan
Thanks, Jack.
I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154


On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Thanks. That at least verifies that the accented e is stored in the field.
 I don't see anything wrong here, so it is as if the Lucene prefix query was
 mapping the accented characters. It's not supposed to do that, but...

 Go ahead and file a Jira bug. Include all of the details that you provided
 in this thread.

 -- Jack Krupansky

 On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan arunrangara...@gmail.com
 
 wrote:

  Exact query:
  /select?q=raw_name:beyonce*wt=jsonfl=raw_name
 
  Response:
 
  {  responseHeader: {status: 0,QTime: 0,params: {
 fl: raw_name,  q: raw_name:beyonce*,  wt: json
}  },  response: {numFound: 2,start: 0,docs: [
 {raw_name: beyoncé  },  {raw_name:
  beyoncé  }]  }}
 
 
 
  On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky 
 jack.krupan...@gmail.com
  
  wrote:
 
   Please post the info I requested - the exact query, and the Solr
  response.
  
   -- Jack Krupansky
  
   On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan 
   arunrangara...@gmail.com
   wrote:
  
In our case, the lower-casing is happening in a custom Java indexer
  code,
via Java's String.toLowerCase() method.
   
I used the analysis tool in Solr admin (with Jetty). I believe the
 raw
bytes explain this.
   
Attached are the results for beyonce in file beyonce_no_spl_chars.JPG
  and
beyoncé in file beyonce_with_spl_chars.JPG.
   
Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
   
So when you look at the bytes, it seems to explain why beyonce*
 matches
beyoncé.
   
I tried your approach with a KeywordTokenizer followed by a
LowerCaseFilter, but I see the same behavior.
   
   
   
On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky 
   jack.krupan...@gmail.com
wrote:
   
But how is that lowercasing occurring? I mean, solr.StrField doesn't
  do
that.
   
Some containers default to automatically mapping accented
 characters,
  so
that the accented e would then get indexed as a normal e, and
 then
your
wildcard would match it, and an accented e in a query would get
  mapped
as
well and then match the normal e in the index. What does your
 query
response look like?
   
This blog post explains that problem:
http://bensch.be/tomcat-solr-and-special-characters
   
Note that you could make your string field a text field with the
  keyword
tokenizer and then filter it for lower case, such as when the user
  query
might have a capital B. String field is most appropriate when the
   field
really is 100% raw.
   
   
-- Jack Krupansky
   
On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan 
arunrangara...@gmail.com
wrote:
   
 Yes, it is a string field and not a text field.

 fieldType name=string class=solr.StrField
  sortMissingLast=true
 omitNorms=true/
 field name=raw_name type=string indexed=true stored=true
 /

 Lower-casing done to do case-insensitive matching.

 On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky 
jack.krupan...@gmail.com
 wrote:

  Is it really a string field - as opposed to a text field? Show
 us
   the
 field
  and field type.
 
  Besides, if it really were a raw name, wouldn't that be a
  capital
B?
 
  -- Jack Krupansky
 
  On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan 
 arunrangara...@gmail.com
  
  wrote:
 
   I have a string field raw_name like this in my document:
  
   {raw_name: beyoncé}
  
   (Notice that the last character is a special character.)
  
   When I issue this wildcard query:
  
   q=raw_name:beyonce*
  
   i.e. with the last character simply being the ASCII 'e', Solr
returns
 me
   the above document.
  
   How do I prevent this?
  
 

   
   
   
  
 



Re: Special character and wildcard matching

2015-02-24 Thread Alexandre Rafalovitch
On 24 February 2015 at 15:50, Jack Krupansky jack.krupan...@gmail.com wrote:
 It's a string field, so there shouldn't be any analysis. (read back in the
 thread for the field and field type.)

It's a multi-term expansion. There is _some_ analysis one way or another :-)


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


RE: how to debug solr performance degradation

2015-02-24 Thread Toke Eskildsen
Tang, Rebecca [rebecca.t...@ucsf.edu] wrote:
[12-15 second response time instead of 0-3]
 Solr index size 183G
 Documents in index 14364201
 We just have single solr box
 It has 100G memory
 500G Harddrive
 16 cpus

The usual culprit is memory (if you are using spinning drive as your storage). 
It appears that you have enough raw memory though. Could you check how much 
memory the machine has free for disk caching? If it is a relative small amount, 
let's say below 50GB, then please provide a breakdown of what the memory is 
used for (very large JVM heap for example).

 I want to pinpoint where the performance issue is coming from.  Could I have 
 some suggestions/help on how to benchmark/debug solr performance issues.

Rough checking of IOWait and CPU load is a fine starting point. If if is CPU 
load then you can turn on debug in Solr admin, which should tell you where the 
time is spend resolving the queries. It it is IOWait then ensure a lot of free 
memory for disk cache and/or improve your storage speed (SSDs instead of 
spinning drives, local storage instead of remote).

- Toke Eskildsen, State and University Library, Denmark.


Re: Special character and wildcard matching

2015-02-24 Thread Jack Krupansky
It's a string field, so there shouldn't be any analysis. (read back in the
thread for the field and field type.)

-- Jack Krupansky

On Tue, Feb 24, 2015 at 3:19 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 What happens if the query does not have wildcard expansion (*)? If the
 behavior is correct, then the issue is somehow with the
 MultitermQueryAnalysis (a hidden automatically generated analyzer
 chain): http://wiki.apache.org/solr/MultitermQueryAnalysis

 Which would still make it a bug, but at least the cause could be narrowed
 down.

 Regards,
Alex.


 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 24 February 2015 at 14:56, Arun Rangarajan arunrangara...@gmail.com
 wrote:
  Thanks, Jack.
  I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154
 
 
  On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky 
 jack.krupan...@gmail.com
  wrote:
 
  Thanks. That at least verifies that the accented e is stored in the
 field.
  I don't see anything wrong here, so it is as if the Lucene prefix query
 was
  mapping the accented characters. It's not supposed to do that, but...
 
  Go ahead and file a Jira bug. Include all of the details that you
 provided
  in this thread.
 
  -- Jack Krupansky
 
  On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan 
 arunrangara...@gmail.com
  
  wrote:
 
   Exact query:
   /select?q=raw_name:beyonce*wt=jsonfl=raw_name
  
   Response:
  
   {  responseHeader: {status: 0,QTime: 0,params: {
  fl: raw_name,  q: raw_name:beyonce*,  wt: json
 }  },  response: {numFound: 2,start: 0,docs: [
  {raw_name: beyoncé  },  {raw_name:
   beyoncé  }]  }}
  
  
  
   On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky 
  jack.krupan...@gmail.com
   
   wrote:
  
Please post the info I requested - the exact query, and the Solr
   response.
   
-- Jack Krupansky
   
On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan 
arunrangara...@gmail.com
wrote:
   
 In our case, the lower-casing is happening in a custom Java
 indexer
   code,
 via Java's String.toLowerCase() method.

 I used the analysis tool in Solr admin (with Jetty). I believe the
  raw
 bytes explain this.

 Attached are the results for beyonce in file
 beyonce_no_spl_chars.JPG
   and
 beyoncé in file beyonce_with_spl_chars.JPG.

 Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
 Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]

 So when you look at the bytes, it seems to explain why beyonce*
  matches
 beyoncé.

 I tried your approach with a KeywordTokenizer followed by a
 LowerCaseFilter, but I see the same behavior.



 On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky 
jack.krupan...@gmail.com
 wrote:

 But how is that lowercasing occurring? I mean, solr.StrField
 doesn't
   do
 that.

 Some containers default to automatically mapping accented
  characters,
   so
 that the accented e would then get indexed as a normal e, and
  then
 your
 wildcard would match it, and an accented e in a query would get
   mapped
 as
 well and then match the normal e in the index. What does your
  query
 response look like?

 This blog post explains that problem:
 http://bensch.be/tomcat-solr-and-special-characters

 Note that you could make your string field a text field with the
   keyword
 tokenizer and then filter it for lower case, such as when the
 user
   query
 might have a capital B. String field is most appropriate when
 the
field
 really is 100% raw.


 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan 
 arunrangara...@gmail.com
 wrote:

  Yes, it is a string field and not a text field.
 
  fieldType name=string class=solr.StrField
   sortMissingLast=true
  omitNorms=true/
  field name=raw_name type=string indexed=true
 stored=true
  /
 
  Lower-casing done to do case-insensitive matching.
 
  On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky 
 jack.krupan...@gmail.com
  wrote:
 
   Is it really a string field - as opposed to a text field?
 Show
  us
the
  field
   and field type.
  
   Besides, if it really were a raw name, wouldn't that be a
   capital
 B?
  
   -- Jack Krupansky
  
   On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan 
  arunrangara...@gmail.com
   
   wrote:
  
I have a string field raw_name like this in my document:
   
{raw_name: beyoncé}
   
(Notice that the last character is a special character.)
   
When I issue this wildcard query:
   
q=raw_name:beyonce*
   
i.e. with the last character simply being the ASCII 'e',
 Solr
 returns
  me
the above 

Re: Special character and wildcard matching

2015-02-24 Thread Arun Rangarajan
Exact query:
/select?q=raw_name:beyonce*wt=jsonfl=raw_name

Response:

{  responseHeader: {status: 0,QTime: 0,params: {
   fl: raw_name,  q: raw_name:beyonce*,  wt: json
  }  },  response: {numFound: 2,start: 0,docs: [
   {raw_name: beyoncé  },  {raw_name:
beyoncé  }]  }}



On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Please post the info I requested - the exact query, and the Solr response.

 -- Jack Krupansky

 On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan 
 arunrangara...@gmail.com
 wrote:

  In our case, the lower-casing is happening in a custom Java indexer code,
  via Java's String.toLowerCase() method.
 
  I used the analysis tool in Solr admin (with Jetty). I believe the raw
  bytes explain this.
 
  Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and
  beyoncé in file beyonce_with_spl_chars.JPG.
 
  Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
  Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
 
  So when you look at the bytes, it seems to explain why beyonce* matches
  beyoncé.
 
  I tried your approach with a KeywordTokenizer followed by a
  LowerCaseFilter, but I see the same behavior.
 
 
 
  On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky 
 jack.krupan...@gmail.com
  wrote:
 
  But how is that lowercasing occurring? I mean, solr.StrField doesn't do
  that.
 
  Some containers default to automatically mapping accented characters, so
  that the accented e would then get indexed as a normal e, and then
  your
  wildcard would match it, and an accented e in a query would get mapped
  as
  well and then match the normal e in the index. What does your query
  response look like?
 
  This blog post explains that problem:
  http://bensch.be/tomcat-solr-and-special-characters
 
  Note that you could make your string field a text field with the keyword
  tokenizer and then filter it for lower case, such as when the user query
  might have a capital B. String field is most appropriate when the
 field
  really is 100% raw.
 
 
  -- Jack Krupansky
 
  On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan 
  arunrangara...@gmail.com
  wrote:
 
   Yes, it is a string field and not a text field.
  
   fieldType name=string class=solr.StrField sortMissingLast=true
   omitNorms=true/
   field name=raw_name type=string indexed=true stored=true /
  
   Lower-casing done to do case-insensitive matching.
  
   On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky 
  jack.krupan...@gmail.com
   wrote:
  
Is it really a string field - as opposed to a text field? Show us
 the
   field
and field type.
   
Besides, if it really were a raw name, wouldn't that be a capital
  B?
   
-- Jack Krupansky
   
On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan 
   arunrangara...@gmail.com

wrote:
   
 I have a string field raw_name like this in my document:

 {raw_name: beyoncé}

 (Notice that the last character is a special character.)

 When I issue this wildcard query:

 q=raw_name:beyonce*

 i.e. with the last character simply being the ASCII 'e', Solr
  returns
   me
 the above document.

 How do I prevent this?

   
  
 
 
 



Re: performance issues with geofilt

2015-02-24 Thread david.w.smi...@gmail.com
Hi Dirk,

The RPT field type can be used for distance sorting/boosting but it’s a
memory pig when used as-such so don’t do it unless you have to.  You only
have to if you have a multi-valued point field.  If you have single-valued,
use LatLonType specifically for distance sorting.

Your sample query doesn’t parse correctly for multiple reasons.  You can’t
put a query into the sort parameter as you have done it.  You have to do
sort=query($sortQuery) ascsortQuery=…   or a slightly different equivalent
variation.  Lets say you do that… still, I don’t recommend this syntax when
you simply want distance sort — just use geodist(), as in:  sort=geodist()
asc.

If you want to use this syntax such as to sort by recipDistance, then it
would look like this (note the filter=false hint to the spatial query
parser, which otherwise is unaware it shouldn’t bother actually
search/filter):
sort=query($sortQuery) descsortQuery={!geofilt score=recipDistance
filter=false sfield=geometry pt=51.3,12.3 d=1.0}

If you are able to use geodist() and still find it slow, there are
alternatives involving using projected data and then with simply euclidean
calculations, sqedist():
https://wiki.apache.org/solr/FunctionQuery#sqedist_-_Squared_Euclidean_Distance

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Tue, Feb 24, 2015 at 6:12 AM, dirk.thalh...@bkg.bund.de wrote:

 Hello,

 we are using solr 4.10.1. There are two cores for different use cases with
 around 20 million documents (location descriptions) per core. Each document
 has a geometry field which stores a point and a bbox field which stores a
 bounding box. Both fields are defined with:
 fieldType name=t_geometry
 class=solr.SpatialRecursivePrefixTreeFieldType

  
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
geo=true distErrPct=0.025 maxDistErr=0.9
 units=degrees /

 I'm currently trying to add a location search (find all documents around a
 point). My intention is to add this as filter query, so that the user is
 able to do an additional keyword search. These are the query parameters so
 far:
 q=*:*fq=typ:strassefq={!geofilt sfield=geometry
 pt=51.370570625523,12.369290471603 d=1.0}
 To sort the documents by their distance to the requested point, I added
 following sort parameter:
 sort={!geofilt sort=distance sfield: geometry
 pt=51.370570625523,12.369290471603 d=1.0} asc

 Unfortunately I'm experiencing here some major performance/memory
 problems. The first distance query on a core takes over 10 seconds. In my
 first setup the same request to the second core completely blocked the
 server and caused an OutOfMemoryError. I had to increase the memory to 16
 GB and now it seems to work for the geometry field. Anyhow the first
 request after a server restart takes some time and when I try it with the
 bbox field after a requested on the geometry field in both cores, the
 server blocks again.

 Can anyone explain why the distance needs so much memory? Can this be
 optimized?

 Kind regards,

 Dirk




how to debug solr performance degradation

2015-02-24 Thread Tang, Rebecca
Our solr index used to perform OK on our beta production box (anywhere between 
0-3 seconds to complete any query), but today I noticed that the performance is 
very bad (queries take between 12 – 15 seconds).

I haven't updated the solr index configuration (schema.xml/solrconfig.xml) 
lately.  All that's changed is the data — every month, I rebuild the solr index 
from scratch and deploy it to the box.  We will eventually go to incremental 
builds. But for now, all indexes are built from scratch.

Here are the stats:
Solr index size 183G
Documents in index 14364201
We just have single solr box
It has 100G memory
500G Harddrive
16 cpus

I don't know when the performance degradation started.  Today was the first 
time I noticed it, but I haven't used this box for a while.

I want to pinpoint where the performance issue is coming from.  Could I have 
some suggestions/help on how to benchmark/debug solr performance issues.

Thank you,
Rebecca Tang
Applications Developer, UCSF CKM
Industry Documents Digital Libraries
E: rebecca.t...@ucsf.edu



Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Michael Della Bitta
I guess the place to start is the Reference Guide:

https://cwiki.apache.org/confluence/display/solr/SolrCloud

Generally speaking, when you start Solr with any sort of Zookeeper, you've
entered cloud mode, which essentially means that Solr is now capable of
organizing cores into groups that represent shards, and groups of shards
are coordinated into collections. Additionally, Zookeeper allows multiple
Solr installations to be coordinated together to serve these collections
with high availability.

If you're just trying to gain parallelism on a single by using multiple
cores, you don't specifically need cloud mode or collections. You can
create multiple cores, distribute your documents manually to each core, and
then do a distributed search ala
https://wiki.apache.org/solr/DistributedSearch. The downside here is that
you're on your own in terms of distributing the documents at write time,
but on the other hand, you don't have to maintain a Zookeeper ensemble or
devote brain cells to understanding collections/shards/etc.


Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/

On Tue, Feb 24, 2015 at 3:21 PM, Benson Margulies bimargul...@gmail.com
wrote:

 On Tue, Feb 24, 2015 at 1:30 PM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
  Benson:
 
  Are you trying to run independent invocations of Solr for every node?
  Otherwise, you'd just want to create a 8 shard collection with
  maxShardsPerNode set to 8 (or more I guess).

 Michael Della Bitta,

 I don't want to run multiple invocations. I just want to exploit
 hardware cores with shards. Can you point me at doc for the process
 you are referencing here? I confess to some ongoing confusion between
 cores and collections.

 --benson


 
  Michael Della Bitta
 
  Senior Software Engineer
 
  o: +1 646 532 3062
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 
  w: appinions.com http://www.appinions.com/
 
  On Tue, Feb 24, 2015 at 1:27 PM, Benson Margulies bimargul...@gmail.com
 
  wrote:
 
  With so much of the site shifted to 5.0, I'm having a bit of trouble
  finding what I need, and so I'm hoping that someone can give me a push
  in the right direction.
 
  On a big multi-core machine, I want to set up a configuration with 8
  (or perhaps more) nodes treated as shards. I have some very particular
  solrconfig.xml and schema.xml that I need to use.
 
  Could some kind person point me at a relatively step-by-step layout?
  This is all on Linux, I'm happy to explicitly run Zookeeper.
 



Re: Special character and wildcard matching

2015-02-24 Thread Alexandre Rafalovitch
What happens if the query does not have wildcard expansion (*)? If the
behavior is correct, then the issue is somehow with the
MultitermQueryAnalysis (a hidden automatically generated analyzer
chain): http://wiki.apache.org/solr/MultitermQueryAnalysis

Which would still make it a bug, but at least the cause could be narrowed down.

Regards,
   Alex.



Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 February 2015 at 14:56, Arun Rangarajan arunrangara...@gmail.com wrote:
 Thanks, Jack.
 I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154


 On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

 Thanks. That at least verifies that the accented e is stored in the field.
 I don't see anything wrong here, so it is as if the Lucene prefix query was
 mapping the accented characters. It's not supposed to do that, but...

 Go ahead and file a Jira bug. Include all of the details that you provided
 in this thread.

 -- Jack Krupansky

 On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan arunrangara...@gmail.com
 
 wrote:

  Exact query:
  /select?q=raw_name:beyonce*wt=jsonfl=raw_name
 
  Response:
 
  {  responseHeader: {status: 0,QTime: 0,params: {
 fl: raw_name,  q: raw_name:beyonce*,  wt: json
}  },  response: {numFound: 2,start: 0,docs: [
 {raw_name: beyoncé  },  {raw_name:
  beyoncé  }]  }}
 
 
 
  On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky 
 jack.krupan...@gmail.com
  
  wrote:
 
   Please post the info I requested - the exact query, and the Solr
  response.
  
   -- Jack Krupansky
  
   On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan 
   arunrangara...@gmail.com
   wrote:
  
In our case, the lower-casing is happening in a custom Java indexer
  code,
via Java's String.toLowerCase() method.
   
I used the analysis tool in Solr admin (with Jetty). I believe the
 raw
bytes explain this.
   
Attached are the results for beyonce in file beyonce_no_spl_chars.JPG
  and
beyoncé in file beyonce_with_spl_chars.JPG.
   
Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
   
So when you look at the bytes, it seems to explain why beyonce*
 matches
beyoncé.
   
I tried your approach with a KeywordTokenizer followed by a
LowerCaseFilter, but I see the same behavior.
   
   
   
On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky 
   jack.krupan...@gmail.com
wrote:
   
But how is that lowercasing occurring? I mean, solr.StrField doesn't
  do
that.
   
Some containers default to automatically mapping accented
 characters,
  so
that the accented e would then get indexed as a normal e, and
 then
your
wildcard would match it, and an accented e in a query would get
  mapped
as
well and then match the normal e in the index. What does your
 query
response look like?
   
This blog post explains that problem:
http://bensch.be/tomcat-solr-and-special-characters
   
Note that you could make your string field a text field with the
  keyword
tokenizer and then filter it for lower case, such as when the user
  query
might have a capital B. String field is most appropriate when the
   field
really is 100% raw.
   
   
-- Jack Krupansky
   
On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan 
arunrangara...@gmail.com
wrote:
   
 Yes, it is a string field and not a text field.

 fieldType name=string class=solr.StrField
  sortMissingLast=true
 omitNorms=true/
 field name=raw_name type=string indexed=true stored=true
 /

 Lower-casing done to do case-insensitive matching.

 On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky 
jack.krupan...@gmail.com
 wrote:

  Is it really a string field - as opposed to a text field? Show
 us
   the
 field
  and field type.
 
  Besides, if it really were a raw name, wouldn't that be a
  capital
B?
 
  -- Jack Krupansky
 
  On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan 
 arunrangara...@gmail.com
  
  wrote:
 
   I have a string field raw_name like this in my document:
  
   {raw_name: beyoncé}
  
   (Notice that the last character is a special character.)
  
   When I issue this wildcard query:
  
   q=raw_name:beyonce*
  
   i.e. with the last character simply being the ASCII 'e', Solr
returns
 me
   the above document.
  
   How do I prevent this?
  
 

   
   
   
  
 



Re: how to debug solr performance degradation

2015-02-24 Thread Shawn Heisey
On 2/24/2015 1:09 PM, Tang, Rebecca wrote:
 Our solr index used to perform OK on our beta production box (anywhere 
 between 0-3 seconds to complete any query), but today I noticed that the 
 performance is very bad (queries take between 12 – 15 seconds).

 I haven't updated the solr index configuration (schema.xml/solrconfig.xml) 
 lately.  All that's changed is the data — every month, I rebuild the solr 
 index from scratch and deploy it to the box.  We will eventually go to 
 incremental builds. But for now, all indexes are built from scratch.

 Here are the stats:
 Solr index size 183G
 Documents in index 14364201
 We just have single solr box
 It has 100G memory
 500G Harddrive
 16 cpus

The bottom line on this problem, and I'm sure it's not something you're
going to want to hear:  You don't have enough memory available to cache
your index.  I'd plan on at least 192GB of RAM for an index this size,
and 256GB would be better.

Depending on the exact index schema, the nature of your queries, and how
large your Java heap for Solr is, 100GB of RAM could be enough for good
performance on an index that size ... or it might be nowhere near
enough.  I would imagine that one of two things is true here, possibly
both:  1) Your queries are very complex and involve accessing a very
large percentage of the index data.  2) Your Java heap is enormous,
leaving very little RAM for the OS to automatically cache the index.

Adding more memory to the machine, if that's possible, might fix some of
the problems.  You can find a discussion of the problem here:

http://wiki.apache.org/solr/SolrPerformanceProblems

If you have any questions after reading that wiki article, feel free to
ask them.

Thanks,
Shawn



Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Shawn Heisey
On 2/24/2015 1:21 PM, Benson Margulies wrote:
 On Tue, Feb 24, 2015 at 1:30 PM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
 Benson:

 Are you trying to run independent invocations of Solr for every node?
 Otherwise, you'd just want to create a 8 shard collection with
 maxShardsPerNode set to 8 (or more I guess).
 Michael Della Bitta,

 I don't want to run multiple invocations. I just want to exploit
 hardware cores with shards. Can you point me at doc for the process
 you are referencing here? I confess to some ongoing confusion between
 cores and collections.

SolrCloud is designed around the idea that each machine runs one copy of
Solr.  Running multiple instances of Solr on one machine is usually a
waste of resources, and can lead to problems with SolrCloud high
availability (redundancy).

Here's a simple way of thinking about the terminology in SolrCloud: 
Collections are made up of one or more shards.  Shards have one or more
replicas.  Each replica is a core.

An important detail:  For each shard, one of the replicas is elected
leader.  SolrCloud gets rid of the master and slave concepts.

Thanks,
Shawn



Re: Special character and wildcard matching

2015-02-24 Thread Jack Krupansky
Thanks. That at least verifies that the accented e is stored in the field.
I don't see anything wrong here, so it is as if the Lucene prefix query was
mapping the accented characters. It's not supposed to do that, but...

Go ahead and file a Jira bug. Include all of the details that you provided
in this thread.

-- Jack Krupansky

On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan arunrangara...@gmail.com
wrote:

 Exact query:
 /select?q=raw_name:beyonce*wt=jsonfl=raw_name

 Response:

 {  responseHeader: {status: 0,QTime: 0,params: {
fl: raw_name,  q: raw_name:beyonce*,  wt: json
   }  },  response: {numFound: 2,start: 0,docs: [
{raw_name: beyoncé  },  {raw_name:
 beyoncé  }]  }}



 On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky jack.krupan...@gmail.com
 
 wrote:

  Please post the info I requested - the exact query, and the Solr
 response.
 
  -- Jack Krupansky
 
  On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan 
  arunrangara...@gmail.com
  wrote:
 
   In our case, the lower-casing is happening in a custom Java indexer
 code,
   via Java's String.toLowerCase() method.
  
   I used the analysis tool in Solr admin (with Jetty). I believe the raw
   bytes explain this.
  
   Attached are the results for beyonce in file beyonce_no_spl_chars.JPG
 and
   beyoncé in file beyonce_with_spl_chars.JPG.
  
   Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
   Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]
  
   So when you look at the bytes, it seems to explain why beyonce* matches
   beyoncé.
  
   I tried your approach with a KeywordTokenizer followed by a
   LowerCaseFilter, but I see the same behavior.
  
  
  
   On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky 
  jack.krupan...@gmail.com
   wrote:
  
   But how is that lowercasing occurring? I mean, solr.StrField doesn't
 do
   that.
  
   Some containers default to automatically mapping accented characters,
 so
   that the accented e would then get indexed as a normal e, and then
   your
   wildcard would match it, and an accented e in a query would get
 mapped
   as
   well and then match the normal e in the index. What does your query
   response look like?
  
   This blog post explains that problem:
   http://bensch.be/tomcat-solr-and-special-characters
  
   Note that you could make your string field a text field with the
 keyword
   tokenizer and then filter it for lower case, such as when the user
 query
   might have a capital B. String field is most appropriate when the
  field
   really is 100% raw.
  
  
   -- Jack Krupansky
  
   On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan 
   arunrangara...@gmail.com
   wrote:
  
Yes, it is a string field and not a text field.
   
fieldType name=string class=solr.StrField
 sortMissingLast=true
omitNorms=true/
field name=raw_name type=string indexed=true stored=true /
   
Lower-casing done to do case-insensitive matching.
   
On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky 
   jack.krupan...@gmail.com
wrote:
   
 Is it really a string field - as opposed to a text field? Show us
  the
field
 and field type.

 Besides, if it really were a raw name, wouldn't that be a
 capital
   B?

 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan 
arunrangara...@gmail.com
 
 wrote:

  I have a string field raw_name like this in my document:
 
  {raw_name: beyoncé}
 
  (Notice that the last character is a special character.)
 
  When I issue this wildcard query:
 
  q=raw_name:beyonce*
 
  i.e. with the last character simply being the ASCII 'e', Solr
   returns
me
  the above document.
 
  How do I prevent this?
 

   
  
  
  
 



Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Benson Margulies
On Tue, Feb 24, 2015 at 4:27 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Unfortunately, this is all 5.1 and instructs me to run the 'start from
 : scratch' process.

 a) checkout the left nav of any ref guide page webpage which has a link to
 Older Versions of this Guide (PDF)

 b) i'm not entirely sure i understand what you're asking, but i'm guessing
 you mean...

 * you have a fully functional individual instance of Solr, with a single
 core
 * you only want to run that one single instance of the Solr process
 * you want tha single solr process to be a SolrCould of one node, but
 replace your single core with a collection that is divided into 8
 shards.
 * presumably: you don't care about replication since you are only trying
 to run one node.

 what you want to look into (in the 4.10 ref guide) is how to bootstrap a
 SolrCloud instance from a non-SolrCloud node -- ie: startup zk, tell solr
 to take the configs from your single core and uploda them to zk as a
 configset, and register that single core as a collection.

 That should give you a single instance of solrcloud, with a single
 collection, consisting of one shard (your original core)

 Then you should be able to use the SPLITSHARD command to split your
 single shard into 2 shards, and then split them again, etc... (i don't
 think you can split directly to 8-sub shards with a single command)



 FWIW: unless you no longer have access to the original data, it would
 almost certainly be a lot easier to just start with a clean install of
 Solr in cloud mode, then create a collection with 8 shards, then re-index
 your data.

OK, now I'm good to go. Thanks.




 -Hoss
 http://www.lucidworks.com/


Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Chris Hostetter

: Unfortunately, this is all 5.1 and instructs me to run the 'start from
: scratch' process.

a) checkout the left nav of any ref guide page webpage which has a link to 
Older Versions of this Guide (PDF)

b) i'm not entirely sure i understand what you're asking, but i'm guessing 
you mean...

* you have a fully functional individual instance of Solr, with a single 
core
* you only want to run that one single instance of the Solr process
* you want tha single solr process to be a SolrCould of one node, but 
replace your single core with a collection that is divided into 8 
shards.
* presumably: you don't care about replication since you are only trying 
to run one node.

what you want to look into (in the 4.10 ref guide) is how to bootstrap a 
SolrCloud instance from a non-SolrCloud node -- ie: startup zk, tell solr 
to take the configs from your single core and uploda them to zk as a 
configset, and register that single core as a collection.  

That should give you a single instance of solrcloud, with a single 
collection, consisting of one shard (your original core)

Then you should be able to use the SPLITSHARD command to split your 
single shard into 2 shards, and then split them again, etc... (i don't 
think you can split directly to 8-sub shards with a single command)



FWIW: unless you no longer have access to the original data, it would 
almost certainly be a lot easier to just start with a clean install of 
Solr in cloud mode, then create a collection with 8 shards, then re-index 
your data.



-Hoss
http://www.lucidworks.com/


Solr Spatial search with self-intersecting polygons

2015-02-24 Thread Prateek Sachan
Hi,

I'm using Solr 4.10.3 As field type in my schema.xml. I'm using
location_rpt like the description in the documentation.

fieldType name=location_rpt class=
solr.SpatialRecursivePrefixTreeFieldType geo=true distErrPct=0.025
maxDistErr=0.09 units=degrees /

Everything works good and fine. I'm able to index points and include a
POLYGON search in my query.

However, it throws an exception in some cases where the Polygon in my query
might be intersecting at some point.

org.apache.solr.common.SolrException:
com.spatial4j.core.exception.InvalidShapeException: Self-intersection at or
near point (23.5235632532521, 77.6266564124352, NaN)

How could I clean up my polygon so that Solr doesn't throw an exception.
Also, a possible solution would be to include only the boundary (in case
the polygon intersects at some point). Is it possible to do it on the Solr
side?


Thanks.


-- 
Prateek Sachan
Indian Institute of Technology Delhi


Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Benson Margulies
On Tue, Feb 24, 2015 at 1:30 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 Benson:

 Are you trying to run independent invocations of Solr for every node?
 Otherwise, you'd just want to create a 8 shard collection with
 maxShardsPerNode set to 8 (or more I guess).

Michael Della Bitta,

I don't want to run multiple invocations. I just want to exploit
hardware cores with shards. Can you point me at doc for the process
you are referencing here? I confess to some ongoing confusion between
cores and collections.

--benson



 Michael Della Bitta

 Senior Software Engineer

 o: +1 646 532 3062

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 w: appinions.com http://www.appinions.com/

 On Tue, Feb 24, 2015 at 1:27 PM, Benson Margulies bimargul...@gmail.com
 wrote:

 With so much of the site shifted to 5.0, I'm having a bit of trouble
 finding what I need, and so I'm hoping that someone can give me a push
 in the right direction.

 On a big multi-core machine, I want to set up a configuration with 8
 (or perhaps more) nodes treated as shards. I have some very particular
 solrconfig.xml and schema.xml that I need to use.

 Could some kind person point me at a relatively step-by-step layout?
 This is all on Linux, I'm happy to explicitly run Zookeeper.



Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Benson Margulies
On Tue, Feb 24, 2015 at 3:32 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 https://cwiki.apache.org/confluence/display/solr/SolrCloud

Unfortunately, this is all 5.1 and instructs me to run the 'start from
scratch' process.

I wish that I could take my existing one-core no-cloud config and
convert it into a cloud, 8-shard config.


Re: snapinstaller does not start newSearcher

2015-02-24 Thread Shalin Shekhar Mangar
Do you mean the snapinstaller (bash) script? Those are legacy scripts. It's
been a long time since they were tested. The ReplicationHandler is the
recommended way to setup replication. If you want to take a snapshot then
the replication handler has an HTTP based API which lets you do that.

In any case, do you have the full stack trace for that exception? There
should be another cause nested under it.

On Tue, Feb 24, 2015 at 12:47 PM, alx...@aim.com wrote:

 Hello,

 I am using latest solr (solr trunk) . I run snapinstaller, and see that it
 copies snapshot to index folder but changes are not picked up and

  logs in slave after running snapinstaller are

 44302 [qtp1312571113-14] INFO  org.apache.solr.update.UpdateHandler  –
 start
 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
 44303 [qtp1312571113-14] INFO  org.apache.solr.update.UpdateHandler  – No
 uncommitted changes. Skipping IW.commit.
 44304 [qtp1312571113-14] INFO  org.apache.solr.core.SolrCore  –
 SolrIndexSearcher has not changed - not re-opening:
 org.apache.solr.search.SolrIndexSearcher
 44305 [qtp1312571113-14] INFO  org.apache.solr.update.UpdateHandler  –
 end_commit_flush
 44305 [qtp1312571113-14] INFO
 org.apache.solr.update.processor.LogUpdateProcessor  – [product]
 webapp=/solr path=/update params={} {commit=} 0 57

 Restarting solr  gives

  Error creating core [product]: Error opening new searcher
 org.apache.solr.common.SolrException: Error opening new searcher
 at org.apache.solr.core.SolrCore.init(SolrCore.java:873)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:646)
 at
 org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
 at
 org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:255)
 at
 org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:249)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 Caused by: org.apache.solr.common.SolrException: Error opening new searcher
 at
 org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:845)
 ... 9 more

 Any idea what causes this issue.

 Thanks in advance.
 Alex.




-- 
Regards,
Shalin Shekhar Mangar.


Re: Geo Aggregations and Search Alerts in Solr

2015-02-24 Thread Richard Gibbs
Hi Charlie,

Thanks a lot for your response

On Tue, Feb 24, 2015 at 5:08 PM, Charlie Hull char...@flax.co.uk wrote:

 On 24/02/2015 03:03, Richard Gibbs wrote:

 Hi There,

 I am in the process of choosing a search technology for one of my projects
 and I was looking into Solr and Elasticsearch.

 Two features that I am more interested are geo aggregations (for map
 clustering) and search alerts. Elasticsearch seem to have these two
 features built-in.

 http://www.elasticsearch.org/guide/en/elasticsearch/guide/
 current/geo-aggs.html
 http://www.elasticsearch.org/guide/en/elasticsearch/
 reference/current/search-percolate.html

 I couldn't find relevant documentation for Solr and therefore not sure
 whether these features are readily available in Solr. Can you please let
 me
 know whether these features are available in Solr? If not, whether there
 are solutions to achieve same with Solr.


 Hi Richard,

 I don't know about geo aggregations, although I know the Heliosearch guys
 and others have been working on various facet statistics that may impinge
 on this. http://heliosearch.org/solr-facet-functions/

 For alerting, you're talking about storing queries and running them
 against any new document to see if it matches. We do this a lot for clients
 needing large scale media monitoring and auto-classification - here's the
 Lucene-based library we released:
 https://github.com/flaxsearch/luwak
 Note that this depends on a patched Lucene currently, but I'm very happy
 to say that a client is funding us to merge this back to trunk and we
 expect Luwak to be be able to a 5.x release of Lucene. More news very soon!
 There are a couple of videos on that page that will explain further. We
 suspect our approach is considerably faster than the Percolator, and it's
 on the list to benchmark the two.

 Cheers

 Charlie


 Thank you.



 --
 Charlie Hull
 Flax - Open Source Enterprise Search

 tel/fax: +44 (0)8700 118334
 mobile:  +44 (0)7767 825828
 web: www.flax.co.uk



Re: highlighting the boolean query

2015-02-24 Thread Dmitry Kan
Erick,

Our default operator is AND.

Both queries below parse the same:

a OR (b c) OR d
a OR (b AND c) OR d

The parsed query:

str name=parsedquery_toStringContents:a (+Contents:b +Contents:c)
Contents:d/str

So this part is consistent with our expectation.


 I'm a bit puzzled by your statement that c didn't contribute to the
score.
what I meant was that the term c was not hit by the scorerer: the explain
section does not refer to it. I'm using the made up terms here, but the
concept holds.

The code suggests that we could benefit from storing term offsets and
positions:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470

Is it correct assumption?

On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Highlighting is such a pain...

 what does the parsed query look like? If the default operator is OR,
 then this seems correct as both 'd' and 'c' appear in the doc. So
 I'm a bit puzzled by your statement that c didn't contribute to the
 score.

 If the parsed query is, indeed
 a +b +c d

 then it does look like something with the highlighter. Whether other
 highlighters are better for this case.. no clue ;(

 Best,
 Erick

 On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote:
  Erick,
 
  nope, we are using std lucene qparser with some customizations, that do
 not
  affect the boolean query parsing logic.
 
  Should we try some other highlighter?
 
  On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  Are you using edismax?
 
  On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com
 wrote:
   Hello!
  
   In solr 4.3.1 there seem to be some inconsistency with the
 highlighting
  of
   the boolean query:
  
   a OR (b c) OR d
  
   This returns a proper hit, which shows that only d was included into
 the
   document score calculation.
  
   But the highlighter returns both d and c in em tags.
  
   Is this a known issue of the standard highlighter? Can it be
 mitigated?
  
  
   --
   Dmitry Kan
   Luke Toolbox: http://github.com/DmitryKey/luke
   Blog: http://dmitrykan.blogspot.com
   Twitter: http://twitter.com/dmitrykan
   SemanticAnalyzer: www.semanticanalyzer.info
 
 
 
 
  --
  Dmitry Kan
  Luke Toolbox: http://github.com/DmitryKey/luke
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
  SemanticAnalyzer: www.semanticanalyzer.info




-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Setting Up an External ZooKeeper Ensemble

2015-02-24 Thread Shalin Shekhar Mangar
Looks like the ZooKeeper server is either not running or not accepting
connection possibly because of some configuration issues. Can you look into
the ZooKeeper logs and see if there are any exceptions?

On Tue, Feb 24, 2015 at 11:30 AM, CKReddy Bhimavarapu chaitu...@gmail.com
wrote:

 Hi,
   I did follow all the steps in [

 https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
 ]
 but still I am getting this error
 bWaiting to see Solr listening on port 8983 [-]  Still not seeing Solr
 listening on 8983 after 30 seconds!/b
 WARN  - 2015-02-24 05:50:19.161;
 org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null,
 unexpected error, closing socket connection and attempting reconnect
 java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at

 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
 WARN  - 2015-02-24 05:50:20.262;
 org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null,
 unexpected error, closing socket connection and attempting reconnect


 Where am I going wrong?

 --
 ckreddybh. chaitu...@gmail.com




-- 
Regards,
Shalin Shekhar Mangar.


Solr query performace

2015-02-24 Thread Ali Nazemian
Dear all,
Hi,
I was wondering is there any performance comparison available for different
solr queries?
I meant what is the cost of different Solr queries from memory and CPU
points of view? I am looking for a report that could help me in case of
having different alternatives for sending single query to Solr.
Thank you very much.
Best regards.
-- 
A.Nazemian


Re: Geo Aggregations and Search Alerts in Solr

2015-02-24 Thread Charlie Hull

On 24/02/2015 03:03, Richard Gibbs wrote:

Hi There,

I am in the process of choosing a search technology for one of my projects
and I was looking into Solr and Elasticsearch.

Two features that I am more interested are geo aggregations (for map
clustering) and search alerts. Elasticsearch seem to have these two
features built-in.

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/geo-aggs.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html

I couldn't find relevant documentation for Solr and therefore not sure
whether these features are readily available in Solr. Can you please let me
know whether these features are available in Solr? If not, whether there
are solutions to achieve same with Solr.


Hi Richard,

I don't know about geo aggregations, although I know the Heliosearch 
guys and others have been working on various facet statistics that may 
impinge on this. http://heliosearch.org/solr-facet-functions/


For alerting, you're talking about storing queries and running them 
against any new document to see if it matches. We do this a lot for 
clients needing large scale media monitoring and auto-classification - 
here's the Lucene-based library we released:

https://github.com/flaxsearch/luwak
Note that this depends on a patched Lucene currently, but I'm very happy 
to say that a client is funding us to merge this back to trunk and we 
expect Luwak to be be able to a 5.x release of Lucene. More news very 
soon! There are a couple of videos on that page that will explain 
further. We suspect our approach is considerably faster than the 
Percolator, and it's on the list to benchmark the two.


Cheers

Charlie



Thank you.




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Integration Tests with SOLR 5

2015-02-24 Thread Thomas Scheffler

Hi,

I noticed that not only SOLR does not deliver a WAR file anymore but 
also advices not to try to provide a custom WAR file that can be 
deployed anymore as future version may depend on custom jetty features.


Until 4.10. we were able to provide a WAR file with all the plug-ins we 
need for easier installs. The same WAR file was used together with an 
web application WAR running integration tests and to check if all 
application details still work. We used the cargo-mave2-plugin and 
different servlet container for testing. I think this is quiet common 
thing to do with continuous integration.


Now I wonder if anyone has a similar setup and with integration tests 
running against SOLR 5.


- No artifacts can be used, so no local repository cache is present
- How to deploy your schema.xml, stopwords, solr plug-ins etc. for 
testing in an isolated environment

- What does a maven boilerplate code look like?

Any ideas would be appreciated.

Kind regards,

Thomas


Re: Is Solr best for did you mean functionality just like Google?

2015-02-24 Thread Yavar Husain
Solr is an IR system where Spell correction is a topping however Google has
a team dedicated just for Spell corrections. Did you mean (more general
term and much broader than basic Spell correctors) or Spell Correctors
require a plethora of skills. I will just discuss Spell correctors here and
not go into Did you mean:

To start with: 1) Edit Distances (Example: In misspelt 'cax' if x is
replaced by 'r' or 't' it becomes car and cat respectively which can be
probable candidates for your misspelt word and now since both are at edit
distance of 1 you can select the one which occurs more number of times in
your solr index, however you will have to handle the cases where the
misspelt word is already present in your index. Say you have misspelt token
'cax' occuring 100 times in your index )

A good spell corrector requires a lot of features on top of the above

2) Phonetics (sounds of words/metaphone etc.).
3) If you have natural language queries like The cax ran out of the
house, here cat would be much more suitable spelling correction for cax as
compared to car.
4) Language models play an important role. Think, what is the probability
of getting an 'm' after 'e' and how does it compare with getting a 'z'
after 'e'
5) Your search/http etc. logs will be a good source to improve spell
corrector
6) and you can list several other

You can build a physics based model by taking into account the above
features for recommending the best.

However rather than working hard doing the above there is always a smarter
way out :), one example on that can be looking at terms in your solr index
and the one's occuring the least times can be analyzed for spelling errors.

Cheers,
Yavar

On Mon, Feb 23, 2015 at 9:53 PM, Nitin Solanki nitinml...@gmail.com wrote:

 Hello,
   I came in the worst condition. I want to do spell/query
 correction functionality. I have 49 GB indexed data where I have applied
 spellchecker. I want to do same as Google - *did you mean*.
 *Example* - If any user types any question/query which might be misspell or
 wrong typed. I need to give them suggestion like Did you mean.
 Is Solr best for it?


 Warm Regards,
 Nitin Solanki



Regarding behavior of docValues.

2015-02-24 Thread Modassar Ather
Hi,

Kindly help me understand the behavior of following field.

field name=manu_exact type=string indexed=true stored=false
docValues=true /

For a field like above where indexed=true and docValues=true, is it
that:
 1) For sorting/faceting on *manu_exact* the docValues will be used.
 2) For querying on *manu_exact* the inverted index will be used.

Thanks,
Modassar


Re: Regarding behavior of docValues.

2015-02-24 Thread Modassar Ather
Thanks for your response Mikhail.

On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Both statements seem true to me.

 On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com
 wrote:

  Hi,
 
  Kindly help me understand the behavior of following field.
 
  field name=manu_exact type=string indexed=true stored=false
  docValues=true /
 
  For a field like above where indexed=true and docValues=true, is it
  that:
   1) For sorting/faceting on *manu_exact* the docValues will be used.
   2) For querying on *manu_exact* the inverted index will be used.
 
  Thanks,
  Modassar
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: How to achieve lemmatization for english words in Solr 4.10.2

2015-02-24 Thread Erlend Garåsen


If you have an English dictionary available containing words with their 
lemmas, you may use my patch:

https://issues.apache.org/jira/browse/LUCENE-6254

This lemmatizer works with Danish, German and Norwegian dictionaries 
which are available for free. I'm not sure there exists a free English 
dictionary which is compatible, but please inform me in case there is 
one. There are probably useful English dictionaries for purchase.


Erlend

On 18.02.15 16:50, dinesh naik wrote:

Hi,
IS there a way to achieve lemmatization in Solr? Stemming option is not
meeting the requirement.





performance issues with geofilt

2015-02-24 Thread dirk.thalheim
Hello,

we are using solr 4.10.1. There are two cores for different use cases with 
around 20 million documents (location descriptions) per core. Each document has 
a geometry field which stores a point and a bbox field which stores a bounding 
box. Both fields are defined with:
fieldType name=t_geometry class=solr.SpatialRecursivePrefixTreeFieldType
   
spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
   geo=true distErrPct=0.025 maxDistErr=0.9 
units=degrees /

I'm currently trying to add a location search (find all documents around a 
point). My intention is to add this as filter query, so that the user is able 
to do an additional keyword search. These are the query parameters so far:
q=*:*fq=typ:strassefq={!geofilt sfield=geometry 
pt=51.370570625523,12.369290471603 d=1.0}
To sort the documents by their distance to the requested point, I added 
following sort parameter:
sort={!geofilt sort=distance sfield: geometry 
pt=51.370570625523,12.369290471603 d=1.0} asc

Unfortunately I'm experiencing here some major performance/memory problems. The 
first distance query on a core takes over 10 seconds. In my first setup the 
same request to the second core completely blocked the server and caused an 
OutOfMemoryError. I had to increase the memory to 16 GB and now it seems to 
work for the geometry field. Anyhow the first request after a server restart 
takes some time and when I try it with the bbox field after a requested on the 
geometry field in both cores, the server blocks again.

Can anyone explain why the distance needs so much memory? Can this be optimized?

Kind regards,

Dirk



Re: Setting Up an External ZooKeeper Ensemble

2015-02-24 Thread CKReddy Bhimavarapu
yes

 chaitanya@imart-desktop:~/solr/zookeeper-3.4.6/bin$ ./zkServer.sh start

 JMX enabled by default
 Using config: /home/chaitanya/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg
 Starting zookeeper ... STARTED
 chaitanya@imart-desktop:~/solr/zookeeper-3.4.6/bin$ ./zkServer.sh status
 JMX enabled by default
 Using config: /home/chaitanya/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg
 Error contacting service. It is probably not running

 but I don't get why it is not running


On Tue, Feb 24, 2015 at 1:45 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Looks like the ZooKeeper server is either not running or not accepting
 connection possibly because of some configuration issues. Can you look into
 the ZooKeeper logs and see if there are any exceptions?

 On Tue, Feb 24, 2015 at 11:30 AM, CKReddy Bhimavarapu chaitu...@gmail.com
 
 wrote:

  Hi,
I did follow all the steps in [
 
 
 https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
  ]
  but still I am getting this error
  bWaiting to see Solr listening on port 8983 [-]  Still not seeing Solr
  listening on 8983 after 30 seconds!/b
  WARN  - 2015-02-24 05:50:19.161;
  org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null,
  unexpected error, closing socket connection and attempting reconnect
  java.net.ConnectException: Connection refused
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
  at
 
 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
  at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
  WARN  - 2015-02-24 05:50:20.262;
  org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null,
  unexpected error, closing socket connection and attempting reconnect
 
 
  Where am I going wrong?
 
  --
  ckreddybh. chaitu...@gmail.com
 



 --
 Regards,
 Shalin Shekhar Mangar.




-- 
ckreddybh. chaitu...@gmail.com


Re: Regarding behavior of docValues.

2015-02-24 Thread Mikhail Khludnev
Both statements seem true to me.

On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com
wrote:

 Hi,

 Kindly help me understand the behavior of following field.

 field name=manu_exact type=string indexed=true stored=false
 docValues=true /

 For a field like above where indexed=true and docValues=true, is it
 that:
  1) For sorting/faceting on *manu_exact* the docValues will be used.
  2) For querying on *manu_exact* the inverted index will be used.

 Thanks,
 Modassar




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Override freq field from custom field in Suggestions

2015-02-24 Thread Nitin Solanki
Hello,
 I have a scenario where I want to use own custom field instead
of freq in suggestions of each term. Custom field will be integer value
and having some different value than freq in suggestion.
Is it possible in Solr to use custom field instead of freq in suggestion.
Your help is appreciated.


Thanks and Regards,
 Nitin Solanki.


Re: Sorting on multi-valued field

2015-02-24 Thread Nitin Solanki
Hi Peri,
  You cannot do sort on multi-valued field. It should be set to
false.

On Tue, Feb 24, 2015 at 8:07 PM, Peri Subrahmanya 
peri.subrahma...@htcinc.com wrote:

 All,

 Is there a way sorting can work on a multi-valued field or does it always
 have to be “false” for it to work.

 Thanks
 -Peri

 *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended
 recipient, please delete without copying and kindly advise us by e-mail of
 the mistake in delivery.
 NOTE: Regardless of content, this e-mail shall not operate to bind HTC
 Global Services to any order or other contract unless pursuant to explicit
 written agreement or government initiative expressly permitting the use of
 e-mail for such purpose.





Re: [ANNOUNCE] Luke 4.10.3 released

2015-02-24 Thread Tomoko Uchida
Hi Dmitry,

Thank you for the detailed clarification!

Recently, I've created a few patches to Pivot version(LUCENE-2562), so I'd
like to some more work and keep up to date it.

 If you would like to work on the Pivot version, may I suggest you to fork
 the github's version? The ultimate goal is to donate this to Apache, but
at
 least we will have the common plate. :)

Yes, I love to the idea about having common code base.
I've looked at both codes of github's (thinlet's) and Pivot's, Pivot's
version has very different structure from github's (I think that is mainly
for UI framework's requirement.)
So it seems to be difficult to directly fork github's version to develop
Pivot's version..., but I think I (or any other developers) could catch up
changes in github's version.
There's long way to go for Pivot's version, of course, I'd like to also
make pull requests to enhance github's version if I can.

Thanks,
Tomoko

2015-02-24 23:34 GMT+09:00 Dmitry Kan solrexp...@gmail.com:

 Hi, Tomoko!

 Thanks for being a fan of luke!

 Current status of github's luke (https://github.com/DmitryKey/luke) is
 that
 it has releases for all the major lucene versions since 4.3.0, excluding
 4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the latest --
 5.0.0.

 Porting the github's luke to ALv2 compliant framework (GWT or Pivot) is a
 long standing goal. With GWT I had issues related to listing and reading
 the index directory. So this effort has been parked. Most recently I have
 been approaching the Pivot. Mark Miller has done an initial port, that I
 took as the basis. I'm hoping to continue on this track as time permits.


 If you would like to work on the Pivot version, may I suggest you to fork
 the github's version? The ultimate goal is to donate this to Apache, but at
 least we will have the common plate. :)


 Thanks,
 Dmitry

 On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida 
 tomoko.uchida.1...@gmail.com
  wrote:

  Hi,
 
  I'm an user / fan of Luke, so deeply appreciate your work.
 
  I've carefully read the readme, noticed the (one of) project's goal:
  To port the thinlet UI to an ASL compliant license framework so that it
  can be contributed back to Apache Lucene. Current work is done with GWT
  2.5.1.
 
  There has been GWT based, ASL compliant Luke supporting the latest
 Lucene ?
 
  I've recently got in with LUCENE-2562. Currently, Apache Pivot based port
  is going. But I do not know so much about Luke's long (and may be
 slightly
  complex) history, so I would grateful if anybody clear the association of
  the Luke project (now on Github) and the Jira issue. Or, they can be
  independent of each other.
  https://issues.apache.org/jira/browse/LUCENE-2562
  I don't have any opinions, just want to understand current status and
 avoid
  duplicate works.
 
  Apologize for a bit annoying post.
 
  Many thanks,
  Tomoko
 
 
 
  2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com:
 
   Hello,
  
   Luke 4.10.3 has been released. Download it here:
  
   https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3
  
   The release has been tested against the solr-4.10.3 based index.
  
   Issues fixed in this release: #13
   https://github.com/DmitryKey/luke/pull/13
   Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2
  
   Thanks to respective contributors!
  
  
   P.S. waiting for lucene 5.0 artifacts to hit public maven repositories
  for
   the next major release of luke.
  
   --
   Dmitry Kan
   Luke Toolbox: http://github.com/DmitryKey/luke
   Blog: http://dmitrykan.blogspot.com
   Twitter: http://twitter.com/dmitrykan
   SemanticAnalyzer: www.semanticanalyzer.info
  
 



 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info



Re: Sorting on multi-valued field

2015-02-24 Thread Mikhail Khludnev
fwiw,

open solr jira https://issues.apache.org/jira/browse/SOLR-2522 pls vote
however, everything seems done at lucene level
https://issues.apache.org/jira/browse/LUCENE-5454


On Tue, Feb 24, 2015 at 6:11 PM, Nitin Solanki nitinml...@gmail.com wrote:

 Hi Peri,
   You cannot do sort on multi-valued field. It should be set to
 false.

 On Tue, Feb 24, 2015 at 8:07 PM, Peri Subrahmanya 
 peri.subrahma...@htcinc.com wrote:

  All,
 
  Is there a way sorting can work on a multi-valued field or does it always
  have to be “false” for it to work.
 
  Thanks
  -Peri
 
  *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended
  recipient, please delete without copying and kindly advise us by e-mail
 of
  the mistake in delivery.
  NOTE: Regardless of content, this e-mail shall not operate to bind HTC
  Global Services to any order or other contract unless pursuant to
 explicit
  written agreement or government initiative expressly permitting the use
 of
  e-mail for such purpose.
 
 
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: snapinstaller does not start newSearcher

2015-02-24 Thread alxsss
Hello,

We cannot use replication with the current architecture, so decided to use 
snapshotter with snapinstaller.

Here is the full stack trace

8937 [coreLoadExecutor-5-thread-3] INFO  
org.apache.solr.core.CachingDirectoryFactory  – Closing directory: 
/home/solr/solr-4.10.1/solr/example/solr/product/data
8938 [coreLoadExecutor-5-thread-3] ERROR org.apache.solr.core.CoreContainer  – 
Error creating core [product]: Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.init(SolrCore.java:873)
at org.apache.solr.core.SolrCore.init(SolrCore.java:646)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:255)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:249)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
at org.apache.solr.core.SolrCore.init(SolrCore.java:845)
... 9 more
Caused by: java.nio.file.NoSuchFileException: 
/home/solr/solr-4.10.1/solr/example/solr/product/data/index/segments_4
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:176)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:334)
at 
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196)
at 
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:198)
at 
org.apache.lucene.store.Directory.openChecksumInput(Directory.java:113)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:341)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:792)
at 
org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:77)
at 
org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
at 
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279)
at 
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528)
... 11 more
8943 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  – 
user.dir=/home/solr/solr-4.10.1/solr/example
8943 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  – 
SolrDispatchFilter.init() done
8982 [main] INFO  org.eclipse.jetty.server.AbstractConnector  – Started 
SocketConnector@0.0.0.0:8983

Thanks.
Alex.

 

 

 

-Original Message-
From: Shalin Shekhar Mangar shalinman...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Feb 24, 2015 12:13 am
Subject: Re: snapinstaller does not start newSearcher


Do you mean the snapinstaller (bash) script? Those are legacy scripts.
It's
been a long time since they were tested. The ReplicationHandler is
the
recommended way to setup replication. If you want to take a snapshot
then
the replication handler has an HTTP based API which lets you do
that.

In any case, do you have the full stack trace for that exception?
There
should be another cause nested under it.

On Tue, Feb 24, 2015 at 12:47
PM, alx...@aim.com wrote:

 Hello,

 I am using latest solr (solr
trunk) . I run snapinstaller, and see that it
 copies snapshot to index folder
but changes are not picked up and

  logs in slave after running
snapinstaller are

 44302 [qtp1312571113-14] INFO 
org.apache.solr.update.UpdateHandler  –
 start

commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

44303 [qtp1312571113-14] INFO  org.apache.solr.update.UpdateHandler  – No

uncommitted changes. Skipping IW.commit.
 44304 [qtp1312571113-14] INFO 
org.apache.solr.core.SolrCore  –
 SolrIndexSearcher has not 

Re: Basic Multilingual search capability

2015-02-24 Thread Alexandre Rafalovitch
Given the limited needs, I would probably do something like this:

1) Put a language identifier in the UpdateRequestProcessor chain
during indexing and route out at least known problematic languages,
such as Chinese, Japanese, Arabic into individual fields
2) Put everything else together into one field with ICUTokenizer,
maybe also ICUFoldingFilter
3) At the very end of that joint filter, stick in LengthFilter with
some high number, e.g. 25 characters max. This will ensure that
super-long words from non-space languages and edge conditions do not
break the rest of your system.


Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote:
 I understand relevancy, stemming etc becomes extremely complicated with 
 multilingual support, but our first goal is to be able to tokenize and 
 provide basic search capability for any language. Ex: When the document 
 contains hello or здравствуйте, the analyzer creates tokens and provides 
 exact match search results.


Re: Integration Tests with SOLR 5

2015-02-24 Thread Dmitry Kan
Hi Thomas,

I just downloaded solr5.0.0 tgz and found this in the directory structure:

solr-5.0.0/server/webapps$ ls
solr.war

 - How to deploy your schema.xml, stopwords, solr plug-ins etc. for
testing in an isolated environment
the cores, for example are created in the:

solr-5.0.0/server/solr/core0$ ls
conf  core.properties  data

Same native solr directory layout for the core. If you need custom
libraries (plugins etc), put them into the lib directory.
Then conf/ directory is what you should be used to from before: it contains
the schema.xml, solrconfig.xml etc.

HTH,
Dmitry

On Tue, Feb 24, 2015 at 10:16 AM, Thomas Scheffler 
thomas.scheff...@uni-jena.de wrote:

 Hi,

 I noticed that not only SOLR does not deliver a WAR file anymore but also
 advices not to try to provide a custom WAR file that can be deployed
 anymore as future version may depend on custom jetty features.

 Until 4.10. we were able to provide a WAR file with all the plug-ins we
 need for easier installs. The same WAR file was used together with an web
 application WAR running integration tests and to check if all application
 details still work. We used the cargo-mave2-plugin and different servlet
 container for testing. I think this is quiet common thing to do with
 continuous integration.

 Now I wonder if anyone has a similar setup and with integration tests
 running against SOLR 5.

 - No artifacts can be used, so no local repository cache is present
 - How to deploy your schema.xml, stopwords, solr plug-ins etc. for testing
 in an isolated environment
 - What does a maven boilerplate code look like?

 Any ideas would be appreciated.

 Kind regards,

 Thomas




-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Integration Tests with SOLR 5

2015-02-24 Thread Shawn Heisey
On 2/24/2015 1:16 AM, Thomas Scheffler wrote:
 I noticed that not only SOLR does not deliver a WAR file anymore but
 also advices not to try to provide a custom WAR file that can be
 deployed anymore as future version may depend on custom jetty features.
 
 Until 4.10. we were able to provide a WAR file with all the plug-ins we
 need for easier installs. The same WAR file was used together with an
 web application WAR running integration tests and to check if all
 application details still work. We used the cargo-mave2-plugin and
 different servlet container for testing. I think this is quiet common
 thing to do with continuous integration.
 
 Now I wonder if anyone has a similar setup and with integration tests
 running against SOLR 5.
 
 - No artifacts can be used, so no local repository cache is present
 - How to deploy your schema.xml, stopwords, solr plug-ins etc. for
 testing in an isolated environment
 - What does a maven boilerplate code look like?

I don't know anything at all about Maven.

For now, Solr 5.x *is* still deployed as a .war file, which you can find
in the download in the server/webapps directory ... but the plan is to
eventually create a standalone application for Solr instead of running
Jetty.  Those plans are expected to happen during the 5.x timeframe,
which is why the documentation advises against relying on the .war file.
 It eventually *will* disappear, and it's important to prepare the
userbase for that well in advance of the actual implementation.

I have a custom install based on the jetty included with Solr 4.x.  When
I upgrade, I will continue to use the war as long as it is available,
and when the standalone app appears, I will either reconfigure my init
script or see about switching over to the script included with the 5.x
download.

Thanks,
Shawn



Re: apache solr - dovecot - some search fields works some dont

2015-02-24 Thread Alexandre Rafalovitch
What specifically do you mean by stall? Very slow but comes back?
Never comes back? Throws an error?

What is your field definition for body? How big is the content in it?
Do you change the fields returned if you search body and if you search
just headers?
How many rows do you request back?


One hypothesis: You are storing (stored=true) your body, it is very
large and the stall happens not during search but during reading very
large amount of text from disk to reconstitute the body to send it
back.

Regards,
   Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/

On 24 February 2015 at 02:06, Kevin Laurie superinterstel...@gmail.com wrote:
 For example if I were to search To and From apache solr would process it in
 its log and give me an output, however if I were to search something in the
 Body it would stall and no output.


How make Searching fast in spell checking

2015-02-24 Thread Nitin Solanki
Hello all,
 I have 49 GB of indexed data. I am doing spell checking
things. I have applied ShingleFilter on both index and query part and
taking 25 suggestions of each word in the query and not using collations.
When I search a phrase(taken 5-6 words. Ex.- barack obama is president of
America) then it takes 2 to 3 seconds to process while searching a single
term(Ex. - barack) then it takes only 0.23 second which is good.
Why phrase checking is taking time. Am I doing something wrong ? Any help
on this?


Re: [ANNOUNCE] Luke 4.10.3 released

2015-02-24 Thread Dmitry Kan
Hi, Tomoko!

Thanks for being a fan of luke!

Current status of github's luke (https://github.com/DmitryKey/luke) is that
it has releases for all the major lucene versions since 4.3.0, excluding
4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the latest --
5.0.0.

Porting the github's luke to ALv2 compliant framework (GWT or Pivot) is a
long standing goal. With GWT I had issues related to listing and reading
the index directory. So this effort has been parked. Most recently I have
been approaching the Pivot. Mark Miller has done an initial port, that I
took as the basis. I'm hoping to continue on this track as time permits.


If you would like to work on the Pivot version, may I suggest you to fork
the github's version? The ultimate goal is to donate this to Apache, but at
least we will have the common plate. :)


Thanks,
Dmitry

On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com
 wrote:

 Hi,

 I'm an user / fan of Luke, so deeply appreciate your work.

 I've carefully read the readme, noticed the (one of) project's goal:
 To port the thinlet UI to an ASL compliant license framework so that it
 can be contributed back to Apache Lucene. Current work is done with GWT
 2.5.1.

 There has been GWT based, ASL compliant Luke supporting the latest Lucene ?

 I've recently got in with LUCENE-2562. Currently, Apache Pivot based port
 is going. But I do not know so much about Luke's long (and may be slightly
 complex) history, so I would grateful if anybody clear the association of
 the Luke project (now on Github) and the Jira issue. Or, they can be
 independent of each other.
 https://issues.apache.org/jira/browse/LUCENE-2562
 I don't have any opinions, just want to understand current status and avoid
 duplicate works.

 Apologize for a bit annoying post.

 Many thanks,
 Tomoko



 2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com:

  Hello,
 
  Luke 4.10.3 has been released. Download it here:
 
  https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3
 
  The release has been tested against the solr-4.10.3 based index.
 
  Issues fixed in this release: #13
  https://github.com/DmitryKey/luke/pull/13
  Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2
 
  Thanks to respective contributors!
 
 
  P.S. waiting for lucene 5.0 artifacts to hit public maven repositories
 for
  the next major release of luke.
 
  --
  Dmitry Kan
  Luke Toolbox: http://github.com/DmitryKey/luke
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
  SemanticAnalyzer: www.semanticanalyzer.info
 




-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Sorting on multi-valued field

2015-02-24 Thread Peri Subrahmanya
All,

Is there a way sorting can work on a multi-valued field or does it always have 
to be “false” for it to work.

Thanks
-Peri

*** DISCLAIMER *** This is a PRIVATE message. If you are not the intended 
recipient, please delete without copying and kindly advise us by e-mail of the 
mistake in delivery.
NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global 
Services to any order or other contract unless pursuant to explicit written 
agreement or government initiative expressly permitting the use of e-mail for 
such purpose.




Re: [ANNOUNCE] Luke 4.10.3 released

2015-02-24 Thread Tomoko Uchida
Hi,

I'm an user / fan of Luke, so deeply appreciate your work.

I've carefully read the readme, noticed the (one of) project's goal:
To port the thinlet UI to an ASL compliant license framework so that it
can be contributed back to Apache Lucene. Current work is done with GWT
2.5.1.

There has been GWT based, ASL compliant Luke supporting the latest Lucene ?

I've recently got in with LUCENE-2562. Currently, Apache Pivot based port
is going. But I do not know so much about Luke's long (and may be slightly
complex) history, so I would grateful if anybody clear the association of
the Luke project (now on Github) and the Jira issue. Or, they can be
independent of each other.
https://issues.apache.org/jira/browse/LUCENE-2562
I don't have any opinions, just want to understand current status and avoid
duplicate works.

Apologize for a bit annoying post.

Many thanks,
Tomoko



2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com:

 Hello,

 Luke 4.10.3 has been released. Download it here:

 https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3

 The release has been tested against the solr-4.10.3 based index.

 Issues fixed in this release: #13
 https://github.com/DmitryKey/luke/pull/13
 Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2

 Thanks to respective contributors!


 P.S. waiting for lucene 5.0 artifacts to hit public maven repositories for
 the next major release of luke.

 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info



Dynamic boosting on a document for Solr4.10.2

2015-02-24 Thread dinesh naik
Hi ,
We are looking for an option to boost a document while indexing based on
the values of certain field.

For example : lets say we have 10 documents with fields say- name,acc no,
status, age address etc.

Now for documents with status 'Active' we want to boost by value 1000 and
if status is 'Closed' we want to do negative boost say -100 . Also if age
is between '20-50' we want to boost by 2000 etc.

Please let us know how can we achieve this ?
-- 
Best Regards,
Dinesh Naik


Re: Sorting on multi-valued field

2015-02-24 Thread Alexandre Rafalovitch
The usual strategy is to have an UpdateRequestProcessor chain that
will copy the field and keep only one value from it, specifically for
sort. There is a whole collection of URPs to help you choose which
value to keep, as well as how to provide a default.

You can see the full list at:
http://www.solr-start.com/info/update-request-processors/#FieldValueSubsetUpdateProcessorFactory

Also, if you are on the recent Solr, consider enabling docValues on
that target single-value field, it's better for sorting. You can have
other flags (stored,indexed) set to false, as you will not be using
the field for anything else.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 February 2015 at 09:37, Peri Subrahmanya
peri.subrahma...@htcinc.com wrote:
 All,

 Is there a way sorting can work on a multi-valued field or does it always 
 have to be “false” for it to work.

 Thanks
 -Peri

 *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended 
 recipient, please delete without copying and kindly advise us by e-mail of 
 the mistake in delivery.
 NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global 
 Services to any order or other contract unless pursuant to explicit written 
 agreement or government initiative expressly permitting the use of e-mail for 
 such purpose.




Re: Special character and wildcard matching

2015-02-24 Thread Jack Krupansky
Please post the info I requested - the exact query, and the Solr response.

-- Jack Krupansky

On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan arunrangara...@gmail.com
wrote:

 In our case, the lower-casing is happening in a custom Java indexer code,
 via Java's String.toLowerCase() method.

 I used the analysis tool in Solr admin (with Jetty). I believe the raw
 bytes explain this.

 Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and
 beyoncé in file beyonce_with_spl_chars.JPG.

 Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
 Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]

 So when you look at the bytes, it seems to explain why beyonce* matches
 beyoncé.

 I tried your approach with a KeywordTokenizer followed by a
 LowerCaseFilter, but I see the same behavior.



 On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

 But how is that lowercasing occurring? I mean, solr.StrField doesn't do
 that.

 Some containers default to automatically mapping accented characters, so
 that the accented e would then get indexed as a normal e, and then
 your
 wildcard would match it, and an accented e in a query would get mapped
 as
 well and then match the normal e in the index. What does your query
 response look like?

 This blog post explains that problem:
 http://bensch.be/tomcat-solr-and-special-characters

 Note that you could make your string field a text field with the keyword
 tokenizer and then filter it for lower case, such as when the user query
 might have a capital B. String field is most appropriate when the field
 really is 100% raw.


 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan 
 arunrangara...@gmail.com
 wrote:

  Yes, it is a string field and not a text field.
 
  fieldType name=string class=solr.StrField sortMissingLast=true
  omitNorms=true/
  field name=raw_name type=string indexed=true stored=true /
 
  Lower-casing done to do case-insensitive matching.
 
  On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky 
 jack.krupan...@gmail.com
  wrote:
 
   Is it really a string field - as opposed to a text field? Show us the
  field
   and field type.
  
   Besides, if it really were a raw name, wouldn't that be a capital
 B?
  
   -- Jack Krupansky
  
   On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan 
  arunrangara...@gmail.com
   
   wrote:
  
I have a string field raw_name like this in my document:
   
{raw_name: beyoncé}
   
(Notice that the last character is a special character.)
   
When I issue this wildcard query:
   
q=raw_name:beyonce*
   
i.e. with the last character simply being the ASCII 'e', Solr
 returns
  me
the above document.
   
How do I prevent this?
   
  
 





Re: highlighting the boolean query

2015-02-24 Thread Erik Hatcher
BooleanQuery’s extractTerms looks like this:

public void extractTerms(SetTerm terms) {
  for (BooleanClause clause : clauses) {
if (clause.isProhibited() == false) {
  clause.getQuery().extractTerms(terms);
}
  }
}
that’s generally the method called by the Highlighter for what terms should be 
highlighted.  So even if a term didn’t match the document, the query that the 
term was in matched the document and it just blindly highlights all the terms 
(minus prohibited ones).   That at least explains the behavior you’re seeing, 
but it’s not ideal.  I’ve seen specialized highlighters that convert to spans, 
which are accurate to the exact matches within the document.  Been a while 
since I dug into the HighlightComponent, so maybe there’s some other options 
available out of the box?

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com http://www.lucidworks.com/




 On Feb 24, 2015, at 3:16 AM, Dmitry Kan solrexp...@gmail.com wrote:
 
 Erick,
 
 Our default operator is AND.
 
 Both queries below parse the same:
 
 a OR (b c) OR d
 a OR (b AND c) OR d
 
 The parsed query:
 
 str name=parsedquery_toStringContents:a (+Contents:b +Contents:c)
 Contents:d/str
 
 So this part is consistent with our expectation.
 
 
 I'm a bit puzzled by your statement that c didn't contribute to the
 score.
 what I meant was that the term c was not hit by the scorerer: the explain
 section does not refer to it. I'm using the made up terms here, but the
 concept holds.
 
 The code suggests that we could benefit from storing term offsets and
 positions:
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470
 
 Is it correct assumption?
 
 On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
 Highlighting is such a pain...
 
 what does the parsed query look like? If the default operator is OR,
 then this seems correct as both 'd' and 'c' appear in the doc. So
 I'm a bit puzzled by your statement that c didn't contribute to the
 score.
 
 If the parsed query is, indeed
 a +b +c d
 
 then it does look like something with the highlighter. Whether other
 highlighters are better for this case.. no clue ;(
 
 Best,
 Erick
 
 On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote:
 Erick,
 
 nope, we are using std lucene qparser with some customizations, that do
 not
 affect the boolean query parsing logic.
 
 Should we try some other highlighter?
 
 On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com
 
 wrote:
 
 Are you using edismax?
 
 On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com
 wrote:
 Hello!
 
 In solr 4.3.1 there seem to be some inconsistency with the
 highlighting
 of
 the boolean query:
 
 a OR (b c) OR d
 
 This returns a proper hit, which shows that only d was included into
 the
 document score calculation.
 
 But the highlighter returns both d and c in em tags.
 
 Is this a known issue of the standard highlighter? Can it be
 mitigated?
 
 
 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info
 
 
 
 
 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info
 
 
 
 
 -- 
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info



Re: apache solr - dovecot - some search fields works some dont

2015-02-24 Thread Kevin Laurie
Dear Alex,
Nothing comes back when I do a body search. It shows a searching
process on the client but then it just stops and no result comes up.
I am wondering if this is schema related problem.

When I search a subject on the mail client I get output as below and :-

8025 [main] INFO  org.eclipse.jetty.server.AbstractConnector  ?
Started SocketConnector@0.0.0.0:8983
9001 [searcherExecutor-6-thread-1] INFO  org.apache.solr.core.SolrCore
 ? [collection1] Registered new searcher Searcher@7dfcb28[collection1]
main{StandardDirectoryReader(segments_4g:789:nrt _6z(4.10.2):C16672
_44(4.10.2):C6996 _56(4.10.2):C3672 _64(4.10.2):C4000
_8y(4.10.2):C3143 _7v(4.10.2):C673 _7b(4.10.2):C830 _85(4.10.2):C3754
_7k(4.10.2):C3975 _8f(4.10.2):C1516 _7n(4.10.2):C67 _9a(4.10.2):C677
_8o(4.10.2):C38 _8v(4.10.2):C40 _9l(4.10.2):C2705 _8x(4.10.2):C43
_90(4.10.2):C16 _9b(4.10.2):C22 _9d(4.10.2):C44 _9f(4.10.2):C84
_9h(4.10.2):C83 _9i(4.10.2):C356 _9j(4.10.2):C84 _9k(4.10.2):C296
_9m(4.10.2):C83 _9n(4.10.2):C57)}
155092 [qtp433527567-13] INFO  org.apache.solr.core.SolrCore  ?
[collection1] webapp=/solr path=/select
params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser:u...@domain.netrows=107178}
hits=1237 status=0 QTime=1918

The content is quite large, 27,000emails .

Could you advise what could this problem be?
How do we correct and fix this problem then?

I might have the wrong schema installed so the body search is not
working. Could this be it?
Might post this on dovecot to see if someone could answer about this.

Kindly advise if you have any idea on this

Ps.How do I check the body definition?
Thanks
Kevin

On Tue, Feb 24, 2015 at 9:36 PM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 What specifically do you mean by stall? Very slow but comes back?
 Never comes back? Throws an error?

 What is your field definition for body? How big is the content in it?
 Do you change the fields returned if you search body and if you search
 just headers?
 How many rows do you request back?


 One hypothesis: You are storing (stored=true) your body, it is very
 large and the stall happens not during search but during reading very
 large amount of text from disk to reconstitute the body to send it
 back.

 Regards,
Alex.

 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/

 On 24 February 2015 at 02:06, Kevin Laurie superinterstel...@gmail.com 
 wrote:
 For example if I were to search To and From apache solr would process it in
 its log and give me an output, however if I were to search something in the
 Body it would stall and no output.


Re: apache solr - dovecot - some search fields works some dont

2015-02-24 Thread Alexandre Rafalovitch
Look for the line like this in your log with the search matching the
body. Maybe put a nonsense string and look for that. This should tell
you what the Solr-side search looks like.

The thing that worries me here is: rows=107178 - that's most probably
what's blowing up Solr. You should be paging, not getting everything.
And that number being like that, it may mean your client makes two
requests, once to get the result count and once to get the rows
themselves. It's the second request that is most probably blowing up.

Once you get the request, you should be able to tell what fields are
being searched and check those fields in schema.xml for field type and
then field type's definition. Which is what I asked for in the
previous email.

Regards,
   Alex.

On 24 February 2015 at 11:55, Kevin Laurie superinterstel...@gmail.com wrote:
 155092 [qtp433527567-13] INFO  org.apache.solr.core.SolrCore  ?
 [collection1] webapp=/solr path=/select
 params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser:u...@domain.netrows=107178}
 hits=1237 status=0 QTime=1918




Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


Re: apache solr - dovecot - some search fields works some dont

2015-02-24 Thread Kevin Laurie
Dear Alex,
I checked the log. When searching the fields From , To, Subject. It records
it
When searching Body, there is no log showing. I am assuming it is a problem
in the schema.

Will post schema.xml output in next mail.

On Wed, Feb 25, 2015 at 1:09 AM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 Look for the line like this in your log with the search matching the
 body. Maybe put a nonsense string and look for that. This should tell
 you what the Solr-side search looks like.

 The thing that worries me here is: rows=107178 - that's most probably
 what's blowing up Solr. You should be paging, not getting everything.
 And that number being like that, it may mean your client makes two
 requests, once to get the result count and once to get the rows
 themselves. It's the second request that is most probably blowing up.

 Once you get the request, you should be able to tell what fields are
 being searched and check those fields in schema.xml for field type and
 then field type's definition. Which is what I asked for in the
 previous email.

 Regards,
Alex.

 On 24 February 2015 at 11:55, Kevin Laurie superinterstel...@gmail.com
 wrote:
  155092 [qtp433527567-13] INFO  org.apache.solr.core.SolrCore  ?
  [collection1] webapp=/solr path=/select
 
 params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser:
 u...@domain.netrows=107178}
  hits=1237 status=0 QTime=1918



 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/



Re: apache solr - dovecot - some search fields works some dont

2015-02-24 Thread Kevin Laurie
Hi Alex,
Sorry for such noobness question.
But where does the schema file go in Solr? Is the directory below correct?
/opt/solr/solr/collection1/data
Correct?
Thanks
Kevin

On Wed, Feb 25, 2015 at 1:21 AM, Kevin Laurie
superinterstel...@gmail.com wrote:
 Dear Alex,
 I checked the log. When searching the fields From , To, Subject. It records
 it
 When searching Body, there is no log showing. I am assuming it is a problem
 in the schema.

 Will post schema.xml output in next mail.

 On Wed, Feb 25, 2015 at 1:09 AM, Alexandre Rafalovitch arafa...@gmail.com
 wrote:

 Look for the line like this in your log with the search matching the
 body. Maybe put a nonsense string and look for that. This should tell
 you what the Solr-side search looks like.

 The thing that worries me here is: rows=107178 - that's most probably
 what's blowing up Solr. You should be paging, not getting everything.
 And that number being like that, it may mean your client makes two
 requests, once to get the result count and once to get the rows
 themselves. It's the second request that is most probably blowing up.

 Once you get the request, you should be able to tell what fields are
 being searched and check those fields in schema.xml for field type and
 then field type's definition. Which is what I asked for in the
 previous email.

 Regards,
Alex.

 On 24 February 2015 at 11:55, Kevin Laurie superinterstel...@gmail.com
 wrote:
  155092 [qtp433527567-13] INFO  org.apache.solr.core.SolrCore  ?
  [collection1] webapp=/solr path=/select
 
  params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser:u...@domain.netrows=107178}
  hits=1237 status=0 QTime=1918



 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/




Re: highlighting the boolean query

2015-02-24 Thread Erick Erickson
Hmmm, not quite sure what to say. Offsets and positions help,
particularly with FastVectorHighlighter, but the highlighting is
usually re-analyzed anyway so it _shouldn't_ matter. But what I don't
know about highlighting could fill volumes ;)..

Sorry I can't be more help here.
Erick

On Tue, Feb 24, 2015 at 12:16 AM, Dmitry Kan solrexp...@gmail.com wrote:
 Erick,

 Our default operator is AND.

 Both queries below parse the same:

 a OR (b c) OR d
 a OR (b AND c) OR d

 The parsed query:

 str name=parsedquery_toStringContents:a (+Contents:b +Contents:c)
 Contents:d/str

 So this part is consistent with our expectation.


 I'm a bit puzzled by your statement that c didn't contribute to the
 score.
 what I meant was that the term c was not hit by the scorerer: the explain
 section does not refer to it. I'm using the made up terms here, but the
 concept holds.

 The code suggests that we could benefit from storing term offsets and
 positions:
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470

 Is it correct assumption?

 On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Highlighting is such a pain...

 what does the parsed query look like? If the default operator is OR,
 then this seems correct as both 'd' and 'c' appear in the doc. So
 I'm a bit puzzled by your statement that c didn't contribute to the
 score.

 If the parsed query is, indeed
 a +b +c d

 then it does look like something with the highlighter. Whether other
 highlighters are better for this case.. no clue ;(

 Best,
 Erick

 On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote:
  Erick,
 
  nope, we are using std lucene qparser with some customizations, that do
 not
  affect the boolean query parsing logic.
 
  Should we try some other highlighter?
 
  On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  Are you using edismax?
 
  On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com
 wrote:
   Hello!
  
   In solr 4.3.1 there seem to be some inconsistency with the
 highlighting
  of
   the boolean query:
  
   a OR (b c) OR d
  
   This returns a proper hit, which shows that only d was included into
 the
   document score calculation.
  
   But the highlighter returns both d and c in em tags.
  
   Is this a known issue of the standard highlighter? Can it be
 mitigated?
  
  
   --
   Dmitry Kan
   Luke Toolbox: http://github.com/DmitryKey/luke
   Blog: http://dmitrykan.blogspot.com
   Twitter: http://twitter.com/dmitrykan
   SemanticAnalyzer: www.semanticanalyzer.info
 
 
 
 
  --
  Dmitry Kan
  Luke Toolbox: http://github.com/DmitryKey/luke
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
  SemanticAnalyzer: www.semanticanalyzer.info




 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info


Re: Query: no result returned if use AND OR operators

2015-02-24 Thread Erick Erickson
You're probably hitting different request handlers. From the fragment
you posted, the one that returns 8 is going to the /browse handler
(see solrconfig.xml). The admin UI goes to either /select or /query.
These are configured totally differently in terms of what fields are
searched etc.

Attach debug=query to the URL and look in Velocity for the debug
info and you'll see that that parsed query is significantly different.

At least that's my guess.

Best,
Erick

On Mon, Feb 23, 2015 at 11:01 PM, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
 Hi,

 My Solr is 4.10.2

 When I use the web UI to run a simple query: 1+AND+2


 1) from the log, I can see the hits=8
 7629109 [qtp1702388274-16] INFO  org.apache.solr.core.SolrCore  – [infocast] 
 webapp=/solr path=/clustering 
 params={q=1+AND+2wt=velocityv.template=cluster_results} hits=8 
 status=0 QTime=21

 However, from the query page, it returns
 2) 0 results found in 5 ms Page 0 of 0
   0 results found. Page 0 of 0


 3) If I use Admin page to ruyn the query, I can get 3 back

 {
   responseHeader: {
 status: 0,
 QTime: 5,
 params: {
   indent: true,
   q: \1\ AND \2\,
   _: 1424761089223,
   wt: json
 }
   },
   response: {
 numFound: 3,
 start: 0,
 docs: [
   {
 title: [ ….

 Very strange to me, please help!

 Regards


Re: apache solr - dovecot - some search fields works some dont

2015-02-24 Thread Kevin Laurie
Hi Alex,

Below is where my schema is stored:-

/opt/solr/solr/collection1/conf#

File name: schema.xml

Below output for body


 fields
   field name=id type=string indexed=true stored=true
required=true /
   field name=uid type=slong indexed=true stored=true
required=true /
   field name=box type=string indexed=true stored=true
required=true /
   field name=user type=string indexed=true stored=true
required=true /

   field name=hdr type=text indexed=true stored=false /
   field name=body type=text indexed=true stored=false /

   field name=from type=text indexed=true stored=false /
   field name=to type=text indexed=true stored=false /
   field name=cc type=text indexed=true stored=false /
   field name=bcc type=text indexed=true stored=false /
   field name=subject type=text indexed=true stored=false /
 /fields

Anything you see that I should be concerned about?




On Wed, Feb 25, 2015 at 1:27 AM, Kevin Laurie superinterstel...@gmail.com
wrote:

 Hi Alex,
 Sorry for such noobness question.
 But where does the schema file go in Solr? Is the directory below correct?
 /opt/solr/solr/collection1/data
 Correct?
 Thanks
 Kevin

 On Wed, Feb 25, 2015 at 1:21 AM, Kevin Laurie
 superinterstel...@gmail.com wrote:
  Dear Alex,
  I checked the log. When searching the fields From , To, Subject. It
 records
  it
  When searching Body, there is no log showing. I am assuming it is a
 problem
  in the schema.
 
  Will post schema.xml output in next mail.
 
  On Wed, Feb 25, 2015 at 1:09 AM, Alexandre Rafalovitch 
 arafa...@gmail.com
  wrote:
 
  Look for the line like this in your log with the search matching the
  body. Maybe put a nonsense string and look for that. This should tell
  you what the Solr-side search looks like.
 
  The thing that worries me here is: rows=107178 - that's most probably
  what's blowing up Solr. You should be paging, not getting everything.
  And that number being like that, it may mean your client makes two
  requests, once to get the result count and once to get the rows
  themselves. It's the second request that is most probably blowing up.
 
  Once you get the request, you should be able to tell what fields are
  being searched and check those fields in schema.xml for field type and
  then field type's definition. Which is what I asked for in the
  previous email.
 
  Regards,
 Alex.
 
  On 24 February 2015 at 11:55, Kevin Laurie superinterstel...@gmail.com
 
  wrote:
   155092 [qtp433527567-13] INFO  org.apache.solr.core.SolrCore  ?
   [collection1] webapp=/solr path=/select
  
  
 params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser:
 u...@domain.netrows=107178}
   hits=1237 status=0 QTime=1918
 
 
 
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 



Re: Special character and wildcard matching

2015-02-24 Thread Arun Rangarajan
In our case, the lower-casing is happening in a custom Java indexer code,
via Java's String.toLowerCase() method.

I used the analysis tool in Solr admin (with Jetty). I believe the raw
bytes explain this.

Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and
beyoncé in file beyonce_with_spl_chars.JPG.

Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81]

So when you look at the bytes, it seems to explain why beyonce* matches
beyoncé.

I tried your approach with a KeywordTokenizer followed by a
LowerCaseFilter, but I see the same behavior.



On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 But how is that lowercasing occurring? I mean, solr.StrField doesn't do
 that.

 Some containers default to automatically mapping accented characters, so
 that the accented e would then get indexed as a normal e, and then your
 wildcard would match it, and an accented e in a query would get mapped as
 well and then match the normal e in the index. What does your query
 response look like?

 This blog post explains that problem:
 http://bensch.be/tomcat-solr-and-special-characters

 Note that you could make your string field a text field with the keyword
 tokenizer and then filter it for lower case, such as when the user query
 might have a capital B. String field is most appropriate when the field
 really is 100% raw.


 -- Jack Krupansky

 On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan arunrangara...@gmail.com
 
 wrote:

  Yes, it is a string field and not a text field.
 
  fieldType name=string class=solr.StrField sortMissingLast=true
  omitNorms=true/
  field name=raw_name type=string indexed=true stored=true /
 
  Lower-casing done to do case-insensitive matching.
 
  On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky 
 jack.krupan...@gmail.com
  wrote:
 
   Is it really a string field - as opposed to a text field? Show us the
  field
   and field type.
  
   Besides, if it really were a raw name, wouldn't that be a capital
 B?
  
   -- Jack Krupansky
  
   On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan 
  arunrangara...@gmail.com
   
   wrote:
  
I have a string field raw_name like this in my document:
   
{raw_name: beyoncé}
   
(Notice that the last character is a special character.)
   
When I issue this wildcard query:
   
q=raw_name:beyonce*
   
i.e. with the last character simply being the ASCII 'e', Solr returns
  me
the above document.
   
How do I prevent this?
   
  
 



Re: Regarding behavior of docValues.

2015-02-24 Thread Erick Erickson
Hmmm, that's not my understanding. docValues are simply a different
layout for storing
the _indexed_ values that facilitates rapid loading of the field from
disk, essentially
putting the uninverted field value in a conveniently-loadable form.

So AFAIK, the field is stored only once and used for all three,
sorting, faceting and
searching.

Best,
Erick

On Tue, Feb 24, 2015 at 4:13 AM, Modassar Ather modather1...@gmail.com wrote:
 Thanks for your response Mikhail.

 On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Both statements seem true to me.

 On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com
 wrote:

  Hi,
 
  Kindly help me understand the behavior of following field.
 
  field name=manu_exact type=string indexed=true stored=false
  docValues=true /
 
  For a field like above where indexed=true and docValues=true, is it
  that:
   1) For sorting/faceting on *manu_exact* the docValues will be used.
   2) For querying on *manu_exact* the inverted index will be used.
 
  Thanks,
  Modassar
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: rankquery usage bug?

2015-02-24 Thread Ryan Josal
Ticket filed, thanks!
https://issues.apache.org/jira/browse/SOLR-7152

On Fri, Feb 20, 2015 at 9:29 PM, Joel Bernstein joels...@gmail.com wrote:

 Ryan,

 This looks like a good jira ticket to me.

 Joel Bernstein
 Search Engineer at Heliosearch

 On Fri, Feb 20, 2015 at 6:40 PM, Ryan Josal rjo...@gmail.com wrote:

  Hey guys, I put a rq in defaults but I can't figure out how to override
 it
  with no rankquery.  Looks like one option might be checking for empty
  string before trying to use it in QueryComponent?  I can work around it
 in
  the prep method of an earlier searchcomponent for now.
 
  Ryan
 



Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Michael Della Bitta
Benson:

Are you trying to run independent invocations of Solr for every node?
Otherwise, you'd just want to create a 8 shard collection with
maxShardsPerNode set to 8 (or more I guess).

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/

On Tue, Feb 24, 2015 at 1:27 PM, Benson Margulies bimargul...@gmail.com
wrote:

 With so much of the site shifted to 5.0, I'm having a bit of trouble
 finding what I need, and so I'm hoping that someone can give me a push
 in the right direction.

 On a big multi-core machine, I want to set up a configuration with 8
 (or perhaps more) nodes treated as shards. I have some very particular
 solrconfig.xml and schema.xml that I need to use.

 Could some kind person point me at a relatively step-by-step layout?
 This is all on Linux, I'm happy to explicitly run Zookeeper.



Re: highlighting the boolean query

2015-02-24 Thread Michael Sokolov
There is also PostingsHighlighter -- I recommend it, if only for the 
performance improvement, which is substantial, but I'm not completely 
sure how it handles this issue.  The one drawback I *am* aware of is 
that it is insensitive to positions (so words from phrases get 
highlighted even in isolation)


-Mike


On 02/24/2015 12:46 PM, Erik Hatcher wrote:

BooleanQuery’s extractTerms looks like this:

public void extractTerms(SetTerm terms) {
   for (BooleanClause clause : clauses) {
 if (clause.isProhibited() == false) {
   clause.getQuery().extractTerms(terms);
 }
   }
}
that’s generally the method called by the Highlighter for what terms should be 
highlighted.  So even if a term didn’t match the document, the query that the 
term was in matched the document and it just blindly highlights all the terms 
(minus prohibited ones).   That at least explains the behavior you’re seeing, 
but it’s not ideal.  I’ve seen specialized highlighters that convert to spans, 
which are accurate to the exact matches within the document.  Been a while 
since I dug into the HighlightComponent, so maybe there’s some other options 
available out of the box?

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com http://www.lucidworks.com/





On Feb 24, 2015, at 3:16 AM, Dmitry Kan solrexp...@gmail.com wrote:

Erick,

Our default operator is AND.

Both queries below parse the same:

a OR (b c) OR d
a OR (b AND c) OR d

The parsed query:

str name=parsedquery_toStringContents:a (+Contents:b +Contents:c)
Contents:d/str

So this part is consistent with our expectation.



I'm a bit puzzled by your statement that c didn't contribute to the

score.
what I meant was that the term c was not hit by the scorerer: the explain
section does not refer to it. I'm using the made up terms here, but the
concept holds.

The code suggests that we could benefit from storing term offsets and
positions:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470

Is it correct assumption?

On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com
wrote:


Highlighting is such a pain...

what does the parsed query look like? If the default operator is OR,
then this seems correct as both 'd' and 'c' appear in the doc. So
I'm a bit puzzled by your statement that c didn't contribute to the
score.

If the parsed query is, indeed
a +b +c d

then it does look like something with the highlighter. Whether other
highlighters are better for this case.. no clue ;(

Best,
Erick

On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote:

Erick,

nope, we are using std lucene qparser with some customizations, that do

not

affect the boolean query parsing logic.

Should we try some other highlighter?

On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com

wrote:


Are you using edismax?

On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com

wrote:

Hello!

In solr 4.3.1 there seem to be some inconsistency with the

highlighting

of

the boolean query:

a OR (b c) OR d

This returns a proper hit, which shows that only d was included into

the

document score calculation.

But the highlighter returns both d and c in em tags.

Is this a known issue of the standard highlighter? Can it be

mitigated?


--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info



--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info



--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info






8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Benson Margulies
With so much of the site shifted to 5.0, I'm having a bit of trouble
finding what I need, and so I'm hoping that someone can give me a push
in the right direction.

On a big multi-core machine, I want to set up a configuration with 8
(or perhaps more) nodes treated as shards. I have some very particular
solrconfig.xml and schema.xml that I need to use.

Could some kind person point me at a relatively step-by-step layout?
This is all on Linux, I'm happy to explicitly run Zookeeper.


Re: apache solr - dovecot - some search fields works some dont

2015-02-24 Thread Alexandre Rafalovitch
The field definition looks fine. It's not storing any content
(stored=false) but is indexing, so you should find the records but not
see the body in them.

Not seeing a log entry is more of a worry. Are you sure the request
even made it to Solr?

Can you see anything in Dovecot's logs? Or in Solr's access.logs
(Actually Jetty/Tomcat's access logs that may need to be enabled
first).

At this point, you don't have enough information to fix anything. You
need to understand what's different between request against subject
vs. the request against body. I would break the communication in
three stages:
1) What Dovecote sent
2) What Solr received
3) What Solr sent back

I don't know your skill levels or your system setup to advise
specifically, but Network tracer (e.g. Wireshark) is good for 1. Logs
are good for 2. Using the query from 1) and manually running it
against Solr is good for 3).

Hope this helps,
   Alex.

On 24 February 2015 at 12:35, Kevin Laurie superinterstel...@gmail.com wrote:
 field name=body type=text indexed=true stored=false /




Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


Re: Sorting on multi-valued field

2015-02-24 Thread Shyamsunder Mutcha
How about creating two fields for the multi-valued field. First, grab the 
higher and lower values of the multi-valued field by using natural sort order. 
Then use the first field to store the highest order value. Use second field to 
store lowest order value. Both these fields are single valued.

Now based on the sort order of the original field, override the sort field in 
the handler side before executing the query.

Thanks 
Shyamsunder
Sent from my iPhone

 On Feb 24, 2015, at 11:28 AM, Alexandre Rafalovitch arafa...@gmail.com 
 wrote:
 
 The usual strategy is to have an UpdateRequestProcessor chain that
 will copy the field and keep only one value from it, specifically for
 sort. There is a whole collection of URPs to help you choose which
 value to keep, as well as how to provide a default.
 
 You can see the full list at:
 http://www.solr-start.com/info/update-request-processors/#FieldValueSubsetUpdateProcessorFactory
 
 Also, if you are on the recent Solr, consider enabling docValues on
 that target single-value field, it's better for sorting. You can have
 other flags (stored,indexed) set to false, as you will not be using
 the field for anything else.
 
 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/
 
 
 On 24 February 2015 at 09:37, Peri Subrahmanya
 peri.subrahma...@htcinc.com wrote:
 All,
 
 Is there a way sorting can work on a multi-valued field or does it always 
 have to be “false” for it to work.
 
 Thanks
 -Peri
 
 *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended 
 recipient, please delete without copying and kindly advise us by e-mail of 
 the mistake in delivery.
 NOTE: Regardless of content, this e-mail shall not operate to bind HTC 
 Global Services to any order or other contract unless pursuant to explicit 
 written agreement or government initiative expressly permitting the use of 
 e-mail for such purpose.
 
 


Re: Solrcloud performance issues

2015-02-24 Thread longsan
why you use 15 replicas?

more replicas more slower.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrcloud-performance-issues-tp4186035p4188738.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regarding behavior of docValues.

2015-02-24 Thread Erick Erickson
You're making it too complicated. Both a docValues field and
an indexed (not docValues) field will give you the same
functionality. For rapidly changing indexes, docValues will
load more quickly when a new searcher is opened.

Your question below is not really relevant.

Can it be *field name=manu_exact type=string indexed=true
stored=false docValues=true /*
or
Two fields each for sorting+faceting and for searching like following.


*field name=manu_exact type=string indexed=true stored=false
/field name=manu_exact_sort type=string indexed=false
stored=false docValues=true /*

*
You simply cannot sort, search, or facet on any field for which
indexed=false. You can do all three on any field where
indexed=true (assuming it's not multiValued and only has one token
since sorting only really makes sense for single-valued fields).

It doesn't matter whether the field is docValues=true or not.
So if you want a rule of thumb, make it a docValues field
if you're updating your index rapidly. Otherwise whether a field is
docValues or not is largely irrelevant.

Best,
Erick

On Tue, Feb 24, 2015 at 9:09 PM, Modassar Ather modather1...@gmail.com wrote:
 So for a requirement where I have a field which is used for sorting,
 faceting and searching what should be the better field definition.

 Can it be *field name=manu_exact type=string indexed=true
 stored=false docValues=true /*
 or
 Two fields each for sorting+faceting and for searching like following.


 *field name=manu_exact type=string indexed=true stored=false
 /field name=manu_exact_sort type=string indexed=false
 stored=false docValues=true /*

 Kindly note that it will be better if can use existing field for sorting,
 faceting and add searching on it like in example one above.

 Regards,
 Modassar

 On Tue, Feb 24, 2015 at 11:15 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Hmmm, that's not my understanding. docValues are simply a different
 layout for storing
 the _indexed_ values that facilitates rapid loading of the field from
 disk, essentially
 putting the uninverted field value in a conveniently-loadable form.

 So AFAIK, the field is stored only once and used for all three,
 sorting, faceting and
 searching.

 Best,
 Erick

 On Tue, Feb 24, 2015 at 4:13 AM, Modassar Ather modather1...@gmail.com
 wrote:
  Thanks for your response Mikhail.
 
  On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
  Both statements seem true to me.
 
  On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com
 
  wrote:
 
   Hi,
  
   Kindly help me understand the behavior of following field.
  
   field name=manu_exact type=string indexed=true stored=false
   docValues=true /
  
   For a field like above where indexed=true and docValues=true, is
 it
   that:
1) For sorting/faceting on *manu_exact* the docValues will be used.
2) For querying on *manu_exact* the inverted index will be used.
  
   Thanks,
   Modassar
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com
 



New leader/replica solution for HDFS

2015-02-24 Thread longsan
We used HDFS as our Solr index storage and we really have a heavy update
load. We had met much problems with current leader/replica solution. There
is duplicate index computing on Replilca side. And the data sync between
leader/replica is always a problem.

As HDFS already provides data replication on data layer, could Solr provide
just service layer replication?

My thought is that the leader and the replica all bind to the same data
index directory. And the leader will build up index for new request, the
replica will just keep update the index version with the leader(such as a
soft commit periodically? ). If the leader lost then the replica will take
the duty immediately. 

Thanks for any suggestion of this idea.







--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regarding behavior of docValues.

2015-02-24 Thread Modassar Ather
Thanks Erick for your detailed response.

Sorry! I missed to put that I was trying to understand it in context of
Solr-5.0.0 where fieldcache is no more available.

Regards,
Modassar

On Wed, Feb 25, 2015 at 11:26 AM, Erick Erickson erickerick...@gmail.com
wrote:

 You're making it too complicated. Both a docValues field and
 an indexed (not docValues) field will give you the same
 functionality. For rapidly changing indexes, docValues will
 load more quickly when a new searcher is opened.

 Your question below is not really relevant.
 
 Can it be *field name=manu_exact type=string indexed=true
 stored=false docValues=true /*
 or
 Two fields each for sorting+faceting and for searching like following.


 *field name=manu_exact type=string indexed=true stored=false
 /field name=manu_exact_sort type=string indexed=false
 stored=false docValues=true /*

 *
 You simply cannot sort, search, or facet on any field for which
 indexed=false. You can do all three on any field where
 indexed=true (assuming it's not multiValued and only has one token
 since sorting only really makes sense for single-valued fields).

 It doesn't matter whether the field is docValues=true or not.
 So if you want a rule of thumb, make it a docValues field
 if you're updating your index rapidly. Otherwise whether a field is
 docValues or not is largely irrelevant.

 Best,
 Erick

 On Tue, Feb 24, 2015 at 9:09 PM, Modassar Ather modather1...@gmail.com
 wrote:
  So for a requirement where I have a field which is used for sorting,
  faceting and searching what should be the better field definition.
 
  Can it be *field name=manu_exact type=string indexed=true
  stored=false docValues=true /*
  or
  Two fields each for sorting+faceting and for searching like following.
 
 
  *field name=manu_exact type=string indexed=true stored=false
  /field name=manu_exact_sort type=string indexed=false
  stored=false docValues=true /*
 
  Kindly note that it will be better if can use existing field for sorting,
  faceting and add searching on it like in example one above.
 
  Regards,
  Modassar
 
  On Tue, Feb 24, 2015 at 11:15 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  Hmmm, that's not my understanding. docValues are simply a different
  layout for storing
  the _indexed_ values that facilitates rapid loading of the field from
  disk, essentially
  putting the uninverted field value in a conveniently-loadable form.
 
  So AFAIK, the field is stored only once and used for all three,
  sorting, faceting and
  searching.
 
  Best,
  Erick
 
  On Tue, Feb 24, 2015 at 4:13 AM, Modassar Ather modather1...@gmail.com
 
  wrote:
   Thanks for your response Mikhail.
  
   On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
   Both statements seem true to me.
  
   On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather 
 modather1...@gmail.com
  
   wrote:
  
Hi,
   
Kindly help me understand the behavior of following field.
   
field name=manu_exact type=string indexed=true
 stored=false
docValues=true /
   
For a field like above where indexed=true and docValues=true,
 is
  it
that:
 1) For sorting/faceting on *manu_exact* the docValues will be
 used.
 2) For querying on *manu_exact* the inverted index will be used.
   
Thanks,
Modassar
   
  
  
  
   --
   Sincerely yours
   Mikhail Khludnev
   Principal Engineer,
   Grid Dynamics
  
   http://www.griddynamics.com
   mkhlud...@griddynamics.com
  
 



Re: Setting Up an External ZooKeeper Ensemble

2015-02-24 Thread CKReddy Bhimavarapu
update: everything is working fine after I downloaded new zookeeper and set
same configuration.

thanks for helping.

On Tue, Feb 24, 2015 at 5:13 PM, CKReddy Bhimavarapu chaitu...@gmail.com
wrote:

 yes

 chaitanya@imart-desktop:~/solr/zookeeper-3.4.6/bin$ ./zkServer.sh start

 JMX enabled by default
 Using config: /home/chaitanya/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg
 Starting zookeeper ... STARTED
 chaitanya@imart-desktop:~/solr/zookeeper-3.4.6/bin$ ./zkServer.sh status
 JMX enabled by default
 Using config: /home/chaitanya/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg
 Error contacting service. It is probably not running

 but I don't get why it is not running


 On Tue, Feb 24, 2015 at 1:45 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 Looks like the ZooKeeper server is either not running or not accepting
 connection possibly because of some configuration issues. Can you look
 into
 the ZooKeeper logs and see if there are any exceptions?

 On Tue, Feb 24, 2015 at 11:30 AM, CKReddy Bhimavarapu 
 chaitu...@gmail.com
 wrote:

  Hi,
I did follow all the steps in [
 
 
 https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
  ]
  but still I am getting this error
  bWaiting to see Solr listening on port 8983 [-]  Still not seeing Solr
  listening on 8983 after 30 seconds!/b
  WARN  - 2015-02-24 05:50:19.161;
  org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null,
  unexpected error, closing socket connection and attempting reconnect
  java.net.ConnectException: Connection refused
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
  at
 
 
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
  at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
  WARN  - 2015-02-24 05:50:20.262;
  org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null,
  unexpected error, closing socket connection and attempting reconnect
 
 
  Where am I going wrong?
 
  --
  ckreddybh. chaitu...@gmail.com
 



 --
 Regards,
 Shalin Shekhar Mangar.




 --
 ckreddybh. chaitu...@gmail.com




-- 
ckreddybh. chaitu...@gmail.com


Solr Document expiration with TTL

2015-02-24 Thread Makailol Charls
Hello,

We are trying to add documents in solr with ttl defined(document expiration
feature), which is expected to expire at specified time, but it is not.

Following are the settings we have defined in solrconfig.xml and
managed-schema.

solr version : 5.0.0
*solrconfig.xml*
---
updateRequestProcessorChain default=true
processor class=solr.processor.DocExpirationUpdateProcessorFactory
  int name=autoDeletePeriodSeconds30/int
  str name=ttlFieldNametime_to_live_s/str
  str name=expirationFieldNameexpire_at_dt/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

*managed-schema*
---
field name=id type=string indexed=true stored=true
multiValued=false /
field name=time_to_live_s type=string stored=true
multiValued=false /
field name=expire_at_dt type=date stored=true
multiValued=false /

*solr query*

Following query posts a document and sets expire_at_dt explicitly. That
is working perfectly ok and ducument expires at defined time.

curl -X POST -H 'Content-Type: application/json' '
http://localhost:8983/solr/collection1/update?commit=true' -d '[{
id:10seconds,expire_at_dt:NOW+10SECONDS}]'


But when trying to post with TTL (following query), document does not
expire after given time.

curl -X POST -H 'Content-Type: application/json' '
http://localhost:8983/solr/collection1/update?commit=true' -d '[{
id:10seconds,time_to_live_s:+10SECONDS}]'

Any help would be appreciated.

Thanks,
Makailol


Re: how to debug solr performance degradation

2015-02-24 Thread Shawn Heisey
On 2/24/2015 5:45 PM, Tang, Rebecca wrote:
 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.

 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?

I would like to know what memory numbers in which program you are
looking at, and why you believe those numbers are a problem.

The JVM has a very different view of memory than the operating system. 
Numbers in top mean different things than numbers on the dashboard of
the admin UI, or the numbers in jconsole.  If you're on Windows, then
replace top with task manager, process explorer, resource monitor, etc.

Please provide as many details as you can about the things you are
looking at.

Thanks,
Shawn



Re: [ANNOUNCE] Apache Solr 5.0.0 and Reference Guide for Solr 5.0 released

2015-02-24 Thread Sebastián Ramírez
Awesome news. Thanks.


*Sebastián Ramírez*
Diseñador de Algoritmos

 http://www.senseta.com

 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Twitter: @tiangolo https://twitter.com/tiangolo
 Email: sebastian.rami...@senseta.com
 www.senseta.com

On Fri, Feb 20, 2015 at 3:55 PM, Anshum Gupta ans...@anshumgupta.net
wrote:

 20 February 2015, Apache Solr™ 5.0.0 and Reference Guide for Solr 5.0
 available

 The Lucene PMC is pleased to announce the release of Apache Solr 5.0.0

 Solr is the popular, blazing fast, open source NoSQL search platform
 from the Apache Lucene project. Its major features include powerful
 full-text search, hit highlighting, faceted search, dynamic
 clustering, database integration, rich document (e.g., Word, PDF)
 handling, and geospatial search.  Solr is highly scalable, providing
 fault tolerant distributed search and indexing, and powers the search
 and navigation features of many of the world's largest internet sites.

 Solr 5.0 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

 See the CHANGES.txt file included with the release for a full list of
 details.

 Solr 5.0 Release Highlights:

  * Usability improvements that include improved bin scripts and new and
 restructured examples.

  * Scripts to support installing and running Solr as a service on Linux.

  * Distributed IDF is now supported and can be enabled via the config.
 Currently, there are four supported implementations for the same:
 * LocalStatsCache: Local document stats.
 * ExactStatsCache: One time use aggregation
 * ExactSharedStatsCache: Stats shared across requests
 * LRUStatsCache: Stats shared in an LRU cache across requests

  * Solr will no longer ship a war file and instead be a downloadable
 application.

  * SolrJ now has first class support for Collections API.

  * Implicit registration of replication,get and admin handlers.

  * Config API that supports paramsets for easily configuring solr
 parameters and configuring fields. This API also supports managing of
 pre-existing request handlers and editing common solrconfig.xml via
 overlay.

  * API for managing blobs allows uploading request handler jars and
 registering them via config API.

  * BALANCESHARDUNIQUE Collection API that allows for even distribution of
 custom replica properties.

  * There's now an option to not shuffle the nodeSet provided during
 collection creation.

  * Option to configure bandwidth usage by Replication handler to prevent it
 from using up all the bandwidth.

  * Splitting of clusterstate to per-collection enables scalability
 improvement in SolrCloud. This is also the default format for new
 Collections that would be created going forward.

  * timeAllowed is now used to prematurely terminate requests during query
 expansion and SolrClient request retry.

  * pivot.facet results can now include nested stats.field results
 constrained by those pivots.

  * stats.field can be used to generate stats over the results of arbitrary
 numeric functions.
 It also allows for requesting for statistics for pivot facets using tags.

  * A new DateRangeField has been added for indexing date ranges, especially
 multi-valued ones.

  * Spatial fields that used to require units=degrees now take
 distanceUnits=degrees/kilometers miles instead.

  * MoreLikeThis query parser allows requesting for documents similar to an
 existing document and also works in SolrCloud mode.

  * Logging improvements:
 * Transaction log replay status is now logged
 * Optional logging of slow requests.

 Solr 5.0 also includes many other new features as well as numerous
 optimizations and bugfixes of the corresponding Apache Lucene release.

 Detailed change log:
 http://lucene.apache.org/solr/5_0_0/changes/Changes.html

 Also available is the *Solr Reference Guide for Solr 5.0*. This 535 page
 PDF serves as the definitive user's manual for Solr 5.0. It can be
 downloaded
 from the Apache mirror network: https://s.apache.org/Solr-Ref-Guide-PDF

 Please report any feedback to the mailing lists
 (http://lucene.apache.org/solr/discussion.html)

 Note: The Apache Software Foundation uses an extensive mirroring network
 for distributing releases.  It is possible that the mirror you are using
 may not have replicated the release yet.  If that is the case, please
 try another mirror.  This also goes for Maven access.

 --
 Anshum Gupta
 http://about.me/anshumgupta


-- 
**
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete 

Re: how to debug solr performance degradation

2015-02-24 Thread Erick Erickson
Be careful what you think is being used by Solr since Lucene uses
MMapDirectories under the covers, and this means you might be seeing
virtual memory. See Uwe's excellent blog here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best,
Erick

On Tue, Feb 24, 2015 at 5:02 PM, Walter Underwood wun...@wunderwood.org wrote:
 The other memory is used by the OS as file buffers. All the important parts 
 of the on-disk search index are buffered in memory. When the Solr process 
 wants a block, it is already right there, no delays for disk access.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)


 On Feb 24, 2015, at 4:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote:

 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.

 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?

 Rebecca Tang
 Applications Developer, UCSF CKM
 Industry Documents Digital Libraries
 E: rebecca.t...@ucsf.edu

 On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 2/24/2015 1:09 PM, Tang, Rebecca wrote:
 Our solr index used to perform OK on our beta production box (anywhere
 between 0-3 seconds to complete any query), but today I noticed that the
 performance is very bad (queries take between 12 ­ 15 seconds).

 I haven't updated the solr index configuration
 (schema.xml/solrconfig.xml) lately.  All that's changed is the data ‹
 every month, I rebuild the solr index from scratch and deploy it to the
 box.  We will eventually go to incremental builds. But for now, all
 indexes are built from scratch.

 Here are the stats:
 Solr index size 183G
 Documents in index 14364201
 We just have single solr box
 It has 100G memory
 500G Harddrive
 16 cpus

 The bottom line on this problem, and I'm sure it's not something you're
 going to want to hear:  You don't have enough memory available to cache
 your index.  I'd plan on at least 192GB of RAM for an index this size,
 and 256GB would be better.

 Depending on the exact index schema, the nature of your queries, and how
 large your Java heap for Solr is, 100GB of RAM could be enough for good
 performance on an index that size ... or it might be nowhere near
 enough.  I would imagine that one of two things is true here, possibly
 both:  1) Your queries are very complex and involve accessing a very
 large percentage of the index data.  2) Your Java heap is enormous,
 leaving very little RAM for the OS to automatically cache the index.

 Adding more memory to the machine, if that's possible, might fix some of
 the problems.  You can find a discussion of the problem here:

 http://wiki.apache.org/solr/SolrPerformanceProblems

 If you have any questions after reading that wiki article, feel free to
 ask them.

 Thanks,
 Shawn





Re: how to debug solr performance degradation

2015-02-24 Thread François Schiettecatte
Rebecca

You don’t want to give all the memory to the JVM. You want to give it just 
enough for it to work optimally and leave the rest of the memory for the OS to 
use for caching data. Giving the JVM too much memory can result in worse 
performance because of GC. There is no magic formula to figuring out the memory 
allocation for the JVM, that is very dependent on the workload. In your case I 
would start with 5GB, and increment by 5GB with each run.

I also use these settings for the JVM

-XX:+UseG1GC -Xms1G -Xmx1G

-XX:+AggressiveOpts -XX:+OptimizeStringConcat -XX:+ParallelRefProcEnabled 
-XX:MaxGCPauseMillis=200

I got them from this list so can’t take credit for them but they work for me.


Cheers

François


 On Feb 24, 2015, at 7:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote:
 
 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.
 
 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?
 
 
 Rebecca Tang
 Applications Developer, UCSF CKM
 Industry Documents Digital Libraries
 E: rebecca.t...@ucsf.edu
 
 
 
 
 
 On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 2/24/2015 1:09 PM, Tang, Rebecca wrote:
 Our solr index used to perform OK on our beta production box (anywhere
 between 0-3 seconds to complete any query), but today I noticed that the
 performance is very bad (queries take between 12 ­ 15 seconds).
 
 I haven't updated the solr index configuration
 (schema.xml/solrconfig.xml) lately.  All that's changed is the data ‹
 every month, I rebuild the solr index from scratch and deploy it to the
 box.  We will eventually go to incremental builds. But for now, all
 indexes are built from scratch.
 
 Here are the stats:
 Solr index size 183G
 Documents in index 14364201
 We just have single solr box
 It has 100G memory
 500G Harddrive
 16 cpus
 
 The bottom line on this problem, and I'm sure it's not something you're
 going to want to hear:  You don't have enough memory available to cache
 your index.  I'd plan on at least 192GB of RAM for an index this size,
 and 256GB would be better.
 
 Depending on the exact index schema, the nature of your queries, and how
 large your Java heap for Solr is, 100GB of RAM could be enough for good
 performance on an index that size ... or it might be nowhere near
 enough.  I would imagine that one of two things is true here, possibly
 both:  1) Your queries are very complex and involve accessing a very
 large percentage of the index data.  2) Your Java heap is enormous,
 leaving very little RAM for the OS to automatically cache the index.
 
 Adding more memory to the machine, if that's possible, might fix some of
 the problems.  You can find a discussion of the problem here:
 
 http://wiki.apache.org/solr/SolrPerformanceProblems
 
 If you have any questions after reading that wiki article, feel free to
 ask them.
 
 Thanks,
 Shawn
 
 



Re: how to debug solr performance degradation

2015-02-24 Thread Boogie Shafer

rebecca,

i would suggest making sure you have some gc logging configured so you have 
some visibility into the JVM, esp if you don't already have JMX for sflow agent 
configured to give you external visibility of those internal metrics

the options below just print out the gc activity to a log

-Xloggc:gc.log
-verbose:gc 
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-XX:+PrintClassHistogram 
-XX:+PrintHeapAtGC 
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure 
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintTLAB
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=10m




on the memory tuning side if things, as has already been mentioned, try to 
leave as much memory (outside the JVM) available to your OS to cache as much of 
the actual index as possible

in your case, you have a lot of RAM, so i would suggest starting with the gc 
logging options above, plus these very basic JVM memory settings
-XX:+UseG1GC
-Xms2G
-Xmx4G
-XX:+UseAdaptiveSizePolicy 
-XX:MaxGCPauseMillis=1000 
-XX:GCTimeRatio=19

in short, start by letting the JVM tune itself ;)

then start looking at the actual GC behavior (this will be visible in the gc 
logs)


---
on the OS performance monitoring, a few real time tools which i like to use on 
linux

nmon
dstat
htop

for trending start with the basics (sysstat/sar) 
and build from there (hsflowd is super easy to install and get pushing data up 
to a central console like ganglia)
you can add to that by adding the sflow JVM agent to your solr environment

enabling JMX interface on jetty will let you use tools like jconsole or 
jvisualvm





From: François Schiettecatte fschietteca...@gmail.com
Sent: Tuesday, February 24, 2015 17:06
To: solr-user@lucene.apache.org
Subject: Re: how to debug solr performance degradation

Rebecca

You don’t want to give all the memory to the JVM. You want to give it just 
enough for it to work optimally and leave the rest of the memory for the OS to 
use for caching data. Giving the JVM too much memory can result in worse 
performance because of GC. There is no magic formula to figuring out the memory 
allocation for the JVM, that is very dependent on the workload. In your case I 
would start with 5GB, and increment by 5GB with each run.

I also use these settings for the JVM

-XX:+UseG1GC -Xms1G -Xmx1G

-XX:+AggressiveOpts -XX:+OptimizeStringConcat -XX:+ParallelRefProcEnabled 
-XX:MaxGCPauseMillis=200

I got them from this list so can’t take credit for them but they work for me.


Cheers

François


 On Feb 24, 2015, at 7:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote:

 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.

 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?


 Rebecca Tang
 Applications Developer, UCSF CKM
 Industry Documents Digital Libraries
 E: rebecca.t...@ucsf.edu





 On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 2/24/2015 1:09 PM, Tang, Rebecca wrote:
 Our solr index used to perform OK on our beta production box (anywhere
 between 0-3 seconds to complete any query), but today I noticed that the
 performance is very bad (queries take between 12 ­ 15 seconds).

 I haven't updated the solr index configuration
 (schema.xml/solrconfig.xml) lately.  All that's changed is the data ‹
 every month, I rebuild the solr index from scratch and deploy it to the
 box.  We will eventually go to incremental builds. But for now, all
 indexes are built from scratch.

 Here are the stats:
 Solr index size 183G
 Documents in index 14364201
 We just have single solr box
 It has 100G memory
 500G Harddrive
 16 cpus

 The bottom line on this problem, and I'm sure it's not something you're
 going to want to hear:  You don't have enough memory available to cache
 your index.  I'd plan on at least 192GB of RAM for an index this size,
 and 256GB would be better.

 Depending on the exact index schema, the nature of your queries, and how
 large your Java heap for Solr is, 100GB of RAM could be enough for good
 performance on an index that size ... or it might be nowhere near
 enough.  I would imagine that one of two things is true here, possibly
 both:  1) Your queries are very complex and involve accessing a very
 large percentage of the index data.  2) Your Java heap is enormous,
 leaving very little RAM for the OS to automatically cache the index.

 Adding more memory to the machine, if that's possible, might fix some of
 the problems.  You can find a discussion of the problem here:

 http://wiki.apache.org/solr/SolrPerformanceProblems

 If you have any questions after reading that wiki article, feel free to
 ask them.

 Thanks,
 Shawn




Re: how to debug solr performance degradation

2015-02-24 Thread Boogie Shafer

meant to type JMX or sflow agent

also should have mentioned you want to be running a very recent JDK


From: Boogie Shafer boogie.sha...@proquest.com
Sent: Tuesday, February 24, 2015 18:03
To: solr-user@lucene.apache.org
Subject: Re: how to debug solr performance degradation

rebecca,

i would suggest making sure you have some gc logging configured so you have 
some visibility into the JVM, esp if you don't already have JMX for sflow agent 
configured to give you external visibility of those internal metrics

the options below just print out the gc activity to a log

-Xloggc:gc.log
-verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-XX:+PrintClassHistogram
-XX:+PrintHeapAtGC
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintTLAB
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=10m




on the memory tuning side if things, as has already been mentioned, try to 
leave as much memory (outside the JVM) available to your OS to cache as much of 
the actual index as possible

in your case, you have a lot of RAM, so i would suggest starting with the gc 
logging options above, plus these very basic JVM memory settings
-XX:+UseG1GC
-Xms2G
-Xmx4G
-XX:+UseAdaptiveSizePolicy
-XX:MaxGCPauseMillis=1000
-XX:GCTimeRatio=19

in short, start by letting the JVM tune itself ;)

then start looking at the actual GC behavior (this will be visible in the gc 
logs)


---
on the OS performance monitoring, a few real time tools which i like to use on 
linux

nmon
dstat
htop

for trending start with the basics (sysstat/sar)
and build from there (hsflowd is super easy to install and get pushing data up 
to a central console like ganglia)
you can add to that by adding the sflow JVM agent to your solr environment

enabling JMX interface on jetty will let you use tools like jconsole or 
jvisualvm





From: François Schiettecatte fschietteca...@gmail.com
Sent: Tuesday, February 24, 2015 17:06
To: solr-user@lucene.apache.org
Subject: Re: how to debug solr performance degradation

Rebecca

You don’t want to give all the memory to the JVM. You want to give it just 
enough for it to work optimally and leave the rest of the memory for the OS to 
use for caching data. Giving the JVM too much memory can result in worse 
performance because of GC. There is no magic formula to figuring out the memory 
allocation for the JVM, that is very dependent on the workload. In your case I 
would start with 5GB, and increment by 5GB with each run.

I also use these settings for the JVM

-XX:+UseG1GC -Xms1G -Xmx1G

-XX:+AggressiveOpts -XX:+OptimizeStringConcat -XX:+ParallelRefProcEnabled 
-XX:MaxGCPauseMillis=200

I got them from this list so can’t take credit for them but they work for me.


Cheers

François


 On Feb 24, 2015, at 7:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote:

 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.

 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?


 Rebecca Tang
 Applications Developer, UCSF CKM
 Industry Documents Digital Libraries
 E: rebecca.t...@ucsf.edu





 On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 2/24/2015 1:09 PM, Tang, Rebecca wrote:
 Our solr index used to perform OK on our beta production box (anywhere
 between 0-3 seconds to complete any query), but today I noticed that the
 performance is very bad (queries take between 12 ­ 15 seconds).

 I haven't updated the solr index configuration
 (schema.xml/solrconfig.xml) lately.  All that's changed is the data ‹
 every month, I rebuild the solr index from scratch and deploy it to the
 box.  We will eventually go to incremental builds. But for now, all
 indexes are built from scratch.

 Here are the stats:
 Solr index size 183G
 Documents in index 14364201
 We just have single solr box
 It has 100G memory
 500G Harddrive
 16 cpus

 The bottom line on this problem, and I'm sure it's not something you're
 going to want to hear:  You don't have enough memory available to cache
 your index.  I'd plan on at least 192GB of RAM for an index this size,
 and 256GB would be better.

 Depending on the exact index schema, the nature of your queries, and how
 large your Java heap for Solr is, 100GB of RAM could be enough for good
 performance on an index that size ... or it might be nowhere near
 enough.  I would imagine that one of two things is true here, possibly
 both:  1) Your queries are very complex and involve accessing a very
 large percentage of the index data.  2) Your Java heap is enormous,
 leaving very little RAM for the OS to automatically cache the index.

 Adding more memory to the machine, if 

Re: how to debug solr performance degradation

2015-02-24 Thread Walter Underwood
The other memory is used by the OS as file buffers. All the important parts of 
the on-disk search index are buffered in memory. When the Solr process wants a 
block, it is already right there, no delays for disk access.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Feb 24, 2015, at 4:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote:

 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.
 
 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?
 
 Rebecca Tang
 Applications Developer, UCSF CKM
 Industry Documents Digital Libraries
 E: rebecca.t...@ucsf.edu
 
 On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 2/24/2015 1:09 PM, Tang, Rebecca wrote:
 Our solr index used to perform OK on our beta production box (anywhere
 between 0-3 seconds to complete any query), but today I noticed that the
 performance is very bad (queries take between 12 ­ 15 seconds).
 
 I haven't updated the solr index configuration
 (schema.xml/solrconfig.xml) lately.  All that's changed is the data ‹
 every month, I rebuild the solr index from scratch and deploy it to the
 box.  We will eventually go to incremental builds. But for now, all
 indexes are built from scratch.
 
 Here are the stats:
 Solr index size 183G
 Documents in index 14364201
 We just have single solr box
 It has 100G memory
 500G Harddrive
 16 cpus
 
 The bottom line on this problem, and I'm sure it's not something you're
 going to want to hear:  You don't have enough memory available to cache
 your index.  I'd plan on at least 192GB of RAM for an index this size,
 and 256GB would be better.
 
 Depending on the exact index schema, the nature of your queries, and how
 large your Java heap for Solr is, 100GB of RAM could be enough for good
 performance on an index that size ... or it might be nowhere near
 enough.  I would imagine that one of two things is true here, possibly
 both:  1) Your queries are very complex and involve accessing a very
 large percentage of the index data.  2) Your Java heap is enormous,
 leaving very little RAM for the OS to automatically cache the index.
 
 Adding more memory to the machine, if that's possible, might fix some of
 the problems.  You can find a discussion of the problem here:
 
 http://wiki.apache.org/solr/SolrPerformanceProblems
 
 If you have any questions after reading that wiki article, feel free to
 ask them.
 
 Thanks,
 Shawn
 
 



Re: how to debug solr performance degradation

2015-02-24 Thread Tang, Rebecca
We gave the machine 180G mem to see if it improves performance.  However,
after we increased the memory, Solr started using only 5% of the physical
memory.  It has always used 90-something%.

What could be causing solr to not grab all the physical memory (grabbing
so little of the physical memory)?


Rebecca Tang
Applications Developer, UCSF CKM
Industry Documents Digital Libraries
E: rebecca.t...@ucsf.edu





On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote:

On 2/24/2015 1:09 PM, Tang, Rebecca wrote:
 Our solr index used to perform OK on our beta production box (anywhere
between 0-3 seconds to complete any query), but today I noticed that the
performance is very bad (queries take between 12 ­ 15 seconds).

 I haven't updated the solr index configuration
(schema.xml/solrconfig.xml) lately.  All that's changed is the data ‹
every month, I rebuild the solr index from scratch and deploy it to the
box.  We will eventually go to incremental builds. But for now, all
indexes are built from scratch.

 Here are the stats:
 Solr index size 183G
 Documents in index 14364201
 We just have single solr box
 It has 100G memory
 500G Harddrive
 16 cpus

The bottom line on this problem, and I'm sure it's not something you're
going to want to hear:  You don't have enough memory available to cache
your index.  I'd plan on at least 192GB of RAM for an index this size,
and 256GB would be better.

Depending on the exact index schema, the nature of your queries, and how
large your Java heap for Solr is, 100GB of RAM could be enough for good
performance on an index that size ... or it might be nowhere near
enough.  I would imagine that one of two things is true here, possibly
both:  1) Your queries are very complex and involve accessing a very
large percentage of the index data.  2) Your Java heap is enormous,
leaving very little RAM for the OS to automatically cache the index.

Adding more memory to the machine, if that's possible, might fix some of
the problems.  You can find a discussion of the problem here:

http://wiki.apache.org/solr/SolrPerformanceProblems

If you have any questions after reading that wiki article, feel free to
ask them.

Thanks,
Shawn




Re: Regarding behavior of docValues.

2015-02-24 Thread Modassar Ather
So for a requirement where I have a field which is used for sorting,
faceting and searching what should be the better field definition.

Can it be *field name=manu_exact type=string indexed=true
stored=false docValues=true /*
or
Two fields each for sorting+faceting and for searching like following.


*field name=manu_exact type=string indexed=true stored=false
/field name=manu_exact_sort type=string indexed=false
stored=false docValues=true /*

Kindly note that it will be better if can use existing field for sorting,
faceting and add searching on it like in example one above.

Regards,
Modassar

On Tue, Feb 24, 2015 at 11:15 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Hmmm, that's not my understanding. docValues are simply a different
 layout for storing
 the _indexed_ values that facilitates rapid loading of the field from
 disk, essentially
 putting the uninverted field value in a conveniently-loadable form.

 So AFAIK, the field is stored only once and used for all three,
 sorting, faceting and
 searching.

 Best,
 Erick

 On Tue, Feb 24, 2015 at 4:13 AM, Modassar Ather modather1...@gmail.com
 wrote:
  Thanks for your response Mikhail.
 
  On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
  Both statements seem true to me.
 
  On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com
 
  wrote:
 
   Hi,
  
   Kindly help me understand the behavior of following field.
  
   field name=manu_exact type=string indexed=true stored=false
   docValues=true /
  
   For a field like above where indexed=true and docValues=true, is
 it
   that:
1) For sorting/faceting on *manu_exact* the docValues will be used.
2) For querying on *manu_exact* the inverted index will be used.
  
   Thanks,
   Modassar
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com