Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Shawn Heisey
On 2/25/2015 5:50 AM, Benson Margulies wrote:
 So, found the following line in the guide:
 
java -DzkRun -DnumShards=2
 -Dbootstrap_confdir=./solr/collection1/conf
 -Dcollection.configName=myconf -jar start.jar
 
 using a completely clean, new, solr_home.
 
 In my own bootstrap dir, I have my own solrconfig.xml and schema.xml,
 and I modified to have:
 
  -DnumShards=8 -DmaxShardsPerNode=8
 
 When I went to start loading data into this, I failed:
 
 Caused by: 
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 No registered leader was found after waiting for 4000ms , collection:
 rni slice: shard4
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
 at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
 at 
 com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53)
 
 with corresponding log traffic in the solr log.
 
 The cloud page in the Solr admin app shows the IP address in green.
 It's a bit hard to read in general, it's all squished up to the top.

The way I would do it would be to start Solr *only* with the zkHost
parameter.  If you're going to use embedded zookeeper, I guess you would
use zkRun instead.

Once I had Solr running in cloud mode, I would upload the config to
zookeeper using zkcli, and create the collection using the Collections
API, including things like numShards and maxShardsPerNode on that CREATE
call, not as startup properties.  Then I would completely reindex my
data into the new collection.  It's a whole lot cleaner than trying to
convert non-cloud to cloud and split shards.

Thanks,
Shawn



Re: Problem with queries that includes NOT

2015-02-25 Thread Jack Krupansky
As a general proposition, your first stop with any query interpretation
questions should be to add the debigQuery=true parameter and look at the
parsed_query in the query response which shows how the query is really
interpreted.

-- Jack Krupansky

On Wed, Feb 25, 2015 at 8:21 AM, david.dav...@correo.aeat.es wrote:

 Hi Shawn,

 thank you for your quick response. I will read your links and make some
 tests.

 Regards,

 David Dávila
 DIT - 915828763




 De: Shawn Heisey apa...@elyograg.org
 Para:   solr-user@lucene.apache.org,
 Fecha:  25/02/2015 13:23
 Asunto: Re: Problem with queries that includes NOT



 On 2/25/2015 4:04 AM, david.dav...@correo.aeat.es wrote:
  We have problems with some queries. All of them include the tag NOT, and

  in my opinion, the results don´t make any sense.
 
  First problem:
 
  This query  NOT Proc:ID01returns   95806 results, however this one
 
  NOT Proc:ID01 OR FileType:PDF_TEXT returns  11484 results. But it's
  impossible that adding a tag OR the query has less number of results.
 
  Second problem. Here the problem is because of the brackets and the NOT
  tag:
 
   This query:
 
  (NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE
  returns 0 documents.
 
  But this query:
 
  (NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE)
  returns 53 documents, which is correct. So, the problem is the position
 of
  the bracket. I have checked the same query without NOTs, and it works
 fine
  returning the same number of results in both cases.  So, I think the
  problem is the combination of the bracket positions and the NOT tag.

 For the first query, there is a difference between NOT condition1 OR
 condition2 and NOT (condition1 OR condition2) ... I can imagine the
 first one increasing the document count compared to just NOT
 condition1 ... the second one wouldn't increase it.

 Boolean queries in Solr (and very likely Lucene as well) do not always
 do what people expect.

 http://robotlibrarian.billdueber.com/2011/12/solr-and-boolean-operators/
 https://lucidworks.com/blog/why-not-and-or-and-not/

 As mentioned in the second link above, you'll get better results if you
 use the prefix operators with explicit parentheses.  One word of
 warning, though -- the prefix operators do not work correctly if you
 change the default operator to AND.

 Thanks,
 Shawn





Re: Stop solr query

2015-02-25 Thread Mikhail Khludnev
Moshe,

if you take a thread dump while a particular query stuck (via jstack of in
SolrAdmin tab), it may explain where exactly it's stalled, just check the
longest stack trace.
FWIW, in 4.x timeallowed is checked only while documents are collected, and
in 5 it's also checked during query expansion (see
http://lucidworks.com/blog/solr-5-0/ now cut-offs requests
https://issues.apache.org/jira/browse/SOLR-5986 during the
query-expansion stage as well ). however I'm not sure it has place (long
query expansion) with hon-synonyms.



On Wed, Feb 25, 2015 at 3:21 PM, Moshe Recanati mos...@kmslh.com wrote:

 Hi Shawn,
 We checked this option and it didn't solve our problem.
 We're using https://github.com/healthonnet/hon-lucene-synonyms for query
 based synonyms.
 While running query with high number of words that have high number of
 synonyms the query got stuck and solr memory is exhausted.
 We tried to use this parameter suggested by you however it didn't stop the
 query and solve the issue.

 Please let me know if there is other option to tackle it. Today it might
 be high number of words that cause the issue and tomorrow it might be other
 something wrong. We can't rely only on user input check.

 Thank you in advance.


 Regards,
 Moshe Recanati
 SVP Engineering
 Office + 972-73-2617564
 Mobile  + 972-52-6194481
 Skype:  recanati

 More at:  www.kmslh.com | LinkedIn | FB


 -Original Message-
 From: Shawn Heisey [mailto:apa...@elyograg.org]
 Sent: Monday, February 23, 2015 5:49 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Stop solr query

 On 2/23/2015 7:23 AM, Moshe Recanati wrote:
  Recently there were some scenarios in which queries that user sent to
  solr got stuck and increased our solr heap.
 
  Is there any option to kill or timeout query that wasn't returned from
  solr by external command?
 

 The best thing you can do is examine all user input and stop such queries
 before they execute, especially if they are the kind of query that will
 cause your heap to grow out of control.

 The timeAllowed parameter can abort a query that takes too long in
 certain phases of the query.  In recent months, Solr has been modified so
 that timeAllowed will take effect during more query phases.  It is not a
 perfect solution, but it can be better than nothing.

 http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed

 Be aware that sometimes legitimate queries will be slow, and using
 timeAllowed may cause those queries to fail.

 Thanks,
 Shawn




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Connect Solr with ODBC to Excel

2015-02-25 Thread Hakim Benoudjit
Hi there,

I'm looking for a library to connect Solr throught ODBC to Excel in order
to do some reporting on my Solr data?
Anybody knows a library for that?

Thanks.

-- 
Cordialement,
Best regards,
Hakim Benoudjit


Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies
On Wed, Feb 25, 2015 at 8:04 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 2/25/2015 5:50 AM, Benson Margulies wrote:
 So, found the following line in the guide:

java -DzkRun -DnumShards=2
 -Dbootstrap_confdir=./solr/collection1/conf
 -Dcollection.configName=myconf -jar start.jar

 using a completely clean, new, solr_home.

 In my own bootstrap dir, I have my own solrconfig.xml and schema.xml,
 and I modified to have:

  -DnumShards=8 -DmaxShardsPerNode=8

 When I went to start loading data into this, I failed:

 Caused by: 
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 No registered leader was found after waiting for 4000ms , collection:
 rni slice: shard4
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
 at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
 at 
 com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53)

 with corresponding log traffic in the solr log.

 The cloud page in the Solr admin app shows the IP address in green.
 It's a bit hard to read in general, it's all squished up to the top.

 The way I would do it would be to start Solr *only* with the zkHost
 parameter.  If you're going to use embedded zookeeper, I guess you would
 use zkRun instead.

 Once I had Solr running in cloud mode, I would upload the config to
 zookeeper using zkcli, and create the collection using the Collections
 API, including things like numShards and maxShardsPerNode on that CREATE
 call, not as startup properties.  Then I would completely reindex my
 data into the new collection.  It's a whole lot cleaner than trying to
 convert non-cloud to cloud and split shards.

Shawn, I _am_ starting from clean. However, I didn't find a recipe for
what you suggest as a process, and  (following Hoss' suggestion) I
found the recipe above with the boostrap_confdir scheme.

I am mostly confused as to how I supply my solrconfig.xml and
schema.xml when I follow the process you are suggesting. I know I'm
verging on vampirism here, but if you could possibly find the time to
turn your paragraph into either a pointer to a recipe or the command
lines in a bit more detail, I'd be exceedingly grateful.

Thanks,
benson




 Thanks,
 Shawn



Re: Stop solr query

2015-02-25 Thread Shawn Heisey
On 2/25/2015 5:21 AM, Moshe Recanati wrote:
 We checked this option and it didn't solve our problem.
 We're using https://github.com/healthonnet/hon-lucene-synonyms for query 
 based synonyms.
 While running query with high number of words that have high number of 
 synonyms the query got stuck and solr memory is exhausted.
 We tried to use this parameter suggested by you however it didn't stop the 
 query and solve the issue.
 
 Please let me know if there is other option to tackle it. Today it might be 
 high number of words that cause the issue and tomorrow it might be other 
 something wrong. We can't rely only on user input check.

If legitimate queries use a lot of memory, you'll either need to
increase the java heap so it can deal with the increased memory
requirements, or you'll have to take steps to decrease memory usage.

Those steps might include changes to your application code to detect
problematic queries before they happen, and/or educating your users
about how to properly use the search.

Lucene and Solr are constantly making advances in memory efficiency, so
making sure you're always on the latest version goes a long way towards
keeping Solr efficient.

Thanks,
Shawn



Re: Facet on TopDocs

2015-02-25 Thread Alvaro Cabrerizo
Hi,

The facet component works with the whole result set, so you can't get the
facets for your topN documents. A naive way you can fulfill your
requirement is two implement it in two steps:

   - Request your data and recover the doc ids.
   - Create a new query with the selected ids (id:id1 OR id:id2 OR ... OR
   id:100) and facet over the result.

Regards.

On Wed, Feb 25, 2015 at 10:34 AM, kakes junkkak...@gmail.com wrote:

 We are trying to limit the number of facets returned only to the top 100
 docs
 and not the complete result set..

 Is there a way of accessing topDocs in the custom Faceting component?
 or
 Can the scores of the docID's in the resultset be accessed in the Facet
 Component?




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Facet-on-TopDocs-tp4188767.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problem with queries that includes NOT

2015-02-25 Thread david . davila
Hi Shawn,

thank you for your quick response. I will read your links and make some 
tests.

Regards,

David Dávila
DIT - 915828763




De: Shawn Heisey apa...@elyograg.org
Para:   solr-user@lucene.apache.org, 
Fecha:  25/02/2015 13:23
Asunto: Re: Problem with queries that includes NOT



On 2/25/2015 4:04 AM, david.dav...@correo.aeat.es wrote:
 We have problems with some queries. All of them include the tag NOT, and 

 in my opinion, the results don´t make any sense.
 
 First problem:
 
 This query  NOT Proc:ID01returns   95806 results, however this one 

 NOT Proc:ID01 OR FileType:PDF_TEXT returns  11484 results. But it's 
 impossible that adding a tag OR the query has less number of results.
 
 Second problem. Here the problem is because of the brackets and the NOT 
 tag:
 
  This query:
 
 (NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE 
 returns 0 documents.
 
 But this query:
 
 (NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE) 
 returns 53 documents, which is correct. So, the problem is the position 
of 
 the bracket. I have checked the same query without NOTs, and it works 
fine 
 returning the same number of results in both cases.  So, I think the 
 problem is the combination of the bracket positions and the NOT tag.

For the first query, there is a difference between NOT condition1 OR
condition2 and NOT (condition1 OR condition2) ... I can imagine the
first one increasing the document count compared to just NOT
condition1 ... the second one wouldn't increase it.

Boolean queries in Solr (and very likely Lucene as well) do not always
do what people expect.

http://robotlibrarian.billdueber.com/2011/12/solr-and-boolean-operators/
https://lucidworks.com/blog/why-not-and-or-and-not/

As mentioned in the second link above, you'll get better results if you
use the prefix operators with explicit parentheses.  One word of
warning, though -- the prefix operators do not work correctly if you
change the default operator to AND.

Thanks,
Shawn




Drop obsolete KEYS files from dist site

2015-02-25 Thread Konstantin Gribov
Hi, folks.

Currently KEYS file is present in:
- www.apache.org/dist/lucene/solr/version/KEYS
- www.apache.org/dist/lucene/solr/KEYS
- www.apache.org/dist/lucene/KEYS

Last two KEYS files are obsolete (both modified at Feb 2014).
Some actual keys used for release process aren't present in them.

I think, it'll be good to drop them to avoid their usage for release
artifact verification.

-- 
Best regards,
Konstantin Gribov


RE: Stop solr query

2015-02-25 Thread Moshe Recanati
Hi Shawn,
We checked this option and it didn't solve our problem.
We're using https://github.com/healthonnet/hon-lucene-synonyms for query based 
synonyms.
While running query with high number of words that have high number of synonyms 
the query got stuck and solr memory is exhausted.
We tried to use this parameter suggested by you however it didn't stop the 
query and solve the issue.

Please let me know if there is other option to tackle it. Today it might be 
high number of words that cause the issue and tomorrow it might be other 
something wrong. We can't rely only on user input check.

Thank you in advance.


Regards,
Moshe Recanati
SVP Engineering
Office + 972-73-2617564
Mobile  + 972-52-6194481
Skype    :  recanati

More at:  www.kmslh.com | LinkedIn | FB


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Monday, February 23, 2015 5:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Stop solr query

On 2/23/2015 7:23 AM, Moshe Recanati wrote:
 Recently there were some scenarios in which queries that user sent to 
 solr got stuck and increased our solr heap.

 Is there any option to kill or timeout query that wasn't returned from 
 solr by external command?


The best thing you can do is examine all user input and stop such queries 
before they execute, especially if they are the kind of query that will cause 
your heap to grow out of control.

The timeAllowed parameter can abort a query that takes too long in certain 
phases of the query.  In recent months, Solr has been modified so that 
timeAllowed will take effect during more query phases.  It is not a perfect 
solution, but it can be better than nothing.

http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed

Be aware that sometimes legitimate queries will be slow, and using timeAllowed 
may cause those queries to fail.

Thanks,
Shawn



solr index time boosting.

2015-02-25 Thread CKReddy Bhimavarapu
Hi all,

We are trying to deboost some documents while indexing depending on their
text available some thing like this

   *doc boost=0.03  *

 field name=pns![CDATA[Testing product
 - Water Bottle. Testing product - Water Bottle. Testing product - Water
 Bottle. Testing product - Water Bottle. Testing product - Water Bottle.
 Testing product - Water Bottle. Testing product - Water Bottle. Testing
 product - Water Bottle. Testing product - Water Bottle. Testing product -
 Water Bottle. Testing product - Water Bottle. Testing product - Water
 Bottle. Testing product - Water Bottle. Testing product - Water Bottle.
 Testing product - Water Bottle. ]]/field
 /doc


 my questions
1.are we going right or not?
2.in order to get the difference how can we deboost this kind of
documents?(any other way to deboost)

Thanks in advance.
-- 
ckreddybh. chaitu...@gmail.com


Re: Problem with queries that includes NOT

2015-02-25 Thread Shawn Heisey
On 2/25/2015 4:04 AM, david.dav...@correo.aeat.es wrote:
 We have problems with some queries. All of them include the tag NOT, and 
 in my opinion, the results don´t make any sense.
 
 First problem:
 
 This query  NOT Proc:ID01returns   95806 results, however this one 
 NOT Proc:ID01 OR FileType:PDF_TEXT returns  11484 results. But it's 
 impossible that adding a tag OR the query has less number of results.
 
 Second problem. Here the problem is because of the brackets and the NOT 
 tag:
 
  This query:
 
 (NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE 
 returns 0 documents.
 
 But this query:
 
 (NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE) 
 returns 53 documents, which is correct. So, the problem is the position of 
 the bracket. I have checked the same query without NOTs, and it works fine 
 returning the same number of results in both cases.  So, I think the 
 problem is the combination of the bracket positions and the NOT tag.

For the first query, there is a difference between NOT condition1 OR
condition2 and NOT (condition1 OR condition2) ... I can imagine the
first one increasing the document count compared to just NOT
condition1 ... the second one wouldn't increase it.

Boolean queries in Solr (and very likely Lucene as well) do not always
do what people expect.

http://robotlibrarian.billdueber.com/2011/12/solr-and-boolean-operators/
https://lucidworks.com/blog/why-not-and-or-and-not/

As mentioned in the second link above, you'll get better results if you
use the prefix operators with explicit parentheses.  One word of
warning, though -- the prefix operators do not work correctly if you
change the default operator to AND.

Thanks,
Shawn



Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies
So, found the following line in the guide:

   java -DzkRun -DnumShards=2
-Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -jar start.jar

using a completely clean, new, solr_home.

In my own bootstrap dir, I have my own solrconfig.xml and schema.xml,
and I modified to have:

 -DnumShards=8 -DmaxShardsPerNode=8

When I went to start loading data into this, I failed:

Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
No registered leader was found after waiting for 4000ms , collection:
rni slice: shard4
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at 
org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
at 
org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
at 
com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53)

with corresponding log traffic in the solr log.

The cloud page in the Solr admin app shows the IP address in green.
It's a bit hard to read in general, it's all squished up to the top.




On Tue, Feb 24, 2015 at 4:33 PM, Benson Margulies bimargul...@gmail.com wrote:
 On Tue, Feb 24, 2015 at 4:27 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : Unfortunately, this is all 5.1 and instructs me to run the 'start from
 : scratch' process.

 a) checkout the left nav of any ref guide page webpage which has a link to
 Older Versions of this Guide (PDF)

 b) i'm not entirely sure i understand what you're asking, but i'm guessing
 you mean...

 * you have a fully functional individual instance of Solr, with a single
 core
 * you only want to run that one single instance of the Solr process
 * you want tha single solr process to be a SolrCould of one node, but
 replace your single core with a collection that is divided into 8
 shards.
 * presumably: you don't care about replication since you are only trying
 to run one node.

 what you want to look into (in the 4.10 ref guide) is how to bootstrap a
 SolrCloud instance from a non-SolrCloud node -- ie: startup zk, tell solr
 to take the configs from your single core and uploda them to zk as a
 configset, and register that single core as a collection.

 That should give you a single instance of solrcloud, with a single
 collection, consisting of one shard (your original core)

 Then you should be able to use the SPLITSHARD command to split your
 single shard into 2 shards, and then split them again, etc... (i don't
 think you can split directly to 8-sub shards with a single command)



 FWIW: unless you no longer have access to the original data, it would
 almost certainly be a lot easier to just start with a clean install of
 Solr in cloud mode, then create a collection with 8 shards, then re-index
 your data.

 OK, now I'm good to go. Thanks.




 -Hoss
 http://www.lucidworks.com/


Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies
A little more data. Note that the cloud status shows the black bubble
for a leader. See http://i.imgur.com/k2MhGPM.png.

org.apache.solr.common.SolrException: No registered leader was found
after waiting for 4000ms , collection: rni slice: shard4
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:568)
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:551)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doDeleteByQuery(DistributedUpdateProcessor.java:1358)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:1226)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:121)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)


On Wed, Feb 25, 2015 at 9:44 AM, Benson Margulies bimargul...@gmail.com wrote:
 On Wed, Feb 25, 2015 at 8:04 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 2/25/2015 5:50 AM, Benson Margulies wrote:
 So, found the following line in the guide:

java -DzkRun -DnumShards=2
 -Dbootstrap_confdir=./solr/collection1/conf
 -Dcollection.configName=myconf -jar start.jar

 using a completely clean, new, solr_home.

 In my own bootstrap dir, I have my own solrconfig.xml and schema.xml,
 and I modified to have:

  -DnumShards=8 -DmaxShardsPerNode=8

 When I went to start loading data into this, I failed:

 Caused by: 
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 No registered leader was found after waiting for 4000ms , collection:
 rni slice: shard4
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
 at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
 at 
 com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53)

 with corresponding log traffic in the solr log.

 The cloud page in the Solr admin app shows the IP address in green.
 It's a bit hard to read in general, it's all squished up to the top.

 The way I would do it would be to start Solr *only* with the zkHost
 parameter.  If you're going to use embedded zookeeper, I guess you would
 use zkRun instead.

 Once I had Solr running in cloud mode, I would upload the config to
 zookeeper using zkcli, and create the collection using the Collections
 API, including things like numShards and maxShardsPerNode on that CREATE
 call, not as startup properties.  Then I would completely reindex my
 data into the new collection.  It's a whole lot cleaner than trying to
 convert non-cloud to cloud and split shards.

 Shawn, I _am_ starting from clean. However, I didn't find a recipe for
 what you suggest as a process, and  (following Hoss' suggestion) I
 found the recipe above with the boostrap_confdir scheme.

 I am mostly confused as to how I supply my solrconfig.xml and
 schema.xml when I follow the process you are suggesting. I know I'm
 verging on vampirism here, but if you could possibly find the time to
 turn your paragraph into either a pointer to a recipe or the command
 lines in a bit more detail, I'd be exceedingly grateful.

 Thanks,
 benson




 Thanks,
 Shawn



Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies
It's the zkcli options on my mind. zkcli's usage shows me 'bootstrap',
'upconfig', and uploading a solr.xml.

When I use upconfig, it might work, but it sure is noise:

benson@ip-10-111-1-103:/data/solr+rni$ 554331
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN
org.apache.zookeeper.server.NIOServerCnxn  – caught end of stream
exception
EndOfStreamException: Unable to read additional data from client
sessionid 0x14bc16c5e660003, likely client has closed socket
at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)

On Wed, Feb 25, 2015 at 10:52 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 2/25/2015 8:35 AM, Benson Margulies wrote:
 Do I need a zkcli bootstrap or do I start with upconfig? What port does
 zkRun put zookeeper on?

 I personally would not use bootstrap options.  They are only meant to be
 used once, when converting from non-cloud, but many people who use them
 do NOT use them only once -- they include them in their startup scripts
 and use them on every startup.  The whole thing becomes extremely
 confusing.  I would just use zkcli and the Collections API, so nothing
 ever happens that you don't explicitly request.

 I believe that the port for embedded zookeeper (zkRun) is the jetty
 listen port plus 1000, so 9983 if jetty.port is 8983 or not set.

 Thanks,
 Shawn



Re: apache solr - dovecot - some search fields works some dont

2015-02-25 Thread Kevin Laurie
Hi Alex,

I get 1 error on start up
Is the error below serious:-


2/25/2015, 11:32:30 PM ERROR SolrCore
org.apache.solr.common.SolrException: undefined field text

org.apache.solr.common.SolrException: undefined field text
at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1269)
at 
org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getWrappedAnalyzer(IndexSchema.java:434)
at 
org.apache.lucene.analysis.DelegatingAnalyzerWrapper$DelegatingReuseStrategy.getReusableComponents(DelegatingAnalyzerWrapper.java:74)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:175)
at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:207)
at 
org.apache.solr.parser.SolrQueryParserBase.newFieldQuery(SolrQueryParserBase.java:374)
at 
org.apache.solr.parser.SolrQueryParserBase.getFieldQuery(SolrQueryParserBase.java:742)
at 
org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:541)
at org.apache.solr.parser.QueryParser.Term(QueryParser.java:299)
at org.apache.solr.parser.QueryParser.Clause(QueryParser.java:185)
at org.apache.solr.parser.QueryParser.Query(QueryParser.java:107)
at org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:96)
at 
org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:151)
at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:50)
at org.apache.solr.search.QParser.getQuery(QParser.java:141)
at 
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:148)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at 
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:64)
at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1739)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


On Wed, Feb 25, 2015 at 3:08 AM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 The field definition looks fine. It's not storing any content
 (stored=false) but is indexing, so you should find the records but not
 see the body in them.

 Not seeing a log entry is more of a worry. Are you sure the request
 even made it to Solr?

 Can you see anything in Dovecot's logs? Or in Solr's access.logs
 (Actually Jetty/Tomcat's access logs that may need to be enabled
 first).

 At this point, you don't have enough information to fix anything. You
 need to understand what's different between request against subject
 vs. the request against body. I would break the communication in
 three stages:
 1) What Dovecote sent
 2) What Solr received
 3) What Solr sent back

 I don't know your skill levels or your system setup to advise
 specifically, but Network tracer (e.g. Wireshark) is good for 1. Logs
 are good for 2. Using the query from 1) and manually running it
 against Solr is good for 3).

 Hope this helps,
Alex.

 On 24 February 2015 at 12:35, Kevin Laurie superinterstel...@gmail.com 
 wrote:
 field name=body type=text indexed=true stored=false /



 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


AW: performance issues with geofilt

2015-02-25 Thread dirk.thalheim
Hello David,

thanks for your answer. In the meantime I found the memory hint too in 
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4#Sorting_and_RelevancySo

Maybe we switch to LatLonType for this kind of searches. But the RPT is also 
needed as we want to support search by arbitrary polygons.

I'm also able to use the sort=geodist() asc. This works well when I modify the 
parameters to:
q=*:*fq=typ:strassefq={!geofilt}sfield=geometrypt=51.370570625523,12.369290471603d=1.0sort=geofilt()
 asc

Kind regards,

Dirk


Tue, 24 Feb 2015 19:42:03 GMT, david.w.smi...@gmail.com wrote:

Hi Dirk,

The RPT field type can be used for distance sorting/boosting but it's a
memory pig when used as-such so don't do it unless you have to.  You only
have to if you have a multi-valued point field.  If you have single-valued,
use LatLonType specifically for distance sorting.

Your sample query doesn't parse correctly for multiple reasons.  You can't
put a query into the sort parameter as you have done it.  You have to do
sort=query($sortQuery) ascsortQuery=...   or a slightly different equivalent
variation.  Lets say you do that... still, I don't recommend this syntax when
you simply want distance sort - just use geodist(), as in:  sort=geodist()
asc.

If you want to use this syntax such as to sort by recipDistance, then it
would look like this (note the filter=false hint to the spatial query
parser, which otherwise is unaware it shouldn't bother actually
search/filter):
sort=query($sortQuery) descsortQuery={!geofilt score=recipDistance
filter=false sfield=geometry pt=51.3,12.3 d=1.0}

If you are able to use geodist() and still find it slow, there are
alternatives involving using projected data and then with simply euclidean
calculations, sqedist():
https://wiki.apache.org/solr/FunctionQuery#sqedist_-_Squared_Euclidean_Distance

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Tue, Feb 24, 2015 at 6:12 AM, dirk.thalh...@bkg.bund.de wrote:

 Hello,

 we are using solr 4.10.1. There are two cores for different use cases with
 around 20 million documents (location descriptions) per core. Each document
 has a geometry field which stores a point and a bbox field which stores a
 bounding box. Both fields are defined with:
 fieldType name=t_geometry
 class=solr.SpatialRecursivePrefixTreeFieldType

  
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
geo=true distErrPct=0.025 maxDistErr=0.9
 units=degrees /

 I'm currently trying to add a location search (find all documents around a
 point). My intention is to add this as filter query, so that the user is
 able to do an additional keyword search. These are the query parameters so
 far:
 q=*:*fq=typ:strassefq={!geofilt sfield=geometry
 pt=51.370570625523,12.369290471603 d=1.0}
 To sort the documents by their distance to the requested point, I added
 following sort parameter:
 sort={!geofilt sort=distance sfield: geometry
 pt=51.370570625523,12.369290471603 d=1.0} asc

 Unfortunately I'm experiencing here some major performance/memory
 problems. The first distance query on a core takes over 10 seconds. In my
 first setup the same request to the second core completely blocked the
 server and caused an OutOfMemoryError. I had to increase the memory to 16
 GB and now it seems to work for the geometry field. Anyhow the first
 request after a server restart takes some time and when I try it with the
 bbox field after a requested on the geometry field in both cores, the
 server blocks again.

 Can anyone explain why the distance needs so much memory? Can this be
 optimized?

 Kind regards,

 Dirk





Re: Solr Document expiration with TTL

2015-02-25 Thread Alexandre Rafalovitch
Reading https://lucidworks.com/blog/document-expiration/

It seems that your Delete check interval granularity is 30 seconds,
but your TTL is 10 seconds. Have you tried setting
autoDeletePeriodSeconds to something like 2 seconds and seeing if the
problem goes away due to more frequent checking of items to delete?

Also, even with the current setup, you should be observing the record
being deleted if not 10 seconds after than 30 seconds. Are you seeing
it not deleted at all?

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 01:51, Makailol Charls 4extrama...@gmail.com wrote:
 Hello,

 We are trying to add documents in solr with ttl defined(document expiration
 feature), which is expected to expire at specified time, but it is not.

 Following are the settings we have defined in solrconfig.xml and
 managed-schema.

 solr version : 5.0.0
 *solrconfig.xml*
 ---
 updateRequestProcessorChain default=true
 processor class=solr.processor.DocExpirationUpdateProcessorFactory
   int name=autoDeletePeriodSeconds30/int
   str name=ttlFieldNametime_to_live_s/str
   str name=expirationFieldNameexpire_at_dt/str
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

 *managed-schema*
 ---
 field name=id type=string indexed=true stored=true
 multiValued=false /
 field name=time_to_live_s type=string stored=true
 multiValued=false /
 field name=expire_at_dt type=date stored=true
 multiValued=false /

 *solr query*
 
 Following query posts a document and sets expire_at_dt explicitly. That
 is working perfectly ok and ducument expires at defined time.

 curl -X POST -H 'Content-Type: application/json' '
 http://localhost:8983/solr/collection1/update?commit=true' -d '[{
 id:10seconds,expire_at_dt:NOW+10SECONDS}]'


 But when trying to post with TTL (following query), document does not
 expire after given time.

 curl -X POST -H 'Content-Type: application/json' '
 http://localhost:8983/solr/collection1/update?commit=true' -d '[{
 id:10seconds,time_to_live_s:+10SECONDS}]'

 Any help would be appreciated.

 Thanks,
 Makailol


Re: New leader/replica solution for HDFS

2015-02-25 Thread Erick Erickson
bq: And the data sync between leader/replica is always a problem

Not quite sure what you mean by this. There shouldn't need to be
any synching in the sense that the index gets replicated, the
incoming documents should be sent to each node (and indexed
to HDFS) as they come in.

bq: There is duplicate index computing on Replilca side.

Yes, that's the design of SolrCloud, explicitly to provide data safety.
If you instead rely on the leader to index and somehow pull that
indexed form to the replica, then you will lose data if the leader
goes down before sending the indexed form.

bq: My thought is that the leader and the replica all bind to the same data
index directory.

This is unsafe. They would both then try to _write_ to the same
index, which can easily corrupt indexes and/or all but the first
one to access the index would be locked out.

All that said, the HDFS triple-redundancy compounded with the
Solr leaders/replicas redundancy means a bunch of extra
storage. You can turn the HDFS replication down to 1, but that has
other implications.

Best,
Erick

On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote:
 We used HDFS as our Solr index storage and we really have a heavy update
 load. We had met much problems with current leader/replica solution. There
 is duplicate index computing on Replilca side. And the data sync between
 leader/replica is always a problem.

 As HDFS already provides data replication on data layer, could Solr provide
 just service layer replication?

 My thought is that the leader and the replica all bind to the same data
 index directory. And the leader will build up index for new request, the
 replica will just keep update the index version with the leader(such as a
 soft commit periodically? ). If the leader lost then the replica will take
 the duty immediately.

 Thanks for any suggestion of this idea.







 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Gary Taylor
I can't get the FileListEntityProcessor and TikeEntityProcessor to 
correctly add a Solr document for each epub file in my local directory.


I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start 
and then solr create -c hn2 to create a new core.


I want to index a load of epub files that I've got in a directory. So I 
created a data-import.xml (in solr\hn2\conf):


dataConfig
dataSource type=BinFileDataSource name=bin /
document
entity name=files dataSource=null rootEntity=false
processor=FileListEntityProcessor
baseDir=c:/Users/gt/Documents/epub fileName=.*epub
onError=skip
recursive=true
field column=fileAbsolutePath name=id /
field column=fileSize name=size /
field column=fileLastModified name=lastModified /

entity name=documentImport processor=TikaEntityProcessor
url=${files.fileAbsolutePath} format=text 
dataSource=bin onError=skip

field column=file name=fileName/
field column=Author name=author meta=true/
field column=title name=title meta=true/
field column=text name=content/
/entity
/entity
/document
/dataConfig

In my solrconfig.xml, I added a requestHandler entry to reference my 
data-import.xml:


  requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler

  lst name=defaults
  str name=configdata-import.xml/str
  /lst
  /requestHandler

I renamed managed-schema to schema.xml, and ensured the following doc 
fields were setup:


  field name=id type=string indexed=true stored=true 
required=true multiValued=false /

  field name=fileName type=string indexed=true stored=true /
  field name=author type=string indexed=true stored=true /
  field name=title type=string indexed=true stored=true /

  field name=size type=long indexed=true stored=true /
  field name=lastModified type=date indexed=true 
stored=true /


  field name=content type=text_en indexed=false 
stored=true multiValued=false/
  field name=text type=text_en indexed=true stored=false 
multiValued=true/


copyField source=content dest=text/

I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file and 
renames schema.xml to schema.xml.back


All good so far.

Now I go to the web admin for dataimport 
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and 
execute a full import.


But, the results show Requests: 0, Fetched: 58, Skipped: 0, 
Processed:1 - ie. it only adds one document (the very first one) even 
though it's iterated over 58!


No errors are reported in the logs.

I can search on the contents of that first epub document, so it's 
extracting OK in Tika, but there's a problem somewhere in my config 
that's causing only 1 document to be indexed in Solr.


Thanks for any assistance / pointers.

Regards,
Gary

--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Basic Multilingual search capability

2015-02-25 Thread Rishi Easwaran


Hi Trey,

Thanks for the detailed response and the link to the talk, it was very 
informative.
Yes looking at the current system requirements ICUTokenizer might be the best 
bet for our use case.
MultiTextField mentioned in the jira SOLR-6492 has some cool features and 
definitely looking forward to trying out once its integrated to main.

 
Thanks,
Rishi.

 

 

-Original Message-
From: Trey Grainger solrt...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Feb 24, 2015 1:40 am
Subject: Re: Basic Multilingual search capability


Hi Rishi,

I don't generally recommend a language-insensitive approach except for
really simple multilingual use cases (for most of the reasons Walter
mentioned), but the ICUTokenizer is probably the best bet you're going to
have if you really want to go that route and only need exact-match on the
tokens that are parsed. It won't work that well for all languages (CJK
languages, for example), but it will work fine for many.

It is also possible to handle multi-lingual content in a more intelligent
(i.e. per-language configuration) way in your search index, of course.
There are three primary strategies (i.e. ways that actually work in the
real world) to do this:
1) create a separate field for each language and search across all of them
at query time
2) create a separate core per language-combination and search across all of
them at query time
3) invoke multiple language-specific analyzers within a single field's
analyzer and index/query using one or more of those language's analyzers
for each document/query.

These are listed in ascending order of complexity, and each can be valid
based upon your use case. For at least the first and third cases, you can
use index-time language detection to map to the appropriate
fields/analyzers if you are otherwise unaware of the languages of the
content from your application layer. The third option requires custom code
(included in the large Multilingual Search chapter of Solr in Action
http://solrinaction.com and soon to be contributed back to Solr via
SOLR-6492 https://issues.apache.org/jira/browse/SOLR-6492), but it
enables you to index an arbitrarily large number of languages into the same
field if needed, while preserving language-specific analysis for each
language.

I presented in detail on the above strategies at Lucene/Solr Revolution
last November, so you may consider checking out the presentation and/or
slides to asses if one of these strategies will work for your use case:
http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

For the record, I'd highly recommend going with the first strategy (a
separate field per language) if you can, as it is certainly the simplest of
the approaches (albeit the one that scales the least well after you add
more than a few languages to your queries). If you want to stay simple and
stick with the ICUTokenizer then it will work to a point, but some of the
problems Walter mentioned may eventually bite you if you are supporting
certain groups of languages.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search  Recommendations @ CareerBuilder

On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood wun...@wunderwood.org
wrote:

 It isn’t just complicated, it can be impossible.

 Do you have content in Chinese or Japanese? Those languages (and some
 others) do not separate words with spaces. You cannot even do word search
 without a language-specific, dictionary-based parser.

 German is space separated, except many noun compounds are not
 space-separated.

 Do you have Finnish content? Entire prepositional phrases turn into word
 endings.

 Do you have Arabic content? That is even harder.

 If all your content is in space-separated languages that are not heavily
 inflected, you can kind of do OK with a language-insensitive approach. But
 it hits the wall pretty fast.

 One thing that does work pretty well is trademarked names (LaserJet, Coke,
 etc). Those are spelled the same in all languages and usually not inflected.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

 On Feb 23, 2015, at 8:00 PM, Rishi Easwaran rishi.easwa...@aol.com
 wrote:

  Hi Alex,
 
  There is no specific language list.
  For example: the documents that needs to be indexed are emails or any
 messages for a global customer base. The messages back and forth could be
 in any language or mix of languages.
 
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide basic search capability for any language. Ex: When the document
 contains hello or здравствуйте, the analyzer creates tokens and provides
 exact match search results.
 
  Now it would be great if it had capability to tokenize email addresses
 (ex:he...@aol.com- i think standardTokenizer already does this),
 filenames (здравствуйте.pdf), but 

Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies
Do I need a zkcli bootstrap or do I start with upconfig? What port does
zkRun put zookeeper on?
On Feb 25, 2015 10:15 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 2/25/2015 7:44 AM, Benson Margulies wrote:
  Shawn, I _am_ starting from clean. However, I didn't find a recipe for
  what you suggest as a process, and  (following Hoss' suggestion) I
  found the recipe above with the boostrap_confdir scheme.
 
  I am mostly confused as to how I supply my solrconfig.xml and
  schema.xml when I follow the process you are suggesting. I know I'm
  verging on vampirism here, but if you could possibly find the time to
  turn your paragraph into either a pointer to a recipe or the command
  lines in a bit more detail, I'd be exceedingly grateful.

 I'm willing to help in any way that I can.

 Normally in the conf directory for a non-cloud core you have
 solrconfig.xml and schema.xml, plus any other configs referenced by
 those files, like synomyms.txt, dih-config.xml, etc.  In cloud terms,
 the directory containing these files is a confdir.  It's best to keep
 the on-disk copy of your configs completely outside of the solr home so
 there's no confusion about what configurations are active.  On-disk
 cores for solrcloud do not need or use a conf directory.

 The cloud-scripts/zkcli.sh (or zkcli.bat) script has an upconfig
 command with -confdir and -confname options.

 When doing upconfig, the zkHost value goes on the -z option to zkcli,
 and you only need to list one of your zookeeper hosts, although it is
 perfectly happy if you list them all.  You would point -confdir at a
 directory containing the config files mentioned earlier, and -confname
 is the name that the config has in zookeeper, which you would then use
 on the collection.configName parameter for the Collections API call.
 Once the config is uploaded, here's an example call to that API for
 creating a collection:

 http://server:port
 /solr/admin/collections?action=CREATEname=testnumShards=8replicationFactor=1collection.configName=testcfgmaxShardsPerNode=8

 If this is not enough detail, please let me know which part you need
 help with.

 Thanks,
 Shawn




RE: Stop solr query

2015-02-25 Thread Moshe Recanati
HI Mikhail,
We're using 4.7.1. This means I can't stop the search.
I think this is mandatory feature.


Regards,
Moshe Recanati
SVP Engineering
Office + 972-73-2617564
Mobile  + 972-52-6194481
Skype    :  recanati

More at:  www.kmslh.com | LinkedIn | FB


-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
Sent: Wednesday, February 25, 2015 3:42 PM
To: solr-user
Subject: Re: Stop solr query

Moshe,

if you take a thread dump while a particular query stuck (via jstack of in 
SolrAdmin tab), it may explain where exactly it's stalled, just check the 
longest stack trace.
FWIW, in 4.x timeallowed is checked only while documents are collected, and in 
5 it's also checked during query expansion (see 
http://lucidworks.com/blog/solr-5-0/ now cut-offs requests 
https://issues.apache.org/jira/browse/SOLR-5986 during the query-expansion 
stage as well ). however I'm not sure it has place (long query expansion) with 
hon-synonyms.



On Wed, Feb 25, 2015 at 3:21 PM, Moshe Recanati mos...@kmslh.com wrote:

 Hi Shawn,
 We checked this option and it didn't solve our problem.
 We're using https://github.com/healthonnet/hon-lucene-synonyms for 
 query based synonyms.
 While running query with high number of words that have high number of 
 synonyms the query got stuck and solr memory is exhausted.
 We tried to use this parameter suggested by you however it didn't stop 
 the query and solve the issue.

 Please let me know if there is other option to tackle it. Today it 
 might be high number of words that cause the issue and tomorrow it 
 might be other something wrong. We can't rely only on user input check.

 Thank you in advance.


 Regards,
 Moshe Recanati
 SVP Engineering
 Office + 972-73-2617564
 Mobile  + 972-52-6194481
 Skype:  recanati

 More at:  www.kmslh.com | LinkedIn | FB


 -Original Message-
 From: Shawn Heisey [mailto:apa...@elyograg.org]
 Sent: Monday, February 23, 2015 5:49 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Stop solr query

 On 2/23/2015 7:23 AM, Moshe Recanati wrote:
  Recently there were some scenarios in which queries that user sent 
  to solr got stuck and increased our solr heap.
 
  Is there any option to kill or timeout query that wasn't returned 
  from solr by external command?
 

 The best thing you can do is examine all user input and stop such 
 queries before they execute, especially if they are the kind of query 
 that will cause your heap to grow out of control.

 The timeAllowed parameter can abort a query that takes too long in 
 certain phases of the query.  In recent months, Solr has been modified 
 so that timeAllowed will take effect during more query phases.  It is 
 not a perfect solution, but it can be better than nothing.

 http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed

 Be aware that sometimes legitimate queries will be slow, and using 
 timeAllowed may cause those queries to fail.

 Thanks,
 Shawn




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Shawn Heisey
On 2/25/2015 7:44 AM, Benson Margulies wrote:
 Shawn, I _am_ starting from clean. However, I didn't find a recipe for
 what you suggest as a process, and  (following Hoss' suggestion) I
 found the recipe above with the boostrap_confdir scheme.

 I am mostly confused as to how I supply my solrconfig.xml and
 schema.xml when I follow the process you are suggesting. I know I'm
 verging on vampirism here, but if you could possibly find the time to
 turn your paragraph into either a pointer to a recipe or the command
 lines in a bit more detail, I'd be exceedingly grateful.

I'm willing to help in any way that I can.

Normally in the conf directory for a non-cloud core you have
solrconfig.xml and schema.xml, plus any other configs referenced by
those files, like synomyms.txt, dih-config.xml, etc.  In cloud terms,
the directory containing these files is a confdir.  It's best to keep
the on-disk copy of your configs completely outside of the solr home so
there's no confusion about what configurations are active.  On-disk
cores for solrcloud do not need or use a conf directory.

The cloud-scripts/zkcli.sh (or zkcli.bat) script has an upconfig
command with -confdir and -confname options.

When doing upconfig, the zkHost value goes on the -z option to zkcli,
and you only need to list one of your zookeeper hosts, although it is
perfectly happy if you list them all.  You would point -confdir at a
directory containing the config files mentioned earlier, and -confname
is the name that the config has in zookeeper, which you would then use
on the collection.configName parameter for the Collections API call. 
Once the config is uploaded, here's an example call to that API for
creating a collection:

http://server:port/solr/admin/collections?action=CREATEname=testnumShards=8replicationFactor=1collection.configName=testcfgmaxShardsPerNode=8

If this is not enough detail, please let me know which part you need
help with.

Thanks,
Shawn



Re: Problem with queries that includes NOT

2015-02-25 Thread Alvaro Cabrerizo
Hi,

The edismax parser should be able to manage the query you want to ask. I've
made a test and the next both queries give me the right result (see the
parenthesis):

   - {!edismax}(NOT id:7 AND NOT id:8  AND id:9)   (gives 1 hit
   the id:9)
   - {!edismax}((NOT id:7 AND NOT id:8)  AND id:9) (gives 1 hit
   the id:9)

In general, the issue appears when using the lucene query parser mixing
different boolean clauses (including NOT). Thus, as you commented, the next
queries gives different result


   - NOT id:7 AND NOT id:8  AND id:9   (gives 1 hit the
   id:9)
   - (NOT id:7 AND NOT id:8)  AND id:9 (gives 0 hits when
   expecting 1 )

Since I read the chapter Limitations of prohibited clauses in sub-queries
from the Apache Solr 3 Enterprise Search Server many years ago,  I always
add the *all documents query clause *:**  to the negative clauses to avoid
the problem you mentioned. Thus I will recommend to rewrite the query you
showed us as:

   - (**:*: AND* NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND
   sys_FileType:PROTOTIPE
   - (NOT id:7 AND NOT id:8 *AND *:**)  AND id:9 (gives 1 hit
   as expected)

The above query can be read then as give me all the documents except those
having ID01 and PDF_TEXT and having PROTOTIPE

Regards.




On Wed, Feb 25, 2015 at 1:23 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 2/25/2015 4:04 AM, david.dav...@correo.aeat.es wrote:
  We have problems with some queries. All of them include the tag NOT, and
  in my opinion, the results don´t make any sense.
 
  First problem:
 
  This query  NOT Proc:ID01returns   95806 results, however this one
 
  NOT Proc:ID01 OR FileType:PDF_TEXT returns  11484 results. But it's
  impossible that adding a tag OR the query has less number of results.
 
  Second problem. Here the problem is because of the brackets and the NOT
  tag:
 
   This query:
 
  (NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE
  returns 0 documents.
 
  But this query:
 
  (NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE)
  returns 53 documents, which is correct. So, the problem is the position
 of
  the bracket. I have checked the same query without NOTs, and it works
 fine
  returning the same number of results in both cases.  So, I think the
  problem is the combination of the bracket positions and the NOT tag.

 For the first query, there is a difference between NOT condition1 OR
 condition2 and NOT (condition1 OR condition2) ... I can imagine the
 first one increasing the document count compared to just NOT
 condition1 ... the second one wouldn't increase it.

 Boolean queries in Solr (and very likely Lucene as well) do not always
 do what people expect.

 http://robotlibrarian.billdueber.com/2011/12/solr-and-boolean-operators/
 https://lucidworks.com/blog/why-not-and-or-and-not/

 As mentioned in the second link above, you'll get better results if you
 use the prefix operators with explicit parentheses.  One word of
 warning, though -- the prefix operators do not work correctly if you
 change the default operator to AND.

 Thanks,
 Shawn




Re: apache solr - dovecot - some search fields works some dont

2015-02-25 Thread Kevin Laurie
Hi Alex,

Below shows that Solr is not getting anything from the text search.
I will try to search from / to and see hows the performance.





select BAD Error in IMAP command INBOX: Unknown command.
. select inbox
* FLAGS (\Answered \Flagged \Deleted \Seen \Draft $Forwarded)
* OK [PERMANENTFLAGS (\Answered \Flagged \Deleted \Seen \Draft
$Forwarded \*)] Flags permitted.
* 49983 EXISTS
* 0 RECENT
* OK [UNSEEN 46791] First unseen.
* OK [UIDVALIDITY 1414214135] UIDs valid
* OK [UIDNEXT 107218] Predicted next UID
* OK [NOMODSEQ] No permanent modsequences
. OK [READ-WRITE] Select completed (0.002 secs).
search text dave
search BAD Error in IMAP command TEXT: Unknown command.
. search text dave
* OK Searched 6% of the mailbox, ETA 2:24
* OK Searched 13% of the mailbox, ETA 2:10
* OK Searched 20% of the mailbox, ETA 1:54
* OK Searched 27% of the mailbox, ETA 1:46
* OK Searched 34% of the mailbox, ETA 1:36
* OK Searched 41% of the mailbox, ETA 1:26
* OK Searched 49% of the mailbox, ETA 1:11
* OK Searched 56% of the mailbox, ETA 1:02
* OK Searched 63% of the mailbox, ETA 0:52
* OK Searched 69% of the mailbox, ETA 0:44
* OK Searched 77% of the mailbox, ETA 0:31
* OK Searched 85% of the mailbox, ETA 0:20
* OK Searched 92% of the mailbox, ETA 0:10
* OK Searched 98% of the mailbox, ETA 0:02

On Wed, Feb 25, 2015 at 11:39 PM, Kevin Laurie
superinterstel...@gmail.com wrote:
 Hi Alex,

 I get 1 error on start up
 Is the error below serious:-


 2/25/2015, 11:32:30 PM ERROR SolrCore
 org.apache.solr.common.SolrException: undefined field text

 org.apache.solr.common.SolrException: undefined field text
 at 
 org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1269)
 at 
 org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getWrappedAnalyzer(IndexSchema.java:434)
 at 
 org.apache.lucene.analysis.DelegatingAnalyzerWrapper$DelegatingReuseStrategy.getReusableComponents(DelegatingAnalyzerWrapper.java:74)
 at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:175)
 at org.apache.lucene.util.QueryBuilder.createFieldQuery(QueryBuilder.java:207)
 at 
 org.apache.solr.parser.SolrQueryParserBase.newFieldQuery(SolrQueryParserBase.java:374)
 at 
 org.apache.solr.parser.SolrQueryParserBase.getFieldQuery(SolrQueryParserBase.java:742)
 at 
 org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:541)
 at org.apache.solr.parser.QueryParser.Term(QueryParser.java:299)
 at org.apache.solr.parser.QueryParser.Clause(QueryParser.java:185)
 at org.apache.solr.parser.QueryParser.Query(QueryParser.java:107)
 at org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:96)
 at 
 org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:151)
 at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:50)
 at org.apache.solr.search.QParser.getQuery(QParser.java:141)
 at 
 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:148)
 at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
 at 
 org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:64)
 at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1739)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 On Wed, Feb 25, 2015 at 3:08 AM, Alexandre Rafalovitch
 arafa...@gmail.com wrote:
 The field definition looks fine. It's not storing any content
 (stored=false) but is indexing, so you should find the records but not
 see the body in them.

 Not seeing a log entry is more of a worry. Are you sure the request
 even made it to Solr?

 Can you see anything in Dovecot's logs? Or in Solr's access.logs
 (Actually Jetty/Tomcat's access logs that may need to be enabled
 first).

 At this point, you don't have enough information to fix anything. You
 need to understand what's different between request against subject
 vs. the request against body. I would break the communication in
 three stages:
 1) What Dovecote sent
 2) What Solr received
 3) What Solr sent back

 I don't know your skill levels or your system setup to advise
 specifically, but Network tracer (e.g. Wireshark) is good for 1. Logs
 are good for 2. Using the query from 1) and manually running it
 against Solr is good for 3).

 Hope this helps,
Alex.

 On 24 February 2015 at 12:35, Kevin Laurie superinterstel...@gmail.com 
 wrote:
 field name=body type=text indexed=true stored=false /



 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


Re: Connect Solr with ODBC to Excel

2015-02-25 Thread Alexandre Rafalovitch
Which direction? You want import data from Solr into Excel? One off or
repeatedly?

For one off Solr - Excel, you could probably use Excel's Open from
Web and load data directly from Solr using CSV output format.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com wrote:
 Hi there,

 I'm looking for a library to connect Solr throught ODBC to Excel in order
 to do some reporting on my Solr data?
 Anybody knows a library for that?

 Thanks.

 --
 Cordialement,
 Best regards,
 Hakim Benoudjit


Re: Connect Solr with ODBC to Excel

2015-02-25 Thread Hakim Benoudjit
Thanks for your answer.
For a one-off it seems like a nice way to import my data.
For an ODBC connection, the only solution I found is to replicate my Solr
data in Apache Hive (or Cassandra...), and then connect to that database
through ODBC.


2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com:

 Which direction? You want import data from Solr into Excel? One off or
 repeatedly?

 For one off Solr - Excel, you could probably use Excel's Open from
 Web and load data directly from Solr using CSV output format.

 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com
 wrote:
  Hi there,
 
  I'm looking for a library to connect Solr throught ODBC to Excel in order
  to do some reporting on my Solr data?
  Anybody knows a library for that?
 
  Thanks.
 
  --
  Cordialement,
  Best regards,
  Hakim Benoudjit




-- 
Cordialement,
Best regards,
Hakim Benoudjit


Re: Basic Multilingual search capability

2015-02-25 Thread Rishi Easwaran
Hi Alex,

Thanks for the suggestions. These steps will definitely help out with our use 
case.
Thanks for the idea about the lengthFilter to protect our system.

Thanks,
Rishi.

 

 

 

-Original Message-
From: Alexandre Rafalovitch arafa...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Feb 24, 2015 8:50 am
Subject: Re: Basic Multilingual search capability


Given the limited needs, I would probably do something like this:

1) Put a language identifier in the UpdateRequestProcessor chain
during indexing and route out at least known problematic languages,
such as Chinese, Japanese, Arabic into individual fields
2) Put everything else together into one field with ICUTokenizer,
maybe also ICUFoldingFilter
3) At the very end of that joint filter, stick in LengthFilter with
some high number, e.g. 25 characters max. This will ensure that
super-long words from non-space languages and edge conditions do not
break the rest of your system.


Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote:
 I understand relevancy, stemming etc becomes extremely complicated with 
multilingual support, but our first goal is to be able to tokenize and provide 
basic search capability for any language. Ex: When the document contains hello 
or здравствуйте, the analyzer creates tokens and provides exact match search 
results.

 


Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Shawn Heisey
On 2/25/2015 8:35 AM, Benson Margulies wrote:
 Do I need a zkcli bootstrap or do I start with upconfig? What port does
 zkRun put zookeeper on?

I personally would not use bootstrap options.  They are only meant to be
used once, when converting from non-cloud, but many people who use them
do NOT use them only once -- they include them in their startup scripts
and use them on every startup.  The whole thing becomes extremely
confusing.  I would just use zkcli and the Collections API, so nothing
ever happens that you don't explicitly request.

I believe that the port for embedded zookeeper (zkRun) is the jetty
listen port plus 1000, so 9983 if jetty.port is 8983 or not set.

Thanks,
Shawn



Re: performance issues with geofilt

2015-02-25 Thread david.w.smi...@gmail.com
Okay.  Just to re-emphasize something I said but which may not have been
clear, it isn’t an either-or for filter  sort.  Filter with the spatial
field type that makes sense for filtering, sort (or boost) with the spatial
field type that makes sense for sorting.  RPT sucks for distance sorting,
LatLonType is good for it.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, Feb 25, 2015 at 10:40 AM, dirk.thalh...@bkg.bund.de wrote:

 Hello David,

 thanks for your answer. In the meantime I found the memory hint too in
 http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4#Sorting_and_RelevancySo

 Maybe we switch to LatLonType for this kind of searches. But the RPT is
 also needed as we want to support search by arbitrary polygons.

 I'm also able to use the sort=geodist() asc. This works well when I modify
 the parameters to:
 q=*:*fq=typ:strassefq={!geofilt}sfield=geometrypt=51.370570625523,12.369290471603d=1.0sort=geofilt()
 asc

 Kind regards,

 Dirk


 Tue, 24 Feb 2015 19:42:03 GMT, david.w.smi...@gmail.com wrote:

 Hi Dirk,

 The RPT field type can be used for distance sorting/boosting but it's a
 memory pig when used as-such so don't do it unless you have to.  You only
 have to if you have a multi-valued point field.  If you have single-valued,
 use LatLonType specifically for distance sorting.

 Your sample query doesn't parse correctly for multiple reasons.  You can't
 put a query into the sort parameter as you have done it.  You have to do
 sort=query($sortQuery) ascsortQuery=...   or a slightly different
 equivalent
 variation.  Lets say you do that... still, I don't recommend this syntax
 when
 you simply want distance sort - just use geodist(), as in:  sort=geodist()
 asc.

 If you want to use this syntax such as to sort by recipDistance, then it
 would look like this (note the filter=false hint to the spatial query
 parser, which otherwise is unaware it shouldn't bother actually
 search/filter):
 sort=query($sortQuery) descsortQuery={!geofilt score=recipDistance
 filter=false sfield=geometry pt=51.3,12.3 d=1.0}

 If you are able to use geodist() and still find it slow, there are
 alternatives involving using projected data and then with simply euclidean
 calculations, sqedist():

 https://wiki.apache.org/solr/FunctionQuery#sqedist_-_Squared_Euclidean_Distance

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley

 On Tue, Feb 24, 2015 at 6:12 AM, dirk.thalh...@bkg.bund.de wrote:

  Hello,
 
  we are using solr 4.10.1. There are two cores for different use cases
 with
  around 20 million documents (location descriptions) per core. Each
 document
  has a geometry field which stores a point and a bbox field which stores a
  bounding box. Both fields are defined with:
  fieldType name=t_geometry
  class=solr.SpatialRecursivePrefixTreeFieldType
 
 
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
 geo=true distErrPct=0.025 maxDistErr=0.9
  units=degrees /
 
  I'm currently trying to add a location search (find all documents around
 a
  point). My intention is to add this as filter query, so that the user is
  able to do an additional keyword search. These are the query parameters
 so
  far:
  q=*:*fq=typ:strassefq={!geofilt sfield=geometry
  pt=51.370570625523,12.369290471603 d=1.0}
  To sort the documents by their distance to the requested point, I added
  following sort parameter:
  sort={!geofilt sort=distance sfield: geometry
  pt=51.370570625523,12.369290471603 d=1.0} asc
 
  Unfortunately I'm experiencing here some major performance/memory
  problems. The first distance query on a core takes over 10 seconds. In my
  first setup the same request to the second core completely blocked the
  server and caused an OutOfMemoryError. I had to increase the memory to 16
  GB and now it seems to work for the geometry field. Anyhow the first
  request after a server restart takes some time and when I try it with the
  bbox field after a requested on the geometry field in both cores, the
  server blocks again.
 
  Can anyone explain why the distance needs so much memory? Can this be
  optimized?
 
  Kind regards,
 
  Dirk
 
 




Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies
Bingo!

Here's the recipe for the record:

 gcopts has the ton of gc options.

First, set up shop:

DIR=$PWD
cd ../solr-4.10.3/example
java -Xmx200g $gcopts DSTOP.PORT=7983 -DSTOP.KEY=solrrocks
-Djetty.port=8983 -Dsolr.solr.home=/data/solr+rni/cloud_solr_home
-Dsolr.install.dir=/dat\
a/solr-4.10.3 -Duser.timezone=UTC -Djava.net.preferIPv4Stack=true
-DzkRun -jar start.jar 

and then:

curl 
'http://localhost:8983/solr/admin/collections?action=CREATEname=rninumShards=8replicationFactor=1collection.configName=rnimaxSh\
ardsPerNode=8'



On Wed, Feb 25, 2015 at 11:03 AM, Benson Margulies
bimargul...@gmail.com wrote:
 It's the zkcli options on my mind. zkcli's usage shows me 'bootstrap',
 'upconfig', and uploading a solr.xml.

 When I use upconfig, it might work, but it sure is noise:

 benson@ip-10-111-1-103:/data/solr+rni$ 554331
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN
 org.apache.zookeeper.server.NIOServerCnxn  – caught end of stream
 exception
 EndOfStreamException: Unable to read additional data from client
 sessionid 0x14bc16c5e660003, likely client has closed socket
 at 
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
 at 
 org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
 at java.lang.Thread.run(Thread.java:745)

 On Wed, Feb 25, 2015 at 10:52 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 2/25/2015 8:35 AM, Benson Margulies wrote:
 Do I need a zkcli bootstrap or do I start with upconfig? What port does
 zkRun put zookeeper on?

 I personally would not use bootstrap options.  They are only meant to be
 used once, when converting from non-cloud, but many people who use them
 do NOT use them only once -- they include them in their startup scripts
 and use them on every startup.  The whole thing becomes extremely
 confusing.  I would just use zkcli and the Collections API, so nothing
 ever happens that you don't explicitly request.

 I believe that the port for embedded zookeeper (zkRun) is the jetty
 listen port plus 1000, so 9983 if jetty.port is 8983 or not set.

 Thanks,
 Shawn



Re: apache solr - dovecot - some search fields works some dont

2015-02-25 Thread Alexandre Rafalovitch
This is very serious. You are missing a field called text. You have
a field _type_ called text, maybe that's where the confusion came
from. Is that something you configured in dovecote? Was it supposed to
be body or a catch-all field with copyFields into it?

I don't know Dovecote, but it is a clear mismatch between expectations
and reality. So, you need to check which one it is. One way would be
to query Solr directly and see if you have anything in your body
field. It's not stored, but you can check the indexed tokens in the
Web Admin UI under Schema Definition (or some such) and asking to load
token values for that field. If you have content in body field then
your indexing works and either you need to search also against that
field or have copyField instructions (which should have came with
Dovecote install).

Fix this first.

Regards,
   Alex.

On 25 February 2015 at 10:39, Kevin Laurie superinterstel...@gmail.com wrote:
 Hi Alex,

 I get 1 error on start up
 Is the error below serious:-


 2/25/2015, 11:32:30 PM ERROR SolrCore
 org.apache.solr.common.SolrException: undefined field text

 org.apache.solr.common.SolrException: undefined field text
 at 
 org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1269)
 at 
 org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getWrappedAnalyzer(IndexSchema.java:434)




Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Gary Taylor

Alex,

Thanks for the suggestions.  It always just indexes 1 doc, regardless of 
the first epub file it sees.  Debug / verbose don't show anything 
obvious to me.  I can include the output here if you think it would help.


I tried using the SimplePostTool first ( *java 
-Dtype=application/epub+zip 
-Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar 
\Users\gt\Documents\epub\*.epub) to index the docs and check the Tika 
parsing and that works OK so I don't think it's the e*pubs.


I was trying to use DIH so that I could more easily specify the schema 
fields and store content in the index in preparation for trying out the 
search highlighting. Couldn't work out how to do that with post.jar 


Thanks,
Gary

On 25/02/2015 17:09, Alexandre Rafalovitch wrote:

Try removing that first epub from the directory and rerunning. If you
now index 0 documents, then there is something unexpected about them
and DIH skips. If it indexes 1 document again but a different one,
then it is definitely something about the repeat logic.

Also, try running with debug and verbose modes and see if something
specific shows up.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote:

I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly
add a Solr document for each epub file in my local directory.

I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start and
then solr create -c hn2 to create a new core.

I want to index a load of epub files that I've got in a directory. So I
created a data-import.xml (in solr\hn2\conf):

dataConfig
 dataSource type=BinFileDataSource name=bin /
 document
 entity name=files dataSource=null rootEntity=false
 processor=FileListEntityProcessor
 baseDir=c:/Users/gt/Documents/epub fileName=.*epub
 onError=skip
 recursive=true
 field column=fileAbsolutePath name=id /
 field column=fileSize name=size /
 field column=fileLastModified name=lastModified /

 entity name=documentImport processor=TikaEntityProcessor
 url=${files.fileAbsolutePath} format=text
dataSource=bin onError=skip
 field column=file name=fileName/
 field column=Author name=author meta=true/
 field column=title name=title meta=true/
 field column=text name=content/
 /entity
 /entity
 /document
/dataConfig

In my solrconfig.xml, I added a requestHandler entry to reference my
data-import.xml:

   requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
   lst name=defaults
   str name=configdata-import.xml/str
   /lst
   /requestHandler

I renamed managed-schema to schema.xml, and ensured the following doc fields
were setup:

   field name=id type=string indexed=true stored=true
required=true multiValued=false /
   field name=fileName type=string indexed=true stored=true /
   field name=author type=string indexed=true stored=true /
   field name=title type=string indexed=true stored=true /

   field name=size type=long indexed=true stored=true /
   field name=lastModified type=date indexed=true stored=true /

   field name=content type=text_en indexed=false stored=true
multiValued=false/
   field name=text type=text_en indexed=true stored=false
multiValued=true/

 copyField source=content dest=text/

I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file and
renames schema.xml to schema.xml.back

All good so far.

Now I go to the web admin for dataimport
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
execute a full import.

But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 -
ie. it only adds one document (the very first one) even though it's iterated
over 58!

No errors are reported in the logs.

I can search on the contents of that first epub document, so it's extracting
OK in Tika, but there's a problem somewhere in my config that's causing only
1 document to be indexed in Solr.

Thanks for any assistance / pointers.

Regards,
Gary

--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Facet By Distance

2015-02-25 Thread Ahmed Adel
Hello,

I'm trying to get Facet By Distance working on an index with LatLonType
fields. The schema is as follows:

fields
...
field name=trip_duration type=int indexed=true stored=true/
field name=start_station type=location indexed=true stored=true /
field name=end_station type=location indexed=true stored=true /
field name=birth_year type=int stored=true/
field name=gender type=int stored=true /
...
/fields


And the query I'm running is:

q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange
l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()


But it returns all the documents in the index so it seems something is
missing. I'm using Solr 4.9.0.

--

A. Adel


Re: Do Multiprocessing on Solr to search?

2015-02-25 Thread Shawn Heisey
On 2/25/2015 9:31 AM, Nitin Solanki wrote:
 I want to search lakhs of queries/terms concurrently.

 Is there any technique to do multiprocessing on Solr?
 Is Solr is capable to handle this situation?
 I wrote a code in python that do multiprocessing and search lakhs of
 queries and do hit on Solr simultaneously/ parallely at once but it seems
 that Solr doesn't able to handle queries at once.
 Any help Please?

Solr is fully multi-threaded and capable of handling multiple requests
simultaneously.  Any of the common servlet containers that are typically
used to run Solr are *also* fully multi-threaded, but may require
configuration adjustment to allow more threads.  The jetty install that
comes with the Solr example server is tuned to allow 1 threads.

Even if you have a very well-tuned Solr install on exceptionally robust
hardware, I would not expect a single index on a single server to be
able to handle more than a few hundred requests per second.  If you need
hundreds of thousands of simultaneous queries, you're going to need a
lot of replicas on a lot of servers.  With that volume you would want a
load balancer to direct requests to those replicas.  You may also run
into problems related to TCP port exhaustion.

Thanks,
Shawn



Re: Solr Document expiration with TTL

2015-02-25 Thread Chris Hostetter

: Following query posts a document and sets expire_at_dt explicitly. That
: is working perfectly ok and ducument expires at defined time.

so the delete trigge logic is working correctly...

: But when trying to post with TTL (following query), document does not
: expire after given time.

...which suggests that the TTL-expire_at logic is not being applied 
properly.  

which is weird.

since your time_to_live_s and expire_at_dt fields are both 
stored, can you confirm that a expire_at_dt field is getting popularted by 
the update processor by doing as simple query for your doc (ie 
q=id:10seconds)

(either way: i can't explain why it's not getting deleted, but it would 
help narrow down where the problem is)


-Hoss
http://www.lucidworks.com/


RE: Do Multiprocessing on Solr to search?

2015-02-25 Thread Toke Eskildsen
Nitin Solanki [nitinml...@gmail.com] wrote:
I want to search lakhs of queries/terms concurrently.

 Is there any technique to do multiprocessing on Solr?

Each concurrent search in Solr runs in its own thread, so the answer is yes, it 
does so out of the box with concurrent searches.

 Is Solr is capable to handle this situation?

Yes and no. There is a limit to the number of concurrent connections and as far 
as I remember, it is 10.000 out of the box. If you are using SolrCloud, 
deadlocks might happen if you exceed the limit.

Anyway, I would not recommend running 10.000 concurrent searches as it leads to 
congestion. You will probably get a higher throughput by queueing your requests 
and process then with 100 concurrent searches or so. Do test.

- Toke Eskildsen

Do Multiprocessing on Solr to search?

2015-02-25 Thread Nitin Solanki
Hello,
I want to search lakhs of queries/terms concurrently.

Is there any technique to do multiprocessing on Solr?
Is Solr is capable to handle this situation?
I wrote a code in python that do multiprocessing and search lakhs of
queries and do hit on Solr simultaneously/ parallely at once but it seems
that Solr doesn't able to handle queries at once.
Any help Please?


Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Alexandre Rafalovitch
What about recursive=true? Do you have subdirectories that could
make a difference. Your SimplePostTool would not look at
subdirectories (great comparison, BTW).

However, you do have lots of mapping options as well with
/update/extract handler, look at the example and documentations. There
is lots of mapping there.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 12:24, Gary Taylor g...@inovem.com wrote:
 Alex,

 Thanks for the suggestions.  It always just indexes 1 doc, regardless of the
 first epub file it sees.  Debug / verbose don't show anything obvious to me.
 I can include the output here if you think it would help.

 I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip
 -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar
 \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika
 parsing and that works OK so I don't think it's the e*pubs.

 I was trying to use DIH so that I could more easily specify the schema
 fields and store content in the index in preparation for trying out the
 search highlighting. Couldn't work out how to do that with post.jar 

 Thanks,
 Gary


 On 25/02/2015 17:09, Alexandre Rafalovitch wrote:

 Try removing that first epub from the directory and rerunning. If you
 now index 0 documents, then there is something unexpected about them
 and DIH skips. If it indexes 1 document again but a different one,
 then it is definitely something about the repeat logic.

 Also, try running with debug and verbose modes and see if something
 specific shows up.

 Regards,
 Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote:

 I can't get the FileListEntityProcessor and TikeEntityProcessor to
 correctly
 add a Solr document for each epub file in my local directory.

 I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start
 and
 then solr create -c hn2 to create a new core.

 I want to index a load of epub files that I've got in a directory. So I
 created a data-import.xml (in solr\hn2\conf):

 dataConfig
  dataSource type=BinFileDataSource name=bin /
  document
  entity name=files dataSource=null rootEntity=false
  processor=FileListEntityProcessor
  baseDir=c:/Users/gt/Documents/epub fileName=.*epub
  onError=skip
  recursive=true
  field column=fileAbsolutePath name=id /
  field column=fileSize name=size /
  field column=fileLastModified name=lastModified /

  entity name=documentImport
 processor=TikaEntityProcessor
  url=${files.fileAbsolutePath} format=text
 dataSource=bin onError=skip
  field column=file name=fileName/
  field column=Author name=author meta=true/
  field column=title name=title meta=true/
  field column=text name=content/
  /entity
  /entity
  /document
 /dataConfig

 In my solrconfig.xml, I added a requestHandler entry to reference my
 data-import.xml:

requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdata-import.xml/str
/lst
/requestHandler

 I renamed managed-schema to schema.xml, and ensured the following doc
 fields
 were setup:

field name=id type=string indexed=true stored=true
 required=true multiValued=false /
field name=fileName type=string indexed=true stored=true
 /
field name=author type=string indexed=true stored=true /
field name=title type=string indexed=true stored=true /

field name=size type=long indexed=true stored=true /
field name=lastModified type=date indexed=true
 stored=true /

field name=content type=text_en indexed=false stored=true
 multiValued=false/
field name=text type=text_en indexed=true stored=false
 multiValued=true/

  copyField source=content dest=text/

 I copied all the jars from dist and contrib\* into server\solr\lib.

 Stopping and restarting solr then creates a new managed-schema file and
 renames schema.xml to schema.xml.back

 All good so far.

 Now I go to the web admin for dataimport
 (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
 execute a full import.

 But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1
 -
 ie. it only adds one document (the very first one) even though it's
 iterated
 over 58!

 No errors are reported in the logs.

 I can search on the contents of that first epub document, so it's
 extracting
 OK in Tika, but there's a problem somewhere in my config that's causing
 only
 1 document to be indexed in Solr.

 Thanks for any assistance / pointers.

 Regards,
 Gary

 --
 Gary Taylor | 

Re: Connect Solr with ODBC to Excel

2015-02-25 Thread Mikhail Khludnev
Some time ago I encounter https://github.com/kawasima/solr-jdbc never tried
it.Anyway, it doesn't help to connect from odbc.
On top of my head, is
https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets but
it returns only JSON, not csv. That's I wonder why.

Seems like a dead end so far.

On Wed, Feb 25, 2015 at 6:15 PM, Hakim Benoudjit h.benoud...@gmail.com
wrote:

 Thanks for your answer.
 For a one-off it seems like a nice way to import my data.
 For an ODBC connection, the only solution I found is to replicate my Solr
 data in Apache Hive (or Cassandra...), and then connect to that database
 through ODBC.


 2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com:

  Which direction? You want import data from Solr into Excel? One off or
  repeatedly?
 
  For one off Solr - Excel, you could probably use Excel's Open from
  Web and load data directly from Solr using CSV output format.
 
  Regards,
 Alex.
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 
  On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com
  wrote:
   Hi there,
  
   I'm looking for a library to connect Solr throught ODBC to Excel in
 order
   to do some reporting on my Solr data?
   Anybody knows a library for that?
  
   Thanks.
  
   --
   Cordialement,
   Best regards,
   Hakim Benoudjit
 



 --
 Cordialement,
 Best regards,
 Hakim Benoudjit




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Shawn Heisey
On 2/25/2015 9:03 AM, Benson Margulies wrote:
 It's the zkcli options on my mind. zkcli's usage shows me 'bootstrap',
 'upconfig', and uploading a solr.xml.

 When I use upconfig, it might work, but it sure is noise:

 benson@ip-10-111-1-103:/data/solr+rni$ 554331
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN
 org.apache.zookeeper.server.NIOServerCnxn  – caught end of stream
 exception
 EndOfStreamException: Unable to read additional data from client
 sessionid 0x14bc16c5e660003, likely client has closed socket
 at 
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
 at 
 org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
 at java.lang.Thread.run(Thread.java:745)

The upconfig command is VERY noisy.  A LOT of data is printed whether
it's successful or not, and exceptions on a successful upload would
actually not surprise me.  An issue to reduce the zkcli output to short
informational/error messages rather than the full zookeeper client
logging is something I'll do soon if someone else doesn't get to it.

I had never noticed the bootstrap option to zkcli before ... based on
the options shown, I think it's meant to convert an entire non-cloud
(and probably non-redundant) Solr installation (all cores currently
present in the solr home) to SolrCloud.  It's a conversion that would
work, but I think it would be very ugly.  There's also a bootstrap
option for Solr that does this.

Thanks,
Shawn



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Alexandre Rafalovitch
Try removing that first epub from the directory and rerunning. If you
now index 0 documents, then there is something unexpected about them
and DIH skips. If it indexes 1 document again but a different one,
then it is definitely something about the repeat logic.

Also, try running with debug and verbose modes and see if something
specific shows up.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote:
 I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly
 add a Solr document for each epub file in my local directory.

 I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start and
 then solr create -c hn2 to create a new core.

 I want to index a load of epub files that I've got in a directory. So I
 created a data-import.xml (in solr\hn2\conf):

 dataConfig
 dataSource type=BinFileDataSource name=bin /
 document
 entity name=files dataSource=null rootEntity=false
 processor=FileListEntityProcessor
 baseDir=c:/Users/gt/Documents/epub fileName=.*epub
 onError=skip
 recursive=true
 field column=fileAbsolutePath name=id /
 field column=fileSize name=size /
 field column=fileLastModified name=lastModified /

 entity name=documentImport processor=TikaEntityProcessor
 url=${files.fileAbsolutePath} format=text
 dataSource=bin onError=skip
 field column=file name=fileName/
 field column=Author name=author meta=true/
 field column=title name=title meta=true/
 field column=text name=content/
 /entity
 /entity
 /document
 /dataConfig

 In my solrconfig.xml, I added a requestHandler entry to reference my
 data-import.xml:

   requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
   lst name=defaults
   str name=configdata-import.xml/str
   /lst
   /requestHandler

 I renamed managed-schema to schema.xml, and ensured the following doc fields
 were setup:

   field name=id type=string indexed=true stored=true
 required=true multiValued=false /
   field name=fileName type=string indexed=true stored=true /
   field name=author type=string indexed=true stored=true /
   field name=title type=string indexed=true stored=true /

   field name=size type=long indexed=true stored=true /
   field name=lastModified type=date indexed=true stored=true /

   field name=content type=text_en indexed=false stored=true
 multiValued=false/
   field name=text type=text_en indexed=true stored=false
 multiValued=true/

 copyField source=content dest=text/

 I copied all the jars from dist and contrib\* into server\solr\lib.

 Stopping and restarting solr then creates a new managed-schema file and
 renames schema.xml to schema.xml.back

 All good so far.

 Now I go to the web admin for dataimport
 (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
 execute a full import.

 But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 -
 ie. it only adds one document (the very first one) even though it's iterated
 over 58!

 No errors are reported in the logs.

 I can search on the contents of that first epub document, so it's extracting
 OK in Tika, but there's a problem somewhere in my config that's causing only
 1 document to be indexed in Solr.

 Thanks for any assistance / pointers.

 Regards,
 Gary

 --
 Gary Taylor | www.inovem.com | www.kahootz.com

 INOVEM Ltd is registered in England and Wales No 4228932
 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
 kahootz.com is a trading name of INOVEM Ltd.



Re: how to debug solr performance degradation

2015-02-25 Thread Boogie Shafer
rebecca,

you probably need to dig into your queries, but if you want to force/preload 
the index into memory you could try doing something like

cat `find /path/to/solr/index`  /dev/null


if you haven't already reviewed the following, you might take a look here
https://wiki.apache.org/solr/SolrPerformanceProblems

perhaps going back to a very vanilla/default solr configuration and building 
back up from that baseline to better isolate what might specific setting be 
impacting your environment


From: Tang, Rebecca rebecca.t...@ucsf.edu
Sent: Wednesday, February 25, 2015 11:44
To: solr-user@lucene.apache.org
Subject: RE: how to debug solr performance degradation

Sorry, I should have been more specific.

I was referring to the solr admin UI page. Today we started up an AWS
instance with 240 G of memory to see if we fit all of our index (183G) in
the memory and have enough for the JMV, could it improve the performance.

I attached the admin UI screen shot with the email.

The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52
GB is used.

The next bar is Swap Space and it¹s at 0.00 MB.

The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G.

My understanding is that when Solr starts up, it reserves some memory for
the JVM, and then it tries to use up as much of the remaining physical
memory as possible.  And I used to see the physical memory at anywhere
between 70% to 90+%.  Is this understanding correct?

And now, even with 240G of memory, our index is performing at 10 - 20
seconds for a query.  Granted that our queries have fq¹s and highlighting
and faceting, I think with a machine this powerful I should be able to get
the queries executed under 5 seconds.

This is what we send to Solr:
q=(phillip%20morris)
wt=json
start=0
rows=50
facet=true
facet.mincount=0
facet.pivot=industry,collection_facet
facet.pivot=availability_facet,availabilitystatus_facet
facet.field=dddate
fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank%
20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be
gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder
%20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she
et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page%
22%20OR%20dt%3A%22tab%20sheet%22))
facet.field=dt_facet
facet.field=brd_facet
facet.field=dg_facet
hl=true
hl.simple.pre=%3Ch1%3E
hl.simple.post=%3C%2Fh1%3E
hl.requireFieldMatch=false
hl.preserveMulti=true
hl.fl=ot,ti
f.ot.hl.fragsize=300
f.ot.hl.alternateField=ot
f.ot.hl.maxAlternateFieldLength=300
f.ti.hl.fragsize=300
f.ti.hl.alternateField=ti
f.ti.hl.maxAlternateFieldLength=300
fq={!collapse%20field=signature}
expand=true
sort=score+desc,availability_facet+asc


My guess is that it¹s performing so badly because it¹s only using 4% of
the memory? And searches require disk access.


Rebecca

From: Shawn Heisey [apa...@elyograg.org]
Sent: Tuesday, February 24, 2015 5:23 PM
To: solr-user@lucene.apache.org
Subject: Re: how to debug solr performance degradation

On 2/24/2015 5:45 PM, Tang, Rebecca wrote:
 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.

 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?

I would like to know what memory numbers in which program you are
looking at, and why you believe those numbers are a problem.

The JVM has a very different view of memory than the operating system.
Numbers in top mean different things than numbers on the dashboard of
the admin UI, or the numbers in jconsole.  If you're on Windows, then
replace top with task manager, process explorer, resource monitor, etc.

Please provide as many details as you can about the things you are
looking at.

Thanks,
Shawn



RE: how to debug solr performance degradation

2015-02-25 Thread Toke Eskildsen
Unfortunately (or luckily, depending on view), attachments does not work with 
this mailing list. You'll have to upload it somewhere and provide an URL. It is 
quite hard _not_ to get your whole index into disk cache, so my guess is that 
it will get there eventually. Just to check: If you re-issue your queries, does 
the response time change? If not, then disk caching is not the problem.

Anyway, with your new information, I would say that pivot faceting is the 
culprit. Does the timing tests in 
https://issues.apache.org/jira/browse/SOLR-6803 line up with the cardinalities 
of your fields?

My next step would be to disable parts of the query (highlight, faceting and 
collapsing one at a time) to check which part is the heaviest.

- Toke Eskildsen

From: Tang, Rebecca [rebecca.t...@ucsf.edu]
Sent: 25 February 2015 20:44
To: solr-user@lucene.apache.org
Subject: RE: how to debug solr performance degradation

Sorry, I should have been more specific.

I was referring to the solr admin UI page. Today we started up an AWS
instance with 240 G of memory to see if we fit all of our index (183G) in
the memory and have enough for the JMV, could it improve the performance.

I attached the admin UI screen shot with the email.

The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52
GB is used.

The next bar is Swap Space and it¹s at 0.00 MB.

The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G.

My understanding is that when Solr starts up, it reserves some memory for
the JVM, and then it tries to use up as much of the remaining physical
memory as possible.  And I used to see the physical memory at anywhere
between 70% to 90+%.  Is this understanding correct?

And now, even with 240G of memory, our index is performing at 10 - 20
seconds for a query.  Granted that our queries have fq¹s and highlighting
and faceting, I think with a machine this powerful I should be able to get
the queries executed under 5 seconds.

This is what we send to Solr:
q=(phillip%20morris)
wt=json
start=0
rows=50
facet=true
facet.mincount=0
facet.pivot=industry,collection_facet
facet.pivot=availability_facet,availabilitystatus_facet
facet.field=dddate
fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank%
20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be
gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder
%20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she
et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page%
22%20OR%20dt%3A%22tab%20sheet%22))
facet.field=dt_facet
facet.field=brd_facet
facet.field=dg_facet
hl=true
hl.simple.pre=%3Ch1%3E
hl.simple.post=%3C%2Fh1%3E
hl.requireFieldMatch=false
hl.preserveMulti=true
hl.fl=ot,ti
f.ot.hl.fragsize=300
f.ot.hl.alternateField=ot
f.ot.hl.maxAlternateFieldLength=300
f.ti.hl.fragsize=300
f.ti.hl.alternateField=ti
f.ti.hl.maxAlternateFieldLength=300
fq={!collapse%20field=signature}
expand=true
sort=score+desc,availability_facet+asc


My guess is that it¹s performing so badly because it¹s only using 4% of
the memory? And searches require disk access.


Rebecca

From: Shawn Heisey [apa...@elyograg.org]
Sent: Tuesday, February 24, 2015 5:23 PM
To: solr-user@lucene.apache.org
Subject: Re: how to debug solr performance degradation

On 2/24/2015 5:45 PM, Tang, Rebecca wrote:
 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.

 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?

I would like to know what memory numbers in which program you are
looking at, and why you believe those numbers are a problem.

The JVM has a very different view of memory than the operating system.
Numbers in top mean different things than numbers on the dashboard of
the admin UI, or the numbers in jconsole.  If you're on Windows, then
replace top with task manager, process explorer, resource monitor, etc.

Please provide as many details as you can about the things you are
looking at.

Thanks,
Shawn




Re: New leader/replica solution for HDFS

2015-02-25 Thread Joseph Obernberger
I am also confused on this.  Is adding replicas going to increase search 
performance?  I'm not sure I see the point of any replicas when using 
HDFS.  Is there one?

Thank you!

-Joe

On 2/25/2015 10:57 AM, Erick Erickson wrote:

bq: And the data sync between leader/replica is always a problem

Not quite sure what you mean by this. There shouldn't need to be
any synching in the sense that the index gets replicated, the
incoming documents should be sent to each node (and indexed
to HDFS) as they come in.

bq: There is duplicate index computing on Replilca side.

Yes, that's the design of SolrCloud, explicitly to provide data safety.
If you instead rely on the leader to index and somehow pull that
indexed form to the replica, then you will lose data if the leader
goes down before sending the indexed form.

bq: My thought is that the leader and the replica all bind to the same data
index directory.

This is unsafe. They would both then try to _write_ to the same
index, which can easily corrupt indexes and/or all but the first
one to access the index would be locked out.

All that said, the HDFS triple-redundancy compounded with the
Solr leaders/replicas redundancy means a bunch of extra
storage. You can turn the HDFS replication down to 1, but that has
other implications.

Best,
Erick

On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote:

We used HDFS as our Solr index storage and we really have a heavy update
load. We had met much problems with current leader/replica solution. There
is duplicate index computing on Replilca side. And the data sync between
leader/replica is always a problem.

As HDFS already provides data replication on data layer, could Solr provide
just service layer replication?

My thought is that the leader and the replica all bind to the same data
index directory. And the leader will build up index for new request, the
replica will just keep update the index version with the leader(such as a
soft commit periodically? ). If the leader lost then the replica will take
the duty immediately.

Thanks for any suggestion of this idea.







--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: how to debug solr performance degradation

2015-02-25 Thread Tang, Rebecca
Sorry, I should have been more specific.

I was referring to the solr admin UI page. Today we started up an AWS
instance with 240 G of memory to see if we fit all of our index (183G) in
the memory and have enough for the JMV, could it improve the performance.

I attached the admin UI screen shot with the email.

The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52
GB is used.

The next bar is Swap Space and it¹s at 0.00 MB.

The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G.

My understanding is that when Solr starts up, it reserves some memory for
the JVM, and then it tries to use up as much of the remaining physical
memory as possible.  And I used to see the physical memory at anywhere
between 70% to 90+%.  Is this understanding correct?

And now, even with 240G of memory, our index is performing at 10 - 20
seconds for a query.  Granted that our queries have fq¹s and highlighting
and faceting, I think with a machine this powerful I should be able to get
the queries executed under 5 seconds.

This is what we send to Solr:
q=(phillip%20morris)
wt=json
start=0
rows=50
facet=true
facet.mincount=0
facet.pivot=industry,collection_facet
facet.pivot=availability_facet,availabilitystatus_facet
facet.field=dddate
fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank%
20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be
gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder
%20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she
et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page%
22%20OR%20dt%3A%22tab%20sheet%22))
facet.field=dt_facet
facet.field=brd_facet
facet.field=dg_facet
hl=true
hl.simple.pre=%3Ch1%3E
hl.simple.post=%3C%2Fh1%3E
hl.requireFieldMatch=false
hl.preserveMulti=true
hl.fl=ot,ti
f.ot.hl.fragsize=300
f.ot.hl.alternateField=ot
f.ot.hl.maxAlternateFieldLength=300
f.ti.hl.fragsize=300
f.ti.hl.alternateField=ti
f.ti.hl.maxAlternateFieldLength=300
fq={!collapse%20field=signature}
expand=true
sort=score+desc,availability_facet+asc


My guess is that it¹s performing so badly because it¹s only using 4% of
the memory? And searches require disk access.


Rebecca

From: Shawn Heisey [apa...@elyograg.org]
Sent: Tuesday, February 24, 2015 5:23 PM
To: solr-user@lucene.apache.org
Subject: Re: how to debug solr performance degradation

On 2/24/2015 5:45 PM, Tang, Rebecca wrote:
 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.

 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?

I would like to know what memory numbers in which program you are
looking at, and why you believe those numbers are a problem.

The JVM has a very different view of memory than the operating system.
Numbers in top mean different things than numbers on the dashboard of
the admin UI, or the numbers in jconsole.  If you're on Windows, then
replace top with task manager, process explorer, resource monitor, etc.

Please provide as many details as you can about the things you are
looking at.

Thanks,
Shawn




Re: Add fields without manually editing Schema.xml.

2015-02-25 Thread Vishal Swaroop
Thanks a lot Alex...

I thought about dynamic fields and will also explore the suggested
options...

On Wed, Feb 25, 2015 at 1:40 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 Several ways. Reading through tutorials should help to get the
 details. But in short:
 1) Map them to dynamic fields using prefixes and/or suffixes.
 2) Use dynamic schema which will guess the types and creates the
 fields based on first use

 Something like SIREn might also be of interest:
 http://siren.solutions/siren/overview/

 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 25 February 2015 at 13:26, Vishal Swaroop vishal@gmail.com wrote:
  Hi,
 
  Just wondering if there is a way to handle this use-case in SOLR without
  manually editing Schema.xml.
 
  Scenario :
  We have xml data with some elements/ attributes which we plan to index.
  As we move forward there can be addition of xml elements.
 
  Is there a way to handle this with out manually adding fields /changing
 in
  schema.xml ?
 
  Thanks
  V



Re: Add fields without manually editing Schema.xml.

2015-02-25 Thread Alexandre Rafalovitch
Several ways. Reading through tutorials should help to get the
details. But in short:
1) Map them to dynamic fields using prefixes and/or suffixes.
2) Use dynamic schema which will guess the types and creates the
fields based on first use

Something like SIREn might also be of interest:
http://siren.solutions/siren/overview/

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 13:26, Vishal Swaroop vishal@gmail.com wrote:
 Hi,

 Just wondering if there is a way to handle this use-case in SOLR without
 manually editing Schema.xml.

 Scenario :
 We have xml data with some elements/ attributes which we plan to index.
 As we move forward there can be addition of xml elements.

 Is there a way to handle this with out manually adding fields /changing in
 schema.xml ?

 Thanks
 V


Re: Facet By Distance

2015-02-25 Thread Ahmed Adel
Hi,
Thank you for your reply. I added a filter query to the query in two ways
as follows:

fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange
l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()d=0.2
-- returns 0 docs

q=*:*fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069d=0.2
-- returns 1484 docs

Not sure why the first query with returns 0 documents

On Wed, Feb 25, 2015 at 8:46 PM, david.w.smi...@gmail.com 
david.w.smi...@gmail.com wrote:

 Hi,
 This will return all the documents in the index because you did nothing
 to filter them out.  Your query is *:* (everything) and there are no filter
 queries.

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley

 On Wed, Feb 25, 2015 at 12:27 PM, Ahmed Adel ahmed.a...@badrit.com
 wrote:

  Hello,
 
  I'm trying to get Facet By Distance working on an index with LatLonType
  fields. The schema is as follows:
 
  fields
  ...
  field name=trip_duration type=int indexed=true stored=true/
  field name=start_station type=location indexed=true stored=true
 /
  field name=end_station type=location indexed=true stored=true /
  field name=birth_year type=int stored=true/
  field name=gender type=int stored=true /
  ...
  /fields
 
 
  And the query I'm running is:
 
 
 q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange
  l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()
 
 
  But it returns all the documents in the index so it seems something is
  missing. I'm using Solr 4.9.0.
 
  --
 
  A. Adel
 


A. Adel


Re: Stop solr query

2015-02-25 Thread Mikhail Khludnev
No. You can, but only search (collecting results) and not a query expansion.
As I said, debugQuery=true, and the stacktrace or sampling can help to
understand the reason.

On Wed, Feb 25, 2015 at 5:45 PM, Moshe Recanati mos...@kmslh.com wrote:

 HI Mikhail,
 We're using 4.7.1. This means I can't stop the search.
 I think this is mandatory feature.


 Regards,
 Moshe Recanati
 SVP Engineering
 Office + 972-73-2617564
 Mobile  + 972-52-6194481
 Skype:  recanati

 More at:  www.kmslh.com | LinkedIn | FB


 -Original Message-
 From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
 Sent: Wednesday, February 25, 2015 3:42 PM
 To: solr-user
 Subject: Re: Stop solr query

 Moshe,

 if you take a thread dump while a particular query stuck (via jstack of in
 SolrAdmin tab), it may explain where exactly it's stalled, just check the
 longest stack trace.
 FWIW, in 4.x timeallowed is checked only while documents are collected,
 and in 5 it's also checked during query expansion (see
 http://lucidworks.com/blog/solr-5-0/ now cut-offs requests 
 https://issues.apache.org/jira/browse/SOLR-5986 during the
 query-expansion stage as well ). however I'm not sure it has place (long
 query expansion) with hon-synonyms.



 On Wed, Feb 25, 2015 at 3:21 PM, Moshe Recanati mos...@kmslh.com wrote:

  Hi Shawn,
  We checked this option and it didn't solve our problem.
  We're using https://github.com/healthonnet/hon-lucene-synonyms for
  query based synonyms.
  While running query with high number of words that have high number of
  synonyms the query got stuck and solr memory is exhausted.
  We tried to use this parameter suggested by you however it didn't stop
  the query and solve the issue.
 
  Please let me know if there is other option to tackle it. Today it
  might be high number of words that cause the issue and tomorrow it
  might be other something wrong. We can't rely only on user input check.
 
  Thank you in advance.
 
 
  Regards,
  Moshe Recanati
  SVP Engineering
  Office + 972-73-2617564
  Mobile  + 972-52-6194481
  Skype:  recanati
 
  More at:  www.kmslh.com | LinkedIn | FB
 
 
  -Original Message-
  From: Shawn Heisey [mailto:apa...@elyograg.org]
  Sent: Monday, February 23, 2015 5:49 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Stop solr query
 
  On 2/23/2015 7:23 AM, Moshe Recanati wrote:
   Recently there were some scenarios in which queries that user sent
   to solr got stuck and increased our solr heap.
  
   Is there any option to kill or timeout query that wasn't returned
   from solr by external command?
  
 
  The best thing you can do is examine all user input and stop such
  queries before they execute, especially if they are the kind of query
  that will cause your heap to grow out of control.
 
  The timeAllowed parameter can abort a query that takes too long in
  certain phases of the query.  In recent months, Solr has been modified
  so that timeAllowed will take effect during more query phases.  It is
  not a perfect solution, but it can be better than nothing.
 
  http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed
 
  Be aware that sometimes legitimate queries will be slow, and using
  timeAllowed may cause those queries to fail.
 
  Thanks,
  Shawn
 
 


 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Facet By Distance

2015-02-25 Thread david.w.smi...@gmail.com
Hi,
This will “return all the documents in the index” because you did nothing
to filter them out.  Your query is *:* (everything) and there are no filter
queries.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, Feb 25, 2015 at 12:27 PM, Ahmed Adel ahmed.a...@badrit.com wrote:

 Hello,

 I'm trying to get Facet By Distance working on an index with LatLonType
 fields. The schema is as follows:

 fields
 ...
 field name=trip_duration type=int indexed=true stored=true/
 field name=start_station type=location indexed=true stored=true /
 field name=end_station type=location indexed=true stored=true /
 field name=birth_year type=int stored=true/
 field name=gender type=int stored=true /
 ...
 /fields


 And the query I'm running is:

 q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange
 l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()


 But it returns all the documents in the index so it seems something is
 missing. I'm using Solr 4.9.0.

 --

 A. Adel



Add fields without manually editing Schema.xml.

2015-02-25 Thread Vishal Swaroop
Hi,

Just wondering if there is a way to handle this use-case in SOLR without
manually editing Schema.xml.

Scenario :
We have xml data with some elements/ attributes which we plan to index.
As we move forward there can be addition of xml elements.

Is there a way to handle this with out manually adding fields /changing in
schema.xml ?

Thanks
V


Re: Connect Solr with ODBC to Excel

2015-02-25 Thread Hakim Benoudjit
Thanks for the two links.
The first one could be helpful if it works.
Regarding the second one, I think it's quite similar to using /select to
return json format.

2015-02-25 19:10 GMT+01:00 Mikhail Khludnev mkhlud...@griddynamics.com:

 Some time ago I encounter https://github.com/kawasima/solr-jdbc never
 tried
 it.Anyway, it doesn't help to connect from odbc.
 On top of my head, is
 https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets but
 it returns only JSON, not csv. That's I wonder why.

 Seems like a dead end so far.

 On Wed, Feb 25, 2015 at 6:15 PM, Hakim Benoudjit h.benoud...@gmail.com
 wrote:

  Thanks for your answer.
  For a one-off it seems like a nice way to import my data.
  For an ODBC connection, the only solution I found is to replicate my Solr
  data in Apache Hive (or Cassandra...), and then connect to that database
  through ODBC.
 
 
  2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com:
 
   Which direction? You want import data from Solr into Excel? One off or
   repeatedly?
  
   For one off Solr - Excel, you could probably use Excel's Open from
   Web and load data directly from Solr using CSV output format.
  
   Regards,
  Alex.
   
   Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
   http://www.solr-start.com/
  
  
   On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com
   wrote:
Hi there,
   
I'm looking for a library to connect Solr throught ODBC to Excel in
  order
to do some reporting on my Solr data?
Anybody knows a library for that?
   
Thanks.
   
--
Cordialement,
Best regards,
Hakim Benoudjit
  
 
 
 
  --
  Cordialement,
  Best regards,
  Hakim Benoudjit
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com




-- 
Cordialement,
Best regards,
Hakim Benoudjit


problems retrieving term vectors using RealTimeGetHandler

2015-02-25 Thread Scott C. Cote
I’m working with term vectors via solr.  
Is there a way to configure the RealTimeGetHandler to return tv info?

Here is my environment info:

Scotts-MacBook-Air-2:solr_jetty scottccote$ java -version
java version 1.8.0_31
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)


Scotts-MacBook-Air-2:solr_jetty scottccote$ uname -a
Darwin Scotts-MacBook-Air-2.local 14.1.0 Darwin Kernel Version 14.1.0: Mon Dec 
22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64


solr-4.10.3

Here is my attempted configuration

  searchComponent name=tvComponent 
class=org.apache.solr.handler.component.TermVectorComponent/
  requestHandler name=/get class=solr.RealTimeGetHandler
lst name=defaults
  str name=omitHeadertrue/str
  bool name=tvtrue/bool
/lst 
arr name=last-components
  strtvComponent/str
/arr
  /requestHandler  

Here is my request on the Solr Admin panel

qt is set to …  /get
Raw Query Parameters are set to … id=7tv=truetv.all=true

http://localhost:8983/solr/question/get?wt=jsonindent=trueid=7tv=truetv.all=true
 
http://localhost:8983/solr/question/get?wt=jsonindent=trueid=7tv=truetv.all=true

which generates the following response (with error)

{
  doc: {
id: 7,
classId: class1,
studentId: fdsfsd,
originalText: sing for raj,
filteredText: [
  sing for raj
],
_version_: 1493662750219436000
  },
  termVectors: [
uniqueKeyFieldName,
id
  ],
  error: {
trace: java.lang.NullPointerException\n\tat 
org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:251)\n\tat
 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:1976)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat
 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat
 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat
 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat
 org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n\tat 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)\n\tat
 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat
 java.lang.Thread.run(Thread.java:745)\n,
code: 500
  }
}

and stack trace on the server

8702 [qtp24433162-15] INFO  org.apache.solr.servlet.SolrDispatchFilter  – 
[admin] webapp=null path=/admin/info/system params={wt=json_=1424895828590} 
status=0 QTime=34 
1645307 [qtp24433162-15] ERROR org.apache.solr.core.SolrCore  – 
java.lang.NullPointerException
at 
org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:251)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)
at 

Re: Facet By Distance

2015-02-25 Thread david.w.smi...@gmail.com
If ‘q’ is absent, then you always match nothing (there may be exceptions?);
so it’s sort of required, in effect.  I wish it defaulted to *:*.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Wed, Feb 25, 2015 at 2:28 PM, Ahmed Adel ahmed.a...@badrit.com wrote:

 Hi,
 Thank you for your reply. I added a filter query to the query in two ways
 as follows:


 fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange
 l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()d=0.2
 -- returns 0 docs

 q=*:*fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069d=0.2
 -- returns 1484 docs

 Not sure why the first query with returns 0 documents

 On Wed, Feb 25, 2015 at 8:46 PM, david.w.smi...@gmail.com 
 david.w.smi...@gmail.com wrote:

  Hi,
  This will return all the documents in the index because you did nothing
  to filter them out.  Your query is *:* (everything) and there are no
 filter
  queries.
 
  ~ David Smiley
  Freelance Apache Lucene/Solr Search Consultant/Developer
  http://www.linkedin.com/in/davidwsmiley
 
  On Wed, Feb 25, 2015 at 12:27 PM, Ahmed Adel ahmed.a...@badrit.com
  wrote:
 
   Hello,
  
   I'm trying to get Facet By Distance working on an index with LatLonType
   fields. The schema is as follows:
  
   fields
   ...
   field name=trip_duration type=int indexed=true stored=true/
   field name=start_station type=location indexed=true
 stored=true
  /
   field name=end_station type=location indexed=true stored=true
 /
   field name=birth_year type=int stored=true/
   field name=gender type=int stored=true /
   ...
   /fields
  
  
   And the query I'm running is:
  
  
 
 q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange
   l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()
  
  
   But it returns all the documents in the index so it seems something is
   missing. I'm using Solr 4.9.0.
  
   --
  
   A. Adel
  
 

 A. Adel



Re: Add fields without manually editing Schema.xml.

2015-02-25 Thread Jack Krupansky
Solr also now has a schema API to dynamically edit the schema without the
need to manually edit the schema file:
https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-AddaDynamicFieldRule


-- Jack Krupansky

On Wed, Feb 25, 2015 at 3:15 PM, Vishal Swaroop vishal@gmail.com
wrote:

 Thanks a lot Alex...

 I thought about dynamic fields and will also explore the suggested
 options...

 On Wed, Feb 25, 2015 at 1:40 PM, Alexandre Rafalovitch arafa...@gmail.com
 
 wrote:

  Several ways. Reading through tutorials should help to get the
  details. But in short:
  1) Map them to dynamic fields using prefixes and/or suffixes.
  2) Use dynamic schema which will guess the types and creates the
  fields based on first use
 
  Something like SIREn might also be of interest:
  http://siren.solutions/siren/overview/
 
  Regards,
 Alex.
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 
  On 25 February 2015 at 13:26, Vishal Swaroop vishal@gmail.com
 wrote:
   Hi,
  
   Just wondering if there is a way to handle this use-case in SOLR
 without
   manually editing Schema.xml.
  
   Scenario :
   We have xml data with some elements/ attributes which we plan to index.
   As we move forward there can be addition of xml elements.
  
   Is there a way to handle this with out manually adding fields /changing
  in
   schema.xml ?
  
   Thanks
   V
 



Re: Connect Solr with ODBC to Excel

2015-02-25 Thread Mikhail Khludnev
On Wed, Feb 25, 2015 at 10:31 PM, Hakim Benoudjit h.benoud...@gmail.com
wrote:

 Thanks for the two links.
 The first one could be helpful if it works.
 Regarding the second one,



 I think it's quite similar to using /select to
 return json format.

not really. /export yields much more data faster.
Also, if you are interested in relatively short result set, you
can /selectwt=csv no facets in this case, sadly. Just fyi.



 2015-02-25 19:10 GMT+01:00 Mikhail Khludnev mkhlud...@griddynamics.com:

  Some time ago I encounter https://github.com/kawasima/solr-jdbc never
  tried
  it.Anyway, it doesn't help to connect from odbc.
  On top of my head, is
  https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
 but
  it returns only JSON, not csv. That's I wonder why.
 
  Seems like a dead end so far.
 
  On Wed, Feb 25, 2015 at 6:15 PM, Hakim Benoudjit h.benoud...@gmail.com
  wrote:
 
   Thanks for your answer.
   For a one-off it seems like a nice way to import my data.
   For an ODBC connection, the only solution I found is to replicate my
 Solr
   data in Apache Hive (or Cassandra...), and then connect to that
 database
   through ODBC.
  
  
   2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com:
  
Which direction? You want import data from Solr into Excel? One off
 or
repeatedly?
   
For one off Solr - Excel, you could probably use Excel's Open from
Web and load data directly from Solr using CSV output format.
   
Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/
   
   
On 25 February 2015 at 08:52, Hakim Benoudjit h.benoud...@gmail.com
 
wrote:
 Hi there,

 I'm looking for a library to connect Solr throught ODBC to Excel in
   order
 to do some reporting on my Solr data?
 Anybody knows a library for that?

 Thanks.

 --
 Cordialement,
 Best regards,
 Hakim Benoudjit
   
  
  
  
   --
   Cordialement,
   Best regards,
   Hakim Benoudjit
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com
 



 --
 Cordialement,
 Best regards,
 Hakim Benoudjit




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


RE: Collations are not working fine.

2015-02-25 Thread Reitzel, Charles
Hi Rajesh,

That was very helpful.   Based on your experience, I dug deeper into it and 
figured out that it does attempt to return collations for single term queries 
in my configuration as well.   However, in the test cases I have been using, 
the suggested correction never gets any hits.   Again, this is based on our use 
cases that always have at least one filter query present.   As soon as I 
dropped the filter query, sure enough, collations were returned for the single 
term.

But this still doesn't solve my original problem:  The original term is never 
included in the collation results (or validated with a query like the suggested 
corrections).   Thus, if it is a valid term, we don't want to throw it away.   
It would be great to have the collator validate it as a term (perhaps 
conditionally, based on the  exactMatchFirst component dictionary parameter).   
But, at this point, I'm happy to just consult the origFreq value in the 
extended results.

Thanks,
Charlie

-Original Message-
From: Rajesh Hazari [mailto:rajeshhaz...@gmail.com] 
Sent: Monday, February 23, 2015 11:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Collations are not working fine.

Hi,

we have used spellcheck component the below configs to get a best collation 
(exact collation) when a query has either single term or multiple terms.

As charles, mentioned above we do have a check for getOriginalFrequency() for 
each term in our service before we send spellcheck response to client, this may 
not be the case for you, hope this helps

request-handler name=/select class=solr.SearchHandler
!-- default values for query parameters can be specified, these
 will be overridden by parameters in the request
  --
lst name=defaults
str name=echoParamsexplicit/str
int name=rows100/int
str name=dftextSpell/str
 str name=spellchecktrue/str str 
name=spellcheck.dictionarydefault/str
str name=spellcheck.dictionarywordbreak/str
int name=spellcheck.count5/int
* str name=spellcheck.alternativeTermCount15/str *
* str name=spellcheck.collatetrue/str*
* str name=spellcheck.onlyMorePopularfalse/str*
* str name=spellcheck.extendedResultstrue/str*
* str name =spellcheck.maxCollations100/str*
* str name=spellcheck.collateParam.mm
http://spellcheck.collateParam.mm100%/str*
* str name=spellcheck.collateParam.q.opAND/str*
* str name=spellcheck.maxCollationTries1000/str*
str name=q.opOR/str
.
.
..   /lst /request-handler
.
.
.

searchComponent name=spellcheck class=solr.SpellCheckComponent

 lst name=spellchecker
str name=namewordbreak/str
str name=classnamesolr.WordBreakSolrSpellChecker/str
str name=fieldtextSpell/str
str name=combineWordstrue/str
str name=breakWordsfalse/str
int name=maxChanges5/int
  /lst

   lst name=spellchecker
str name=namedefault/str
str name=fieldtextSpell/str
str name=classnamesolr.IndexBasedSpellChecker/str
!-- str name=classnamesolr.DirectSolrSpellChecker/str -- str 
name=spellcheckIndexDir./spellchecker/str
!-- str
name=distanceMeasureorg.apache.lucene.search.spell.JaroWinklerDistance/str--
str name=accuracy0.75/str
float name=thresholdTokenFrequency0.01/float
str name=buildOnCommittrue/str
str name=spellcheck.maxResultsForSuggest5/str
 /lst


  /searchComponent



*Rajesh**.*

On Fri, Feb 20, 2015 at 8:42 AM, Nitin Solanki nitinml...@gmail.com wrote:

 How to get only the best collations whose hits are more and need to 
 sort them?

 On Wed, Feb 18, 2015 at 3:53 AM, Reitzel, Charles  
 charles.reit...@tiaa-cref.org wrote:

  Hi Nitin,
 
  I was trying many different options for a couple different queries.   In
  fact, I have collations working ok now with the Suggester and WFSTLookup.
   The problem may have been due to a different dictionary and/or 
  lookup implementation and the specific options I was sending.
 
  In general, we're using spellcheck for search suggestions.   The
 Suggester
  component (vs. Suggester spellcheck implementation), doesn't handle 
  all
 of
  our cases.  But we can get things working using the spellcheck interface.
  What gives us particular troubles are the cases where a term may be 
  valid by itself, but also be the start of longer words.
 
  The specific terms are acronyms specific to our business.   But I'll
  attempt to show generic examples.
 
  E.g. a partial term like fo can expand to fox, fog, etc. and a 
  full
 term
  like brown can also expand to something like brownstone.   And, yes, the
  collation brownstone fox is nonsense.  But assume, for the sake of 
  argument, it appears in our documents somewhere.
 
  For multiple term query with a spelling error (or partially typed term):
  brown fo
 
  We get collations in order of hits, descending like ...
  brown fox,
  brown fog,
  brownstone fox.
 
  So far, so good.
 
  For a single term query, brown, we get a single suggestion, 
  brownstone
 and
  no collations.
 
  So, we don't know to keep the term brown!
 
  At this point, we need spellcheck.extendedResults=true and 

Re: how to debug solr performance degradation

2015-02-25 Thread Erick Erickson
Before diving in too deeply, try attaching debug=timing to the query.
Near the bottom of the response there'll be a list of the time taken
by each _component_. So there'll be separate entries for query,
highlighting, etc.

This may not show any surprises, you might be spending all your time
scoring. But it's worth doing as a check and might save you from going
down some dead-ends. I mean if your query winds up spending 80% of its
time in the highlighter you know where to start looking..

Best,
Erick


On Wed, Feb 25, 2015 at 12:01 PM, Boogie Shafer
boogie.sha...@proquest.com wrote:
 rebecca,

 you probably need to dig into your queries, but if you want to force/preload 
 the index into memory you could try doing something like

 cat `find /path/to/solr/index`  /dev/null


 if you haven't already reviewed the following, you might take a look here
 https://wiki.apache.org/solr/SolrPerformanceProblems

 perhaps going back to a very vanilla/default solr configuration and building 
 back up from that baseline to better isolate what might specific setting be 
 impacting your environment

 
 From: Tang, Rebecca rebecca.t...@ucsf.edu
 Sent: Wednesday, February 25, 2015 11:44
 To: solr-user@lucene.apache.org
 Subject: RE: how to debug solr performance degradation

 Sorry, I should have been more specific.

 I was referring to the solr admin UI page. Today we started up an AWS
 instance with 240 G of memory to see if we fit all of our index (183G) in
 the memory and have enough for the JMV, could it improve the performance.

 I attached the admin UI screen shot with the email.

 The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52
 GB is used.

 The next bar is Swap Space and it¹s at 0.00 MB.

 The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G.

 My understanding is that when Solr starts up, it reserves some memory for
 the JVM, and then it tries to use up as much of the remaining physical
 memory as possible.  And I used to see the physical memory at anywhere
 between 70% to 90+%.  Is this understanding correct?

 And now, even with 240G of memory, our index is performing at 10 - 20
 seconds for a query.  Granted that our queries have fq¹s and highlighting
 and faceting, I think with a machine this powerful I should be able to get
 the queries executed under 5 seconds.

 This is what we send to Solr:
 q=(phillip%20morris)
 wt=json
 start=0
 rows=50
 facet=true
 facet.mincount=0
 facet.pivot=industry,collection_facet
 facet.pivot=availability_facet,availabilitystatus_facet
 facet.field=dddate
 fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank%
 20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be
 gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder
 %20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she
 et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page%
 22%20OR%20dt%3A%22tab%20sheet%22))
 facet.field=dt_facet
 facet.field=brd_facet
 facet.field=dg_facet
 hl=true
 hl.simple.pre=%3Ch1%3E
 hl.simple.post=%3C%2Fh1%3E
 hl.requireFieldMatch=false
 hl.preserveMulti=true
 hl.fl=ot,ti
 f.ot.hl.fragsize=300
 f.ot.hl.alternateField=ot
 f.ot.hl.maxAlternateFieldLength=300
 f.ti.hl.fragsize=300
 f.ti.hl.alternateField=ti
 f.ti.hl.maxAlternateFieldLength=300
 fq={!collapse%20field=signature}
 expand=true
 sort=score+desc,availability_facet+asc


 My guess is that it¹s performing so badly because it¹s only using 4% of
 the memory? And searches require disk access.


 Rebecca
 
 From: Shawn Heisey [apa...@elyograg.org]
 Sent: Tuesday, February 24, 2015 5:23 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to debug solr performance degradation

 On 2/24/2015 5:45 PM, Tang, Rebecca wrote:
 We gave the machine 180G mem to see if it improves performance.  However,
 after we increased the memory, Solr started using only 5% of the physical
 memory.  It has always used 90-something%.

 What could be causing solr to not grab all the physical memory (grabbing
 so little of the physical memory)?

 I would like to know what memory numbers in which program you are
 looking at, and why you believe those numbers are a problem.

 The JVM has a very different view of memory than the operating system.
 Numbers in top mean different things than numbers on the dashboard of
 the admin UI, or the numbers in jconsole.  If you're on Windows, then
 replace top with task manager, process explorer, resource monitor, etc.

 Please provide as many details as you can about the things you are
 looking at.

 Thanks,
 Shawn



Re: Facet By Distance

2015-02-25 Thread Alexandre Rafalovitch
In the examples it used to default to *:* with default params, which
caused even more confusion.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 15:21, david.w.smi...@gmail.com
david.w.smi...@gmail.com wrote:
 If ‘q’ is absent, then you always match nothing (there may be exceptions?);
 so it’s sort of required, in effect.  I wish it defaulted to *:*.

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley

 On Wed, Feb 25, 2015 at 2:28 PM, Ahmed Adel ahmed.a...@badrit.com wrote:

 Hi,
 Thank you for your reply. I added a filter query to the query in two ways
 as follows:


 fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange
 l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()d=0.2
 -- returns 0 docs

 q=*:*fq={!geofilt}sfield=start_stationpt=40.71754834,-74.01322069d=0.2
 -- returns 1484 docs

 Not sure why the first query with returns 0 documents

 On Wed, Feb 25, 2015 at 8:46 PM, david.w.smi...@gmail.com 
 david.w.smi...@gmail.com wrote:

  Hi,
  This will return all the documents in the index because you did nothing
  to filter them out.  Your query is *:* (everything) and there are no
 filter
  queries.
 
  ~ David Smiley
  Freelance Apache Lucene/Solr Search Consultant/Developer
  http://www.linkedin.com/in/davidwsmiley
 
  On Wed, Feb 25, 2015 at 12:27 PM, Ahmed Adel ahmed.a...@badrit.com
  wrote:
 
   Hello,
  
   I'm trying to get Facet By Distance working on an index with LatLonType
   fields. The schema is as follows:
  
   fields
   ...
   field name=trip_duration type=int indexed=true stored=true/
   field name=start_station type=location indexed=true
 stored=true
  /
   field name=end_station type=location indexed=true stored=true
 /
   field name=birth_year type=int stored=true/
   field name=gender type=int stored=true /
   ...
   /fields
  
  
   And the query I'm running is:
  
  
 
 q=*:*sfield=start_stationpt=40.71754834,-74.01322069facet.query={!frange
   l=0.0 u=0.1}geodist()facet.query={!frange l=0.10001 u=0.2}geodist()
  
  
   But it returns all the documents in the index so it seems something is
   missing. I'm using Solr 4.9.0.
  
   --
  
   A. Adel
  
 

 A. Adel



Re: Basic Multilingual search capability

2015-02-25 Thread Tom Burton-West
Hi Rishi,

As others have indicated Multilingual search is very difficult to do well.

At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages.  We also added the
CJKBigramFilter to get better precision on CJK queries.  We don't use stop
words because stop words in one language are content words in another.  For
example die in German is a stopword but it is a content word in English.

Putting multiple languages in one index can affect word frequency
statistics which make relevance ranking less accurate.  So for example for
the English query Die Hard the word die would get a low idf score
because it occurs so frequently in German.  We realize that our  approach
does not produce the best results, but given the 400 languages, and limited
resources, we do our best to make search not suck for non-English
languages.   When we have the resources we are thinking about doing special
processing for a small fraction of the top 20 languages.  We plan to select
those languages  that most need special processing and relatively easy to
disambiguate from other languages.


If you plan on identifying languages (rather than scripts), you should be
aware that most language detection libraries don't work well on short texts
such as queries.

If you know that you have scripts for which you have content in only one
language, you can use script detection instead of language detection.


If you have German, a filter length of 25 might be too low (Because of
compounding). You might want to analyze a sample of your German text to
find a good length.

Tom

http://www.hathitrust.org/blogs/Large-scale-Search


On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran rishi.easwa...@aol.com
wrote:

 Hi Alex,

 Thanks for the suggestions. These steps will definitely help out with our
 use case.
 Thanks for the idea about the lengthFilter to protect our system.

 Thanks,
 Rishi.







 -Original Message-
 From: Alexandre Rafalovitch arafa...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tue, Feb 24, 2015 8:50 am
 Subject: Re: Basic Multilingual search capability


 Given the limited needs, I would probably do something like this:

 1) Put a language identifier in the UpdateRequestProcessor chain
 during indexing and route out at least known problematic languages,
 such as Chinese, Japanese, Arabic into individual fields
 2) Put everything else together into one field with ICUTokenizer,
 maybe also ICUFoldingFilter
 3) At the very end of that joint filter, stick in LengthFilter with
 some high number, e.g. 25 characters max. This will ensure that
 super-long words from non-space languages and edge conditions do not
 break the rest of your system.


 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org
 wrote:
  I understand relevancy, stemming etc becomes extremely complicated with
 multilingual support, but our first goal is to be able to tokenize and
 provide
 basic search capability for any language. Ex: When the document contains
 hello
 or здравствуйте, the analyzer creates tokens and provides exact match
 search
 results.





Re: New leader/replica solution for HDFS

2015-02-25 Thread Erick Erickson
bq: Is adding replicas going to increase search performance?

Absolutely, assuming you've maxed out Solr. You can scale the SOLR
query/second rate nearly linearly by adding replicas regardless of
whether it's over HDFS or not.

Having multiple replicas per shard _also_ increases fault tolerance,
so you get both. Even with HDFS, though, a single replica (just a
leader) per shard means that you don't have any redundancy if the
motherboard on that server dies even though HDFS has multiple copies
of the _data_.

Best,
Erick

On Wed, Feb 25, 2015 at 12:01 PM, Joseph Obernberger
j...@lovehorsepower.com wrote:
 I am also confused on this.  Is adding replicas going to increase search
 performance?  I'm not sure I see the point of any replicas when using HDFS.
 Is there one?
 Thank you!

 -Joe


 On 2/25/2015 10:57 AM, Erick Erickson wrote:

 bq: And the data sync between leader/replica is always a problem

 Not quite sure what you mean by this. There shouldn't need to be
 any synching in the sense that the index gets replicated, the
 incoming documents should be sent to each node (and indexed
 to HDFS) as they come in.

 bq: There is duplicate index computing on Replilca side.

 Yes, that's the design of SolrCloud, explicitly to provide data safety.
 If you instead rely on the leader to index and somehow pull that
 indexed form to the replica, then you will lose data if the leader
 goes down before sending the indexed form.

 bq: My thought is that the leader and the replica all bind to the same
 data
 index directory.

 This is unsafe. They would both then try to _write_ to the same
 index, which can easily corrupt indexes and/or all but the first
 one to access the index would be locked out.

 All that said, the HDFS triple-redundancy compounded with the
 Solr leaders/replicas redundancy means a bunch of extra
 storage. You can turn the HDFS replication down to 1, but that has
 other implications.

 Best,
 Erick

 On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote:

 We used HDFS as our Solr index storage and we really have a heavy update
 load. We had met much problems with current leader/replica solution.
 There
 is duplicate index computing on Replilca side. And the data sync between
 leader/replica is always a problem.

 As HDFS already provides data replication on data layer, could Solr
 provide
 just service layer replication?

 My thought is that the leader and the replica all bind to the same data
 index directory. And the leader will build up index for new request, the
 replica will just keep update the index version with the leader(such as a
 soft commit periodically? ). If the leader lost then the replica will
 take
 the duty immediately.

 Thanks for any suggestion of this idea.







 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: New leader/replica solution for HDFS

2015-02-25 Thread Joseph Obernberger
Thank you!  I'm mainly concerned about facet performance.  When we have 
indexing turned on, our facet performance suffers significantly.

I will add replicas and measure the performance change.

-Joe Obernberger

On 2/25/2015 4:31 PM, Erick Erickson wrote:

bq: Is adding replicas going to increase search performance?

Absolutely, assuming you've maxed out Solr. You can scale the SOLR
query/second rate nearly linearly by adding replicas regardless of
whether it's over HDFS or not.

Having multiple replicas per shard _also_ increases fault tolerance,
so you get both. Even with HDFS, though, a single replica (just a
leader) per shard means that you don't have any redundancy if the
motherboard on that server dies even though HDFS has multiple copies
of the _data_.

Best,
Erick

On Wed, Feb 25, 2015 at 12:01 PM, Joseph Obernberger
j...@lovehorsepower.com wrote:

I am also confused on this.  Is adding replicas going to increase search
performance?  I'm not sure I see the point of any replicas when using HDFS.
Is there one?
Thank you!

-Joe


On 2/25/2015 10:57 AM, Erick Erickson wrote:

bq: And the data sync between leader/replica is always a problem

Not quite sure what you mean by this. There shouldn't need to be
any synching in the sense that the index gets replicated, the
incoming documents should be sent to each node (and indexed
to HDFS) as they come in.

bq: There is duplicate index computing on Replilca side.

Yes, that's the design of SolrCloud, explicitly to provide data safety.
If you instead rely on the leader to index and somehow pull that
indexed form to the replica, then you will lose data if the leader
goes down before sending the indexed form.

bq: My thought is that the leader and the replica all bind to the same
data
index directory.

This is unsafe. They would both then try to _write_ to the same
index, which can easily corrupt indexes and/or all but the first
one to access the index would be locked out.

All that said, the HDFS triple-redundancy compounded with the
Solr leaders/replicas redundancy means a bunch of extra
storage. You can turn the HDFS replication down to 1, but that has
other implications.

Best,
Erick

On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote:

We used HDFS as our Solr index storage and we really have a heavy update
load. We had met much problems with current leader/replica solution.
There
is duplicate index computing on Replilca side. And the data sync between
leader/replica is always a problem.

As HDFS already provides data replication on data layer, could Solr
provide
just service layer replication?

My thought is that the leader and the replica all bind to the same data
index directory. And the leader will build up index for new request, the
replica will just keep update the index version with the leader(such as a
soft commit periodically? ). If the leader lost then the replica will
take
the duty immediately.

Thanks for any suggestion of this idea.







--
View this message in context:
http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
Sent from the Solr - User mailing list archive at Nabble.com.






Re: how to debug solr performance degradation

2015-02-25 Thread Otis Gospodnetic
Lots of suggestions here already.  +1 for those JVM params from Boogie and
for looking at JMX.
Rebecca, try SPM http://sematext.com/spm (will look at JMX for you, among
other things), it may save you time figuring out
JVM/heap/memory/performance issues.  If you can't tell what's slow via SPM,
we can have a look at your metrics (charts are sharable) and may be able to
help you faster than guessing.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Feb 25, 2015 at 4:27 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Before diving in too deeply, try attaching debug=timing to the query.
 Near the bottom of the response there'll be a list of the time taken
 by each _component_. So there'll be separate entries for query,
 highlighting, etc.

 This may not show any surprises, you might be spending all your time
 scoring. But it's worth doing as a check and might save you from going
 down some dead-ends. I mean if your query winds up spending 80% of its
 time in the highlighter you know where to start looking..

 Best,
 Erick


 On Wed, Feb 25, 2015 at 12:01 PM, Boogie Shafer
 boogie.sha...@proquest.com wrote:
  rebecca,
 
  you probably need to dig into your queries, but if you want to
 force/preload the index into memory you could try doing something like
 
  cat `find /path/to/solr/index`  /dev/null
 
 
  if you haven't already reviewed the following, you might take a look here
  https://wiki.apache.org/solr/SolrPerformanceProblems
 
  perhaps going back to a very vanilla/default solr configuration and
 building back up from that baseline to better isolate what might specific
 setting be impacting your environment
 
  
  From: Tang, Rebecca rebecca.t...@ucsf.edu
  Sent: Wednesday, February 25, 2015 11:44
  To: solr-user@lucene.apache.org
  Subject: RE: how to debug solr performance degradation
 
  Sorry, I should have been more specific.
 
  I was referring to the solr admin UI page. Today we started up an AWS
  instance with 240 G of memory to see if we fit all of our index (183G) in
  the memory and have enough for the JMV, could it improve the performance.
 
  I attached the admin UI screen shot with the email.
 
  The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52
  GB is used.
 
  The next bar is Swap Space and it¹s at 0.00 MB.
 
  The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G.
 
  My understanding is that when Solr starts up, it reserves some memory for
  the JVM, and then it tries to use up as much of the remaining physical
  memory as possible.  And I used to see the physical memory at anywhere
  between 70% to 90+%.  Is this understanding correct?
 
  And now, even with 240G of memory, our index is performing at 10 - 20
  seconds for a query.  Granted that our queries have fq¹s and highlighting
  and faceting, I think with a machine this powerful I should be able to
 get
  the queries executed under 5 seconds.
 
  This is what we send to Solr:
  q=(phillip%20morris)
  wt=json
  start=0
  rows=50
  facet=true
  facet.mincount=0
  facet.pivot=industry,collection_facet
  facet.pivot=availability_facet,availabilitystatus_facet
  facet.field=dddate
 
 fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank%
 
 20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be
 
 gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder
 
 %20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she
 
 et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page%
  22%20OR%20dt%3A%22tab%20sheet%22))
  facet.field=dt_facet
  facet.field=brd_facet
  facet.field=dg_facet
  hl=true
  hl.simple.pre=%3Ch1%3E
  hl.simple.post=%3C%2Fh1%3E
  hl.requireFieldMatch=false
  hl.preserveMulti=true
  hl.fl=ot,ti
  f.ot.hl.fragsize=300
  f.ot.hl.alternateField=ot
  f.ot.hl.maxAlternateFieldLength=300
  f.ti.hl.fragsize=300
  f.ti.hl.alternateField=ti
  f.ti.hl.maxAlternateFieldLength=300
  fq={!collapse%20field=signature}
  expand=true
  sort=score+desc,availability_facet+asc
 
 
  My guess is that it¹s performing so badly because it¹s only using 4% of
  the memory? And searches require disk access.
 
 
  Rebecca
  
  From: Shawn Heisey [apa...@elyograg.org]
  Sent: Tuesday, February 24, 2015 5:23 PM
  To: solr-user@lucene.apache.org
  Subject: Re: how to debug solr performance degradation
 
  On 2/24/2015 5:45 PM, Tang, Rebecca wrote:
  We gave the machine 180G mem to see if it improves performance.
 However,
  after we increased the memory, Solr started using only 5% of the
 physical
  memory.  It has always used 90-something%.
 
  What could be causing solr to not grab all the physical memory (grabbing
  so little of the physical memory)?
 
  I would like to know what memory numbers in which program you are
  looking at, and why you 

Re: [ANNOUNCE] Luke 4.10.3 released

2015-02-25 Thread Dmitry Kan
Hi Tomoko,

Thanks for the link. Do you have build instructions somewhere? When I
executed ant with no params, I get:

BUILD FAILED
/home/dmitry/projects/svn/luke/build.xml:40:
/home/dmitry/projects/svn/luke/lib-ivy does not exist.


On Thu, Feb 26, 2015 at 2:27 AM, Tomoko Uchida tomoko.uchida.1...@gmail.com
 wrote:

 Thanks!

 Would you announce at LUCENE-2562 to me and all watchers interested in this
 issue, when the branch is ready? :)
 As you know, current pivots's version (that supports Lucene 4.10.3) is
 here.
 http://svn.apache.org/repos/asf/lucene/sandbox/luke/

 Regards,
 Tomoko

 2015-02-25 18:37 GMT+09:00 Dmitry Kan solrexp...@gmail.com:

  Ok, sure. The plan is to make the pivot branch in the current github repo
  and update its structure accordingly.
  Once it is there, I'll let you know.
 
  Thank you,
  Dmitry
 
  On Tue, Feb 24, 2015 at 5:26 PM, Tomoko Uchida 
  tomoko.uchida.1...@gmail.com
   wrote:
 
   Hi Dmitry,
  
   Thank you for the detailed clarification!
  
   Recently, I've created a few patches to Pivot version(LUCENE-2562), so
  I'd
   like to some more work and keep up to date it.
  
If you would like to work on the Pivot version, may I suggest you to
  fork
the github's version? The ultimate goal is to donate this to Apache,
  but
   at
least we will have the common plate. :)
  
   Yes, I love to the idea about having common code base.
   I've looked at both codes of github's (thinlet's) and Pivot's, Pivot's
   version has very different structure from github's (I think that is
  mainly
   for UI framework's requirement.)
   So it seems to be difficult to directly fork github's version to
 develop
   Pivot's version..., but I think I (or any other developers) could catch
  up
   changes in github's version.
   There's long way to go for Pivot's version, of course, I'd like to also
   make pull requests to enhance github's version if I can.
  
   Thanks,
   Tomoko
  
   2015-02-24 23:34 GMT+09:00 Dmitry Kan solrexp...@gmail.com:
  
Hi, Tomoko!
   
Thanks for being a fan of luke!
   
Current status of github's luke (https://github.com/DmitryKey/luke)
 is
that
it has releases for all the major lucene versions since 4.3.0,
  excluding
4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the
 latest
  --
5.0.0.
   
Porting the github's luke to ALv2 compliant framework (GWT or Pivot)
  is a
long standing goal. With GWT I had issues related to listing and
  reading
the index directory. So this effort has been parked. Most recently I
  have
been approaching the Pivot. Mark Miller has done an initial port,
 that
  I
took as the basis. I'm hoping to continue on this track as time
  permits.
   
   
If you would like to work on the Pivot version, may I suggest you to
  fork
the github's version? The ultimate goal is to donate this to Apache,
  but
   at
least we will have the common plate. :)
   
   
Thanks,
Dmitry
   
On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida 
tomoko.uchida.1...@gmail.com
 wrote:
   
 Hi,

 I'm an user / fan of Luke, so deeply appreciate your work.

 I've carefully read the readme, noticed the (one of) project's
 goal:
 To port the thinlet UI to an ASL compliant license framework so
 that
   it
 can be contributed back to Apache Lucene. Current work is done with
  GWT
 2.5.1.

 There has been GWT based, ASL compliant Luke supporting the latest
Lucene ?

 I've recently got in with LUCENE-2562. Currently, Apache Pivot
 based
   port
 is going. But I do not know so much about Luke's long (and may be
slightly
 complex) history, so I would grateful if anybody clear the
  association
   of
 the Luke project (now on Github) and the Jira issue. Or, they can
 be
 independent of each other.
 https://issues.apache.org/jira/browse/LUCENE-2562
 I don't have any opinions, just want to understand current status
 and
avoid
 duplicate works.

 Apologize for a bit annoying post.

 Many thanks,
 Tomoko



 2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com:

  Hello,
 
  Luke 4.10.3 has been released. Download it here:
 
  https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3
 
  The release has been tested against the solr-4.10.3 based index.
 
  Issues fixed in this release: #13
  https://github.com/DmitryKey/luke/pull/13
  Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2
 
  Thanks to respective contributors!
 
 
  P.S. waiting for lucene 5.0 artifacts to hit public maven
   repositories
 for
  the next major release of luke.
 
  --
  Dmitry Kan
  Luke Toolbox: http://github.com/DmitryKey/luke
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
  SemanticAnalyzer: www.semanticanalyzer.info
 

   
   
  

Customized search handler components and cloud

2015-02-25 Thread Benson Margulies
We have a pair of customized search components which we used
successfully with SolrCloud some releases back (4.x). In 4.10.3, I am
trying to find the point of departure in debugging why we get no
results back when querying to them with a sharded index.

If I query the regular /select, all is swell. Obviously, there's a
debugger in my future, but I wonder if this rings any bells for
anyone.


Here's what we add to solrconfig.xml.

  searchComponent name=name-indexing-query
class=com.basistech.rni.solr.NameIndexingQueryComponent /
  searchComponent name=name-indexing-rescore
class=com.basistech.rni.solr.NameIndexingRescoreComponent/

  requestHandler name=/RNI class=solr.SearchHandler default=false
arr name=first-components
strname-indexing-query/str
strname-indexing-rescore/str
  /arr
  /requestHandler


Re: Connect Solr with ODBC to Excel

2015-02-25 Thread Hakim Benoudjit
I'll need to use /export since I retrieve large amount of data.
And I don't really need facets, so it won't be an issue.
Thanks again for your help.

2015-02-25 21:26 GMT+01:00 Mikhail Khludnev mkhlud...@griddynamics.com:

 On Wed, Feb 25, 2015 at 10:31 PM, Hakim Benoudjit h.benoud...@gmail.com
 wrote:

  Thanks for the two links.
  The first one could be helpful if it works.
  Regarding the second one,



  I think it's quite similar to using /select to
  return json format.
 
 not really. /export yields much more data faster.
 Also, if you are interested in relatively short result set, you
 can /selectwt=csv no facets in this case, sadly. Just fyi.


 
  2015-02-25 19:10 GMT+01:00 Mikhail Khludnev mkhlud...@griddynamics.com
 :
 
   Some time ago I encounter https://github.com/kawasima/solr-jdbc never
   tried
   it.Anyway, it doesn't help to connect from odbc.
   On top of my head, is
   https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
  but
   it returns only JSON, not csv. That's I wonder why.
  
   Seems like a dead end so far.
  
   On Wed, Feb 25, 2015 at 6:15 PM, Hakim Benoudjit 
 h.benoud...@gmail.com
   wrote:
  
Thanks for your answer.
For a one-off it seems like a nice way to import my data.
For an ODBC connection, the only solution I found is to replicate my
  Solr
data in Apache Hive (or Cassandra...), and then connect to that
  database
through ODBC.
   
   
2015-02-25 15:49 GMT+01:00 Alexandre Rafalovitch arafa...@gmail.com
 :
   
 Which direction? You want import data from Solr into Excel? One off
  or
 repeatedly?

 For one off Solr - Excel, you could probably use Excel's Open from
 Web and load data directly from Solr using CSV output format.

 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 25 February 2015 at 08:52, Hakim Benoudjit 
 h.benoud...@gmail.com
  
 wrote:
  Hi there,
 
  I'm looking for a library to connect Solr throught ODBC to Excel
 in
order
  to do some reporting on my Solr data?
  Anybody knows a library for that?
 
  Thanks.
 
  --
  Cordialement,
  Best regards,
  Hakim Benoudjit

   
   
   
--
Cordialement,
Best regards,
Hakim Benoudjit
   
  
  
  
   --
   Sincerely yours
   Mikhail Khludnev
   Principal Engineer,
   Grid Dynamics
  
   http://www.griddynamics.com
   mkhlud...@griddynamics.com
  
 
 
 
  --
  Cordialement,
  Best regards,
  Hakim Benoudjit
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com




-- 
Cordialement,
Best regards,
Hakim Benoudjit


Re: [ANNOUNCE] Luke 4.10.3 released

2015-02-25 Thread Tomoko Uchida
Thanks!

Would you announce at LUCENE-2562 to me and all watchers interested in this
issue, when the branch is ready? :)
As you know, current pivots's version (that supports Lucene 4.10.3) is here.
http://svn.apache.org/repos/asf/lucene/sandbox/luke/

Regards,
Tomoko

2015-02-25 18:37 GMT+09:00 Dmitry Kan solrexp...@gmail.com:

 Ok, sure. The plan is to make the pivot branch in the current github repo
 and update its structure accordingly.
 Once it is there, I'll let you know.

 Thank you,
 Dmitry

 On Tue, Feb 24, 2015 at 5:26 PM, Tomoko Uchida 
 tomoko.uchida.1...@gmail.com
  wrote:

  Hi Dmitry,
 
  Thank you for the detailed clarification!
 
  Recently, I've created a few patches to Pivot version(LUCENE-2562), so
 I'd
  like to some more work and keep up to date it.
 
   If you would like to work on the Pivot version, may I suggest you to
 fork
   the github's version? The ultimate goal is to donate this to Apache,
 but
  at
   least we will have the common plate. :)
 
  Yes, I love to the idea about having common code base.
  I've looked at both codes of github's (thinlet's) and Pivot's, Pivot's
  version has very different structure from github's (I think that is
 mainly
  for UI framework's requirement.)
  So it seems to be difficult to directly fork github's version to develop
  Pivot's version..., but I think I (or any other developers) could catch
 up
  changes in github's version.
  There's long way to go for Pivot's version, of course, I'd like to also
  make pull requests to enhance github's version if I can.
 
  Thanks,
  Tomoko
 
  2015-02-24 23:34 GMT+09:00 Dmitry Kan solrexp...@gmail.com:
 
   Hi, Tomoko!
  
   Thanks for being a fan of luke!
  
   Current status of github's luke (https://github.com/DmitryKey/luke) is
   that
   it has releases for all the major lucene versions since 4.3.0,
 excluding
   4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the latest
 --
   5.0.0.
  
   Porting the github's luke to ALv2 compliant framework (GWT or Pivot)
 is a
   long standing goal. With GWT I had issues related to listing and
 reading
   the index directory. So this effort has been parked. Most recently I
 have
   been approaching the Pivot. Mark Miller has done an initial port, that
 I
   took as the basis. I'm hoping to continue on this track as time
 permits.
  
  
   If you would like to work on the Pivot version, may I suggest you to
 fork
   the github's version? The ultimate goal is to donate this to Apache,
 but
  at
   least we will have the common plate. :)
  
  
   Thanks,
   Dmitry
  
   On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida 
   tomoko.uchida.1...@gmail.com
wrote:
  
Hi,
   
I'm an user / fan of Luke, so deeply appreciate your work.
   
I've carefully read the readme, noticed the (one of) project's goal:
To port the thinlet UI to an ASL compliant license framework so that
  it
can be contributed back to Apache Lucene. Current work is done with
 GWT
2.5.1.
   
There has been GWT based, ASL compliant Luke supporting the latest
   Lucene ?
   
I've recently got in with LUCENE-2562. Currently, Apache Pivot based
  port
is going. But I do not know so much about Luke's long (and may be
   slightly
complex) history, so I would grateful if anybody clear the
 association
  of
the Luke project (now on Github) and the Jira issue. Or, they can be
independent of each other.
https://issues.apache.org/jira/browse/LUCENE-2562
I don't have any opinions, just want to understand current status and
   avoid
duplicate works.
   
Apologize for a bit annoying post.
   
Many thanks,
Tomoko
   
   
   
2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com:
   
 Hello,

 Luke 4.10.3 has been released. Download it here:

 https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3

 The release has been tested against the solr-4.10.3 based index.

 Issues fixed in this release: #13
 https://github.com/DmitryKey/luke/pull/13
 Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2

 Thanks to respective contributors!


 P.S. waiting for lucene 5.0 artifacts to hit public maven
  repositories
for
 the next major release of luke.

 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info

   
  
  
  
   --
   Dmitry Kan
   Luke Toolbox: http://github.com/DmitryKey/luke
   Blog: http://dmitrykan.blogspot.com
   Twitter: http://twitter.com/dmitrykan
   SemanticAnalyzer: www.semanticanalyzer.info
  
 



 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info



Solr takes time to start

2015-02-25 Thread Nitin Solanki
Hello,
 Why Solr is taking too much of time to start all nodes/ports?


Solr resource usage/Clustering

2015-02-25 Thread Vikas Agarwal
Hi,

We have a single solr instance serving queries to the client through out
the day and being indexed twice a day using scheduled jobs. During the
scheduled jobs, which actually syncs databases from data collection
machines to the master database, it can make many indexing calls. It is
usually about 50k-100k records that are synced on each iteration of sync
and we make calls to solr in batch of 1000 documents.

Now, during the sync process, solr throws 503 (service not available
message) quite frequently and in fact it responds very slow to index the
documents. I have checked the cpu and memory usage during the sync process
and it never consumed more than 40-50 % of CPU and 10-20% of RAM.

My question is how to increase the performance of indexing to increase the
speed up the sync process.

-- 
Regards,
Vikas Agarwal
91 – 9928301411

InfoObjects, Inc.
Execution Matters
http://www.infoobjects.com
2041 Mission College Boulevard, #280
Santa Clara, CA 95054
+1 (408) 988-2000 Work
+1 (408) 716-2726 Fax


Re: New leader/replica solution for HDFS

2015-02-25 Thread William Bell
Use DocValues.

On Wed, Feb 25, 2015 at 3:14 PM, Joseph Obernberger j...@lovehorsepower.com
 wrote:

 Thank you!  I'm mainly concerned about facet performance.  When we have
 indexing turned on, our facet performance suffers significantly.
 I will add replicas and measure the performance change.

 -Joe Obernberger


 On 2/25/2015 4:31 PM, Erick Erickson wrote:

 bq: Is adding replicas going to increase search performance?

 Absolutely, assuming you've maxed out Solr. You can scale the SOLR
 query/second rate nearly linearly by adding replicas regardless of
 whether it's over HDFS or not.

 Having multiple replicas per shard _also_ increases fault tolerance,
 so you get both. Even with HDFS, though, a single replica (just a
 leader) per shard means that you don't have any redundancy if the
 motherboard on that server dies even though HDFS has multiple copies
 of the _data_.

 Best,
 Erick

 On Wed, Feb 25, 2015 at 12:01 PM, Joseph Obernberger
 j...@lovehorsepower.com wrote:

 I am also confused on this.  Is adding replicas going to increase search
 performance?  I'm not sure I see the point of any replicas when using
 HDFS.
 Is there one?
 Thank you!

 -Joe


 On 2/25/2015 10:57 AM, Erick Erickson wrote:

 bq: And the data sync between leader/replica is always a problem

 Not quite sure what you mean by this. There shouldn't need to be
 any synching in the sense that the index gets replicated, the
 incoming documents should be sent to each node (and indexed
 to HDFS) as they come in.

 bq: There is duplicate index computing on Replilca side.

 Yes, that's the design of SolrCloud, explicitly to provide data safety.
 If you instead rely on the leader to index and somehow pull that
 indexed form to the replica, then you will lose data if the leader
 goes down before sending the indexed form.

 bq: My thought is that the leader and the replica all bind to the same
 data
 index directory.

 This is unsafe. They would both then try to _write_ to the same
 index, which can easily corrupt indexes and/or all but the first
 one to access the index would be locked out.

 All that said, the HDFS triple-redundancy compounded with the
 Solr leaders/replicas redundancy means a bunch of extra
 storage. You can turn the HDFS replication down to 1, but that has
 other implications.

 Best,
 Erick

 On Tue, Feb 24, 2015 at 11:12 PM, longsan longsan...@sina.com wrote:

 We used HDFS as our Solr index storage and we really have a heavy
 update
 load. We had met much problems with current leader/replica solution.
 There
 is duplicate index computing on Replilca side. And the data sync
 between
 leader/replica is always a problem.

 As HDFS already provides data replication on data layer, could Solr
 provide
 just service layer replication?

 My thought is that the leader and the replica all bind to the same data
 index directory. And the leader will build up index for new request,
 the
 replica will just keep update the index version with the leader(such
 as a
 soft commit periodically? ). If the leader lost then the replica will
 take
 the duty immediately.

 Thanks for any suggestion of this idea.







 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/New-leader-replica-
 solution-for-HDFS-tp4188735.html
 Sent from the Solr - User mailing list archive at Nabble.com.






-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Collations are not working fine.

2015-02-25 Thread Nitin Solanki
Hi Rajesh,
What configuration had you set in your schema.xml?

On Sat, Feb 14, 2015 at 2:18 AM, Rajesh Hazari rajeshhaz...@gmail.com
wrote:

 Hi Nitin,

 Can u try with the below config, we have these config seems to be working
 for us.

 searchComponent name=spellcheck class=solr.SpellCheckComponent

  str name=queryAnalyzerFieldTypetext_general/str


   lst name=spellchecker
 str name=namewordbreak/str
 str name=classnamesolr.WordBreakSolrSpellChecker/str
 str name=fieldtextSpell/str
 str name=combineWordstrue/str
 str name=breakWordsfalse/str
 int name=maxChanges5/int
   /lst

lst name=spellchecker
 str name=namedefault/str
 str name=fieldtextSpell/str
 str name=classnamesolr.IndexBasedSpellChecker/str
 str name=spellcheckIndexDir./spellchecker/str
 str name=accuracy0.75/str
 float name=thresholdTokenFrequency0.01/float
 str name=buildOnCommittrue/str
 str name=spellcheck.maxResultsForSuggest5/str
  /lst


   /searchComponent



 str name=spellchecktrue/str
 str name=spellcheck.dictionarydefault/str
 str name=spellcheck.dictionarywordbreak/str
 int name=spellcheck.count5/int
 str name=spellcheck.alternativeTermCount15/str
 str name=spellcheck.collatetrue/str
 str name=spellcheck.onlyMorePopularfalse/str
 str name=spellcheck.extendedResultstrue/str
 str name =spellcheck.maxCollations100/str
 str name=spellcheck.collateParam.mm100%/str
 str name=spellcheck.collateParam.q.opAND/str
 str name=spellcheck.maxCollationTries1000/str


 *Rajesh.*

 On Fri, Feb 13, 2015 at 1:01 PM, Dyer, James james.d...@ingramcontent.com
 
 wrote:

  Nitin,
 
  Can you post the full spellcheck response when you query:
 
  q=gram_ci:gone wthh thes wintwt=jsonindent=trueshards.qt=/spell
 
  James Dyer
  Ingram Content Group
 
 
  -Original Message-
  From: Nitin Solanki [mailto:nitinml...@gmail.com]
  Sent: Friday, February 13, 2015 1:05 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Collations are not working fine.
 
  Hi James Dyer,
I did the same as you told me. Used
  WordBreakSolrSpellChecker instead of shingles. But still collations are
 not
  coming or working.
  For instance, I tried to get collation of gone with the wind by
 searching
  gone wthh thes wint on field=gram_ci but didn't succeed. Even, I am
  getting the suggestions of wtth as *with*, thes as *the*, wint as *wind*.
  Also I have documents which contains gone with the wind having 167
 times
  in the documents. I don't know that I am missing something or not.
  Please check my below solr configuration:
 
  *URL: *localhost:8983/solr/wikingram/spell?q=gram_ci:gone wthh thes
  wintwt=jsonindent=trueshards.qt=/spell
 
  *solrconfig.xml:*
 
  searchComponent name=spellcheck class=solr.SpellCheckComponent
  str name=queryAnalyzerFieldTypetextSpellCi/str
  lst name=spellchecker
str name=namedefault/str
str name=fieldgram_ci/str
str name=classnamesolr.DirectSolrSpellChecker/str
str name=distanceMeasureinternal/str
float name=accuracy0.5/float
int name=maxEdits2/int
int name=minPrefix0/int
int name=maxInspections5/int
int name=minQueryLength2/int
float name=maxQueryFrequency0.9/float
str name=comparatorClassfreq/str
  /lst
  lst name=spellchecker
str name=namewordbreak/str
str name=classnamesolr.WordBreakSolrSpellChecker/str
str name=fieldgram/str
str name=combineWordstrue/str
str name=breakWordstrue/str
int name=maxChanges5/int
  /lst
  /searchComponent
 
  requestHandler name=/spell class=solr.SearchHandler startup=lazy
  lst name=defaults
str name=dfgram_ci/str
str name=spellcheck.dictionarydefault/str
str name=spellcheckon/str
str name=spellcheck.extendedResultstrue/str
str name=spellcheck.count25/str
str name=spellcheck.onlyMorePopulartrue/str
str name=spellcheck.maxResultsForSuggest1/str
str name=spellcheck.alternativeTermCount25/str
str name=spellcheck.collatetrue/str
str name=spellcheck.maxCollations50/str
str name=spellcheck.maxCollationTries50/str
str name=spellcheck.collateExtendedResultstrue/str
  /lst
  arr name=last-components
strspellcheck/str
  /arr
/requestHandler
 
  *Schema.xml: *
 
  field name=gram_ci type=textSpellCi indexed=true stored=true
  multiValued=false/
 
  /fieldTypefieldType name=textSpellCi class=solr.TextField
  positionIncrementGap=100
 analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
  /fieldType
 



Re: Solr resource usage/Clustering

2015-02-25 Thread Erick Erickson
How are you indexing? SolrJ? DIH? some other process?

And what, if anything, comes out in the Solr logs when this happens?

'cause this is pretty odd so I'm grasping at straws.

Best,
Erick

On Wed, Feb 25, 2015 at 9:10 PM, Vikas Agarwal vi...@infoobjects.com wrote:
 Hi,

 We have a single solr instance serving queries to the client through out
 the day and being indexed twice a day using scheduled jobs. During the
 scheduled jobs, which actually syncs databases from data collection
 machines to the master database, it can make many indexing calls. It is
 usually about 50k-100k records that are synced on each iteration of sync
 and we make calls to solr in batch of 1000 documents.

 Now, during the sync process, solr throws 503 (service not available
 message) quite frequently and in fact it responds very slow to index the
 documents. I have checked the cpu and memory usage during the sync process
 and it never consumed more than 40-50 % of CPU and 10-20% of RAM.

 My question is how to increase the performance of indexing to increase the
 speed up the sync process.

 --
 Regards,
 Vikas Agarwal
 91 – 9928301411

 InfoObjects, Inc.
 Execution Matters
 http://www.infoobjects.com
 2041 Mission College Boulevard, #280
 Santa Clara, CA 95054
 +1 (408) 988-2000 Work
 +1 (408) 716-2726 Fax


Facet on TopDocs

2015-02-25 Thread kakes
We are trying to limit the number of facets returned only to the top 100 docs
and not the complete result set..

Is there a way of accessing topDocs in the custom Faceting component?
or
Can the scores of the docID's in the resultset be accessed in the Facet
Component?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facet-on-TopDocs-tp4188767.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [ANNOUNCE] Luke 4.10.3 released

2015-02-25 Thread Dmitry Kan
Ok, sure. The plan is to make the pivot branch in the current github repo
and update its structure accordingly.
Once it is there, I'll let you know.

Thank you,
Dmitry

On Tue, Feb 24, 2015 at 5:26 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com
 wrote:

 Hi Dmitry,

 Thank you for the detailed clarification!

 Recently, I've created a few patches to Pivot version(LUCENE-2562), so I'd
 like to some more work and keep up to date it.

  If you would like to work on the Pivot version, may I suggest you to fork
  the github's version? The ultimate goal is to donate this to Apache, but
 at
  least we will have the common plate. :)

 Yes, I love to the idea about having common code base.
 I've looked at both codes of github's (thinlet's) and Pivot's, Pivot's
 version has very different structure from github's (I think that is mainly
 for UI framework's requirement.)
 So it seems to be difficult to directly fork github's version to develop
 Pivot's version..., but I think I (or any other developers) could catch up
 changes in github's version.
 There's long way to go for Pivot's version, of course, I'd like to also
 make pull requests to enhance github's version if I can.

 Thanks,
 Tomoko

 2015-02-24 23:34 GMT+09:00 Dmitry Kan solrexp...@gmail.com:

  Hi, Tomoko!
 
  Thanks for being a fan of luke!
 
  Current status of github's luke (https://github.com/DmitryKey/luke) is
  that
  it has releases for all the major lucene versions since 4.3.0, excluding
  4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the latest --
  5.0.0.
 
  Porting the github's luke to ALv2 compliant framework (GWT or Pivot) is a
  long standing goal. With GWT I had issues related to listing and reading
  the index directory. So this effort has been parked. Most recently I have
  been approaching the Pivot. Mark Miller has done an initial port, that I
  took as the basis. I'm hoping to continue on this track as time permits.
 
 
  If you would like to work on the Pivot version, may I suggest you to fork
  the github's version? The ultimate goal is to donate this to Apache, but
 at
  least we will have the common plate. :)
 
 
  Thanks,
  Dmitry
 
  On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida 
  tomoko.uchida.1...@gmail.com
   wrote:
 
   Hi,
  
   I'm an user / fan of Luke, so deeply appreciate your work.
  
   I've carefully read the readme, noticed the (one of) project's goal:
   To port the thinlet UI to an ASL compliant license framework so that
 it
   can be contributed back to Apache Lucene. Current work is done with GWT
   2.5.1.
  
   There has been GWT based, ASL compliant Luke supporting the latest
  Lucene ?
  
   I've recently got in with LUCENE-2562. Currently, Apache Pivot based
 port
   is going. But I do not know so much about Luke's long (and may be
  slightly
   complex) history, so I would grateful if anybody clear the association
 of
   the Luke project (now on Github) and the Jira issue. Or, they can be
   independent of each other.
   https://issues.apache.org/jira/browse/LUCENE-2562
   I don't have any opinions, just want to understand current status and
  avoid
   duplicate works.
  
   Apologize for a bit annoying post.
  
   Many thanks,
   Tomoko
  
  
  
   2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com:
  
Hello,
   
Luke 4.10.3 has been released. Download it here:
   
https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3
   
The release has been tested against the solr-4.10.3 based index.
   
Issues fixed in this release: #13
https://github.com/DmitryKey/luke/pull/13
Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2
   
Thanks to respective contributors!
   
   
P.S. waiting for lucene 5.0 artifacts to hit public maven
 repositories
   for
the next major release of luke.
   
--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info
   
  
 
 
 
  --
  Dmitry Kan
  Luke Toolbox: http://github.com/DmitryKey/luke
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
  SemanticAnalyzer: www.semanticanalyzer.info
 




-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Creating a collection/core on HDFS with SolrCloud

2015-02-25 Thread Simon Minery

Hello,

I'm trying to create a collection on HDFS with Solr 5.0.0.
I have my solrconfig.xml with the HDFS parameters, following the 
confluence guidelines.



When creating with the bin/Solr script  bin/solr create -c 
collectionHDFS -d /my/conf/ I have this error:



failure:{:org.apache.solr.client.solrj.SolrServerException:IOException 
occured when talking to server at: https://192.168.200.32:8983/solr}}



With the GUI on the SolrCloud server, I have this one:

Error CREATEing SolrCore 'collectionHDFS': Unable to create core 
[collectionHDFS] Caused by: hadoop.security.authentication set to: 
simple, not kerberos, but attempting to connect to HDFS via kerberos


On my /my/conf/solrconfig.xml, I have already double-checked that :
bool name=solr.hdfs.security.kerberos.enabledtrue/bool
str 
name=solr.hdfs.security.kerberos.keytabfile/my/conf/solr.keytab/str
str 
name=solr.hdfs.security.kerberos.principalsolr/@CLUSTER.HADOOP/str


and on Hadoop' core-site.xml, my hadoop.security.authentication 
parameter is set to Kerberos.

Am I missing something ?
Thank you very much for your input, have a great day.
Simon M.


Re: highlighting the boolean query

2015-02-25 Thread Dmitry Kan
Erick, Eric and Mike,

Thanks for your help and ideas.

It sounds like we'd need to do a bit of revamping in the highlighter.
Perhaps even PostingsHighligher should be taken as the baseline, since it
is faster. It uses the same extractTerms() method, that Erik has shown.

The user story here is that user is made to believe, that the boolean query
did not work correctly, judging from the highlights. The issue is minor
otherwise, since the search *does* work as expected.

Dmitry

On Tue, Feb 24, 2015 at 8:19 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 There is also PostingsHighlighter -- I recommend it, if only for the
 performance improvement, which is substantial, but I'm not completely sure
 how it handles this issue.  The one drawback I *am* aware of is that it is
 insensitive to positions (so words from phrases get highlighted even in
 isolation)

 -Mike



 On 02/24/2015 12:46 PM, Erik Hatcher wrote:

 BooleanQuery’s extractTerms looks like this:

 public void extractTerms(SetTerm terms) {
for (BooleanClause clause : clauses) {
  if (clause.isProhibited() == false) {
clause.getQuery().extractTerms(terms);
  }
}
 }
 that’s generally the method called by the Highlighter for what terms
 should be highlighted.  So even if a term didn’t match the document, the
 query that the term was in matched the document and it just blindly
 highlights all the terms (minus prohibited ones).   That at least explains
 the behavior you’re seeing, but it’s not ideal.  I’ve seen specialized
 highlighters that convert to spans, which are accurate to the exact matches
 within the document.  Been a while since I dug into the HighlightComponent,
 so maybe there’s some other options available out of the box?

 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com http://www.lucidworks.com/




  On Feb 24, 2015, at 3:16 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Erick,

 Our default operator is AND.

 Both queries below parse the same:

 a OR (b c) OR d
 a OR (b AND c) OR d

 The parsed query:

 str name=parsedquery_toStringContents:a (+Contents:b +Contents:c)
 Contents:d/str

 So this part is consistent with our expectation.


  I'm a bit puzzled by your statement that c didn't contribute to the

 score.
 what I meant was that the term c was not hit by the scorerer: the explain
 section does not refer to it. I'm using the made up terms here, but the
 concept holds.

 The code suggests that we could benefit from storing term offsets and
 positions:
 http://grepcode.com/file/repo1.maven.org/maven2/org.
 apache.solr/solr-core/4.3.1/org/apache/solr/highlight/
 DefaultSolrHighlighter.java#470

 Is it correct assumption?

 On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com
 
 wrote:

  Highlighting is such a pain...

 what does the parsed query look like? If the default operator is OR,
 then this seems correct as both 'd' and 'c' appear in the doc. So
 I'm a bit puzzled by your statement that c didn't contribute to the
 score.

 If the parsed query is, indeed
 a +b +c d

 then it does look like something with the highlighter. Whether other
 highlighters are better for this case.. no clue ;(

 Best,
 Erick

 On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com
 wrote:

 Erick,

 nope, we are using std lucene qparser with some customizations, that do

 not

 affect the boolean query parsing logic.

 Should we try some other highlighter?

 On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson 
 erickerick...@gmail.com

 wrote:

  Are you using edismax?

 On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com

 wrote:

 Hello!

 In solr 4.3.1 there seem to be some inconsistency with the

 highlighting

 of

 the boolean query:

 a OR (b c) OR d

 This returns a proper hit, which shows that only d was included into

 the

 document score calculation.

 But the highlighter returns both d and c in em tags.

 Is this a known issue of the standard highlighter? Can it be

 mitigated?


 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info



 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info



 --
 Dmitry Kan
 Luke Toolbox: http://github.com/DmitryKey/luke
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info






-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Problem with queries that includes NOT

2015-02-25 Thread david . davila
Hello,

We have problems with some queries. All of them include the tag NOT, and 
in my opinion, the results don´t make any sense.

First problem:

This query  NOT Proc:ID01returns   95806 results, however this one 
NOT Proc:ID01 OR FileType:PDF_TEXT returns  11484 results. But it's 
impossible that adding a tag OR the query has less number of results.

Second problem. Here the problem is because of the brackets and the NOT 
tag:

 This query:

(NOT Proc:ID01 AND NOT FileType:PDF_TEXT) AND sys_FileType:PROTOTIPE 
returns 0 documents.

But this query:

(NOT Proc:ID01 AND NOT FileType:PDF_TEXT AND sys_FileType:PROTOTIPE) 
returns 53 documents, which is correct. So, the problem is the position of 
the bracket. I have checked the same query without NOTs, and it works fine 
returning the same number of results in both cases.  So, I think the 
problem is the combination of the bracket positions and the NOT tag.

This second problem is less important, but the queries comes from a web 
page and I'd have to change it, so I need to know if the problem is Solr 
or not.



This is the part of the scheme that applies:

fieldType name=string class=solr.StrField sortMissingLast=true/



Thank you very much,




David Dávila 

DIT - 915828763