RE: Backup a solr cloud collection - timeout in 180s?

2018-04-10 Thread Petersen, Robert (Contr)
Erick:

Good to know!

Thx
Robi


-Original Message-
From: Erick Erickson <erickerick...@gmail.com> 
Sent: Tuesday, April 10, 2018 12:42 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: Backup a solr cloud collection - timeout in 180s?

Robi:

Yeah, the ref guide has lots and lots and lots of info, but at 1,100 pages and 
growing things can be "interesting" to find.

Do be aware of one thing. The async ID should be unique and before 7.3 there 
was a bug that if you used the same ID twice (without waiting for completion 
and deleting it first) it lead to bewildering results.
See: https://issues.apache.org/jira/browse/SOLR-11739.

The operations would succeed, but you might not be getting the status of the 
task you think you are.


Best,
Erick

On Tue, Apr 10, 2018 at 9:25 AM, Petersen, Robert (Contr) 
<robert.peters...@ftr.com> wrote:
> HI Erick,
>
>
> I *just* found that parameter in the guide... it was waaay down at the bottom 
> of the page (in proverbial small print)!
>
>
> So for other readers the steps are this:
>
> # start the backup async enabled
>
> /admin/collections?action=BACKUP=addrsearchBackup=addr
> search=/apps/logs/backups=1234
>
>
> # check on the status of the async job
>
> /admin/collections?action=REQUESTSTATUS=1234
>
>
> # clear out the status when done
>
> /admin/collections?action=DELETESTATUS=1234
>
>
> Thx
>
> Robi
>
> 
> From: Erick Erickson <erickerick...@gmail.com>
> Sent: Tuesday, April 10, 2018 8:24:20 AM
> To: solr-user
> Subject: Re: Backup a solr cloud collection - timeout in 180s?
>
> 
> WARNING: External email. Please verify sender before opening attachments or 
> clicking on links.
> 
>
>
>
> Specify the "async" property, see:
> https://lucene.apache.org/solr/guide/6_6/collections-api.html
>
> There's also a way to check the status of the backup running in the 
> background.
>
> Best,
> Erick
>
> On Mon, Apr 9, 2018 at 11:05 AM, Petersen, Robert (Contr) 
> <robert.peters...@ftr.com> wrote:
>> Shouldn't this just create the backup file(s) asynchronously? Can the 
>> timeout be adjusted?
>>
>>
>> Solr 7.2.1 with five nodes and the addrsearch collection is five 
>> shards x five replicas and "numFound":38837970 docs
>>
>>
>> Thx
>>
>> Robi
>>
>>
>> http://myServer.corp.pvt:8983/solr/admin/collections?action=BACKUP
>> me=addrsearchBackup=addrsearch=/apps/logs/backups
>>
>>
>>   *
>>  *
>> responseHeader:
>> {
>> *
>> status: 500,
>> *
>> QTime: 180211
>> },
>>  *
>> error:
>> {
>> *
>> metadata:
>> [
>>*
>> "error-class",
>>*
>> "org.apache.solr.common.SolrException",
>>*
>> "root-error-class",
>>*
>> "org.apache.solr.common.SolrException"
>> ],
>> *
>> msg: "backup the collection time out:180s",
>>   *
>>
>>
>> From the logs:
>>
>>
>> 2018-04-09 17:47:32.667 INFO  (qtp64830413-22) [   ] o.a.s.s.HttpSolrCall 
>> [admin] webapp=null path=/admin/collections 
>> params={name=addrsearchBackup=BACKUP=/apps/logs/backups=addrsearch}
>>  status=500 QTime=180211
>> 2018-04-09 17:47:32.667 ERROR (qtp64830413-22) [   ] o.a.s.s.HttpSolrCall 
>> null:org.apache.solr.common.SolrException: backup the collection time 
>> out:180s
>> at 
>> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:314)
>> at 
>> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:246)
>> at 
>> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:224)
>> at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
>> at 
>> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:735)
>> at 
>> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:716)
>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:497)
>> at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
>> at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
>> r.java:326)
>>
>>
>>
>> 
>>
>> This communication is confidential. Frontier only sends and receives email 
>> on the basis of the terms set out at 
>> http://www.frontier.com/email_disclaimer.


Re: Backup a solr cloud collection - timeout in 180s?

2018-04-10 Thread Petersen, Robert (Contr)
HI Erick,


I *just* found that parameter in the guide... it was waaay down at the bottom 
of the page (in proverbial small print)!


So for other readers the steps are this:

# start the backup async enabled

/admin/collections?action=BACKUP=addrsearchBackup=addrsearch=/apps/logs/backups=1234


# check on the status of the async job

/admin/collections?action=REQUESTSTATUS=1234


# clear out the status when done

/admin/collections?action=DELETESTATUS=1234


Thx

Robi


From: Erick Erickson <erickerick...@gmail.com>
Sent: Tuesday, April 10, 2018 8:24:20 AM
To: solr-user
Subject: Re: Backup a solr cloud collection - timeout in 180s?


WARNING: External email. Please verify sender before opening attachments or 
clicking on links.




Specify the "async" property, see:
https://lucene.apache.org/solr/guide/6_6/collections-api.html

There's also a way to check the status of the backup running in the background.

Best,
Erick

On Mon, Apr 9, 2018 at 11:05 AM, Petersen, Robert (Contr)
<robert.peters...@ftr.com> wrote:
> Shouldn't this just create the backup file(s) asynchronously? Can the timeout 
> be adjusted?
>
>
> Solr 7.2.1 with five nodes and the addrsearch collection is five shards x 
> five replicas and "numFound":38837970 docs
>
>
> Thx
>
> Robi
>
>
> http://myServer.corp.pvt:8983/solr/admin/collections?action=BACKUP=addrsearchBackup=addrsearch=/apps/logs/backups
>
>
>   *
>  *
> responseHeader:
> {
> *
> status: 500,
> *
> QTime: 180211
> },
>  *
> error:
> {
> *
> metadata:
> [
>*
> "error-class",
>*
> "org.apache.solr.common.SolrException",
>*
> "root-error-class",
>*
> "org.apache.solr.common.SolrException"
> ],
> *
> msg: "backup the collection time out:180s",
>   *
>
>
> From the logs:
>
>
> 2018-04-09 17:47:32.667 INFO  (qtp64830413-22) [   ] o.a.s.s.HttpSolrCall 
> [admin] webapp=null path=/admin/collections 
> params={name=addrsearchBackup=BACKUP=/apps/logs/backups=addrsearch}
>  status=500 QTime=180211
> 2018-04-09 17:47:32.667 ERROR (qtp64830413-22) [   ] o.a.s.s.HttpSolrCall 
> null:org.apache.solr.common.SolrException: backup the collection time out:180s
> at 
> org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:314)
> at 
> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:246)
> at 
> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:224)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
> at 
> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:735)
> at 
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:716)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:497)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
>
>
>
> 
>
> This communication is confidential. Frontier only sends and receives email on 
> the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Backup a solr cloud collection - timeout in 180s?

2018-04-09 Thread Petersen, Robert (Contr)
Shouldn't this just create the backup file(s) asynchronously? Can the timeout 
be adjusted?


Solr 7.2.1 with five nodes and the addrsearch collection is five shards x five 
replicas and "numFound":38837970 docs


Thx

Robi


http://myServer.corp.pvt:8983/solr/admin/collections?action=BACKUP=addrsearchBackup=addrsearch=/apps/logs/backups


  *
 *
responseHeader:
{
*
status: 500,
*
QTime: 180211
},
 *
error:
{
*
metadata:
[
   *
"error-class",
   *
"org.apache.solr.common.SolrException",
   *
"root-error-class",
   *
"org.apache.solr.common.SolrException"
],
*
msg: "backup the collection time out:180s",
  *


>From the logs:


2018-04-09 17:47:32.667 INFO  (qtp64830413-22) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/collections 
params={name=addrsearchBackup=BACKUP=/apps/logs/backups=addrsearch}
 status=500 QTime=180211
2018-04-09 17:47:32.667 ERROR (qtp64830413-22) [   ] o.a.s.s.HttpSolrCall 
null:org.apache.solr.common.SolrException: backup the collection time out:180s
at 
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:314)
at 
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:246)
at 
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:224)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
at 
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:735)
at 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:716)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:497)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)





This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


CDCR - cross data center replication

2018-01-25 Thread Petersen, Robert (Contr)
Hi all,


So for an initial CDCR setup documentation says bulk load should be performed 
first otherwise CDCR won't keep up. By bulk load does that include an ETL 
process doing rapid atomic updates one doc at a time (with multiple threads) so 
like 4K docs per minute assuming bandwidth between DCs is actually good?


Also as a follow up question, in the documentation it says to do the bulk load 
first and sync the data centers then turn on CDCR, what is recommended for the 
initial sync? A Solr backup and restore?


Thanks

Robi


CDCR is unlikely to be satisfactory for bulk-load situations where the update 
rate is high, especially if the bandwidth between the Source and Target 
clusters is restricted. In this scenario, the initial bulk load should be 
performed, the Source and Target data centers synchronized and CDCR be utilized 
for incremental updates.



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: solr 5.4.1 leader issue

2018-01-08 Thread Petersen, Robert (Contr)
OK just restarting all the solr nodes did fix it, since they are in production 
I was hesitant to do that


From: Petersen, Robert (Contr) <robert.peters...@ftr.com>
Sent: Monday, January 8, 2018 12:34:28 PM
To: solr-user@lucene.apache.org
Subject: solr 5.4.1 leader issue

Hi got two out of my three servers think they are replicas on one shard getting 
exceptions wondering what is the easiest way to fix this? Can I just restart 
zookeeper across the servers? Here are the exceptions:


TY

Robi


ERROR
null
RecoveryStrategy
Error while trying to recover. 
core=custsearch_shard3_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://x.x.x.x:8983/solr: We are not the leader
Error while trying to recover. 
core=custsearch_shard3_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://x.x.x.x:8983/solr: We are not the leader
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:607)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:364)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:226)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://10.209.55.10:8983/solr: We are not the leader
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:285)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:281)
... 5 more
(and on the one everyone thinks is the leader)
Error while trying to recover. 
core=custsearch_shard3_replica3:org.apache.solr.common.SolrException: Cloud 
state still says we are leader.
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:332)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:226)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: solr 5.4.1 leader issue

2018-01-08 Thread Petersen, Robert (Contr)
Perhaps I didn't explain well, three nodes live. Two are in recovering mode 
exception being they cant get to the Leader because the Leader replies that he 
is not the leader. On the dashboard it shows him as the leader but he thinks he 
isn't. The exceptions are below... Do I have to just restart the solr 
instances, the zookeeper instances, both, or is there another better way 
without restarting everything?


Thx

Robi


From: Petersen, Robert (Contr) <robert.peters...@ftr.com>
Sent: Monday, January 8, 2018 12:34:28 PM
To: solr-user@lucene.apache.org
Subject: solr 5.4.1 leader issue

Hi got two out of my three servers think they are replicas on one shard getting 
exceptions wondering what is the easiest way to fix this? Can I just restart 
zookeeper across the servers? Here are the exceptions:


TY

Robi


ERROR
null
RecoveryStrategy
Error while trying to recover. 
core=custsearch_shard3_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://x.x.x.x:8983/solr: We are not the leader
Error while trying to recover. 
core=custsearch_shard3_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://x.x.x.x:8983/solr: We are not the leader
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:607)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:364)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:226)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://10.209.55.10:8983/solr: We are not the leader
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:285)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:281)
... 5 more
(and on the one everyone thinks is the leader)
Error while trying to recover. 
core=custsearch_shard3_replica3:org.apache.solr.common.SolrException: Cloud 
state still says we are leader.
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:332)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:226)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: solr 5.4.1 leader issue

2018-01-08 Thread Petersen, Robert (Contr)
I'm on zookeeper 3.4.8


From: Petersen, Robert (Contr) <robert.peters...@ftr.com>
Sent: Monday, January 8, 2018 12:34:28 PM
To: solr-user@lucene.apache.org
Subject: solr 5.4.1 leader issue

Hi got two out of my three servers think they are replicas on one shard getting 
exceptions wondering what is the easiest way to fix this? Can I just restart 
zookeeper across the servers? Here are the exceptions:


TY

Robi


ERROR
null
RecoveryStrategy
Error while trying to recover. 
core=custsearch_shard3_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://x.x.x.x:8983/solr: We are not the leader
Error while trying to recover. 
core=custsearch_shard3_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://x.x.x.x:8983/solr: We are not the leader
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:607)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:364)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:226)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://10.209.55.10:8983/solr: We are not the leader
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:285)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:281)
... 5 more
(and on the one everyone thinks is the leader)
Error while trying to recover. 
core=custsearch_shard3_replica3:org.apache.solr.common.SolrException: Cloud 
state still says we are leader.
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:332)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:226)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


solr 5.4.1 leader issue

2018-01-08 Thread Petersen, Robert (Contr)
Hi got two out of my three servers think they are replicas on one shard getting 
exceptions wondering what is the easiest way to fix this? Can I just restart 
zookeeper across the servers? Here are the exceptions:


TY

Robi


ERROR
null
RecoveryStrategy
Error while trying to recover. 
core=custsearch_shard3_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://x.x.x.x:8983/solr: We are not the leader
Error while trying to recover. 
core=custsearch_shard3_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://x.x.x.x:8983/solr: We are not the leader
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:607)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:364)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:226)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://10.209.55.10:8983/solr: We are not the leader
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:285)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:281)
... 5 more
(and on the one everyone thinks is the leader)
Error while trying to recover. 
core=custsearch_shard3_replica3:org.apache.solr.common.SolrException: Cloud 
state still says we are leader.
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:332)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:226)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Any Insights SOLR Rank tuning tool

2017-12-14 Thread Petersen, Robert (Contr)
I remember when FAST (when it was still FAST) came to our enterprise to pitch 
their search when we were looking to replace our alta vista search engine with 
*something* and they demonstrated that relevance tool for business side. While 
that thing was awesome, I've never seen anything close to it in the solr world 
where I ended up going instead of the soon to be doomed FAST search. Also that 
tool was totally manual and of limited use in a very large corpus/catalog. Sort 
of like just applying a bandaid to a larger problem.


Splainer will only detail the reasons things show up in one query but won't 
solve a bigger relevancy problem. On the other hand, there are several ways to 
skin this cat. There are solutions which analyze logs for outlying cases and 
feed back into solr these results to automatically improve relevancy. I don't 
think most any of these are open source and some are quite proprietary.


If your company could afford to assign a buisdev guy to tweeking individual 
searches, I'm sure they could instead get some jr devs to go over query logs 
inspecting outlying cases like zero results/too many results then look at if it 
is a data issue or a query issue. And then recommend changes in the appropriate 
domain.


Thanks

Robi


From: Charlie Hull 
Sent: Thursday, December 14, 2017 1:24:42 AM
To: solr-user@lucene.apache.org
Subject: Re: Any Insights SOLR Rank tuning tool

On 13/12/2017 20:18, Sharma, Abhinav wrote:
> Hello Folks,
>
> Currently, we are running FAST ESP as a Search System & are looking to 
> migrate from FAST ESP to SOLR.
> I was just wondering if you Guys have any built-in Relevancy tool for the 
> Business Folks like what we have in FAST called SBC (Search Business Center)?
>
> Thanks, Abhi
>
I'd second Quepid as we've used it for several projects where migration
is an issue (disclaimer: we're partners with OSC and resell Quepid).

Migration is a tricky thing to get right: the business side want the new
engine to behave like the old one, but don't understand the technical
issues when you're putting in a totally different core engine; technical
folks don't necessarily understand the business drivers behind making
the transition as painless as possible for users. Developing tests (and
being able to compare both sets of search results) is essential.
Remember that you might even have to replicate some 'wrong' behaviour of
the old engine as people are used to it!

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: SOLR Rest API for monitoring

2017-12-14 Thread Petersen, Robert (Contr)
you are using cloudera? sounds like a question for them...


From: Abhi Basu <9000r...@gmail.com>
Sent: Thursday, December 14, 2017 1:27:23 PM
To: solr-user@lucene.apache.org
Subject: SOLR Rest API for monitoring

Hi All:

I am using CDH 5.13 with Solr 4.10. Trying to automate metrics gathering
for JVM (CPU, RAM, Storage etc.) by calling the REST APIs described here ->
https://lucene.apache.org/solr/guide/6_6/metrics-reporting.html.

Are these not supported in my version of Solr? If not, what option do I
have?

I tried calling this:
http://hadoop-nn2.esolocal.com:8983/solr/admin/metrics?wt=json=counter=core

And receive 404 - request not available.

Are there any configuration changes needed?

Thanks, Abhi

--
Abhi Basu



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Solr upgrade from 4.x to 7.1

2017-12-14 Thread Petersen, Robert (Contr)
>From what I have read, you can only upgrade to the next major version number 
>without using a tool to convert the indexes to the newer version. But that is 
>still perilous due to deprications etc


So I think best advice out there is to spin up a new farm on 7.1 (especially 
from 4.x), make a new collection there, reindex everything into it and then 
switch over to the new farm. I would also ask the question are you thinking to 
go to master/slave on 7.1? Wouldn't you want to go with solr cloud?


I started with master/slave and yes it is simpler but there is that one single 
point of failure (the master) for indexing, which is of course easily manually 
overcome by purposing a slave as the new master and repointing the remaining 
slaves at the new master however this is a completely manual process you try to 
avoid in cloud mode.


I think you'd need to think this through more fully with the new possibilities 
available and how you'd want to migrate given your existing environment is so 
far behind.


Thanks

Robi


From: Drooy Drooy 
Sent: Thursday, December 14, 2017 1:27:53 PM
To: solr-user@lucene.apache.org
Subject: Solr upgrade from 4.x to 7.1

Hi All,

We have an in-house project running in Solr 4.7 with Master/Slave mode for
a few years, what is it going to take to upgrade it to SolrCloud with
TLOG/PULL replica mode ?

I read the upgrade guides, none of them talking about the jump from 4.x to
7.

Thanks much



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Can someone help? Two level nested doc... ChildDocTransformerFactory sytax...

2017-11-07 Thread Petersen, Robert (Contr)
OK although this was talked about as possibly coming in solr 6.x I guess it was 
hearsay and from what I can tell after rereading everythying I can find on the 
subject as of now the child docs are only retrievable as a one level hierarchy 
when using the ChildDocTransformerFactory




From: Petersen, Robert (Contr) <robert.peters...@ftr.com>
Sent: Monday, November 6, 2017 5:05:31 PM
To: solr-user@lucene.apache.org
Subject: Can someone help? Two level nested doc... ChildDocTransformerFactory 
sytax...

OK no faceting, no filtering, I just want the hierarchy to come backin the 
results. Can't quite get it... googled all over the place too.


Doc:

{ id : asdf, type_s:customer, firstName_s:Manny, lastName_s:Acevedo, 
address_s:"123 Fourth Street", city_s:Gotham, tn_s:1234561234,
  _childDocuments_:[
  { id : adsf_c1,
src_s : "CRM.Customer",
type_s:customerSource,
_childDocuments_:[
{
id : asdf_c1_c1,
type_s:customerSourceType,
"key_s": "id",
"value_s": "GUID"
}
]
},
  { id : adsf_c2,
"src_s": "DPI.SalesOrder",
type_s:customerSource,
_childDocuments_:[
{
id : asdf_c2_c1,
type_s:customerSourceType,
"key_s": "btn",
"value_s": "4052328908"
},
{
id : asdf_c2_c2,
type_s:customerSourceType,
"key_s": "seq",
"value_s": "5"
   },
{
id : asdf_c2_c3,
type_s:customerSourceType,
"key_s": "env",
"value_s": "MS"
}
]
}
]
}


Queries:

Gives all nested docs regardless of level as a flat set
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer]

Gives all nested child docs only
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource]

How to get nested grandchild docs at correct level?
Nope exception:
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customerSource%20childFilter=type_s:customerSourceType]

Nope exception:
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customerSource]


Nope but no exception only gets children again tho like above:
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customer*]

Nope but no exception only gets children 
again:<http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customer*%20childFilter=type_s:customerSourceType]>

http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customer*%20childFilter=type_s:customerSourceType]


Nope same again... no grandchildren:

http://localhost:8983/solr/temptest/select?q=id:asdf=id,p:[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],q:[child%20parentFilter=-type_s:customer%20parentFilter=type_s:customerSource%20childFilter=type_s:customerSourceType]


Gives all but flat no child to grandchild hierarchy:

http://localhost:8983/solr/temptest/select?q=id:asdf=id,p:[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],q:[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSourceType]


Thanks in advance,

Robi



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Can someone help? Two level nested doc... ChildDocTransformerFactory sytax...

2017-11-06 Thread Petersen, Robert (Contr)
OK no faceting, no filtering, I just want the hierarchy to come backin the 
results. Can't quite get it... googled all over the place too.


Doc:

{ id : asdf, type_s:customer, firstName_s:Manny, lastName_s:Acevedo, 
address_s:"123 Fourth Street", city_s:Gotham, tn_s:1234561234,
  _childDocuments_:[
  { id : adsf_c1,
src_s : "CRM.Customer",
type_s:customerSource,
_childDocuments_:[
{
id : asdf_c1_c1,
type_s:customerSourceType,
"key_s": "id",
"value_s": "GUID"
}
]
},
  { id : adsf_c2,
"src_s": "DPI.SalesOrder",
type_s:customerSource,
_childDocuments_:[
{
id : asdf_c2_c1,
type_s:customerSourceType,
"key_s": "btn",
"value_s": "4052328908"
},
{
id : asdf_c2_c2,
type_s:customerSourceType,
"key_s": "seq",
"value_s": "5"
   },
{
id : asdf_c2_c3,
type_s:customerSourceType,
"key_s": "env",
"value_s": "MS"
}
]
}
]
}


Queries:

Gives all nested docs regardless of level as a flat set
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer]

Gives all nested child docs only
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource]

How to get nested grandchild docs at correct level?
Nope exception:
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customerSource%20childFilter=type_s:customerSourceType]

Nope exception:
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customerSource]


Nope but no exception only gets children again tho like above:
http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customer*]

Nope but no exception only gets children 
again:

http://localhost:8983/solr/temptest/select?q=id:asdf=id,[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],[child%20parentFilter=type_s:customer*%20childFilter=type_s:customerSourceType]


Nope same again... no grandchildren:

http://localhost:8983/solr/temptest/select?q=id:asdf=id,p:[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],q:[child%20parentFilter=-type_s:customer%20parentFilter=type_s:customerSource%20childFilter=type_s:customerSourceType]


Gives all but flat no child to grandchild hierarchy:

http://localhost:8983/solr/temptest/select?q=id:asdf=id,p:[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSource],q:[child%20parentFilter=type_s:customer%20childFilter=type_s:customerSourceType]


Thanks in advance,

Robi



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Java 9

2017-11-06 Thread Petersen, Robert (Contr)
Actually I can't believe they're depricating UseConcMarkSweepGC , That was the 
one that finally made solr 'sing' with no OOMs!


I guess they must have found something better, have to look into that...


Robi


From: Chris Hostetter 
Sent: Monday, November 6, 2017 3:07:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Java 9



: Anyone else been noticing this this msg when starting up solr with java 9? 
(This is just an FYI and not a real question)

: Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was 
deprecated in version 9.0 and will likely be removed in a future release.
: Java HotSpot(TM) 64-Bit Server VM warning: Option UseParNewGC was deprecated 
in version 9.0 and will likely be removed in a future release.

IIRC the default GC_TUNE options for Solr still assume java8, but also
work fine with java9 -- although they do cause those deprecation warnings
and result in using the JVM defaults

You are free to customize this in your solr.in.sh if you are running java9 and
don't like the deprecation warnings ... and/or open a Jira w/suggestions
for what Solr's default GC_TUNE option should be when running in java9 (i
don't know if there is any community concensus on that yet -- but you're
welcome to try and build some)


-Hoss
http://www.lucidworks.com/



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Java 9

2017-11-06 Thread Petersen, Robert (Contr)
Hi Guys,


Anyone else been noticing this this msg when starting up solr with java 9? 
(This is just an FYI and not a real question)


Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was 
deprecated in version 9.0 and will likely be removed in a future release.
Java HotSpot(TM) 64-Bit Server VM warning: Option UseParNewGC was deprecated in 
version 9.0 and will likely be removed in a future release.


Robi



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Petersen, Robert (Contr)
Hi Walter,


OK now that sounds really interesting. I actually just turned on logging in 
Jetty and yes did see all the intra-cluster traffic there. I'm pushing our ELK 
team to pick out the get search requests across the cluster and aggregate them 
for me. We'll see how that looks but that would just be for user query analysis 
and not for real time analysis. Still looking for something to monitor real 
time since apparently my company has all it's new relic licenses tied up with 
other level one processes and doesn't want to buy any more of them at this 
time...  lol


And yes when I looked directly at the Graphite data backing Grafana at my last 
position it was just scary!


Thanks

Robi


PS early adapter for influxDB in general or just for this use case?


From: Walter Underwood <wun...@wunderwood.org>
Sent: Monday, November 6, 2017 1:44:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

We use New Relic across the site, but it doesn’t split out traffic to different 
endpoints. It also cannot distinguish between search traffic to the cluster and 
intra-cluster traffic. With four shards, the total traffic is 4X bigger than 
the incoming traffic.

We have a bunch of business metrics (orders) and other stuff that is currently 
in Graphite. We’ll almost certainly move all that to InfluxDB and Grafana.

The Solr metrics were overloading the Graphite database, so we’re the first 
service that is trying InfluxDB.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 6, 2017, at 1:31 PM, Petersen, Robert (Contr) 
> <robert.peters...@ftr.com> wrote:
>
> Hi Walter,
>
>
> Yes, now I see it. I'm wondering about using Grafana and New Relic at the 
> same time since New Relic has a dashboard and also costs money for corporate 
> use. I guess after a reread you are using Grafana to visualize the influxDB 
> data and New Relic just for JVM right?  Did this give you more control over 
> the solr metrics you are monitoring? (PS I've never heard of influxDB)
>
>
> Thanks
>
> Robi
>
> 
> From: Walter Underwood <wun...@wunderwood.org>
> Sent: Monday, November 6, 2017 11:26:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>
> Look back down the string to my post. We use Grafana.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Nov 6, 2017, at 11:23 AM, Petersen, Robert (Contr) 
>> <robert.peters...@ftr.com> wrote:
>>
>> Interesting! Finally a Grafana user... Thanks Daniel, I will follow your 
>> links. That looks promising.
>>
>>
>> Is anyone using Grafana over Graphite?
>>
>>
>> Thanks
>>
>> Robi
>>
>> 
>> From: Daniel Ortega <danielortegauf...@gmail.com>
>> Sent: Monday, November 6, 2017 11:19:10 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>>
>> Hi Robert,
>>
>> We use the following stack:
>>
>> - Prometheus to scrape metrics (https://prometheus.io/)
>> - Prometheus node exporter to export "machine metrics" (Disk, network
>> usage, etc.) (https://github.com/prometheus/node_exporter)
>> - Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
>> Response times...) (https://github.com/prometheus/jmx_exporter)
>> - Grafana to visualize all the data scrapped by Prometheus (
>> https://grafana.com/)
>>
>> Best regards
>> Daniel Ortega
>>
>> 2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
>> robert.peters...@ftr.com>:
>>
>>> PS I knew sematext would be required to chime in here!  
>>>
>>>
>>> Is there a non-expiring dev version I could experiment with? I think I did
>>> sign up for a trial years ago from a different company... I was actually
>>> wondering about hooking it up to my personal AWS based solr cloud instance.
>>>
>>>
>>> Thanks
>>>
>>> Robi
>>>
>>> 
>>> From: Emir Arnautović <emir.arnauto...@sematext.com>
>>> Sent: Thursday, November 2, 2017 2:05:10 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>>>
>>> Hi Robi,
>>> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
>>> more. We use it for monitoring 

Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Petersen, Robert (Contr)
Hi Walter,


Yes, now I see it. I'm wondering about using Grafana and New Relic at the same 
time since New Relic has a dashboard and also costs money for corporate use. I 
guess after a reread you are using Grafana to visualize the influxDB data and 
New Relic just for JVM right?  Did this give you more control over the solr 
metrics you are monitoring? (PS I've never heard of influxDB)


Thanks

Robi


From: Walter Underwood <wun...@wunderwood.org>
Sent: Monday, November 6, 2017 11:26:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

Look back down the string to my post. We use Grafana.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 6, 2017, at 11:23 AM, Petersen, Robert (Contr) 
> <robert.peters...@ftr.com> wrote:
>
> Interesting! Finally a Grafana user... Thanks Daniel, I will follow your 
> links. That looks promising.
>
>
> Is anyone using Grafana over Graphite?
>
>
> Thanks
>
> Robi
>
> 
> From: Daniel Ortega <danielortegauf...@gmail.com>
> Sent: Monday, November 6, 2017 11:19:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>
> Hi Robert,
>
> We use the following stack:
>
> - Prometheus to scrape metrics (https://prometheus.io/)
> - Prometheus node exporter to export "machine metrics" (Disk, network
> usage, etc.) (https://github.com/prometheus/node_exporter)
> - Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
> Response times...) (https://github.com/prometheus/jmx_exporter)
> - Grafana to visualize all the data scrapped by Prometheus (
> https://grafana.com/)
>
> Best regards
> Daniel Ortega
>
> 2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
> robert.peters...@ftr.com>:
>
>> PS I knew sematext would be required to chime in here!  
>>
>>
>> Is there a non-expiring dev version I could experiment with? I think I did
>> sign up for a trial years ago from a different company... I was actually
>> wondering about hooking it up to my personal AWS based solr cloud instance.
>>
>>
>> Thanks
>>
>> Robi
>>
>> 
>> From: Emir Arnautović <emir.arnauto...@sematext.com>
>> Sent: Thursday, November 2, 2017 2:05:10 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>>
>> Hi Robi,
>> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
>> more. We use it for monitoring our Solr instances and for consulting.
>>
>> Disclaimer - see signature :)
>>
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 2 Nov 2017, at 19:35, Walter Underwood <wun...@wunderwood.org> wrote:
>>>
>>> We use New Relic for JVM, CPU, and disk monitoring.
>>>
>>> I tried the built-in metrics support in 6.4, but it just didn’t do what
>> we want. We want rates and percentiles for each request handler. That gives
>> us 95th percentile for textbooks suggest or for homework search results
>> page, etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do
>> that.
>>>
>>> We built a dedicated servlet filter that goes in front of the Solr
>> webapp and reports metrics. It has some special hacks to handle some weird
>> behavior in SolrJ. A request to the “/srp” handler is sent as
>> “/select?qt=/srp”, so we normalize that.
>>>
>>> The metrics start with the cluster name, the hostname, and the
>> collection. The rest is generated like this:
>>>
>>> URL: GET /solr/textbooks/select?q=foo=/auto
>>> Metric: textbooks.GET./auto
>>>
>>> URL: GET /solr/textbooks/select?q=foo
>>> Metric: textbooks.GET./select
>>>
>>> URL: GET /solr/questions/auto
>>> Metric: questions.GET./auto
>>>
>>> So a full metric for the cluster “solr-cloud” and the host “search01"
>> would look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
>>>
>>> We send all that to InfluxDB. We’ve configured a template so that each
>> part of the metric name is mapped to a field, so we can write efficient
>> queries in InfluxQL.
>>>
>>> Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch
>> (for the load balancer) 

Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Petersen, Robert (Contr)
Interesting! Finally a Grafana user... Thanks Daniel, I will follow your links. 
That looks promising.


Is anyone using Grafana over Graphite?


Thanks

Robi


From: Daniel Ortega <danielortegauf...@gmail.com>
Sent: Monday, November 6, 2017 11:19:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

Hi Robert,

We use the following stack:

- Prometheus to scrape metrics (https://prometheus.io/)
- Prometheus node exporter to export "machine metrics" (Disk, network
usage, etc.) (https://github.com/prometheus/node_exporter)
- Prometheus JMX exporter to export "Solr metrics" (Cache usage, QPS,
Response times...) (https://github.com/prometheus/jmx_exporter)
- Grafana to visualize all the data scrapped by Prometheus (
https://grafana.com/)

Best regards
Daniel Ortega

2017-11-06 20:13 GMT+01:00 Petersen, Robert (Contr) <
robert.peters...@ftr.com>:

> PS I knew sematext would be required to chime in here!  
>
>
> Is there a non-expiring dev version I could experiment with? I think I did
> sign up for a trial years ago from a different company... I was actually
> wondering about hooking it up to my personal AWS based solr cloud instance.
>
>
> Thanks
>
> Robi
>
> 
> From: Emir Arnautović <emir.arnauto...@sematext.com>
> Sent: Thursday, November 2, 2017 2:05:10 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Anyone have any comments on current solr monitoring favorites?
>
> Hi Robi,
> Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and
> more. We use it for monitoring our Solr instances and for consulting.
>
> Disclaimer - see signature :)
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 2 Nov 2017, at 19:35, Walter Underwood <wun...@wunderwood.org> wrote:
> >
> > We use New Relic for JVM, CPU, and disk monitoring.
> >
> > I tried the built-in metrics support in 6.4, but it just didn’t do what
> we want. We want rates and percentiles for each request handler. That gives
> us 95th percentile for textbooks suggest or for homework search results
> page, etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do
> that.
> >
> > We built a dedicated servlet filter that goes in front of the Solr
> webapp and reports metrics. It has some special hacks to handle some weird
> behavior in SolrJ. A request to the “/srp” handler is sent as
> “/select?qt=/srp”, so we normalize that.
> >
> > The metrics start with the cluster name, the hostname, and the
> collection. The rest is generated like this:
> >
> > URL: GET /solr/textbooks/select?q=foo=/auto
> > Metric: textbooks.GET./auto
> >
> > URL: GET /solr/textbooks/select?q=foo
> > Metric: textbooks.GET./select
> >
> > URL: GET /solr/questions/auto
> > Metric: questions.GET./auto
> >
> > So a full metric for the cluster “solr-cloud” and the host “search01"
> would look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
> >
> > We send all that to InfluxDB. We’ve configured a template so that each
> part of the metric name is mapped to a field, so we can write efficient
> queries in InfluxQL.
> >
> > Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch
> (for the load balancer) and InfluxDB.
> >
> > I’m still working out the kinks in some of the more complicated queries,
> but the data is all there. I also want to expand the servlet filter to
> report HTTP response codes.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> >> On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) <
> robert.peters...@ftr.com> wrote:
> >>
> >> OK I'm probably going to open a can of worms here...  lol
> >>
> >>
> >> In the old old days I used PSI probe to monitor solr running on tomcat
> which worked ok on a machine by machine basis.
> >>
> >>
> >> Later I had a grafana dashboard on top of graphite monitoring which was
> really nice looking but kind of complicated to set up.
> >>
> >>
> >> Even later I successfully just dropped in a newrelic java agent which
> had solr monitors and a dashboard right out of the box, but it costs money
> for the full tamale.
> >>
> >>
> >> For basic JVM health and Solr QPS and time percentiles, does anyone
> have any favorites or other alternative suggestions?
> >>
> >>
> >> Thanks in advance!
> >>
> >> Robi
> >>
> >> 
> >>
> >> This communication is confidential. Frontier only sends and receives
> email on the basis of the terms set out at http://www.frontier.com/email_
> disclaimer.
> >
>
>



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Petersen, Robert (Contr)
PS I knew sematext would be required to chime in here!  


Is there a non-expiring dev version I could experiment with? I think I did sign 
up for a trial years ago from a different company... I was actually wondering 
about hooking it up to my personal AWS based solr cloud instance.


Thanks

Robi


From: Emir Arnautović <emir.arnauto...@sematext.com>
Sent: Thursday, November 2, 2017 2:05:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Anyone have any comments on current solr monitoring favorites?

Hi Robi,
Did you try Sematext’s SPM? It provides host, JVM and Solr metrics and more. We 
use it for monitoring our Solr instances and for consulting.

Disclaimer - see signature :)

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Nov 2017, at 19:35, Walter Underwood <wun...@wunderwood.org> wrote:
>
> We use New Relic for JVM, CPU, and disk monitoring.
>
> I tried the built-in metrics support in 6.4, but it just didn’t do what we 
> want. We want rates and percentiles for each request handler. That gives us 
> 95th percentile for textbooks suggest or for homework search results page, 
> etc. The Solr metrics didn’t do that. The Jetty metrics didn’t do that.
>
> We built a dedicated servlet filter that goes in front of the Solr webapp and 
> reports metrics. It has some special hacks to handle some weird behavior in 
> SolrJ. A request to the “/srp” handler is sent as “/select?qt=/srp”, so we 
> normalize that.
>
> The metrics start with the cluster name, the hostname, and the collection. 
> The rest is generated like this:
>
> URL: GET /solr/textbooks/select?q=foo=/auto
> Metric: textbooks.GET./auto
>
> URL: GET /solr/textbooks/select?q=foo
> Metric: textbooks.GET./select
>
> URL: GET /solr/questions/auto
> Metric: questions.GET./auto
>
> So a full metric for the cluster “solr-cloud” and the host “search01" would 
> look like “solr-cloud.search01.solr.textbooks.GET./auto.m1_rate”.
>
> We send all that to InfluxDB. We’ve configured a template so that each part 
> of the metric name is mapped to a field, so we can write efficient queries in 
> InfluxQL.
>
> Metrics are graphed in Grafana. We have dashboards that mix Cloudwatch (for 
> the load balancer) and InfluxDB.
>
> I’m still working out the kinks in some of the more complicated queries, but 
> the data is all there. I also want to expand the servlet filter to report 
> HTTP response codes.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Nov 2, 2017, at 9:30 AM, Petersen, Robert (Contr) 
>> <robert.peters...@ftr.com> wrote:
>>
>> OK I'm probably going to open a can of worms here...  lol
>>
>>
>> In the old old days I used PSI probe to monitor solr running on tomcat which 
>> worked ok on a machine by machine basis.
>>
>>
>> Later I had a grafana dashboard on top of graphite monitoring which was 
>> really nice looking but kind of complicated to set up.
>>
>>
>> Even later I successfully just dropped in a newrelic java agent which had 
>> solr monitors and a dashboard right out of the box, but it costs money for 
>> the full tamale.
>>
>>
>> For basic JVM health and Solr QPS and time percentiles, does anyone have any 
>> favorites or other alternative suggestions?
>>
>>
>> Thanks in advance!
>>
>> Robi
>>
>> 
>>
>> This communication is confidential. Frontier only sends and receives email 
>> on the basis of the terms set out at 
>> http://www.frontier.com/email_disclaimer.
>



String payloads...

2017-11-06 Thread Petersen, Robert (Contr)
Hi Guys,


I was playing with payloads example as I had a possible use case of alternate 
product titles for a product.

https://lucidworks.com/2017/09/14/solr-payloads/

bin/solr start
bin/solr create -c payloads
bin/post -c payloads -type text/csv -out yes -d $'id,vals_dpf\n1,one|1.0 
two|2.0 three|3.0\n2,weig...

I saw you could do this:

http://localhost:8983/solr/payloads/query?q=*:*=csv=id,p:payload(vals_dpf,three)
id,p
1,3.0
2,0.0

So I wanted to do something similar wiht strings and so I loaded solr with


./post -c payloads -type text/csv -out yes -d 
$'id,vals_dps\n1,one|thisisastring two|"this is a string" three|hi\n2,j
son|{asdf:123}'


http://localhost:8983/solr/payloads/query?q=vals_dps:json


[{"id":"2","vals_dps":"json|{asdf:123}","_version_":1583284597287813000}]


OK so here is my question, it seems like the payload function only works 
against numeric payloads. Further I can't see a way to get the payload to come 
out alone without the field value attached. What I would like is something like 
this, is this possible in any way? I know it would be easy enough to do some 
post query processing in a service layer but... just wondering about this. It 
seems like I should be able to get at the payload when it is a string.


http://localhost:8983/solr/payloads/query?q=vals_dps:json=id,p:payloadvalue(vals_dpf,
 json)


[{"id":"2","p":"{asdf:123}","_version_":1583284597287813000}]

Thanks

Robi




This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Anyone have any comments on current solr monitoring favorites?

2017-11-02 Thread Petersen, Robert (Contr)
OK I'm probably going to open a can of worms here...  lol


In the old old days I used PSI probe to monitor solr running on tomcat which 
worked ok on a machine by machine basis.


Later I had a grafana dashboard on top of graphite monitoring which was really 
nice looking but kind of complicated to set up.


Even later I successfully just dropped in a newrelic java agent which had solr 
monitors and a dashboard right out of the box, but it costs money for the full 
tamale.


For basic JVM health and Solr QPS and time percentiles, does anyone have any 
favorites or other alternative suggestions?


Thanks in advance!

Robi



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Re: Upgrade path from 5.4.1

2017-11-02 Thread Petersen, Robert (Contr)
Thanks guys! I kind of suspected this would be the best route and I'll move 
forward with a fresh start on 7.x as soon as I can get ops to give me the 
needed machines! 


Best

Robi


From: Erick Erickson 
Sent: Thursday, November 2, 2017 8:17:49 AM
To: solr-user
Subject: Re: Upgrade path from 5.4.1

Yonik:

Yeah, I was justparroting what had been reported I have no data to
back it up personally. I just saw the JIRA that Simon indicated and it
looks like the statement "which are faster on all fronts and use less
memory" is just flat wrong when it comes to looking up individual
values.

Ya learn somethin' new every day.

On Thu, Nov 2, 2017 at 6:57 AM, simon  wrote:
> though see SOLR-11078 , which is reporting significant query slowdowns
> after converting  *Trie to *Point fields in 7.1, compared with 6.4.2
>
> On Wed, Nov 1, 2017 at 9:06 PM, Yonik Seeley  wrote:
>
>> On Wed, Nov 1, 2017 at 2:36 PM, Erick Erickson 
>> wrote:
>> > I _always_ prefer to reindex if possible. Additionally, as of Solr 7
>> > all the numeric types are deprecated in favor of points-based types
>> > which are faster on all fronts and use less memory.
>>
>> They are a good step forward in genera, and faster for range queries
>> (and multiple-dimensions), but looking at the design I'd guess that
>> they may be slower for exact-match queries?
>> Has anyone tested this?
>>
>> -Yonik
>>



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


Upgrade path from 5.4.1

2017-11-01 Thread Petersen, Robert (Contr)
Hi Guys,


I just took over the care and feeding of three poor neglected solr 5.4.1 cloud 
clusters at my new position. While spinning up new collections and supporting 
other business initiatives I am pushing management to give me the green light 
on migrating to a newer version of solr. The last solr I worked with was 6.6.1 
and I was thinking of doing an upgrade to that (er actually 6.6.2) as I was 
reading an existing index only upgrades one major version number at a time.


Then I realized the existing 5.4.1 cloud clusters here were set up with 
unmanaged configs, so now I'm starting to lean toward just spinning up clean 
new 6.6.2 or 7.1 clouds on new machines leaving the existing 5.4.1 machines in 
place then reindexing everything on to the new machines with the intention of 
testing and then swapping in the new machines and finally destroying the old 
ones when the dust settles (they're all virtuals so NP just destroying the old 
instances and recovering their resources).


Thoughts?


Thanks

Robi



This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


RE: Do I really need copyField when my app can do the copy?

2015-07-08 Thread Petersen, Robert
Perhaps some people like maybe those using DIH to feed their index might not 
have that luxury and copyfield is the better way for them.  If you have an 
application you can do it either way.  I have done both ways in different 
situations.

Robi

-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Wednesday, July 08, 2015 3:38 PM
To: solr-user@lucene.apache.org
Subject: Do I really need copyField when my app can do the copy?

Hi Everyone,

What good is the use of copyField in Solr's schema.xml if my application can do 
it into the designated field?  Having my application do so helps me simplify 
the schema.xml maintains task thus my motivation.

Thanks

Steve


RE: Best practice to support multi-tenant with Solr

2014-03-15 Thread Petersen, Robert
Hi 

Overall I think you are mixing up your terminology.  What used to be called a 
'core' is now called a 'collection' in solr cloud.  In the old master slave 
setup, you made separate cores and replicated them to all slaves.  Now they 
want you to think of them as collections and let the cloud manage the 
distribution over the physical machines and their cores.  
https://wiki.apache.org/solr/SolrTerminology

On the multi-tenancy front, I have one core/collection with thousands of 
tenants.  I manage the separation of concerns with dynamic fields using the 
tenant ids as prefixes.  Thus I can have one schema allowing searches across 
all tenants or restricted to one tenants data.  This is secure because I use a 
wrapper web service to present a simpler API to the web clients and the wrapper 
constructs the actual queries to solr behind the curtains, thus nobody can make 
any malicious queries.  Secure done.

On the performance front, one big index for all the tenants works fine.  It's 
probably just as good as having thousands of collections and much simpler to 
maintain.

Hope that helps a bit,
Robi


-Original Message-
From: shushuai zhu [mailto:ss...@yahoo.com] 
Sent: Saturday, March 15, 2014 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Best practice to support multi-tenant with Solr

Hi Lajos, thanks again. 
 
Your suggestion is to support multi-tenant via collection in a Solr Cloud: 
putting small tenants in one collection and big tenants in their own 
collections. 
 
My original question was to find out which approach is better: supporting 
multi-tenant at collection level or core level. Based on the links below and a 
few comments there, it seems people more prefer at core level. Collection is 
logical and core is physical. I am trying to figure out the trade-offs between 
the approaches regarding to scalability, security, performance, and 
flexibility. My understanding might be wrong, the belows are some rough 
comparison:
 
1) Scalability
Core is more scalable than collection by number: we can have much more cores 
than collections in one Solr Cloud? Or collection is more scalable than core by 
size: a collection could be much bigger than a core? Not sure which one is 
better: having ~1000 cores or ~1000 collections in a Solr Cloud.
 
2) Security
Core is more isolated than collection: core is physical and has its own index, 
but collection is logical so multiple collections may contain the same cores?
 
3) Performance
Core has better performance control since it has its own index? Collection 
index is bigger so performance is not as good as smaller core index?
 
4) Flexibilty
Core is more flexible since it has its own schema/config, but one collection 
may have multiple cores hence multiple schemas/configs? Or it does not matter 
since we can set same schema/config for the whole collection?
 
Basically, I just want to get opinions about which approach might be better for 
the given use case.
 
Regards.
 
Shushuai



From: Lajos la...@protulae.com
To: solr-user@lucene.apache.org 
Sent: Saturday, March 15, 2014 1:19 PM
Subject: Re: Best practice to support multi-tenant with Solr


Hi Shushuai,


 ---
 Finally, I would (in general) argue for cloud-based implementations to give 
 you data redundancy ...
 ---
 Do you mean using multi-sharding to have multiple replicas of cores 
 (corresponding to tenants) across nodes?

 Shushuai




What I means first and foremost is that using SolrCloud with replication 
ensures that your data isn't lost if you lose a note. So in a hosted 
solution, that's a good thing.

If you are using SolrCloud, then its up to you to choose whether to have 
one collection per tenant, or one collection that supports multiple 
tenants via document routing.

Obviously the former has implications on the number of shards you'll 
have. For example, if you have a 3-node cluster with replication factor 
of 2, that's 6 shards per collection. If you have 1,000 tenant 
collections, that's 6,000 shards. Hence my argument for multiple low-end 
tenants per collection, and then only give your higher-end tenants their 
own collections. Just to make things simpler for you ;)

Regards, 


Lajos



 
 From: Lajos la...@protulae.com
 To: solr-user@lucene.apache.org
 Sent: Saturday, March 15, 2014 5:37 AM
 Subject: Re: Best practice to support multi-tenant with Solr


 Hi Shushuai,

 Just a few thoughts.

 I would guess that most people would argue for implementing
 multi-tenancy within your core (via some unique filter ID) or collection
 (via document routing) because of the headache of managing individual
 cores at the scale you are talking about.

 There are disadvantages the other way too: having a core/collection
 support multiple tenants does affect scoring, since TF-IDF is calculated
 across the index, and can open up security implications that you have to
 address 

RE: network slows when solr is running - help

2014-03-04 Thread Petersen, Robert
autoCommit 
  maxDocs25/maxDocs
  maxTime90/maxTime 
/autoCommit

-Original Message-
From: Lan [mailto:dung@gmail.com] 
Sent: Monday, March 03, 2014 1:24 PM
To: solr-user@lucene.apache.org
Subject: Re: network slows when solr is running - help

How frequently are you committing? Frequent commits can slow everything down.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/network-slows-when-solr-is-running-help-tp4120523p4120992.html
Sent from the Solr - User mailing list archive at Nabble.com.


network slows when solr is running - help

2014-02-28 Thread Petersen, Robert
Hi guys,

Got an odd thing going on right now.  Indexing into my master server (solr 
3.6.1) has slowed and it is because when solr runs ping shows latency.  When I 
stop solr though, ping returns to normal.  This has been happening 
occasionally, rebooting didn't help.  This is the first time I noticed that 
stopping solr returns ping speeds to normal.  I was thinking it was something 
with our network.   Solr is not consuming all resources on the box or anything 
like that, and normally everything works fine.  Has anyone seen this type of 
thing before?  Let me know if more info of any kind is needed.

Solr process is at 8% memory utilization and 35% cpu utilization in 'top' 
command.

Note: solr is the only thing running on the box.

C:\Users\robertpeping 10.12.132.101  -- Indexing

Pinging 10.12.132.101 with 32 bytes of data:
Reply from 10.12.132.101: bytes=32 time1ms TTL=64
Reply from 10.12.132.101: bytes=32 time1ms TTL=64
Reply from 10.12.132.101: bytes=32 time1ms TTL=64
Reply from 10.12.132.101: bytes=32 time1ms TTL=64

Ping statistics for 10.12.132.101:
Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms

C:\Users\robertpeping 10.12.132.101  -- Solr stopped

Pinging 10.12.132.101 with 32 bytes of data:
Reply from 10.12.132.101: bytes=32 time1ms TTL=64
Reply from 10.12.132.101: bytes=32 time1ms TTL=64
Reply from 10.12.132.101: bytes=32 time1ms TTL=64
Reply from 10.12.132.101: bytes=32 time1ms TTL=64

Ping statistics for 10.12.132.101:
Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms

C:\Users\robertpeping 10.12.132.101  -- Solr started but no indexing activity

Pinging 10.12.132.101 with 32 bytes of data:
Reply from 10.12.132.101: bytes=32 time1ms TTL=64
Reply from 10.12.132.101: bytes=32 time1ms TTL=64
Reply from 10.12.132.101: bytes=32 time1ms TTL=64
Reply from 10.12.132.101: bytes=32 time1ms TTL=64

Ping statistics for 10.12.132.101:
Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms

C:\Users\robertpeping 10.12.132.101  -- Solr started and indexing started

Pinging 10.12.132.101 with 32 bytes of data:
Reply from 10.12.132.101: bytes=32 time=53ms TTL=64
Reply from 10.12.132.101: bytes=32 time=51ms TTL=64
Reply from 10.12.132.101: bytes=32 time=48ms TTL=64
Reply from 10.12.132.101: bytes=32 time=51ms TTL=64

Ping statistics for 10.12.132.101:
Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
Approximate round trip times in milli-seconds:
Minimum = 48ms, Maximum = 53ms, Average = 50ms

Robert (Robi) Petersen
Senior Software Engineer
Search Department





RE: network slows when solr is running - help

2014-02-28 Thread Petersen, Robert
Yes my indexer runs as a service on a different box, it has 24 threads pushing 
docs to solr atomically.  No the solr master is not virtual, it has 64 GB main 
memory and dual quad xeon cpus.  The cpu utilization is not maxed out from what 
I can see in 'top'.  Right now it says 38%.  The other thing is that this only 
happens intermittently.  I'm going to have IT update firmware on the NIC and 
then we'll open a ticket with HP for lack of anything else.

Here is some other information:

OS Name: Linux OS Version: 2.6.18-128.el5 Total RAM: 62.92 GB Free RAM: 44.20 
GB Committed JVM memory: 36.03 GB Total swap: 20.00 GB Free swap: 20.00 GB

NUMBER OF REQUESTS EACH INTERVAL request count: 232678   error count: 5
PROCESSING TIME (MS) IN EACH INTERVAL processing time: 15355740   max time: 
79408
TRAFFIC VOLUME (BYTES) IN EACH INTERVAL sent: 702 GB   received: 956 MB

-Original Message-
From: Josh [mailto:jwda...@gmail.com] 
Sent: Friday, February 28, 2014 1:27 PM
To: solr-user@lucene.apache.org
Subject: Re: network slows when solr is running - help

Is it indexing data from over the network? (high data throughput would increase 
latency) Is it a virtual machine? (Other machines causing slow
downs) Another possible option is the network card is offloading processing 
onto the CPU which is introducing latency when the CPU is under load.


On Fri, Feb 28, 2014 at 4:11 PM, Petersen, Robert  
robert.peter...@mail.rakuten.com wrote:

 Hi guys,

 Got an odd thing going on right now.  Indexing into my master server 
 (solr
 3.6.1) has slowed and it is because when solr runs ping shows latency.
  When I stop solr though, ping returns to normal.  This has been 
 happening occasionally, rebooting didn't help.  This is the first time 
 I noticed that stopping solr returns ping speeds to normal.  I was thinking 
 it was
 something with our network.   Solr is not consuming all resources on the
 box or anything like that, and normally everything works fine.  Has 
 anyone seen this type of thing before?  Let me know if more info of 
 any kind is needed.

 Solr process is at 8% memory utilization and 35% cpu utilization in 'top'
 command.

 Note: solr is the only thing running on the box.

 C:\Users\robertpeping 10.12.132.101  -- Indexing

 Pinging 10.12.132.101 with 32 bytes of data:
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64 Reply from 
 10.12.132.101: bytes=32 time1ms TTL=64 Reply from 10.12.132.101: 
 bytes=32 time1ms TTL=64 Reply from 10.12.132.101: bytes=32 time1ms 
 TTL=64

 Ping statistics for 10.12.132.101:
 Packets: Sent = 4, Received = 4, Lost = 0 (0% lo Approximate round 
 trip times in milli-seconds:
 Minimum = 0ms, Maximum = 0ms, Average = 0ms

 C:\Users\robertpeping 10.12.132.101  -- Solr stopped

 Pinging 10.12.132.101 with 32 bytes of data:
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64 Reply from 
 10.12.132.101: bytes=32 time1ms TTL=64 Reply from 10.12.132.101: 
 bytes=32 time1ms TTL=64 Reply from 10.12.132.101: bytes=32 time1ms 
 TTL=64

 Ping statistics for 10.12.132.101:
 Packets: Sent = 4, Received = 4, Lost = 0 (0% lo Approximate round 
 trip times in milli-seconds:
 Minimum = 0ms, Maximum = 0ms, Average = 0ms

 C:\Users\robertpeping 10.12.132.101  -- Solr started but no indexing 
 activity

 Pinging 10.12.132.101 with 32 bytes of data:
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64 Reply from 
 10.12.132.101: bytes=32 time1ms TTL=64 Reply from 10.12.132.101: 
 bytes=32 time1ms TTL=64 Reply from 10.12.132.101: bytes=32 time1ms 
 TTL=64

 Ping statistics for 10.12.132.101:
 Packets: Sent = 4, Received = 4, Lost = 0 (0% lo Approximate round 
 trip times in milli-seconds:
 Minimum = 0ms, Maximum = 0ms, Average = 0ms

 C:\Users\robertpeping 10.12.132.101  -- Solr started and indexing 
 started

 Pinging 10.12.132.101 with 32 bytes of data:
 Reply from 10.12.132.101: bytes=32 time=53ms TTL=64 Reply from 
 10.12.132.101: bytes=32 time=51ms TTL=64 Reply from 10.12.132.101: 
 bytes=32 time=48ms TTL=64 Reply from 10.12.132.101: bytes=32 time=51ms 
 TTL=64

 Ping statistics for 10.12.132.101:
 Packets: Sent = 4, Received = 4, Lost = 0 (0% lo Approximate round 
 trip times in milli-seconds:
 Minimum = 48ms, Maximum = 53ms, Average = 50ms

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department






RE: Searching with special chars

2014-02-27 Thread Petersen, Robert
I agree with Erick, but if you want the special characters to count in 
searches, you might consider not just stripping them out but replacing them 
with textual placeholders (which would also have to be done at indexing time).  
For instance, I replace C# with csharp and C++ with cplusplus during indexing 
and during searching before passing them along to my solr layer.

Hope that helps,
Robi

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, February 27, 2014 7:45 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching with special chars

Good luck! You'll need it.

Problem is this is such a sticky wicket. You can move the cleaning up to the 
PHP layer, that is strip out the parens.

You could write a Solr component that got the query _very_ early and 
transformed it. You'd have to get here before parsing.

Either way, though, you'll be endlessly trying to second-guess the query 
parsing and/or intent of the user.

I'd recommend the PHP layer if anything, it's closer to the user and you may 
have a better chance to guess right.

Best,
Erick


On Wed, Feb 26, 2014 at 10:36 PM, deniz denizdurmu...@gmail.com wrote:

 Hello,

 We are facing some kinda weird problem. So here is the scenario:

 We have a frontend and a middle-ware which is dealing with user input 
 search queries before posting to Solr.

 So when a user enters city:Frankenthal_(Pfalz) and then searches, 
 there is no result although there are fields on some documents 
 matching city:Frankenthal_(Pfalz). We are aware that we can escape 
 those chars, but the middleware which is accepting queries is running 
 on a Glassfish server, which is refusing URLs with backslashes in it, 
 hence using backslashes is not okay for posting the query.

 To make everyone clear about the system it looks like:

 (PHP) - Encoded JSON - (Glassfish App - Middleware) - Javabin - 
 Solr

 any other ideas who to deal with queries with special chars like this one?



 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Searching-with-special-chars-tp4120
 047.html Sent from the Solr - User mailing list archive at Nabble.com.



RE: expungeDeletes vs optimize

2014-02-05 Thread Petersen, Robert
Hi Bryan,

From what I've seen it will only get rid of the deletes in the segments that 
the commit merged and there will be some residual deleted docs still in the 
index.  It doesn't do the full rewrite.  Even if you play with merge factors 
etc, you'll still have lint. In your situation I'd probably run the full 
optimize after your process finishes, then you'll have the smallest and 
fastest-to-search index possible.  

Cheers,
Robi

-Original Message-
From: Bryan Bende [mailto:bbe...@gmail.com] 
Sent: Wednesday, February 05, 2014 6:37 AM
To: solr-user@lucene.apache.org
Subject: expungeDeletes vs optimize

Does calling commit with expungeDeletes=true result in a full rewrite of the 
index like an optimize does? or does it only merge away the documents that were 
deleted by commit?

Every two weeks or so we run a process to rebuild our index from the original 
documents resulting in a large amount of deleted docs still on disk, and 
basically doubling the amount of disk space used by the index. We are trying 
determine if it is best to just run an optimize at the end of this process, or 
if there is a better solution. This is with solr 4.3.



RE: Interesting search question! How to match documents based on the least number of fields that match all query terms?

2014-01-22 Thread Petersen, Robert
Hi Daniel,

How about trying something like this (you'll have to play with the boosts to 
tune this), search all the fields with all the terms using edismax and use the 
minimum should match parameter, but require all terms to match in the 
allMetadata field.
https://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29

Lucene query syntax below to give you the general idea, but this query would 
require all terms to be in one of the metadata fields to get the boost.

metadata1:(term1 AND ... AND termN)^2
metadata2:(term1 AND ... AND termN)^2
.
metadataN:(term1 AND ... AND termN)^2
allMetadatas :(term1 AND ... AND termN)^0.5

That should do approximately what you want,
Robi

-Original Message-
From: Daniel Shane [mailto:sha...@lexum.com] 
Sent: Tuesday, January 21, 2014 8:42 AM
To: solr-user@lucene.apache.org
Subject: Interesting search question! How to match documents based on the least 
number of fields that match all query terms?

I have an interesting solr/lucene question and its quite possible that some new 
features in solr might make this much easier that what I am about to try. If 
anyone has a clever idea on how to do this search, please let me know!

Basically, lets state that I have an index in which each documents has a 
content and several metadata fields.

Document Fields:

content
metadata1
metadata2
.
metadataN
allMetadatas (all the terms indexed in metadata1...N are concatenated in this 
field) 

Assuming that I am searching for documents that contains a certain number of 
terms (term1 to termN) in their metadata fields, I would like to build a search 
query that will return document that satisfy these requirement:

a) All search terms must be present in a metadata field. This is quite easy, we 
can simply search in the field allMetadatas and that will work fine.

b) Now for the hard part, we prefer document in which we found the metadatas in 
the *least number of different fields*. So if one document contains all the 
search terms in 10 different fields, but another document contains all search 
terms but in only 8 fields, we would like those to sort first. 

My first idea was to index terms in the allMetadatas using payloads. Each 
indexed term would also have the specific metadataN field from which they 
originate. Then I can write a scorer to score based on these payloads. 

However, if there is a way to do this without payloads I'm all ears!

-- 
Daniel Shane
Lexum (www.lexum.com)
sha...@lexum.com



solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Petersen, Robert
Hi solr users,

We have a new use case where need to make a pile of data available as XML to a 
client and I was thinking we could easily put all this data into a solr 
collection and the client could just do a star search and page through all the 
results to obtain the data we need to give them.  Then I remembered we 
currently don't allow deep paging in our current search indexes as performance 
declines the deeper you go.  Is this still the case?

If so, is there another approach to make all the data in a collection easily 
available for retrieval?  The only thing I can think of is to query our DB for 
all the unique IDs of all the documents in the collection and then pull out the 
documents out in small groups with successive queries like 'UniqueIdField:(id1 
OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which 
doesn't seem like a very good approach because the DB might have been updated 
with new data which hasn't been indexed yet and so all the ids might not be in 
there (which may or may not matter I suppose).

Then I was thinking we could have a field with an incrementing numeric value 
which could be used to perform range queries as a substitute for paging through 
everything.  Ie queries like 'IncrementalField:[1 TO 100]' 
'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we 
update the index unless we reindex the entire collection every time we update 
any docs at all.

Is this perhaps not a good use case for solr?  Should I use something else or 
is there another approach that would work here to allow a client to pull groups 
of docs in a collection through the rest api until the client has gotten them 
all?

Thanks
Robi



RE: solr as nosql - pulling all docs vs deep paging limitations

2013-12-17 Thread Petersen, Robert
My use case is basically to do a dump of all contents of the index with no 
ordering needed.  It's actually to be a product data export for third parties.  
Unique key is product sku.  I could take the min sku and range query up to the 
max sku but the skus are not contiguous because some get turned off and only 
some are valid for export so each range would return a different number of 
products (which may or may not be acceptable and I might be able to kind of 
hide that with some code).

-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
Sent: Tuesday, December 17, 2013 10:41 AM
To: solr-user
Subject: Re: solr as nosql - pulling all docs vs deep paging limitations

Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been 
asked many times for that.
What if client don't need to rank results somehow, but just requesting 
unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or there 
is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Then I remembered we currently don't allow deep paging in our 
 current
 : search indexes as performance declines the deeper you go.  Is this 
 still
 : the case?

 Coincidently, i'm working on a new cursor based API to make this much 
 more feasible as we speak..

 https://issues.apache.org/jira/browse/SOLR-5463

 I did some simple perf testing of the strawman approach and posted the 
 results last week...


 http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
 ion-of-large-result-sets/

 ...current iterations on the patch are to eliminate the strawman code 
 to improve performance even more and beef up the test cases.

 : If so, is there another approach to make all the data in a 
 collection
 : easily available for retrieval?  The only thing I can think of is to
 ...
 : Then I was thinking we could have a field with an incrementing 
 numeric
 : value which could be used to perform range queries as a substitute 
 for
 : paging through everything.  Ie queries like 'IncrementalField:[1 TO
 : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
 : maintain as we update the index unless we reindex the entire 
 collection
 : every time we update any docs at all.

 As i mentioned in the blog above, as long as you have a uniqueKey 
 field that supports range queries, bulk exporting of all documents is 
 fairly trivial by sorting on your uniqueKey field and using an fq that 
 also filters on your uniqueKey field modify the fq each time to change 
 the lower bound to match the highest ID you got on the previous page.

 This approach works really well in simple cases where you wnat to 
 fetch all documents matching a query and then process/sort them by 
 some other criteria on the client -- but it's not viable if it's 
 important to you that the documents come back from solr in score order 
 before your client gets them because you want to stop fetching once 
 some criteria is met in your client.  Example: you have billions of 
 documents matching a query, you want to fetch all sorted by score desc 
 and crunch them on your client to compute some stats, and once your 
 client side stat crunching tells you you have enough results (which 
 might be after the 1000th result, or might be after the millionth result) 
 then you want to stop.

 SOLR-5463 will help even in that later case.  The bulk of the patch 
 should easy to use in the next day or so (having other people try out 
 and test in their applications would be *very* helpful) and hopefully 
 show up in Solr 4.7

 -Hoss
 http://www.lucidworks.com/




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com



RE: Best implementation for multi-price store?

2013-11-21 Thread Petersen, Robert
Hi,

I'd go with (2) also but using dynamic fields so you don't have to define all 
the storeX_price fields in your schema but rather just one *_price field.  Then 
when you filter on store:store1 you'd know to sort with store1_price and so 
forth for units.  That should be pretty straightforward.

Hope that helps,
Robi

-Original Message-
From: Alejandro Marqués Rodríguez [mailto:amarq...@paradigmatecnologico.com] 
Sent: Thursday, November 21, 2013 1:36 AM
To: solr-user@lucene.apache.org
Subject: Best implementation for multi-price store?

Hi,

I've been recently ask to implement an application to search products from 
several stores, each store having different prices and stock for the same 
product.

So I have products that have the usual fields (name, description, brand,
etc) and also number of units and price for each store. I must be able to 
filter for a given store and order by stock or price for that store. The 
application should also allow incresing the number of stores, fields depending 
of store and number of products without much work.

The numbers for the application are more or less 100 stores and 7M products.

I've been thinking of some ways of defining the index structure but I don't 
know wich one is better as I think each one has it's pros and cons.


   1. *Each product-store as a document:* Denormalizing the information so
   for every product and store I have a different document. Pros are that I
   can filter and order without problems and that adding a new store-depending
   field is very easy. Cons are that the index goes from 7M documents to 700M
   and that most of the info is redundant as most of the fields are repeated
   among stores.
   2. *Each field-store as a field:* For example for price I would have
   store1_price, store2_price,  Pros are that the index stays at 7M
   documents, and I can still filter and sort by those fields. Cons are that I
   have to add some logic so if I filter by one store I order for the
   associated price field, and that number of fields increases as number of
   store-depending fields x number of stores. I don't know if having more
   fields affects performance, but adding new store-depending fields will
   increase the number of fields even more
   3. *Join:* First time I read about solr joins thought it was the way to
   go in this case, but after reading a bit more and doing some tests I'm not
   so sure about it... Maybe I've done it wrong but I think it also
   denormalizes the info (So I will also havee 700M documents) and besides I
   can't order or filter by store fields.


I must say my preferred option is number 2, so I don't duplicate information, I 
keep a relatively small number of documents and I can filter and sort by the 
store fields. However, my main concern here is I don't know if having too many 
fields in a document will be harmful to performance.

Which one do you think is the best approach for this application? Is there a 
better approach that I have missed?

Thanks in advance



--
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42



RE: Sorting memory-efficiently by any numeric field (dates too?)

2013-11-12 Thread Petersen, Robert
Hi Erick,

I like your idea, FWIW please also leave room for boost by function query which 
takes many numeric fields as input but results in a single value.  I don't know 
if this counts as a really clever function but here's one that I currently use:

{!boost 
b=pow(sum(log(sum(product(boosted,9000),product(product(image,stocked),300),product(product(image,taxonomyCategoryTypeId),300),product(product(image,sales),150),product(stocked,2),product(sales,2),views)),1),3)}

Note, image is an int/bool field:  1=has image, 0=no image, hence all the 
product(product(image,...),...) terms above as they negate the boosts if there 
isn't an image!

Thanks
Robi

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, November 12, 2013 9:01 AM
To: solr-user@lucene.apache.org
Subject: Sorting memory-efficiently by any numeric field (dates too?)

Before I go and pat myself on the back, what do people think about this trick? 
The base problem is Is there a space-efficient way to return the top N 
documents, sorted by a numeric field. The numeric field includes dates.

It come to me in a vision in a flash! (The Pickle Song, Arlo Guthrie). If we 
could return the numeric field in question as the score of a document it should 
work without allocating the internal arrays for holding all the timestamps.

So what about something like this?
/select?q={!boost b=manufacturedate_dt}text:* and reverse order by 
/select?q={!boost b=div(1,manufacturedate_dt)}text:*

It works on the test data. So let's assume that we're space constrained. It 
_seems_ like this would only allocate enough space for the top N documents in 
the result set which is insignificant in terms of memory consumption for a 
large number of documents in a core. Any obvious problems that people see?

I see a couple of shortcomings:

1  You only get one field. Unless you can create a really clever 
1 function
that incorporates all the values in multiple fields, this is going to be hard 
to use with more than one field.

2 The boost syntax doesn't allow for a *:*, so you have to specify an
existing field. If there happen to be documents that don't have anything in the 
field, you'll miss them.

3 I'm not sure what the performance issues are, especially in the case
where _every_ document scores better than the current top-N

Erick



RE: removing duplicates

2013-08-21 Thread Petersen, Robert
Hi

Perhaps you could query for all documents asking for the id field to be 
returned and then facet on the field you say you can key off of for duplicates. 
 Set the facet mincount to 2, then you would have to filter on each facet value 
and page through all doc IDs (except skip the first document) for each returned 
facet and delete by ID using a small app or something like that.  Spin all the 
deletes into the index and then do a commit at the end.  I think that would do 
it.

Thanks
Robi

-Original Message-
From: Ali, Saqib [mailto:docbook@gmail.com] 
Sent: Wednesday, August 21, 2013 2:15 PM
To: solr-user@lucene.apache.org
Subject: removing duplicates

hello,

We have documents that are duplicates i.e. the ID is different, but rest of the 
fields are same. Is there a query that can remove duplicate, and just leave one 
copy of the document on solr? There is one numeric field that we can key off 
for find duplicates.

Please advise.

Thanks



RE: removing duplicates

2013-08-21 Thread Petersen, Robert
This would describe the facet parameters we're talking about:

http://wiki.apache.org/solr/SimpleFacetParameters

Query something like this:
http://localhost:8983/solr/select?q=*:*fl=idrows=0facet=truefacet.limit=-1facet.field=your
 field namefacet.mincount=2

Then filter on each facet returned with a filter query described here: 
http://wiki.apache.org/solr/CommonQueryParameters
Example: q=*:*fq=your field name:your field value

Then you would have to get all ids returned and delete all but the first one 
using some app...

Thanks 
Robi


-Original Message-
From: Ali, Saqib [mailto:docbook@gmail.com] 
Sent: Wednesday, August 21, 2013 2:34 PM
To: solr-user@lucene.apache.org
Subject: Re: removing duplicates

Thanks Aloke and Robert. Can you please give me code/query snippets?
(newbie here)


On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal alghos...@gmail.com wrote:

 Hi,

 Facet by one of the duplicate fields (probably by the numeric field 
 that you mentioned) and set facet.mincount=2.

 Regards,
 Aloke


 On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib docbook@gmail.com wrote:

  hello,
 
  We have documents that are duplicates i.e. the ID is different, but 
  rest
 of
  the fields are same. Is there a query that can remove duplicate, and 
  just leave one copy of the document on solr? There is one numeric 
  field that
 we
  can key off for find duplicates.
 
  Please advise.
 
  Thanks
 




RE: uniqueKey: string vs. long integer

2013-08-01 Thread Petersen, Robert
Hi guys,

We have used an integer as our unique key since solr 1.3 with no problems at 
all.  We never thought of using anything else because our solr unique key is 
based upon our product sku data base field which is defined as an integer also. 
  We're on solr 3.6.1 currently.

Thanks
Robi

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Thursday, August 01, 2013 9:27 AM
To: solr-user@lucene.apache.org
Subject: Re: uniqueKey: string vs. long integer

Although I cringe at the thought of anybody using anything other than a string 
for the unique key for a document, I can't point to any part of Solr that will 
absolutely fail. I wouldn't be surprised if there weren't a few nooks and 
crannies in Solr that might depend on the type of the ID, or at least depend on 
it being able to converted to and from string. I'm not sure if SolrCloud has 
any dependence on the document ID field type.

Could you inquire as to why this third party chose to go with a non-string 
document key? Just curious if they perceived some advantage. I mean, is the key 
used in numeric calculations? Can it be negative? Is it ever sorted?

But as a Solr best practice, I'd advise against it.

-- Jack Krupansky

-Original Message-
From: Ali, Saqib
Sent: Thursday, August 01, 2013 12:02 PM
To: solr-user@lucene.apache.org
Subject: uniqueKey: string vs. long integer

We have an application that was developed by a third party. It uses uniqueKey 
that is a long integer instead of a string. Will there be any repercussions of 
using a long integer instead of string for the uniqueKey?

Thanks! :) 





RE: replication getting stuck on a file

2013-08-01 Thread Petersen, Robert
I have seen this happen before in our 3.6.1 deployment.  It seemed related to 
high JVM memory consumption on the server when our index got too big (ie we 
were close to getting OOMs).   That is probably why restarting solr sort of 
fixes it, assuming the file it is stuck on is the final file and it got 100% of 
it.

Thanks
Robi

-Original Message-
From: Rohit Harchandani [mailto:rhar...@gmail.com] 
Sent: Thursday, August 01, 2013 1:55 PM
To: solr-user@lucene.apache.org
Subject: Re: replication getting stuck on a file

I am facing this problem in solr 4.0 too. Its definitely not related to 
autowarming. It just gets stuck while downloading a file and there is no way to 
abort the replication except restarting solr.


On Wed, Jul 10, 2013 at 6:10 PM, adityab aditya_ba...@yahoo.com wrote:

 I have seen this in 4.2.1 too.
 Once replication is finished, on Admin UI we see 100% and time and 
 dlspeed information goes out of wack Same is reflected in mbeans. But 
 whats actually happening in the background is auto-warmup of caches 
 (in my case) May be some minor stats bug




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/replication-getting-stuck-on-a-file
 -tp4076707p4077112.html Sent from the Solr - User mailing list archive 
 at Nabble.com.




RE: Alternative searches

2013-07-31 Thread Petersen, Robert
Hi Mark

Yes, it is something we implemented also.  We just try various subsets of the 
search terms when there are zero results.  To increase performance for all 
these searches we return only the first three results and no facets so we can 
simply display the result counts for the various subsets of the original search 
terms.  We only do this if the first search had zero results and then a double 
metaphone search (which is how we handle misspelled terms) also returned 
nothing.  We also apply various heuristics to the alternative searches being 
performed like no one word searches if the original search had many words etc

Thanks
Robi

-Original Message-
From: Mark [mailto:static.void@gmail.com] 
Sent: Wednesday, July 31, 2013 10:35 AM
To: solr-user@lucene.apache.org
Subject: Alternative searches

Can someone explain how one would go about providing alternative searches for a 
query... similar to Amazon.

For example say I search for Red Dump Truck

- 0 results for Red Dump Truck
- 500 results for  Red Truck
- 350 results for Dump Truck

Does this require multiple searches? 

Thanks



RE: expunging deletes

2013-07-12 Thread Petersen, Robert
OK Thanks Shawn,

 I went with this because 10 wasn't working for us and it looks like my index 
is staying under 20 GB now with numDocs : 16897524 and maxDoc : 19048053

mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnce5/int
  int name=segmentsPerTier5/int
  int name=maxMergeAtOnceExplicit15/int
  double name=maxMergedSegmentMB6144.0/double
  double name=reclaimDeletesWeight6.0/double
/mergePolicy



-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Wednesday, July 10, 2013 5:34 PM
To: solr-user@lucene.apache.org
Subject: Re: expunging deletes

On 7/10/2013 5:58 PM, Petersen, Robert wrote:
 Using solr 3.6.1 and the following settings, I am trying to run without 
 optimizes.  I used to optimize nightly, but sometimes the optimize took a 
 very long time to complete and slowed down our indexing.  We are continuously 
 indexing our new or changed data all day and night.  After a few days running 
 without an optimize, the index size has nearly doubled and maxdocs is nearly 
 twice the size of numdocs.  I understand deletes should be expunged on 
 merges, but even after trying lots of different settings for our merge policy 
 it seems this growth is somewhat unbounded.  I have tried sending an optimize 
 with numSegments = 2 which is a lot lighter weight then a regular optimize 
 and that does bring the number down but not by too much.  Does anyone have 
 any ideas for better settings for my merge policy that would help?  Here is 
 my current index snapshot too:

Your merge settings are the equivalent of the old mergeFactor set to 35, and 
based on the fact that you have the Explicit set to 105, I'm guessing your 
settings originally came from something I posted - these are the numbers that I 
use.  These settings can result in a very large number of segments on your disk.

Because you index a lot (and probably reindex existing documents often), I can 
understand why you have high merge settings, but if you want to eliminate 
optimizes, you'll need to go lower.  The default merge setting of 10 (with an 
Explicit value of 30) is probably a good starting point, but you might need to 
go even smaller.

On Solr 3.6, an optimize probably cannot take place at the same time as index 
updates -- the optimize would probably delay updates until after it's finished. 
 I remember running into problems on Solr 3.x, so I set up my indexing program 
to stop updates while the index was optimizing.

Solr 4.x should lift any restriction where optimizes and updates can't happen 
at the same time.

With an index size of 25GB, a six-drive RAID10 should be able to optimize in 
10-15 minutes, but if your I/O system is single disk, RAID1, RAID5, or RAID6, 
the write performance may cause this to take longer.
If you went with SSD, optimizes would happen VERY fast.

Thanks,
Shawn





expunging deletes

2013-07-10 Thread Petersen, Robert
Hi guys,

Using solr 3.6.1 and the following settings, I am trying to run without 
optimizes.  I used to optimize nightly, but sometimes the optimize took a very 
long time to complete and slowed down our indexing.  We are continuously 
indexing our new or changed data all day and night.  After a few days running 
without an optimize, the index size has nearly doubled and maxdocs is nearly 
twice the size of numdocs.  I understand deletes should be expunged on merges, 
but even after trying lots of different settings for our merge policy it seems 
this growth is somewhat unbounded.  I have tried sending an optimize with 
numSegments = 2 which is a lot lighter weight then a regular optimize and that 
does bring the number down but not by too much.  Does anyone have any ideas for 
better settings for my merge policy that would help?  Here is my current index 
snapshot too:

Location: /var/LucidWorks/lucidworks/solr/1/data/index
Size: 25.05 GB  (when the index is optimized it is around 15.5 GB)
searcherName : Searcher@6c3a3517 main 
caching : true 
numDocs : 16852155 
maxDoc : 24512617 
reader : 
SolrIndexReader{this=6e3b4ec8,r=ReadOnlyDirectoryReader@6e3b4ec8,refCnt=1,segments=61}
 


mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnce35/int
  int name=segmentsPerTier35/int
  int name=maxMergeAtOnceExplicit105/int
  double name=maxMergedSegmentMB6144.0/double
  double name=reclaimDeletesWeight8.0/double
/mergePolicy
 
 mergeScheduler 
class=org.apache.lucene.index.ConcurrentMergeScheduler
  int name=maxMergeCount20/int
  int name=maxThreadCount3/int
  /mergeScheduler

Thanks,

Robert (Robi) Petersen
Senior Software Engineer
Search Department


   (formerly Buy.com)
85 enterprise, suite 100
aliso viejo, ca 92656
tel 949.389.2000 x5465
fax 949.448.5415


  





replication getting stuck on a file

2013-07-09 Thread Petersen, Robert
Hi 

My solr 3.6.1 slave farm is suddenly getting stuck during replication.  It 
seems to stop on a random file on various slaves (not all) and not continue.  
I've tried stoping and restarting tomcat etc but some slaves just can't get the 
index pulled down.  Note there is plenty of space on the hard drive.  I don't 
get it.  Everything else seems fine.  Does this ring a bell for anyone?  I have 
the slaves set for five minute polling intervals.

Here is what I see in admin page, it just stays on that one file and won't get 
past it while the speed steadily averages down to 0kbs:

Master   http://ssbuyma01:8983/solr/1/replication
Latest Index Version:null, Generation: null
Replicatable Index Version:1276893670111, Generation: 127205
Poll Interval00:05:00
Local Index  Index Version: 1276893670084, Generation: 127202
Location: /var/LucidWorks/lucidworks/solr/1/data/index
Size: 23.06 GB
Times Replicated Since Startup: 48903
Previous Replication Done At: Tue Jul 09 12:55:01 EDT 2013
Config Files Replicated At: null
Config Files Replicated: null
Times Config Files Replicated Since Startup: null
Next Replication Cycle At: Tue Jul 09 13:00:00 EDT 2013
Current Replication Status   Start Time: Tue Jul 09 12:55:00 EDT 2013
Files Downloaded: 59 / 486
Downloaded: 88.73 MB / 23.06 GB [0.0%]
Downloading File: _34mt.fnm, Downloaded: 1.35 MB / 1.35 MB [100.0%]
Time Elapsed: 691s, Estimated Time Remaining: 183204s, Speed: 131.49 KB/s


Robert (Robi) Petersen
Senior Software Engineer
Search Department

 


  




RE: replication getting stuck on a file

2013-07-09 Thread Petersen, Robert
Look at the speed and time remaining on this one, pretty funny:


Master   http://ssbuyma01:8983/solr/1/replication
Latest Index Version:null, Generation: null
Replicatable Index Version:1276893670202, Generation: 127213
Poll Interval00:05:00
Local Index  Index Version: 1276893670108, Generation: 127204
Location: /var/LucidWorks/lucidworks/solr/1/data/index
Size: 23.13 GB
Times Replicated Since Startup: 48874
Previous Replication Done At: Tue Jul 09 13:12:05 PDT 2013
Config Files Replicated At: null
Config Files Replicated: null
Times Config Files Replicated Since Startup: null
Next Replication Cycle At: Tue Jul 09 13:17:04 PDT 2013
Current Replication Status   Start Time: Tue Jul 09 13:12:04 PDT 2013
Files Downloaded: 10 / 538
Downloaded: 1.67 MB / 23.13 GB [0.0%]
Downloading File: _34n2.prx, Downloaded: 140 bytes / 140 bytes [100.0%]
Time Elapsed: 6203s, Estimated Time Remaining: 88091277s, Speed: 281 bytes/s


-Original Message-
From: Petersen, Robert [mailto:robert.peter...@mail.rakuten.com] 
Sent: Tuesday, July 09, 2013 1:22 PM
To: solr-user@lucene.apache.org
Subject: replication getting stuck on a file

Hi 

My solr 3.6.1 slave farm is suddenly getting stuck during replication.  It 
seems to stop on a random file on various slaves (not all) and not continue.  
I've tried stoping and restarting tomcat etc but some slaves just can't get the 
index pulled down.  Note there is plenty of space on the hard drive.  I don't 
get it.  Everything else seems fine.  Does this ring a bell for anyone?  I have 
the slaves set for five minute polling intervals.

Here is what I see in admin page, it just stays on that one file and won't get 
past it while the speed steadily averages down to 0kbs:

Master   http://ssbuyma01:8983/solr/1/replication
Latest Index Version:null, Generation: null Replicatable Index 
Version:1276893670111, Generation: 127205
Poll Interval00:05:00
Local Index  Index Version: 1276893670084, Generation: 127202
Location: /var/LucidWorks/lucidworks/solr/1/data/index
Size: 23.06 GB
Times Replicated Since Startup: 48903
Previous Replication Done At: Tue Jul 09 12:55:01 EDT 2013 Config Files 
Replicated At: null Config Files Replicated: null Times Config Files Replicated 
Since Startup: null Next Replication Cycle At: Tue Jul 09 13:00:00 EDT 2013
Current Replication Status   Start Time: Tue Jul 09 12:55:00 EDT 2013
Files Downloaded: 59 / 486
Downloaded: 88.73 MB / 23.06 GB [0.0%]
Downloading File: _34mt.fnm, Downloaded: 1.35 MB / 1.35 MB [100.0%] Time 
Elapsed: 691s, Estimated Time Remaining: 183204s, Speed: 131.49 KB/s


Robert (Robi) Petersen
Senior Software Engineer
Search Department

 


  






RE: Informal poll on running Solr 4 on Java 7 with G1GC

2013-06-20 Thread Petersen, Robert
I've been trying it out on solr 3.6.1 with a 32GB heap and G1GC seems to be 
more prone to OOMEs than CMS.  I have been running it on one slave box in our 
farm and the rest of the slaves are still on CMS and three times now it has 
gone OOM on me whereas the rest of our slaves kept chugging along with no 
errors.  I even went from no other tuning params to using these suggested on 
Shawns wiki page here and that didn't help either, still got some OOMs.  I'm 
giving it a 'fail' pretty soon here.   

-XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError
-XX:+OptimizeStringConcat -XX:+UseFastAccessorMethods
-XX:+UseG1GC -XX:+UseStringCache -XX:-UseSplitVerifier
-XX:MaxGCPauseMillis=50

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thanks
Robi

-Original Message-
From: Timothy Potter [mailto:thelabd...@gmail.com] 
Sent: Thursday, June 20, 2013 9:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Informal poll on running Solr 4 on Java 7 with G1GC

Awesome info, thanks Shawn! I'll post back my results with G1 after we've had 
some time to analyze it in production.

On Thu, Jun 20, 2013 at 11:01 AM, Shawn Heisey s...@elyograg.org wrote:
 On 6/20/2013 8:02 AM, John Nielsen wrote:

 We used to use G1, but recently went back to CMS.

 G1 gave us too long stop-the-world events. CMS uses more ressources 
 for the same work, but it is more predictable and we get better 
 worst-case performance out of it.


 This is exactly the behavior I saw.  When you take a look at the 
 overall stats and the memory graph over time, G1 looks way better. 
 Unfortunately GC with any collector does sometimes get bad, and when 
 that happens, un-tuned
 G1 is a little worse than un-tuned CMS.  Perhaps if G1 were tuned, it 
 would be really good, but I haven't been able to find any information 
 on how to tune G1.

 jHiccup or gclogviewer can give you really good insight into how your 
 GC is doing in both average and worst-case scenarios.  jHiccup is a 
 wrapper for your program and gclogviewer draws graphs from GC logs.  
 I'm not sure whether gclogviewer works with G1 logs or not, but I know 
 that jHiccup will work with G1.

 http://www.azulsystems.com/downloads/jHiccup
 http://code.google.com/p/gclogviewer/downloads/list
 http://code.google.com/p/gclogviewer/source/checkout
 http://code.google.com/p/gclogviewer/issues/detail?id=7

 Thanks,
 Shawn





RE: yet another optimize question

2013-06-19 Thread Petersen, Robert
Hi Walter,

I used to have larger settings on our caches but it seemed like I had to make 
the caches that small to reduce memory usage to keep from getting the dreaded 
OOM exceptions.  Also our search is behind Akamai with a one hour TTL.  Our 
slave farm has a load balancer in front of twelve slave servers and our index 
is being updated constantly, pretty much 24/7.  

So my question would be how do you run with such big caches without going into 
the OOM zone?  Was the Netflix index only updated based upon the release 
schedules of the studios, like once a week?  Our entertainment stores used to 
be like that before we turned into a marketplace based e-tailer, but now we get 
new listings from merchants all the time and so have a constant churn of 
additions and deletions in our index.

I feel like at 32GB our heap is really huge, but we seem to use almost all of 
it with these settings.   I am trying out the G1GC on one slave to see if that 
gets memory usage lower but while it has a different collection pattern in the 
various spaces it seems like the total memory usage peaks out at about the same 
level.

Thanks
Robi

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, June 18, 2013 6:57 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Your query cache is far too small. Most of the default caches are too small.

We run with 10K entries and get a hit rate around 0.30 across four servers. 
This rate goes up with more queries, down with less, but try a bigger cache, 
especially if you are updating the index infrequently, like once per day.

At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP cache 
in front of it. The HTTP cache had an 80% hit rate.

I'd increase your document cache, too. I usually see about 0.75 or better on 
that.

wunder

On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:

 Hi Otis, 
 
 Yes the query results cache is just about worthless.   I guess we have too 
 diverse of a set of user queries.  The business unit has decided to let bots 
 crawl our search pages too so that doesn't help either.  I turned it way down 
 but decided to keep it because my understanding was that it would still help 
 for users going from page 1 to page 2 in a search.  Is that true?
 
 Thanks
 Robi
 
 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
 Sent: Monday, June 17, 2013 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: yet another optimize question
 
 Hi Robi,
 
 This goes against the original problem of getting OOMEs, but it looks like 
 each of your Solr caches could be a little bigger if you want to eliminate 
 evictions, with the query results one possibly not being worth keeping if you 
 can't get the hit % up enough.
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 
 
 On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:
 Hi Otis,
 
 Right I didn't restart the JVMs except on the one slave where I was 
 experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
 made all our caches small enough to keep us from getting OOMs while still 
 having a good hit rate.Our index has about 50 fields which are mostly 
 int IDs and there are some dynamic fields also.  These dynamic fields can be 
 used for custom faceting.  We have some standard facets we always facet on 
 and other dynamic facets which are only used if the query is filtering on a 
 particular category.  There are hundreds of these fields but since they are 
 only for a small subset of the overall index they are very sparsely 
 populated with regard to the overall index.  With CMS GC we get a sawtooth 
 on the old generation (I guess every replication and commit causes it's 
 usage to drop down to 10GB or so) and it seems to be the old generation 
 which is the main space consumer.  With the G1GC, the memory map looked 
 totally different!  I was a little lost looking at memory consumption with 
 that GC.  Maybe I'll try it again now that the index is a bit smaller than 
 it was last time I tried it.  After four days without running an optimize 
 now it is 21GB.  BTW our indexing speed is mostly bound by the DB so 
 reducing the segments might be ok...
 
 Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, 
 but unfortunately I guess I can't send the history graphics to the solr-user 
 list to show their changes over time:
NameUsedCommitted   Max   
   Initial Group
 Par Survivor Space 20.02 MB108.13 MB   108.13 MB 
   108.13 MB   HEAP
 CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
 MBNON_HEAP
 Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
 NON_HEAP
 CMS Old Gen20.22 GB30.94 GB30.94 GB  
   30.94 GB

RE: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Petersen, Robert
OK thanks, will do.  Just out of curiosity, what would having that set way too 
high do?  Would the index become fragmented or what?

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, June 19, 2013 9:33 AM
To: solr-user@lucene.apache.org
Subject: Re: TieredMergePolicy reclaimDeletesWeight

The default is 2.0, and higher values will more strongly favor merging segments 
with deletes.

I think 20.0 is likely way too high ... maybe try 3-5?


Mike McCandless

http://blog.mikemccandless.com


On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:
 Hi

 In continuing a previous conversation, I am attempting to not have to 
 do optimizes on our continuously updated index in solr3.6.1 and I came 
 across the mention of the reclaimDeletesWeight setting in this blog: 
 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-mer
 ges.html

 We do a *lot* of deletes in our index so I want to make the merges be more 
 aggressive on reclaiming deletes, but I am having trouble finding much out 
 about this setting.  Does anyone have experience with this setting?  Would 
 the below accomplish what I want ie for it to go after deletes more 
 aggressively than normal?  I got the impression 10.0 was the default from 
 looking at this code but I could be wrong:
 https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulB
 uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3
 085

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce20/int
   int name=segmentsPerTier8/int
   double name=reclaimDeletesWeight20.0/double
 /mergePolicy

 Thanks

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department





RE: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Petersen, Robert
Oh!  Thanks for the info.  I'll change that right away.

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, June 19, 2013 10:42 AM
To: solr-user@lucene.apache.org
Subject: Re: TieredMergePolicy reclaimDeletesWeight

Way too high would cause it to pick highly lopsided merges just because a few 
deletes were removed.

Highly lopsided merges (e.g. one big segment and N tiny segments) can be 
horrible because it can lead to O(N^2) merge cost over time.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jun 19, 2013 at 1:36 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:
 OK thanks, will do.  Just out of curiosity, what would having that set way 
 too high do?  Would the index become fragmented or what?

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Wednesday, June 19, 2013 9:33 AM
 To: solr-user@lucene.apache.org
 Subject: Re: TieredMergePolicy reclaimDeletesWeight

 The default is 2.0, and higher values will more strongly favor merging 
 segments with deletes.

 I think 20.0 is likely way too high ... maybe try 3-5?


 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:
 Hi

 In continuing a previous conversation, I am attempting to not have to 
 do optimizes on our continuously updated index in solr3.6.1 and I 
 came across the mention of the reclaimDeletesWeight setting in this blog:
 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-me
 r
 ges.html

 We do a *lot* of deletes in our index so I want to make the merges be more 
 aggressive on reclaiming deletes, but I am having trouble finding much out 
 about this setting.  Does anyone have experience with this setting?  Would 
 the below accomplish what I want ie for it to go after deletes more 
 aggressively than normal?  I got the impression 10.0 was the default from 
 looking at this code but I could be wrong:
 https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessful
 B
 uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=
 3
 085

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce20/int
   int name=segmentsPerTier8/int
   double name=reclaimDeletesWeight20.0/double
 /mergePolicy

 Thanks

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department







RE: yet another optimize question

2013-06-19 Thread Petersen, Robert
We actually have hundreds of facet-able fields, but most are specialized and 
are only faceted upon if the user has drilled into the particular category to 
which they are applicable and so they are only indexed for products in those 
categories.  I guess it is the facets that eat up so much of our memory.  It 
was suggested that if I use facet method = enum for those particular 
specialized facets then my memory usage would go down.  I'm going to try that 
out and see how much it helps.

Thanks
Robi

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, June 19, 2013 10:50 AM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

I generally run with an 8GB heap for a system that does no faceting. 32GB does 
seem rather large, but you really should have room for bigger caches.

The Akamai cache will reduce your hit rate a lot. That is OK, because users are 
getting faster responses than they would from Solr. A 5% hit rate may be OK 
since you have that front end HTTP cache.

The Netflix index was updated daily. 

wunder

On Jun 19, 2013, at 10:36 AM, Petersen, Robert wrote:

 Hi Walter,
 
 I used to have larger settings on our caches but it seemed like I had to make 
 the caches that small to reduce memory usage to keep from getting the dreaded 
 OOM exceptions.  Also our search is behind Akamai with a one hour TTL.  Our 
 slave farm has a load balancer in front of twelve slave servers and our index 
 is being updated constantly, pretty much 24/7.  
 
 So my question would be how do you run with such big caches without going 
 into the OOM zone?  Was the Netflix index only updated based upon the release 
 schedules of the studios, like once a week?  Our entertainment stores used to 
 be like that before we turned into a marketplace based e-tailer, but now we 
 get new listings from merchants all the time and so have a constant churn of 
 additions and deletions in our index.
 
 I feel like at 32GB our heap is really huge, but we seem to use almost all of 
 it with these settings.   I am trying out the G1GC on one slave to see if 
 that gets memory usage lower but while it has a different collection pattern 
 in the various spaces it seems like the total memory usage peaks out at about 
 the same level.
 
 Thanks
 Robi
 
 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org] 
 Sent: Tuesday, June 18, 2013 6:57 PM
 To: solr-user@lucene.apache.org
 Subject: Re: yet another optimize question
 
 Your query cache is far too small. Most of the default caches are too small.
 
 We run with 10K entries and get a hit rate around 0.30 across four servers. 
 This rate goes up with more queries, down with less, but try a bigger cache, 
 especially if you are updating the index infrequently, like once per day.
 
 At Netflix, we had a 0.12 hit rate on the query cache, even with an HTTP 
 cache in front of it. The HTTP cache had an 80% hit rate.
 
 I'd increase your document cache, too. I usually see about 0.75 or better on 
 that.
 
 wunder
 
 On Jun 18, 2013, at 10:22 AM, Petersen, Robert wrote:
 
 Hi Otis, 
 
 Yes the query results cache is just about worthless.   I guess we have too 
 diverse of a set of user queries.  The business unit has decided to let bots 
 crawl our search pages too so that doesn't help either.  I turned it way 
 down but decided to keep it because my understanding was that it would still 
 help for users going from page 1 to page 2 in a search.  Is that true?
 
 Thanks
 Robi
 
 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
 Sent: Monday, June 17, 2013 6:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: yet another optimize question
 
 Hi Robi,
 
 This goes against the original problem of getting OOMEs, but it looks like 
 each of your Solr caches could be a little bigger if you want to eliminate 
 evictions, with the query results one possibly not being worth keeping if 
 you can't get the hit % up enough.
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 
 
 On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:
 Hi Otis,
 
 Right I didn't restart the JVMs except on the one slave where I was 
 experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
 made all our caches small enough to keep us from getting OOMs while still 
 having a good hit rate.Our index has about 50 fields which are mostly 
 int IDs and there are some dynamic fields also.  These dynamic fields can 
 be used for custom faceting.  We have some standard facets we always facet 
 on and other dynamic facets which are only used if the query is filtering 
 on a particular category.  There are hundreds of these fields but since 
 they are only for a small subset of the overall index they are very 
 sparsely populated with regard to the overall index.  With CMS GC we get a 
 sawtooth on the old generation (I guess every

RE: yet another optimize question

2013-06-18 Thread Petersen, Robert
Hi Otis, 

Yes the query results cache is just about worthless.   I guess we have too 
diverse of a set of user queries.  The business unit has decided to let bots 
crawl our search pages too so that doesn't help either.  I turned it way down 
but decided to keep it because my understanding was that it would still help 
for users going from page 1 to page 2 in a search.  Is that true?

Thanks
Robi

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Monday, June 17, 2013 6:39 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Hi Robi,

This goes against the original problem of getting OOMEs, but it looks like each 
of your Solr caches could be a little bigger if you want to eliminate 
evictions, with the query results one possibly not being worth keeping if you 
can't get the hit % up enough.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/





On Mon, Jun 17, 2013 at 2:21 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:
 Hi Otis,

 Right I didn't restart the JVMs except on the one slave where I was 
 experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I 
 made all our caches small enough to keep us from getting OOMs while still 
 having a good hit rate.Our index has about 50 fields which are mostly int 
 IDs and there are some dynamic fields also.  These dynamic fields can be used 
 for custom faceting.  We have some standard facets we always facet on and 
 other dynamic facets which are only used if the query is filtering on a 
 particular category.  There are hundreds of these fields but since they are 
 only for a small subset of the overall index they are very sparsely populated 
 with regard to the overall index.  With CMS GC we get a sawtooth on the old 
 generation (I guess every replication and commit causes it's usage to drop 
 down to 10GB or so) and it seems to be the old generation which is the main 
 space consumer.  With the G1GC, the memory map looked totally different!  I 
 was a little lost looking at memory consumption with that GC.  Maybe I'll try 
 it again now that the index is a bit smaller than it was last time I tried 
 it.  After four days without running an optimize now it is 21GB.  BTW our 
 indexing speed is mostly bound by the DB so reducing the segments might be 
 ok...

 Here is a quick snapshot of one slaves memory map as reported by PSI-Probe, 
 but unfortunately I guess I can't send the history graphics to the solr-user 
 list to show their changes over time:
 NameUsedCommitted   Max   
   Initial Group
  Par Survivor Space 20.02 MB108.13 MB   108.13 MB 
   108.13 MB   HEAP
  CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
 MBNON_HEAP
  Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
 NON_HEAP
  CMS Old Gen20.22 GB30.94 GB30.94 GB  
   30.94 GBHEAP
  Par Eden Space 42.20 MB865.31 MB   865.31 MB   
 865.31 MB   HEAP
  Total  20.33 GB31.97 GB32.02 GB  
   31.92 GBTOTAL

 And here's our current cache stats from a random slave:

 name:queryResultCache
 class:   org.apache.solr.search.LRUCache
 version: 1.0
 description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6, 
 regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
 stats:  lookups : 619
 hits : 36
 hitratio : 0.05
 inserts : 592
 evictions : 101
 size : 488
 warmupTime : 2949
 cumulative_lookups : 681225
 cumulative_hits : 73126
 cumulative_hitratio : 0.10
 cumulative_inserts : 602396
 cumulative_evictions : 428868


  name:   fieldCache
 class:   org.apache.solr.search.SolrFieldCacheMBean
 version: 1.0
 description: Provides introspection of the Lucene FieldCache, this is 
 **NOT** a cache that is managed by Solr.
 stats:  entries_count : 359


 name:documentCache
 class:   org.apache.solr.search.LRUCache
 version: 1.0
 description: LRU Cache(maxSize=2048, initialSize=512, autowarmCount=10, 
 regenerator=null)
 stats:  lookups : 12710
 hits : 7160
 hitratio : 0.56
 inserts : 5636
 evictions : 3588
 size : 2048
 warmupTime : 0
 cumulative_lookups : 10590054
 cumulative_hits : 6166913
 cumulative_hitratio : 0.58
 cumulative_inserts : 4423141
 cumulative_evictions : 3714653


 name:fieldValueCache
 class:   org.apache.solr.search.FastLRUCache
 version: 1.0
 description: Concurrent LRU Cache(maxSize=280, initialSize=280, 
 minSize=252, acceptableSize=266, cleanupThread=false, autowarmCount=6, 
 regenerator=org.apache.solr.search.SolrIndexSearcher$1@143eb77a)
 stats:  lookups : 1725
 hits : 1481
 hitratio : 0.85
 inserts : 122
 evictions : 0
 size : 128
 warmupTime : 4426
 cumulative_lookups : 3449712
 cumulative_hits : 3281805
 cumulative_hitratio

RE: yet another optimize question

2013-06-18 Thread Petersen, Robert
In reading the newer solrconfig in the example conf folder it seems like it is 
saying this setting ' mergeFactor10/mergeFactor' is shorthand to putting 
the below and that these both are the defaults?  It says 'The default since 
Solr/Lucene 3.3 is TieredMergePolicy.' So isn't this setting already in effect 
for me?

mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnce10/int
  int name=segmentsPerTier10/int
  /mergePolicy

Thanks
Robi

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Monday, June 17, 2013 6:36 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Yes, in one of the example solrconfig.xml files this is right above the merge 
factor definition.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/





On Mon, Jun 17, 2013 at 8:00 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:
 Hi Upayavira,

 You might have gotten it.  Yes we noticed maxdocs was way bigger than 
 numdocs.  There were a lot of files ending in '.del' in the index folder 
 also.  We started on 1.3 also.   I don't currently have any solr config 
 settings for MergePolicy at all.  Am I going to want to put something like 
 this into my index defaults section?

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
int name=maxMergeAtOnce10/int
int name=segmentsPerTier10/int /mergePolicy

 Thanks
 Robi

 -Original Message-
 From: Upayavira [mailto:u...@odoko.co.uk]
 Sent: Monday, June 17, 2013 12:29 PM
 To: solr-user@lucene.apache.org
 Subject: Re: yet another optimize question

 The key figures are numdocs vs maxdocs. Maxdocs-numdocs is the number of 
 deleted docs in your index.

 This is a 3.6 system you say. But has it been upgraded? I've seen folks 
 who've upgraded from 1.4 or 3.0/3.1 over time, keeping the old config.
 The consequence of this is that they don't get the right config for the 
 TieredMergePolicy, and therefore don't get to use it, seeing the old 
 behaviour which does require periodic optimise.

 Upayavira

 On Mon, Jun 17, 2013, at 07:21 PM, Petersen, Robert wrote:
 Hi Otis,

 Right I didn't restart the JVMs except on the one slave where I was
 experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I
 made all our caches small enough to keep us from getting OOMs while still
 having a good hit rate.Our index has about 50 fields which are mostly
 int IDs and there are some dynamic fields also.  These dynamic fields 
 can be used for custom faceting.  We have some standard facets we 
 always facet on and other dynamic facets which are only used if the 
 query is filtering on a particular category.  There are hundreds of 
 these fields but since they are only for a small subset of the 
 overall index they are very sparsely populated with regard to the 
 overall index.  With CMS GC we get a sawtooth on the old generation 
 (I guess every replication and commit causes it's usage to drop down 
 to 10GB or
 so) and it seems to be the old generation which is the main space 
 consumer.  With the G1GC, the memory map looked totally different!  I 
 was a little lost looking at memory consumption with that GC.  Maybe 
 I'll try it again now that the index is a bit smaller than it was 
 last time I tried it.  After four days without running an optimize 
 now it is 21GB.  BTW our indexing speed is mostly bound by the DB so 
 reducing the segments might be ok...

 Here is a quick snapshot of one slaves memory map as reported by 
 PSI-Probe, but unfortunately I guess I can't send the history 
 graphics to the solr-user list to show their changes over time:
   NameUsedCommitted   Max
  Initial Group
Par Survivor Space 20.02 MB108.13 MB   108.13 MB  
  108.13 MB   HEAP
CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
 MBNON_HEAP
Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB 
 NON_HEAP
CMS Old Gen20.22 GB30.94 GB30.94 GB   
  30.94 GBHEAP
Par Eden Space 42.20 MB865.31 MB   865.31 MB   865.31 
 MB   HEAP
Total  20.33 GB31.97 GB32.02 GB   
  31.92 GBTOTAL

 And here's our current cache stats from a random slave:

 name:queryResultCache
 class:   org.apache.solr.search.LRUCache
 version: 1.0
 description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6,
 regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
 stats:  lookups : 619
 hits : 36
 hitratio : 0.05
 inserts : 592
 evictions : 101
 size : 488
 warmupTime : 2949
 cumulative_lookups : 681225
 cumulative_hits : 73126
 cumulative_hitratio : 0.10
 cumulative_inserts : 602396
 cumulative_evictions : 428868


  name:fieldCache
 class:   org.apache.solr.search.SolrFieldCacheMBean

RE: yet another optimize question

2013-06-18 Thread Petersen, Robert
Hi Andre,

Wow that is astonishing!  I will definitely also try that out!  Just set the 
facet method on a per field basis for the less used sparse facet fields eh?  
Thanks for the tip.

Thanks
Robi

-Original Message-
From: Andre Bois-Crettez [mailto:andre.b...@kelkoo.com] 
Sent: Tuesday, June 18, 2013 3:03 AM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

Recently we had steadily increasing memory usage and OOM due to facets on 
dynamic fields.
The default facet.method=fc need to build a large array of maxdocs ints for 
each field (a fieldCache or fieldValueCahe entry), whether it is sparsely 
populated or not.

Once you have reduced your number of maxDocs with the merge policy, it can be 
interesting to try facet.method=enum for all the sparsely populated dynamic 
fields.
Despite what is said in the wiki, in our case the performance was similar to 
facet.method=fc, however the JVM heap usage went down from about 20GB to 4GB.

André

On 06/17/2013 08:21 PM, Petersen, Robert wrote:
 Also some time ago I made all our caches small enough to keep us from getting 
 OOMs while still having a good hit rate.Our index has about 50 fields 
 which are mostly int IDs and there are some dynamic fields also.  These 
 dynamic fields can be used for custom faceting.  We have some standard facets 
 we always facet on and other dynamic facets which are only used if the query 
 is filtering on a particular category.  There are hundreds of these fields 
 but since they are only for a small subset of the overall index they are very 
 sparsely populated with regard to the overall index.
--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.



TieredMergePolicy reclaimDeletesWeight

2013-06-18 Thread Petersen, Robert
Hi

In continuing a previous conversation, I am attempting to not have to do 
optimizes on our continuously updated index in solr3.6.1 and I came across the 
mention of the reclaimDeletesWeight setting in this blog: 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

We do a *lot* of deletes in our index so I want to make the merges be more 
aggressive on reclaiming deletes, but I am having trouble finding much out 
about this setting.  Does anyone have experience with this setting?  Would the 
below accomplish what I want ie for it to go after deletes more aggressively 
than normal?  I got the impression 10.0 was the default from looking at this 
code but I could be wrong:
https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3085

mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnce20/int
  int name=segmentsPerTier8/int
  double name=reclaimDeletesWeight20.0/double
/mergePolicy

Thanks

Robert (Robi) Petersen
Senior Software Engineer
Search Department



RE: yet another optimize question

2013-06-17 Thread Petersen, Robert
. 
Lucene will merge segments itself. Lower mergeFactor will force it to do it 
more often (it means slower indexing, bigger IO hit when segments are merged, 
more per-segment data that Lucene/Solr need to read from the segment for 
faceting and such, etc.) so maybe you shouldn't mess with that.  Do you know 
what your caches are like in terms of size, hit %, evictions?  We've recently 
seen people set those to a few hundred K or even higher, which can eat a lot of 
heap.  We have had luck with G1 recently, too.
Maybe you can run jstat and see which of the memory pools get filled up and 
change/increase appropriate JVM param based on that?  How many fields do you 
index, facet, or group on?

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Solr  ElasticSearch Support -- http://sematext.com/





On Fri, Jun 14, 2013 at 8:04 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:
 Hi guys,

 We're on solr 3.6.1 and I've read the discussions about whether to optimize 
 or not to optimize.  I decided to try not optimizing our index as was 
 recommended.  We have a little over 15 million docs in our biggest index and 
 a 32gb heap for our jvm.  So without the optimizes the index folder seemed to 
 grow in size and quantity of files.  There seemed to be an upper limit but 
 eventually it hit 300 files consuming 26gb of space and that seemed to push 
 our slave farm over the edge and we started getting the dreaded OOMs.  We 
 have continuous indexing activity, so I stopped the indexer and manually ran 
 an optimize which made the index become 9 files consuming 15gb of space and 
 our slave farm started having acceptable memory usage.  Our merge factor is 
 10, we're on java 7.  Before optimizing, I tried on one slave machine to go 
 with the latest JVM and tried switching from the CMS GC to the G1GC but it 
 hit OOM condition even faster.  So it seems like I have to continue to 
 schedule a regular optimize.  Right now it has been a couple of days since 
 running the optimize and the index is slowly growing bigger, now up to a bit 
 over 19gb.  What do you guys think?  Did I miss something that would make us 
 able to run without doing an optimize?

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department




RE: yet another optimize question

2013-06-17 Thread Petersen, Robert
Hi Upayavira,

You might have gotten it.  Yes we noticed maxdocs was way bigger than numdocs.  
There were a lot of files ending in '.del' in the index folder also.  We 
started on 1.3 also.   I don't currently have any solr config settings for 
MergePolicy at all.  Am I going to want to put something like this into my 
index defaults section?

mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce10/int
   int name=segmentsPerTier10/int
/mergePolicy

Thanks
Robi

-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk] 
Sent: Monday, June 17, 2013 12:29 PM
To: solr-user@lucene.apache.org
Subject: Re: yet another optimize question

The key figures are numdocs vs maxdocs. Maxdocs-numdocs is the number of 
deleted docs in your index.

This is a 3.6 system you say. But has it been upgraded? I've seen folks who've 
upgraded from 1.4 or 3.0/3.1 over time, keeping the old config.
The consequence of this is that they don't get the right config for the 
TieredMergePolicy, and therefore don't get to use it, seeing the old behaviour 
which does require periodic optimise.

Upayavira

On Mon, Jun 17, 2013, at 07:21 PM, Petersen, Robert wrote:
 Hi Otis,
 
 Right I didn't restart the JVMs except on the one slave where I was
 experimenting with using G1GC on the 1.7.0_21 JRE.   Also some time ago I
 made all our caches small enough to keep us from getting OOMs while still
 having a good hit rate.Our index has about 50 fields which are mostly
 int IDs and there are some dynamic fields also.  These dynamic fields 
 can be used for custom faceting.  We have some standard facets we 
 always facet on and other dynamic facets which are only used if the 
 query is filtering on a particular category.  There are hundreds of 
 these fields but since they are only for a small subset of the overall 
 index they are very sparsely populated with regard to the overall 
 index.  With CMS GC we get a sawtooth on the old generation (I guess 
 every replication and commit causes it's usage to drop down to 10GB or 
 so) and it seems to be the old generation which is the main space 
 consumer.  With the G1GC, the memory map looked totally different!  I 
 was a little lost looking at memory consumption with that GC.  Maybe 
 I'll try it again now that the index is a bit smaller than it was last 
 time I tried it.  After four days without running an optimize now it 
 is 21GB.  BTW our indexing speed is mostly bound by the DB so reducing the 
 segments might be ok...
 
 Here is a quick snapshot of one slaves memory map as reported by 
 PSI-Probe, but unfortunately I guess I can't send the history graphics 
 to the solr-user list to show their changes over time:
   NameUsedCommitted   Max 
 Initial Group
Par Survivor Space 20.02 MB108.13 MB   108.13 MB   
 108.13 MB   HEAP
CMS Perm Gen   42.29 MB70.66 MB82.00 MB20.75 
 MBNON_HEAP
Code Cache 9.73 MB 9.88 MB 48.00 MB2.44 MB NON_HEAP
CMS Old Gen20.22 GB30.94 GB30.94 GB
 30.94 GBHEAP
Par Eden Space 42.20 MB865.31 MB   865.31 MB   865.31 
 MB   HEAP
Total  20.33 GB31.97 GB32.02 GB
 31.92 GBTOTAL
 
 And here's our current cache stats from a random slave:
 
 name:queryResultCache  
 class:   org.apache.solr.search.LRUCache  
 version: 1.0  
 description: LRU Cache(maxSize=488, initialSize=6, autowarmCount=6,
 regenerator=org.apache.solr.search.SolrIndexSearcher$3@461ff4c3)
 stats:  lookups : 619
 hits : 36
 hitratio : 0.05
 inserts : 592
 evictions : 101
 size : 488
 warmupTime : 2949
 cumulative_lookups : 681225
 cumulative_hits : 73126
 cumulative_hitratio : 0.10
 cumulative_inserts : 602396
 cumulative_evictions : 428868
 
 
  name:fieldCache  
 class:   org.apache.solr.search.SolrFieldCacheMBean  
 version: 1.0  
 description: Provides introspection of the Lucene FieldCache, this is
 **NOT** a cache that is managed by Solr.  
 stats:  entries_count : 359
 
 
 name:documentCache  
 class:   org.apache.solr.search.LRUCache  
 version: 1.0  
 description: LRU Cache(maxSize=2048, initialSize=512,
 autowarmCount=10, regenerator=null)
 stats:  lookups : 12710
 hits : 7160
 hitratio : 0.56
 inserts : 5636
 evictions : 3588
 size : 2048
 warmupTime : 0
 cumulative_lookups : 10590054
 cumulative_hits : 6166913
 cumulative_hitratio : 0.58
 cumulative_inserts : 4423141
 cumulative_evictions : 3714653
 
 
 name:fieldValueCache  
 class:   org.apache.solr.search.FastLRUCache  
 version: 1.0  
 description: Concurrent LRU Cache(maxSize=280, initialSize=280,
 minSize=252, acceptableSize=266, cleanupThread=false, autowarmCount=6,
 regenerator=org.apache.solr.search.SolrIndexSearcher$1@143eb77a)
 stats:  lookups

yet another optimize question

2013-06-14 Thread Petersen, Robert
Hi guys,

We're on solr 3.6.1 and I've read the discussions about whether to optimize or 
not to optimize.  I decided to try not optimizing our index as was recommended. 
 We have a little over 15 million docs in our biggest index and a 32gb heap for 
our jvm.  So without the optimizes the index folder seemed to grow in size and 
quantity of files.  There seemed to be an upper limit but eventually it hit 300 
files consuming 26gb of space and that seemed to push our slave farm over the 
edge and we started getting the dreaded OOMs.  We have continuous indexing 
activity, so I stopped the indexer and manually ran an optimize which made the 
index become 9 files consuming 15gb of space and our slave farm started having 
acceptable memory usage.  Our merge factor is 10, we're on java 7.  Before 
optimizing, I tried on one slave machine to go with the latest JVM and tried 
switching from the CMS GC to the G1GC but it hit OOM condition even faster.  So 
it seems like I have to continue to schedule a regular optimize.  Right now it 
has been a couple of days since running the optimize and the index is slowly 
growing bigger, now up to a bit over 19gb.  What do you guys think?  Did I miss 
something that would make us able to run without doing an optimize?

Robert (Robi) Petersen
Senior Software Engineer
Search Department


RE: Is payload the right solution for my problem?

2013-05-17 Thread Petersen, Robert
Hi

It will not be double the disk space at all.  You will not need to store the 
field you search, only the field being returned needs to be stored.  
Furthermore if you are not searching the XML field you will not need to index 
that field, only store it.

Hope that helps,
Robi

-Original Message-
From: jasimop [mailto:stricker...@gmail.com] 
Sent: Friday, May 17, 2013 12:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Is payload the right solution for my problem?

I think I just found the solution.

Would the right strategy be to store the original XML content and then use a 
solr.HTMLStripCharFilterFactory when querying? I just made a quick test and it 
work, the only problem now is that it also finds the data contained in the XML 
attribute fields.

I think I will put my data into two fields, one containing only the raw data 
without XML, and one in the original format. Then I search in the raw field but 
return the original format with the response.
The only problem I see here is that I need the double amount of diskspace.
Is there a better solution?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-payload-the-right-solution-for-my-problem-tp4063814p4064117.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: Solr 3.6.1: changing a field from stored to not stored

2013-04-23 Thread Petersen, Robert
Good info, Thanks Hoss!  I was going to add a more specific fl= parameter to my 
queries at the same time.  Currently I am doing fl=*,score so that will have to 
be changed.


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Tuesday, April 23, 2013 4:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.6.1: changing a field from stored to not stored


: index?  I noticed I am unnecessarily storing some fields in my index and
: I'd like to stop storing them without having to 'reindex the world' and
: let the changes just naturally percolate into my index as updates come
: in the normal course of things.  Do you guys think I could get away with
: this?

Yes, you can easily get away with this type of change w/o re-indexing, however 
you won't gain any immediate index size savings until each and every existing 
doc has been reindexed and the old copies expunged from the index via segment 
merges.

the one hicup thta can affect people when doing this is what happens if you use 
something like fl=* (and likely hl=* as well) ... many places in Solr will 
try to avoid failure if a stored field is found in the index which isn't 
defined in the schema, and treat that stored value as a string (legacy behavior 
designed to make it easier for people to point Solr at old lucene indexes built 
w/o using Solr) ... so if these stored values are not strings, you might get 
some weird data in your response for these documents.


-Hoss




RE: Solr 3.6.1: changing a field from stored to not stored

2013-04-23 Thread Petersen, Robert
Hey I just want to verify one thing before I start doing this:  function 
queries only require fields to be indexed but don't require them to be stored 
right?

-Original Message-
From: Petersen, Robert [mailto:robert.peter...@mail.rakuten.com] 
Sent: Tuesday, April 23, 2013 4:39 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr 3.6.1: changing a field from stored to not stored

Good info, Thanks Hoss!  I was going to add a more specific fl= parameter to my 
queries at the same time.  Currently I am doing fl=*,score so that will have to 
be changed.


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Tuesday, April 23, 2013 4:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.6.1: changing a field from stored to not stored


: index?  I noticed I am unnecessarily storing some fields in my index and
: I'd like to stop storing them without having to 'reindex the world' and
: let the changes just naturally percolate into my index as updates come
: in the normal course of things.  Do you guys think I could get away with
: this?

Yes, you can easily get away with this type of change w/o re-indexing, however 
you won't gain any immediate index size savings until each and every existing 
doc has been reindexed and the old copies expunged from the index via segment 
merges.

the one hicup thta can affect people when doing this is what happens if you use 
something like fl=* (and likely hl=* as well) ... many places in Solr will 
try to avoid failure if a stored field is found in the index which isn't 
defined in the schema, and treat that stored value as a string (legacy behavior 
designed to make it easier for people to point Solr at old lucene indexes built 
w/o using Solr) ... so if these stored values are not strings, you might get 
some weird data in your response for these documents.


-Hoss






RE: Really bad query performance for date range queries

2013-02-05 Thread Petersen, Robert
Hi Shawn,

I've looked at the xing JVM before but don't use it.  jHiccup looks like a 
really useful tool.  Can you tell us how you are starting it up?  Do you start 
it wrapping the app container (ie tomcat / jetty)?

Thanks
Robi

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Tuesday, February 05, 2013 1:27 PM
To: solr-user@lucene.apache.org
Subject: Re: Really bad query performance for date range queries

On 2/5/2013 12:51 PM, sausarkar wrote:
 We have a 96GB ram machine with 16 processors. the JVM is set to use 60 GB.
 The test that we are running are purely query there is no indexing going on.
 I dont see garbage collection when I attach visualVM but see frequent 
 CPU spikes ~once every minute.

A previous message from you indicates that your index is 12GB.  I agree with 
Erick that this is not very large.  The pauses that you have described sound a 
lot like stop-the-world garbage collection.  I've seen very long pauses on an 
8GB heap ... I don't even want to think about what could happen on 60GB.

Do you really need a 60GB heap?  My dev server handles seven index shards with 
a 7GB heap and 16GB total RAM.  On 4.1 the total index size is is over 100GB.  
On 4.2-SNAPSHOT the total index size is about 83GB. 
Query performance isn't stellar, but it works perfectly.  My production servers 
(running 3.5) have tons of RAM and each one only gets half the index, but they 
only run with the heap at 8GB.  My queries are pretty low volume and not HUGELY 
complex.  Median query time is about 26 milliseconds and 95th percentile is 
about 950 milliseconds.

Looking at the GC stats in jconsole/jvisualvm, I didn't think I had a GC pause 
problem, but I was proven wrong when I started correlating all the various logs 
in my system to load balancer DOWN incidents.  I saw a pause of 12 seconds 
once in the GC log - on an 8GB heap.

I was introduced to a very cool program that tracks any kind of pause that's 
caused by factors outside the Java program, like GC pauses in the JVM or 
something happening in the OS.  This is much easier to interpret than Java's GC 
logging, and you can get a nice graph from the data.

http://www.azulsystems.com/jHiccup

Using jHiccup, I was able to do a little bit of comparison between different 
runs.  That helped me find some GC tuning parameters that have almost gotten 
rid of my GC pause problem.  I'm constantly working on those parameters.  The 
current values are:

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:NewRatio=3
-XX:MaxTenuringThreshold=8
-XX:+CMSParallelRemarkEnabled
-XX:+ParallelRefProcEnabled
-XX:+UseLargePages
-XX:+AggressiveOpts

The Xing JVM (made by the company that created jHiccup) apparently has 
extremely low GC pause characteristics even with giant heaps like yours. 
  I'm not using it, and I don't know how much it costs.

Thanks,
Shawn





RE: Really bad query performance for date range queries

2013-02-05 Thread Petersen, Robert
Hi Shawn,

I'm running solr in Tomcat on RHEL.  It looks like what you're doing is making 
jHiccup wrap around the whole JVM by doing it that way, is that right?  That's 
pretty cool if so.  I'll see if I can set it up in my dev environment tomorrow.

Thanks,
Robi

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Tuesday, February 05, 2013 2:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Really bad query performance for date range queries

On 2/5/2013 3:19 PM, Petersen, Robert wrote:
 Hi Shawn,

 I've looked at the xing JVM before but don't use it.  jHiccup looks like a 
 really useful tool.  Can you tell us how you are starting it up?  Do you 
 start it wrapping the app container (ie tomcat / jetty)?

Instead of just calling /usr/bin/java , I use this in my init script (homegrown 
for the jetty included in Solr):

/usr/local/bin/jHiccup /usr/bin/java

The jHiccup shell script is in /usr/local/bin, and the two jar files included 
in the download are in /usr/local/bin/bin.

The shell script that's included looks like it works under cygwin, so you could 
run it on Windows as long as you've got that.  The rest of the shell script 
looks too complex to easily convert to a Windows batch file.

Thanks,
Shawn





RE: field space consumption - stored vs not stored

2013-01-31 Thread Petersen, Robert
Thanks Shawn.  Actually now that I think about it,  Yonik also mentioned 
something about lucene number representation once in reply to one of my 
questions.  Here it is:
Could you also tell me what these `#8;#0;#0;#0;#1; strings represent in the 
debug output?

That's internally how a number is encoded into a string (5 bytes, the first 
being binary 8, the next 0, etc.)  This is not representable in XML as #0; is 
illegal, hence we leave off the '' so it's not a true character entity.  
-Yonik

Hey I followed your link, and it had a link to this talk.  Did you see this 
example?
http://lucene.sourceforge.net/talks/pisa/

VInt Encoding Example (table was flattened during pasting):

Value

First byte

Second byte

Third byte

0





1

0001



2

0010



...




127

0111



128

1000

0001



129

1001

0001


130

1010

0001


...




16,383



0111


16,384

1000

1000

0001

16,385

1001

1000

0001

...



-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Wednesday, January 30, 2013 5:28 PM
Cc: solr-user@lucene.apache.org
Subject: Re: field space consumption - stored vs not stored

On 1/30/2013 6:24 PM, Shawn Heisey wrote:
 If I had to guess about the extra space required for storing an int 
 field, I would say it's in the neighborhood of 20 bytes per document, 
 perhaps less.  I am also interested in a definitive answer.

The answer is very likely less than 20 bytes per doc.  I was assuming a larger 
size for VInt than it is likely to use.  See the answer for this
question:

http://stackoverflow.com/questions/2752612/what-is-the-vint-in-lucene

Thanks,
Shawn





RE: Can I start solr with replication activated but disabled between master and slave

2013-01-30 Thread Petersen, Robert
Hi Jamel,

You can start solr slaves with them pointed at a master and then turn off 
replication in the admin replication page.

Hope that helps,
-Robi

Robert (Robi) Petersen
Senior Software Engineer
Search Department

 


-Original Message-
From: Jamel ESSOUSSI [mailto:jamel.essou...@gmail.com] 
Sent: Wednesday, January 30, 2013 2:45 AM
To: solr-user@lucene.apache.org
Subject: Can I start solr with replication activated but disabled between 
master and slave

Hello,

I would like to start solr with the following configuration;

Replication between master and slave activated but not enabled.

Regards



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-I-start-solr-with-replication-activated-but-disabled-between-master-and-slave-tp4037333.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: Solr Faceting with Name Values

2013-01-29 Thread Petersen, Robert
Hi O.O

1.  Yes faceting on field function_s would return all the facet values in the 
search results with their counts.
2.  You would probably have to join the names together with a special character 
and then split them later in the UI.  
3.  I'm sure there is a way to query the index for all defined fields.  The 
admin schema browser page does this exact thing.

Resources for further exploration:
http://wiki.apache.org/solr/SolrFacetingOverview
http://wiki.apache.org/solr/SimpleFacetParameters
http://searchhub.org/2009/09/02/faceted-search-with-solr/
http://wiki.apache.org/solr/HierarchicalFaceting
http://lucidworks.lucidimagination.com/display/solr/Faceting

Have fun!
Robi


-Original Message-
From: O. Olson [mailto:olson_...@yahoo.it] 
Sent: Monday, January 28, 2013 3:11 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Faceting with Name Values

Thank you Robi. Your idea seems good but I have a few questions: 

1.  From your description, I would create a field “Function_s” with the 
value
“Scanner” and “Function_s” with the value “Printer” for my two Products.
This seems good. Is it possible for you give me a query for this dynamic field. 
For e.g., could I do something like: 

facet=truefacet.field=Function_s

I would like this to tell me how many of the products are Scanners and how many 
of the products are Printers.

2.  Many of my Attribute Names have spaces e.g. “PC Connection”, or even
brackets and slashes e.g. “Scan Speed (ppm)”. Would there be a problem putting 
these in a dynamic field name?

3.  Is it possible to query for the possible list of dynamic fieldnames? I
might need this when creating a list of attributes.


Thanks again Robi.
O. O.

--

Petersen, Robert wrote
 Hi O.O.,
 
 You don't need to add them all into the schema.  You can use the 
 wildcard fields like dynamicField name=*_s  type=string  
 indexed=true  stored=true /  to hold them.  You can then have the 
 attribute name be the part of the wildcard and the attribute value be 
 the field contents. So you could have fields like Function_s:Scanner 
 etc and then you could ask for facets which are relevant based upon 
 query or category.
 
 That would be a much more straightforward approach and much easier to 
 facet on.  Hope that helps a little bit.
 
 -Robi





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Faceting-with-Name-Values-tp4036872p4036904.html
Sent from the Solr - User mailing list archive at Nabble.com.



queryResultCache *very* low hit ratio

2013-01-29 Thread Petersen, Robert
Hi solr users,

My queryResultCache hitratio has been trending down lately and is now at 0.01%, 
and also it's warmup time was almost a minute.  I have lowered the autowarm 
count dramatically since there are no hits anyway.  I also wanted to lower my 
autowarm counts across the board because I am about to expand the warmup 
queries in my newSearcher config section.  Would I be better just turning off 
this cache completely?  I don't really want to increase its size because I've 
found that by keeping my cache sizes limited keeps me from getting OOM 
exceptions across my slave farm.

Thanks,

Robert (Robi) Petersen
Senior Software Engineer
Search Department



RE: queryResultCache *very* low hit ratio

2013-01-29 Thread Petersen, Robert
Thanks Yonik,

I'm cooking up some static warming queries right now, based upon our commonly 
issued queries.  I've already been noticing occasional long running queries.  
Our web farm times out a search after twenty seconds and issues an exception.  
I see a few of these every day and am trying to combat them with better warm up 
queries.  My current static warm up queries are too simple I suspect.  They 
don't replicate any of our typically issued filter queries nor function queries.

Thanks
Robi

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, January 29, 2013 2:46 PM
To: solr-user@lucene.apache.org
Subject: Re: queryResultCache *very* low hit ratio

One other thing that some auto-warming of the query result cache can achieve is 
loading FieldCache entries for sorting / function queries so real user queries 
don't experience increased latency.  If you remove all auto-warming of the 
query result cache, you may want to add static warming entries for these fields.

-Yonik
http://lucidworks.com


On Tue, Jan 29, 2013 at 3:36 PM, Petersen, Robert rober...@buy.com wrote:
 Hi solr users,

 My queryResultCache hitratio has been trending down lately and is now at 
 0.01%, and also it's warmup time was almost a minute.  I have lowered the 
 autowarm count dramatically since there are no hits anyway.  I also wanted to 
 lower my autowarm counts across the board because I am about to expand the 
 warmup queries in my newSearcher config section.  Would I be better just 
 turning off this cache completely?  I don't really want to increase its size 
 because I've found that by keeping my cache sizes limited keeps me from 
 getting OOM exceptions across my slave farm.

 Thanks,

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department





RE: queryResultCache *very* low hit ratio

2013-01-29 Thread Petersen, Robert
Hi Shawn,

Since my solr services power product search for a large retail web site with 
over fourteen million unique products, so I'm suspecting the main reason for 
the low hit rate is many unique user queries.  We're expanding our product 
count and product type categories every day as fast as we can.

Thanks!
Robi

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Tuesday, January 29, 2013 2:24 PM
To: solr-user@lucene.apache.org
Subject: Re: queryResultCache *very* low hit ratio

On 1/29/2013 1:36 PM, Petersen, Robert wrote:
 My queryResultCache hitratio has been trending down lately and is now at 
 0.01%, and also it's warmup time was almost a minute.  I have lowered the 
 autowarm count dramatically since there are no hits anyway.  I also wanted to 
 lower my autowarm counts across the board because I am about to expand the 
 warmup queries in my newSearcher config section.  Would I be better just 
 turning off this cache completely?  I don't really want to increase its size 
 because I've found that by keeping my cache sizes limited keeps me from 
 getting OOM exceptions across my slave farm.

A low hit ratio on this cache means quite simply that most of your queries (q 
parameter) are unique.

Often this is the result of including unique identifiers within the query text, 
or using the NOW variable in queries against a date field, because NOW changes 
every millisecond.  By using rounding (NOW/HOUR,
NOW/DAY) you can fix the latter.

Sometimes it's caused by an unexpected and very very active query source.  If 
your developers see your Solr service as an unlimited resource, they might 
write programs that bombard the server with unique queries.  If that's what is 
happening, you might need another copy of your solr infrastructure that's for 
internal use only.

Sometimes it's just because your users are entering a lot of unique searches, 
or not visiting multiple pages of results.

If you're not seeing any value from the cache, turning it off might be sensible 
so it doesn't use memory.

Thanks,
Shawn





RE: Solr Faceting with Name Values

2013-01-28 Thread Petersen, Robert
Hi O.O.,

You don't need to add them all into the schema.  You can use the wildcard 
fields like dynamicField name=*_s  type=string  indexed=true  
stored=true / to hold them.  You can then have the attribute name be the 
part of the wildcard and the attribute value be the field contents. So you 
could have fields like Function_s:Scanner etc and then you could ask for facets 
which are relevant based upon query or category.

That would be a much more straightforward approach and much easier to facet on. 
 Hope that helps a little bit.

-Robi


-Original Message-
From: O. Olson [mailto:olson_...@yahoo.it] 
Sent: Monday, January 28, 2013 1:42 PM
To: solr-user@lucene.apache.org
Subject: Solr Faceting with Name Values

Hi,

We are looking at putting our Product Catalog into Solr. Our Product 
Catalog involves a Product, and a number of [Name, Value] pairs – which 
represent attributes of a particular product. The attribute names are standard 
along a certain Product Category, but they are too numerous to put into the 
schema. I would like to add faceting queries on these attributes. 

For e.g. 

Product 1: 
Name: Cannon Scanner
Category: Office Machines
Attribute 1 Name: Function
Attribute 1 Value: Scanner
Attribute 2 Name: PC Connection
Attribute 2 Value: USB
Attribute 3 Name: Scan Speed (ppm)
Attribute 3 Value: 2

Product 2: 
Name: HP Printer
Category: Office Machines
Attribute 1 Name: Function
Attribute 1 Value: Printer
Attribute 2 Name: PC Connection
Attribute 2 Value: LAN
Attribute 3 Name: Print Speed (ppm)
Attribute 3 Value: 35

I would like to know if there would be an easy way to retrieve the Facet Counts 
related to “PC Connection”. I think this should give me the counts for LAN, 
USB, Wi-Fi etc. for the way products connect to a PC. 

If I would put “PC Connection” into a separate field in the schema in Solr, I 
can append something like the following to the end of my query:

facet=truefacet.field=PC+Connection

However, there are too many attribute names like “PC Connection”. Is there any 
way to get the facet counts without putting “PC Connection” into a separate 
field? How should I structure my schema to get these results?


Thank you all for your help.
O. O.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Faceting-with-Name-Values-tp4036872.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: firstSearcher and NewSearcher parameters

2013-01-23 Thread Petersen, Robert
Hi Otis, 

OK I guess I see how that makes sense.  If I use function queries for affecting 
the scoring of results, does it help to include those in the warm up queries or 
does the same thing go for those also?  IE is it useless to add str 
name=q{!boost%20b=... ?

Thanks,
Robi

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Tuesday, January 22, 2013 5:21 PM
To: solr-user@lucene.apache.org
Subject: Re: firstSearcher and NewSearcher parameters

Hi Robi,

Boosts don't do anything for warmup queries if that is what you're after...

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Jan 22, 2013 8:08 PM, Petersen, Robert rober...@buy.com wrote:

 Hi guys,

 I was wondering if there was a way to pass commonly used boost values 
 in with commonly used filter queries in these solrConfig event handler 
 sections.  Could I just append the ^1.5 at the end of the fq value?  
 IE can I do this:
 str name=fqtaxonomyCategoryTypeId:1^1.5/str
 Or perhaps this:
 str name=fq(taxonomyCategoryTypeId:1)%5e1.5/str


 Is there a more comprehensive list of possible xml query parameters we 
 can put in these config sections?  Is it just anything normally passed 
 in? So far these are the only ones I have seen used:

 listener event=newSearcher class=solr.QuerySenderListener arr 
 name=queries lst
 str name=qstar:1/str
 str name=facet.fieldstoreId/str
 str name=start0/str
 str name=rows35/str
 str name=fqtaxonomyCategoryTypeId:0 TO 1/str 
 ...Etc etc...

 Thanks,
 Robi




RE: firstSearcher and NewSearcher parameters

2013-01-23 Thread Petersen, Robert
Thanks Hoss, Good to know!  

I have that exact situation:  a complex function based on multiple field values 
that I always run for particular types of searches including global star 
searches to aid in sorting the results appropriately.  

Robi


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Wednesday, January 23, 2013 11:40 AM
To: solr-user@lucene.apache.org
Subject: RE: firstSearcher and NewSearcher parameters


: OK I guess I see how that makes sense.  If I use function queries for
: affecting the scoring of results, does it help to include those in the
: warm up queries or does the same thing go for those also?  IE is it
: useless to add str name=q{!boost%20b=... ?

boosts on *queries* probably won't affect your warming queries (unless you have 
concerned about a particularly important/expensive query and you always want 
that exact query to be warmed) but if you typically boost on some functions of 
field values then including those functions in your warming queries can be 
helpul to ensure that the field caches for the fields used in those functions 
are warmed up.


-Hoss




firstSearcher and NewSearcher parameters

2013-01-22 Thread Petersen, Robert
Hi guys,

I was wondering if there was a way to pass commonly used boost values in with 
commonly used filter queries in these solrConfig event handler sections.  Could 
I just append the ^1.5 at the end of the fq value?  IE can I do this:
str name=fqtaxonomyCategoryTypeId:1^1.5/str
Or perhaps this:
str name=fq(taxonomyCategoryTypeId:1)%5e1.5/str


Is there a more comprehensive list of possible xml query parameters we can put 
in these config sections?  Is it just anything normally passed in? So far these 
are the only ones I have seen used:

listener event=newSearcher class=solr.QuerySenderListener
arr name=queries
lst
str name=qstar:1/str
str name=facet.fieldstoreId/str
str name=start0/str
str name=rows35/str
str name=fqtaxonomyCategoryTypeId:0 TO 1/str
...Etc etc...

Thanks,
Robi


parsing debug output for readability

2013-01-10 Thread Petersen, Robert
Hi Solr Users,

Can someone give me some good parsing rules of thumb to make the debug explain 
output human readable?  I found this cool site for visualizing the output but 
our queries are too complex and break their parser:  http://explain.solr.pl

I tried adding new lines plus indenting after every 'sum of:' and 'product 
of:', adding new lines between every number = something, and adding new lines 
plus un-indenting after every comma, but that doesn't quite seem right.  Thanks 
for any input.

Here is one of our explains:
lst name=explain
str name=243030948
46.044563 = (MATCH) boost(+(mfgPartNo:canon title:canon^1.1 titleSort:canon^1.1 
taxonomyCategoryName:canon^20.0 moreWords:canon) 
+(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^105.0 
taxonomyCategoryTypeId:`#8;#0;#0;#0;#0;) +(ConstantScore(boosted:[1 TO 
*]^1000.0)^1000.0 boosted:`#8;#0;#0;#0;#0;) +(storeId:`#8;#0;#0;#0;#1;^80.0 
storeId:`#8;#0;#0;#0;#2;^5.0 storeId:`#8;#0;#0;#0;#3; storeId:`#8;#0;#0;#0;#4; 
storeId:`#8;#0;#0;#0;#5;^5.0 storeId:`#8;#0;#0;#0;#6; 
storeId:`#8;#0;#0;#0;#7;^80.0 storeId:`#8;#0;#0;#0;#8;^80.0 
ConstantScore(storeId:[9 TO *]^60.0)^60.0) +(ConstantScore(sales:[10 TO 
*]^200.0)^200.0 ConstantScore(sales:[5 TO 9]^190.0)^190.0 
ConstantScore(sales:[1 TO 4]^180.0)^180.0 sales:`#8;#0;#0;#0;#0;) 
+(ConstantScore(views:[51 TO *]^250.0)^250.0 ConstantScore(views:[10 TO 
50]^30.0)^30.0 ConstantScore(views:[1 TO 9]^10.0)^10.0 views:`#8;#0;#0;#0;#0;) 
+(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^75.0 
taxonomyCategoryTypeId:`#8;#0;#0;#0;#0;),pow(sum(log(sum(product(int(boosted),const(9000.0)),product(product(int(image),int(stocked)),const(300.0)),product(product(int(image),int(taxonomyCategoryTypeId)),const(300.0)),product(product(int(image),int(sales)),const(150.0)),product(int(stocked),const(2.0)),product(int(sales),const(2.0)),int(views))),const(1.0)),const(3.0))),
 product of: 0.36104107 = (MATCH) sum of: 0.026820535 = (MATCH) product of: 
0.04470089 = (MATCH) sum of: 0.03307638 = (MATCH) weight(mfgPartNo:canon in 
122108), product of: 0.0059631695 = queryWeight(mfgPartNo:canon), product of: 
11.093556 = idf(docFreq=672, maxDocs=16277616) 5.3753454E-4 = queryNorm 
5.546778 = (MATCH) fieldWeight(mfgPartNo:canon in 122108), product of: 1.0 = 
tf(termFreq(mfgPartNo:canon)=1) 11.093556 = idf(docFreq=672, maxDocs=16277616) 
0.5 = fieldNorm(field=mfgPartNo, doc=122108) 0.0057238983 = (MATCH) 
weight(title:canon^1.1 in 122108), product of: 0.00424859 = 
queryWeight(title:canon^1.1), product of: 1.1 = boost 7.1853147 = 
idf(docFreq=33522, maxDocs=16277616) 5.3753454E-4 = queryNorm 1.3472465 = 
(MATCH) fieldWeight(title:canon in 122108), product of: 1.0 = 
tf(termFreq(title:canon)=1) 7.1853147 = idf(docFreq=33522, maxDocs=16277616) 
0.1875 = fieldNorm(field=title, doc=122108) 0.005900612 = (MATCH) 
weight(moreWords:canon in 122108), product of: 0.0038275106 = 
queryWeight(moreWords:canon), product of: 7.1204925 = idf(docFreq=35767, 
maxDocs=16277616) 5.3753454E-4 = queryNorm 1.5416318 = (MATCH) 
fieldWeight(moreWords:canon in 122108), product of: 1.7320508 = 
tf(termFreq(moreWords:canon)=3) 7.1204925 = idf(docFreq=35767, 
maxDocs=16277616) 0.125 = fieldNorm(field=moreWords, doc=122108) 0.6 = 
coord(3/5) 0.058011983 = (MATCH) product of: 0.116023965 = (MATCH) sum of: 
0.116023965 = (MATCH) weight(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^105.0 in 
122108), product of: 0.08092295 = 
queryWeight(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^105.0), product of: 105.0 = 
boost 1.4337585 = idf(docFreq=10549013, maxDocs=16277616) 5.3753454E-4 = 
queryNorm 1.4337585 = (MATCH) 
fieldWeight(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1; in 122108), product of: 1.0 
= tf(termFreq(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;)=1) 1.4337585 = 
idf(docFreq=10549013, maxDocs=16277616) 1.0 = 
fieldNorm(field=taxonomyCategoryTypeId, doc=122108) 0.5 = coord(1/2) 
2.6876872E-4 = (MATCH) product of: 5.3753745E-4 = (MATCH) sum of: 5.3753745E-4 
= (MATCH) weight(boosted:`#8;#0;#0;#0;#0; in 122108), product of: 5.37536E-4 = 
queryWeight(boosted:`#8;#0;#0;#0;#0;), product of: 1.027 = 
idf(docFreq=16277571, maxDocs=16277616) 5.3753454E-4 = queryNorm 1.027 = 
(MATCH) fieldWeight(boosted:`#8;#0;#0;#0;#0; in 122108), product of: 1.0 = 
tf(termFreq(boosted:`#8;#0;#0;#0;#0;)=1) 1.027 = idf(docFreq=16277571, 
maxDocs=16277616) 1.0 = fieldNorm(field=boosted, doc=122108) 0.5 = coord(1/2) 
0.17403002 = (MATCH) product of: 0.7831351 = (MATCH) sum of: 0.75088304 = 
(MATCH) weight(storeId:`#8;#0;#0;#0;#8;^80.0 in 122108), product of: 0.17969431 
= queryWeight(storeId:`#8;#0;#0;#0;#8;^80.0), product of: 80.0 = boost 4.178669 
= idf(docFreq=677816, maxDocs=16277616) 5.3753454E-4 = queryNorm 4.178669 = 
(MATCH) fieldWeight(storeId:`#8;#0;#0;#0;#8; in 122108), product of: 1.0 = 
tf(termFreq(storeId:`#8;#0;#0;#0;#8;)=1) 4.178669 = idf(docFreq=677816, 
maxDocs=16277616) 1.0 = fieldNorm(field=storeId, doc=122108) 0.032252073 = 
(MATCH) ConstantScore(storeId:[9 TO *]^60.0)^60.0, product of: 60.0 

RE: parsing debug output for readability

2013-01-10 Thread Petersen, Robert
Hi Erik,

Thanks, debug.explain.structured=true helps a lot!  Could you also tell me what 
these `#8;#0;#0;#0;#1; strings represent in the debug output?  Are they some 
internal representation of the field name/value combos in the query?  They come 
out like this:  
fieldWeight(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1; in 122108)
tf(termFreq(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;)=1)

...and this:
str name=description
boost(+(mfgPartNo:canon title:canon^1.1 titleSort:canon^1.1 
taxonomyCategoryName:canon^20.0 moreWords:canon) 
+(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^105.0 
taxonomyCategoryTypeId:`#8;#0;#0;#0;#0;) +(ConstantScore(boosted:[1 TO 
*]^1000.0)^1000.0 boosted:`#8;#0;#0;#0;#0;) +(storeId:`#8;#0;#0;#0;#1;^80.0 
storeId:`#8;#0;#0;#0;#2;^5.0 storeId:`#8;#0;#0;#0;#3; storeId:`#8;#0;#0;#0;#4; 
storeId:`#8;#0;#0;#0;#5;^5.0 storeId:`#8;#0;#0;#0;#6; 
storeId:`#8;#0;#0;#0;#7;^80.0 storeId:`#8;#0;#0;#0;#8;^80.0 
ConstantScore(storeId:[9 TO *]^60.0)^60.0) +(ConstantScore(sales:[10 TO 
*]^200.0)^200.0 ConstantScore(sales:[5 TO 9]^190.0)^190.0 
ConstantScore(sales:[1 TO 4]^180.0)^180.0 sales:`#8;#0;#0;#0;#0;) 
+(ConstantScore(views:[51 TO *]^250.0)^250.0 ConstantScore(views:[10 TO 
50]^30.0)^30.0 ConstantScore(views:[1 TO 9]^10.0)^10.0 views:`#8;#0;#0;#0;#0;) 
+(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^75.0


Thanks again,
R


-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Thursday, January 10, 2013 12:49 PM
To: solr-user@lucene.apache.org
Subject: Re: parsing debug output for readability

Robert -

Two options here:

  - Use debug.explain.structured 
http://wiki.apache.org/solr/CommonQueryParameters#debug.explain.structured

  - Use wt=rubyindent=on and it'll come out in an indented fashion browser 
friendly manner, but even in XML it should come out with whitespace and 
newlines in the actual XML source (browsers render it ugly though)

Erik


On Jan 10, 2013, at 15:35 , Petersen, Robert wrote:

 Hi Solr Users,
 
 Can someone give me some good parsing rules of thumb to make the debug 
 explain output human readable?  I found this cool site for visualizing the 
 output but our queries are too complex and break their parser:  
 http://explain.solr.pl
 
 I tried adding new lines plus indenting after every 'sum of:' and 'product 
 of:', adding new lines between every number = something, and adding new 
 lines plus un-indenting after every comma, but that doesn't quite seem right. 
  Thanks for any input.
 
 Here is one of our explains:
 lst name=explain
 str name=243030948
 46.044563 = (MATCH) boost(+(mfgPartNo:canon title:canon^1.1 
 titleSort:canon^1.1 taxonomyCategoryName:canon^20.0 moreWords:canon) 
 +(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^105.0 
 taxonomyCategoryTypeId:`#8;#0;#0;#0;#0;) +(ConstantScore(boosted:[1 TO 
 *]^1000.0)^1000.0 boosted:`#8;#0;#0;#0;#0;) +(storeId:`#8;#0;#0;#0;#1;^80.0 
 storeId:`#8;#0;#0;#0;#2;^5.0 storeId:`#8;#0;#0;#0;#3; 
 storeId:`#8;#0;#0;#0;#4; storeId:`#8;#0;#0;#0;#5;^5.0 
 storeId:`#8;#0;#0;#0;#6; storeId:`#8;#0;#0;#0;#7;^80.0 
 storeId:`#8;#0;#0;#0;#8;^80.0 ConstantScore(storeId:[9 TO *]^60.0)^60.0) 
 +(ConstantScore(sales:[10 TO *]^200.0)^200.0 ConstantScore(sales:[5 TO 
 9]^190.0)^190.0 ConstantScore(sales:[1 TO 4]^180.0)^180.0 
 sales:`#8;#0;#0;#0;#0;) +(ConstantScore(views:[51 TO *]^250.0)^250.0 
 ConstantScore(views:[10 TO 50]^30.0)^30.0 ConstantScore(views:[1 TO 
 9]^10.0)^10.0 views:`#8;#0;#0;#0;#0;) 
 +(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^75.0 
 taxonomyCategoryTypeId:`#8;#0;#0;#0;#0;),pow(sum(log(sum(product(int(boosted),const(9000.0)),product(product(int(image),int(stocked)),const(300.0)),product(product(int(image),int(taxonomyCategoryTypeId)),const(300.0)),product(product(int(image),int(sales)),const(150.0)),product(int(stocked),const(2.0)),product(int(sales),const(2.0)),int(views))),const(1.0)),const(3.0))),
  product of: 0.36104107 = (MATCH) sum of: 0.026820535 = (MATCH) product of: 
 0.04470089 = (MATCH) sum of: 0.03307638 = (MATCH) weight(mfgPartNo:canon in 
 122108), product of: 0.0059631695 = queryWeight(mfgPartNo:canon), product of: 
 11.093556 = idf(docFreq=672, maxDocs=16277616) 5.3753454E-4 = queryNorm 
 5.546778 = (MATCH) fieldWeight(mfgPartNo:canon in 122108), product of: 1.0 = 
 tf(termFreq(mfgPartNo:canon)=1) 11.093556 = idf(docFreq=672, 
 maxDocs=16277616) 0.5 = fieldNorm(field=mfgPartNo, doc=122108) 0.0057238983 = 
 (MATCH) weight(title:canon^1.1 in 122108), product of: 0.00424859 = 
 queryWeight(title:canon^1.1), product of: 1.1 = boost 7.1853147 = 
 idf(docFreq=33522, maxDocs=16277616) 5.3753454E-4 = queryNorm 1.3472465 = 
 (MATCH) fieldWeight(title:canon in 122108), product of: 1.0 = 
 tf(termFreq(title:canon)=1) 7.1853147 = idf(docFreq=33522, maxDocs=16277616) 
 0.1875 = fieldNorm(field=title, doc=122108) 0.005900612 = (MATCH) 
 weight(moreWords:canon in 122108), product of: 0.0038275106 = 
 queryWeight(moreWords:canon), product of: 7.1204925 = idf(docFreq=35767, 
 maxDocs=16277616

RE: parsing debug output for readability

2013-01-10 Thread Petersen, Robert
PS  the wt=ruby param is even better!  Great tips.

-Original Message-
From: Petersen, Robert [mailto:rober...@buy.com] 
Sent: Thursday, January 10, 2013 3:17 PM
To: solr-user@lucene.apache.org
Subject: RE: parsing debug output for readability

Hi Erik,

Thanks, debug.explain.structured=true helps a lot!  Could you also tell me what 
these `#8;#0;#0;#0;#1; strings represent in the debug output?  Are they some 
internal representation of the field name/value combos in the query?  They come 
out like this:  
fieldWeight(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1; in 122108)
tf(termFreq(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;)=1)

...and this:
str name=description
boost(+(mfgPartNo:canon title:canon^1.1 titleSort:canon^1.1 
taxonomyCategoryName:canon^20.0 moreWords:canon) 
+(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^105.0 
taxonomyCategoryTypeId:`#8;#0;#0;#0;#0;) +(ConstantScore(boosted:[1 TO 
*]^1000.0)^1000.0 boosted:`#8;#0;#0;#0;#0;) +(storeId:`#8;#0;#0;#0;#1;^80.0 
storeId:`#8;#0;#0;#0;#2;^5.0 storeId:`#8;#0;#0;#0;#3; storeId:`#8;#0;#0;#0;#4; 
storeId:`#8;#0;#0;#0;#5;^5.0 storeId:`#8;#0;#0;#0;#6; 
storeId:`#8;#0;#0;#0;#7;^80.0 storeId:`#8;#0;#0;#0;#8;^80.0 
ConstantScore(storeId:[9 TO *]^60.0)^60.0) +(ConstantScore(sales:[10 TO 
*]^200.0)^200.0 ConstantScore(sales:[5 TO 9]^190.0)^190.0 
ConstantScore(sales:[1 TO 4]^180.0)^180.0 sales:`#8;#0;#0;#0;#0;) 
+(ConstantScore(views:[51 TO *]^250.0)^250.0 ConstantScore(views:[10 TO 
50]^30.0)^30.0 ConstantScore(views:[1 TO 9]^10.0)^10.0 views:`#8;#0;#0;#0;#0;) 
+(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^75.0


Thanks again,
R


-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Thursday, January 10, 2013 12:49 PM
To: solr-user@lucene.apache.org
Subject: Re: parsing debug output for readability

Robert -

Two options here:

  - Use debug.explain.structured 
http://wiki.apache.org/solr/CommonQueryParameters#debug.explain.structured

  - Use wt=rubyindent=on and it'll come out in an indented fashion browser 
friendly manner, but even in XML it should come out with whitespace and 
newlines in the actual XML source (browsers render it ugly though)

Erik


On Jan 10, 2013, at 15:35 , Petersen, Robert wrote:

 Hi Solr Users,
 
 Can someone give me some good parsing rules of thumb to make the debug 
 explain output human readable?  I found this cool site for visualizing the 
 output but our queries are too complex and break their parser:  
 http://explain.solr.pl
 
 I tried adding new lines plus indenting after every 'sum of:' and 'product 
 of:', adding new lines between every number = something, and adding new 
 lines plus un-indenting after every comma, but that doesn't quite seem right. 
  Thanks for any input.
 
 Here is one of our explains:
 lst name=explain
 str name=243030948
 46.044563 = (MATCH) boost(+(mfgPartNo:canon title:canon^1.1 
 titleSort:canon^1.1 taxonomyCategoryName:canon^20.0 moreWords:canon) 
 +(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^105.0 
 taxonomyCategoryTypeId:`#8;#0;#0;#0;#0;) +(ConstantScore(boosted:[1 TO 
 *]^1000.0)^1000.0 boosted:`#8;#0;#0;#0;#0;) +(storeId:`#8;#0;#0;#0;#1;^80.0 
 storeId:`#8;#0;#0;#0;#2;^5.0 storeId:`#8;#0;#0;#0;#3; 
 storeId:`#8;#0;#0;#0;#4; storeId:`#8;#0;#0;#0;#5;^5.0 
 storeId:`#8;#0;#0;#0;#6; storeId:`#8;#0;#0;#0;#7;^80.0 
 storeId:`#8;#0;#0;#0;#8;^80.0 ConstantScore(storeId:[9 TO *]^60.0)^60.0) 
 +(ConstantScore(sales:[10 TO *]^200.0)^200.0 ConstantScore(sales:[5 TO 
 9]^190.0)^190.0 ConstantScore(sales:[1 TO 4]^180.0)^180.0 
 sales:`#8;#0;#0;#0;#0;) +(ConstantScore(views:[51 TO *]^250.0)^250.0 
 ConstantScore(views:[10 TO 50]^30.0)^30.0 ConstantScore(views:[1 TO 
 9]^10.0)^10.0 views:`#8;#0;#0;#0;#0;) 
 +(taxonomyCategoryTypeId:`#8;#0;#0;#0;#1;^75.0 
 taxonomyCategoryTypeId:`#8;#0;#0;#0;#0;),pow(sum(log(sum(product(int(boosted),const(9000.0)),product(product(int(image),int(stocked)),const(300.0)),product(product(int(image),int(taxonomyCategoryTypeId)),const(300.0)),product(product(int(image),int(sales)),const(150.0)),product(int(stocked),const(2.0)),product(int(sales),const(2.0)),int(views))),const(1.0)),const(3.0))),
  product of: 0.36104107 = (MATCH) sum of: 0.026820535 = (MATCH) product of: 
 0.04470089 = (MATCH) sum of: 0.03307638 = (MATCH) weight(mfgPartNo:canon in 
 122108), product of: 0.0059631695 = queryWeight(mfgPartNo:canon), product of: 
 11.093556 = idf(docFreq=672, maxDocs=16277616) 5.3753454E-4 = queryNorm 
 5.546778 = (MATCH) fieldWeight(mfgPartNo:canon in 122108), product of: 1.0 = 
 tf(termFreq(mfgPartNo:canon)=1) 11.093556 = idf(docFreq=672, 
 maxDocs=16277616) 0.5 = fieldNorm(field=mfgPartNo, doc=122108) 0.0057238983 = 
 (MATCH) weight(title:canon^1.1 in 122108), product of: 0.00424859 = 
 queryWeight(title:canon^1.1), product of: 1.1 = boost 7.1853147 = 
 idf(docFreq=33522, maxDocs=16277616) 5.3753454E-4 = queryNorm 1.3472465 = 
 (MATCH) fieldWeight(title:canon in 122108), product of: 1.0 = 
 tf(termFreq(title:canon)=1) 7.1853147 = idf(docFreq=33522

RE: theory of sets

2013-01-07 Thread Petersen, Robert
Hi Uwe,

We have hundreds of dynamic fields but since most of our docs only use some of 
them it doesn't seem to be a performance drag.  They can be viewed as a sparse 
matrix of fields in your indexed docs.  Then if you make the 
sortinfo_for_groupx an int then that could be used in a function query to 
perform your sorting.  See  http://wiki.apache.org/solr/FunctionQuery


Robi

-Original Message-
From: Uwe Reh [mailto:r...@hebis.uni-frankfurt.de] 
Sent: Thursday, January 03, 2013 1:10 PM
To: solr-user@lucene.apache.org
Subject: theory of sets

Hi,

I'm looking for a tricky solution of a common problem. I have to handle a lot 
of items and each could be member of several groups.
- OK, just add a field called 'member_of'

No that's not enough, because each group is sorted and each member has a 
sortstring for this group.
- OK, still easy add a dynamic field 'sortinfo_for_*' and fill this for each 
group membership.

Yes, this works, but there are thousands of different groups, that much dynamic 
fields are probably a serious performance issue.
- Well ...

I'm looking for a smart way to answer to the question Find the members of 
group X and sort them by the the sortstring for this group.

One idea I had was to fill the 'member_of' field with composed entrys 
(groupname + _ + sortstring). Finding the members is easy with wildcards but 
there seems to be no way to use the sortstring as a boostfactor

Has anybody solved this problem?
Any hints are welcome.

Uwe



RE: occasional GC crashes

2012-12-20 Thread Petersen, Robert
Hi Otis,

I thought Java 7 had a bug which wasn't being addressed by Oracle which was 
making it not suitable for Solr.  Did that get fixed now?
http://searchhub.org/2011/07/28/dont-use-java-7-for-anything/

I did see this but it doesn't really mention the bug:  
http://opensearchnews.com/2012/04/announcing-java7-support-with-apache-solr-and-lucene/

Thanks
Robi


-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Tuesday, December 18, 2012 5:25 PM
To: solr-user@lucene.apache.org
Subject: Re: occasional GC crashes

Robert,

Step 1 is to get the latest Java 7 or if you have to remain on 6 then use the 
latest 6.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm On Dec 18, 2012 7:54 PM, 
Petersen, Robert rober...@buy.com wrote:

  Hi solr user group,

 ** **

 Sorry if this isn't directly a Solr question.  Seems like once in a 
 blue moon the GC crashes on a server in our Solr 3.6.1 slave farm.  
 This seems to only happen on a couple of the twelve slaves we have 
 deployed and only very rarely on those.  It seems like this doesn't 
 directly affect solr because in the logs it looks like solr keeps 
 working after the time of the exception but our external monitoring 
 tool reports that the solr service is down so our operations department 
 restarts solr on that box and alerts me.
 The solr logs show nothing unusual.  The exception does show up in the 
 catalina.out log file though.  Does this happen to anyone else?  Here is
 the basic error and I have attached the crash dump file also.   Our total
 uptime on these boxes is over a year now BTW.

 ** **

 #

 # A fatal error has been detected by the Java Runtime Environment:

 #

 #  SIGSEGV (0xb) at pc=0x2b5379346612, pid=13724, 
 tid=1082353984

 #

 # JRE version: 6.0_25-b06

 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.0-b11 mixed mode
 linux-amd64 )

 # Problematic frame:

 # V  [libjvm.so+0x3c4612]  Par_ConcMarkingClosure::trim_queue(unsigned
 long)+0x82

 #

 # An error report file with more information is saved as:

 # /var/LucidWorks/lucidworks/hs_err_pid13724.log

 #

 # If you would like to submit a bug report, please visit:

 #   http://java.sun.com/webapps/bugreport/crash.jsp

 #

 ** **

 VM Arguments:

 jvm_args:
 -Djava.util.logging.config.file=/var/LucidWorks/lucidworks/tomcat/conf
 /logging.properties 
 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
 -Xmx32768m -Xms32768m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
 -Dcom.sun.management.jmxremote 
 -Dcom.sun.management.jmxremote.ssl=false
 -Dcom.sun.management.jmxremote.authenticate=false
 -Dcom.sun.management.jmxremote.port=6060
 -Djava.endorsed.dirs=/var/LucidWorks/lucidworks/tomcat/endorsed
 -Dcatalina.base=/var/LucidWorks/lucidworks/tomcat
 -Dcatalina.home=/var/LucidWorks/lucidworks/tomcat
 -Djava.io.tmpdir=/var/LucidWorks/lucidworks/tomcat/temp 

 java_command: org.apache.catalina.startup.Bootstrap -server 
 -Dsolr.solr.home=lucidworks/solr start

 Launcher Type: SUN_STANDARD

 ** **

 Stack: [0x,0x],  
 sp=0x40835eb0, free space=1056983k

 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
 C=native
 code)

 V  [libjvm.so+0x3c4612]  Par_ConcMarkingClosure::trim_queue(unsigned
 long)+0x82

 V  [libjvm.so+0x3c481a]  
 CMSConcMarkingTask::do_work_steal(int)+0xfa

 V  [libjvm.so+0x3c3dcf]  CMSConcMarkingTask::work(int)+0xef

 V  [libjvm.so+0x8783dc]  YieldingFlexibleGangWorker::loop()+0xbc

 V  [libjvm.so+0x8755b4]  GangWorker::run()+0x24

 V  [libjvm.so+0x71096f]  java_start(Thread*)+0x13f

 ** **

 Heap

 par new generation   total 345024K, used 180672K [0x2e12,
 0x2aaac578, 0x2aaac578)

   eden space 306688K,  53% used [0x2e12, 
 0x2aaab8243c28,
 0x2aaac0ca)

   from space 38336K,  40% used [0x2aaac321, 
 0x2aaac415c3f8,
 0x2aaac578)

   to   space 38336K,   0% used [0x2aaac0ca, 0x2aaac0ca,
 0x2aaac321)

 concurrent mark-sweep generation total 33171072K, used 12144213K 
 [0x2aaac578, 0x2ab2ae12, 0x2ab2ae12)

 concurrent-mark-sweep perm gen total 83968K, used 50650K 
 [0x2ab2ae12, 0x2ab2b332, 0x2ab2b332)

 ** **

 Code Cache  [0x2b054000, 0x2b9a4000, 
 0x2e054000)**
 **

 total_blobs=2800 nmethods=2273 adapters=480 free_code_cache=40752512
 largest_free_block=15808

 ** **

 ** **

 ** **

 Thanks,

 ** **

 *Robert (Robi) Petersen*

 Senior Software Engineer

 Search Department

 ** **




occasional GC crashes

2012-12-18 Thread Petersen, Robert
Hi solr user group,

Sorry if this isn't directly a Solr question.  Seems like once in a blue moon 
the GC crashes on a server in our Solr 3.6.1 slave farm.  This seems to only 
happen on a couple of the twelve slaves we have deployed and only very rarely 
on those.  It seems like this doesn't directly affect solr because in the logs 
it looks like solr keeps working after the time of the exception but our 
external monitoring tool reports that the solr service is down so our 
operations department restarts solr on that box and alerts me.  The solr logs 
show nothing unusual.  The exception does show up in the catalina.out log file 
though.  Does this happen to anyone else?  Here is the basic error and I have 
attached the crash dump file also.   Our total uptime on these boxes is over a 
year now BTW.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x2b5379346612, pid=13724, tid=1082353984
#
# JRE version: 6.0_25-b06
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.0-b11 mixed mode linux-amd64 )
# Problematic frame:
# V  [libjvm.so+0x3c4612]  Par_ConcMarkingClosure::trim_queue(unsigned 
long)+0x82
#
# An error report file with more information is saved as:
# /var/LucidWorks/lucidworks/hs_err_pid13724.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

VM Arguments:
jvm_args: 
-Djava.util.logging.config.file=/var/LucidWorks/lucidworks/tomcat/conf/logging.properties
 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xmx32768m 
-Xms32768m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.port=6060 
-Djava.endorsed.dirs=/var/LucidWorks/lucidworks/tomcat/endorsed 
-Dcatalina.base=/var/LucidWorks/lucidworks/tomcat 
-Dcatalina.home=/var/LucidWorks/lucidworks/tomcat 
-Djava.io.tmpdir=/var/LucidWorks/lucidworks/tomcat/temp
java_command: org.apache.catalina.startup.Bootstrap -server 
-Dsolr.solr.home=lucidworks/solr start
Launcher Type: SUN_STANDARD

Stack: [0x,0x],  sp=0x40835eb0,  free 
space=1056983k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x3c4612]  Par_ConcMarkingClosure::trim_queue(unsigned long)+0x82
V  [libjvm.so+0x3c481a]  CMSConcMarkingTask::do_work_steal(int)+0xfa
V  [libjvm.so+0x3c3dcf]  CMSConcMarkingTask::work(int)+0xef
V  [libjvm.so+0x8783dc]  YieldingFlexibleGangWorker::loop()+0xbc
V  [libjvm.so+0x8755b4]  GangWorker::run()+0x24
V  [libjvm.so+0x71096f]  java_start(Thread*)+0x13f

Heap
par new generation   total 345024K, used 180672K [0x2e12, 
0x2aaac578, 0x2aaac578)
  eden space 306688K,  53% used [0x2e12, 0x2aaab8243c28, 
0x2aaac0ca)
  from space 38336K,  40% used [0x2aaac321, 0x2aaac415c3f8, 
0x2aaac578)
  to   space 38336K,   0% used [0x2aaac0ca, 0x2aaac0ca, 
0x2aaac321)
concurrent mark-sweep generation total 33171072K, used 12144213K 
[0x2aaac578, 0x2ab2ae12, 0x2ab2ae12)
concurrent-mark-sweep perm gen total 83968K, used 50650K [0x2ab2ae12, 
0x2ab2b332, 0x2ab2b332)

Code Cache  [0x2b054000, 0x2b9a4000, 0x2e054000)
total_blobs=2800 nmethods=2273 adapters=480 free_code_cache=40752512 
largest_free_block=15808



Thanks,

Robert (Robi) Petersen
Senior Software Engineer
Search Department



Re: star searches with high page number requests taking long times

2012-12-08 Thread Petersen, Robert
We have a limit in place to restrict searches to the first ten thousand pages. 
I am going to try to get that number reduced!  I'm thinking even as low as page 
fifty should be the limit. What human (with a wallet) would even go as deep as 
fifty pages?  :)

Sent from my iGizmo


On Dec 8, 2012, at 10:21 AM, Otis Gospodnetic otis.gospodne...@gmail.com 
wrote:

 It is common practise not to allow drilling deep in search results.
 
 Otis
 --
 SOLR Performance Monitoring - http://sematext.com/spm
 On Dec 8, 2012 10:27 AM, Jack Krupansky j...@basetechnology.com wrote:
 
 What exactly is the common practice - is there a free, downloadable search
 component that does that or at least a blueprint for recommended best
 practice? What limit is common? (I know Google limits you to the top 1,000
 results.)
 
 -- Jack Krupansky
 
 -Original Message- From: Otis Gospodnetic
 Sent: Saturday, December 08, 2012 7:25 AM
 To: solr-user@lucene.apache.org
 Subject: Re: star searches with high page number requests taking long times
 
 Hi Robert,
 
 You should just prevent deep paging. Humans with wallets don't do that, so
 you will not lose anything by doing that. It's common practice.
 
 Otis
 --
 SOLR Performance Monitoring - http://sematext.com/spm
 On Dec 7, 2012 8:10 PM, Petersen, Robert rober...@buy.com wrote:
 
 Hi guys,
 
 
 Sometimes we get a bot crawling our search function on our retail web
 site.  The ebay crawler loves to do this (Request.UserAgent: Terapeakbot).
 They just do a star search and then iterate through page after page. I've
 noticed that when they get to higher page numbers like page 9000, the
 searches are taking more than 20 seconds.  Is this expected behavior?
 We're requesting standard facets with the search as well as incorporating
 boosting by function query.  Our index is almost 15 million docs now and
 we're on Solr 3.6.1, this isn't causing any errors to occur at the solr
 layer but our web layer times out the search after 20 seconds and logs the
 exception.
 
 
 
 Thanks
 
 Robi
 



star searches with high page number requests taking long times

2012-12-07 Thread Petersen, Robert
Hi guys,


Sometimes we get a bot crawling our search function on our retail web site.  
The ebay crawler loves to do this (Request.UserAgent: Terapeakbot).  They just 
do a star search and then iterate through page after page.  I've noticed that 
when they get to higher page numbers like page 9000, the searches are taking 
more than 20 seconds.  Is this expected behavior?  We're requesting standard 
facets with the search as well as incorporating boosting by function query.  
Our index is almost 15 million docs now and we're on Solr 3.6.1, this isn't 
causing any errors to occur at the solr layer but our web layer times out the 
search after 20 seconds and logs the exception.



Thanks

Robi



RE: anyone have any clues about this exception

2012-10-12 Thread Petersen, Robert
Hi Erick,

After reading the discussion you guys were having about renaming optimize to 
forceMerge I realized I was guilty of over-optimizing like you guys were 
worried about!  We have about 15 million docs indexed now and we spin about 
50-300 adds per second 24/7, most of them being updates to existing documents 
whose data has changed since the last time it was indexed (which we keep track 
of in a DB table).  There are some new documents being added in the mix and 
some deletes as well too.

I understand now how the merge policy caps the number of segments.  I used to 
think they would grow unbounded and thus optimize was required.  How does the 
large number of updates of existing documents affect the need to optimize, by 
causing a large number of deletes with a 're-add'?  And so I suppose that means 
the index size tends to grow with the deleted docs hanging around in the 
background, as it were.

So in our situation, what frequency of optimize would you recommend?  We're on 
3.6.1 btw...

Thanks,
Robi

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, October 11, 2012 5:29 AM
To: solr-user@lucene.apache.org
Subject: Re: anyone have any clues about this exception

Well, you'll actually be able to optimize, it's just called forceMerge.

But the point is that optimize seems like something that _of course_ you want 
to do, when in reality it's not something you usually should do at all. 
Optimize does two things:
1 merges all the segments into one (usually)
2 removes all of the info associated with deleted documents.

Of the two, point 2 is the one that really counts and that's done whenever 
segment merging is done anyway. So unless you have a very large number of 
deletes (or updates of the same document), optimize buys you very little. You 
can tell this by the difference between numDocs and maxDoc in the admin page.

So what happens if you just don't bother to optimize? Take a look at merge 
policy to help control how merging happens perhaps as an alternative.

Best
Erick

On Wed, Oct 10, 2012 at 3:04 PM, Petersen, Robert rober...@buy.com wrote:
 You could be right.  Going back in the logs, I noticed it used to happen less 
 frequently and always towards the end of an optimize operation.  It is 
 probably my indexer timing out waiting for updates to occur during optimizes. 
  The errors grew recently due to my upping the indexer threadcount to 22 
 threads, so there's a lot more timeouts occurring now.  Also our index has 
 grown to double the old size so the optimize operation has started taking a 
 lot longer, also contributing to what I'm seeing.   I have just changed my 
 optimize frequency from three times a day to one time a day after reading the 
 following:

 Here they are talking about completely deprecating the optimize 
 command in the next version of solr... 
 https://issues.apache.org/jira/browse/SOLR-3141c


 -Original Message-
 From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
 Sent: Wednesday, October 10, 2012 11:10 AM
 To: solr-user@lucene.apache.org
 Subject: Re: anyone have any clues about this exception

 Something timed out, the other end closed the connection. This end tried to 
 write to closed pipe and died, something tried to catch that exception and 
 write its own and died even worse? Just making it up really, but sounds good 
 (plus a 3-year Java tech-support hunch).

 If it happens often enough, see if you can run WireShark on that machine's 
 network interface and catch the whole network conversation in action. Often, 
 there is enough clues there by looking at tcp packets and/or stuff 
 transmitted. WireShark is a power-tool, so takes a little while the first 
 time, but the learning will pay for itself over and over again.

 Regards,
Alex.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all 
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


 On Wed, Oct 10, 2012 at 11:31 PM, Petersen, Robert rober...@buy.com wrote:
 Tomcat localhost log (not the catalina log) for my  solr 3.6.1 (master) 
 instance contains lots of these exceptions but solr itself seems to be doing 
 fine... any ideas?  I'm not seeing these exceptions being logged on my slave 
 servers btw, just the master where we do our indexing only.



 Oct 9, 2012 5:34:11 PM org.apache.catalina.core.StandardWrapperValve
 invoke
 SEVERE: Servlet.service() for servlet default threw exception 
 java.lang.IllegalStateException
 at 
 org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:407)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:389)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:291

RE: anyone have any clues about this exception

2012-10-10 Thread Petersen, Robert
You could be right.  Going back in the logs, I noticed it used to happen less 
frequently and always towards the end of an optimize operation.  It is probably 
my indexer timing out waiting for updates to occur during optimizes.  The 
errors grew recently due to my upping the indexer threadcount to 22 threads, so 
there's a lot more timeouts occurring now.  Also our index has grown to double 
the old size so the optimize operation has started taking a lot longer, also 
contributing to what I'm seeing.   I have just changed my optimize frequency 
from three times a day to one time a day after reading the following:

Here they are talking about completely deprecating the optimize command in the 
next version of solr…
https://issues.apache.org/jira/browse/SOLR-3141c


-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Wednesday, October 10, 2012 11:10 AM
To: solr-user@lucene.apache.org
Subject: Re: anyone have any clues about this exception

Something timed out, the other end closed the connection. This end tried to 
write to closed pipe and died, something tried to catch that exception and 
write its own and died even worse? Just making it up really, but sounds good 
(plus a 3-year Java tech-support hunch).

If it happens often enough, see if you can run WireShark on that machine's 
network interface and catch the whole network conversation in action. Often, 
there is enough clues there by looking at tcp packets and/or stuff transmitted. 
WireShark is a power-tool, so takes a little while the first time, but the 
learning will pay for itself over and over again.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at once. 
Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Oct 10, 2012 at 11:31 PM, Petersen, Robert rober...@buy.com wrote:
 Tomcat localhost log (not the catalina log) for my  solr 3.6.1 (master) 
 instance contains lots of these exceptions but solr itself seems to be doing 
 fine... any ideas?  I'm not seeing these exceptions being logged on my slave 
 servers btw, just the master where we do our indexing only.



 Oct 9, 2012 5:34:11 PM org.apache.catalina.core.StandardWrapperValve 
 invoke
 SEVERE: Servlet.service() for servlet default threw exception 
 java.lang.IllegalStateException
 at 
 org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:407)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:389)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:291)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
 at 
 com.googlecode.psiprobe.Tomcat60AgentValve.invoke(Tomcat60AgentValve.java:30)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
 at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
 at java.lang.Thread.run(Unknown Source)



RE: Faceted search question (Tokenizing)

2012-10-10 Thread Petersen, Robert
What do you want the results to be, persons?  And the facets should be 
interests or subinterests?  Why are there two layers of interests anyway?  Can 
there my many subinterests under one interest?  Is one of those two a name of 
the interest which would look nice as a facet?

Anyway, have you read these pages yet?  These should get you started in the 
right direction.
http://wiki.apache.org/solr/SolrFacetingOverview
http://wiki.apache.org/solr/HierarchicalFaceting

Hope that helps,
Robi

-Original Message-
From: Grapes [mailto:mkloub...@gmail.com] 
Sent: Wednesday, October 10, 2012 8:52 AM
To: solr-user@lucene.apache.org
Subject: Faceted search question (Tokenizing)

Hey There, 

We have the following data structure: 


- Person
-- Interest 1
--- Subinterest 1
--- Subinterest 1 Description
--- Subinterest 1 ID
-- Interest 2
--- Subinterest 2
--- Subinterest 2 Description
--- Subinterest 2 ID
. 
-- Interest 99
--- Subinterest 99
--- Subinterest 99 Description
--- Subinterest 99 ID 

Interest, Subinterest, Subinterest Description and Subinterest IDs are all 
multiavlued fields. A person can have any number of subinterests,descriptions 
and IDS. 

How could we faced/search this based on this data structure? Right now we 
tokenized everything in a seperate multivalued column in the following fasion; 


|Interest='Interest 1',Subinterest='Subinterest 1',Subinterest='Another
Subinterest 1',Description='Interest 1 Description',ID='Interest 1 ID'| 
|Interest='Interest 2',Description='Interest 2 Description',ID='Interest 
|2
ID'| 

I have a feeling like this is a wrong approach to this problem.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Faceted-search-question-Tokenizing-tp4012948.html
Sent from the Solr - User mailing list archive at Nabble.com.




some general solr 4.0 questions

2012-09-20 Thread Petersen, Robert
Hello solr user group,

I am evaluating the new Solr 4.0 beta with an eye to how to fit it into our 
current solr setup.  Our current setup is running on solr 3.6.1 and uses 12 
slaves behind a load balancer and a master which we index into, and they all 
have three cores (now referred to as collections in 4.0 eh?) for three 
disparate types of indexes.  All machines are configured with dual quad xeon 
cpus and 64GB main memory.  We've worked hard to keep our index sizes small 
despite holding millions of documents, so we have no need to shard any of the 
indexes.  Everything is working very well at this time.

So to move to solr 4.0, I imagine we'd set -DnumShards=1 and spin up 11 
replicas, but I'm worried about the statement For production, it's recommended 
that you run an external zookeeper ensemble rather than having Solr run 
embedded zookeeper servers.  That means we'd need at least three more machines 
dedicated to just running zookeeper.   So here are my questions:


1.Could the zookeeper servers be smaller commodity servers?  Ie They 
wouldn't need 64GB of memory and huge CPUs right?

2.Is the overhead of running embedded zookeeper really great enough to 
require the external ensemble?  Our configuration will be pretty static, I 
don't anticipate having to change the zookeeper cluster once it is set up 
unless a machine completely dies or something.

3.Can we still use our external load balancer hardware to distribute 
queries to the solr 4.0 replicas as we do now with our slave farm?

4.Can solr 4.0 still run in a master- slave configuration if we don't want 
to use zookeeper or some of the other cloud features?


Thanks,

Robert (Robi) Petersen
Senior Software Engineer
Site Search Specialist




RE: some general solr 4.0 questions

2012-09-20 Thread Petersen, Robert
That is a great idea to run the updates thru the LB also!  I like it!

Thanks for the replies guys


-Original Message-
From: jimtronic [mailto:jimtro...@gmail.com] 
Sent: Thursday, September 20, 2012 1:46 PM
To: solr-user@lucene.apache.org
Subject: Re: some general solr 4.0 questions

I've got a setup like yours -- lots of cores and replicas, but no need for 
shards -- and here's what I've found so far:

1. Zookeeper is tiny. I would think network I/O is going to be the biggest 
concern.

2. I think this is more about high availability than performance. I've been 
expirementing with taking down parts of my setup to see what happens. When 
zookeeper goes down, the solr instances still serve requests. It appears, 
however, that updating and replication stop. I want to make frequent updates so 
this is a big concern for me.

3. On ec2, I launch a server which is configured to register itself with my 
zookeeper box upon launch. When they are ready I add them to my load balancer. 
Theoretically, zookeeper would help further balance them, but right now I find 
those queries to be too slow. Since the load balancer is already distributing 
the load, I'm adding the parameter distrib=false to my queries. This forces 
the request to stay on the box the load balancer chose.

4. This is interesting. I started down this path of wanting to maintain a 
master, but I've moved towards a system where all of my update requests go 
through my load balancer. Since zookeeper dynamically elects a leader, no 
matter which box gets the update the leader gets it anyway. This is very nice 
for me because I want all my solr instances to be identical.

Since there's not a lot of documentation on this yet, I hope other people share 
their findings, too.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/some-general-solr-4-0-questions-tp4009267p4009286.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: broken links in solr wiki

2012-09-18 Thread Petersen, Robert
OK I made a login and corrected the links.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Monday, September 17, 2012 5:07 PM
To: solr-user@lucene.apache.org
Subject: Re: broken links in solr wiki

Hi Robert,

Anyone can edit wiki, you just need to create user.

Regarding URLs

http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/stemdict.txt

http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/collection1/conf/protwords.txt

--- On Tue, 9/18/12, Petersen, Robert rober...@buy.com wrote:

 From: Petersen, Robert rober...@buy.com
 Subject: broken links in solr wiki
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Tuesday, September 18, 2012, 2:58 AM Hi group,
 
 On this wiki page these two links below are broken as they are also on 
 lucidworks' version, can someone point me at the correct locations 
 please?  I googled around and came up with possible good links.
 
 Thanks
 Robi
 
 http://wiki.apache.org/solr/LanguageAnalysis#Other_Tips
 http://lucidworks.lucidimagination.com/display/solr/Language+Analysis
 
 solr.KeywordMarkerFilterFactory
 
 A sample Solr protwords.txt with 
 commentshttp://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example
 /solr/conf/protwords.txt can be found in the Source Repository.
 
 Is this it?  
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/col
 lection1/conf/protwords.txt
 
 
 
 solr.StemmerOverrideFilterFactory
 
 A sample stemdict.txt with 
 commentshttp://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/sr
 c/test-files/solr/conf/stemdict.txt
 can be found in the Source Repository.
 
 Is this it?  
 https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-f
 iles/solr/conf/stemdict.txt?p=1227271
 (needs the ?p= parameter???)
 
 




broken links in solr wiki

2012-09-17 Thread Petersen, Robert
Hi group,

On this wiki page these two links below are broken as they are also on 
lucidworks' version, can someone point me at the correct locations please?  I 
googled around and came up with possible good links.

Thanks
Robi

http://wiki.apache.org/solr/LanguageAnalysis#Other_Tips
http://lucidworks.lucidimagination.com/display/solr/Language+Analysis

solr.KeywordMarkerFilterFactory

A sample Solr protwords.txt with 
commentshttp://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/protwords.txt
 can be found in the Source Repository.

Is this it?  
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/collection1/conf/protwords.txt



solr.StemmerOverrideFilterFactory

A sample stemdict.txt with 
commentshttp://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/stemdict.txt
 can be found in the Source Repository.

Is this it?  
https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/stemdict.txt?p=1227271
  (needs the ?p= parameter???)



RE: Solr grouping / facet query

2012-07-20 Thread Petersen, Robert
Why not just index one title per document, each having author and specialty 
fields included?  Then you could search titles with a user query and also 
filter/facet on the author and specialties at the same time.   The author bio 
and other data could be looked up on the fly from a DB if you didn't want to 
store that all in each document.  If the users query is for the titles though, 
I don't really see the point of indexing authors with no titles but you could 
include them with nothing in the title field if you wanted them to show up in 
facets or use a title placeholder for them which says 'No Titles Available' 
perhaps.

Just a thought
Robi


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, July 20, 2012 5:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr grouping / facet query

You might try two queries. The first would get your authors, the second would 
use the returned authors as a filter query and search your titles, grouped by 
author then combine the two lists. I don't know how big your corpus is, but two 
queries may well be fast enough

Best
Erick

On Thu, Jul 19, 2012 at 10:28 AM, s215903406 s...@s215903406.onlinehome.us 
wrote:
 Thanks for the reply.

 To clarify, the idea is to search for authors with certain specialties (eg.
 political, horror, etc.) and if they have any published titles 
 relevant to the user's query, then display those titles next to the author's 
 name.

 At first, I thought it would be great to have all the author's data 
 (name, location, bio, titles with descriptions, etc) all in one 
 document. Each title and description being a multivalued field, 
 however, I have no idea how the relevant titles based on the user's 
 query as described above can be quickly picked from within the document and 
 displayed.

 The only solution I see is to have a doc per title and include the 
 name, location, bio, etc in each one. As for the author's with no 
 published titles, simply add their bio data to a document with no 
 title or description and when I do the grouping check to see if the 
 title is blank, then display no titles found.

 This could work, though I'm concerned if having all that duplicate bio 
 data will affect the relevancy of the results or speed/performance of solr?

 Thank you.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-grouping-facet-query-tp3995787
 p3995974.html Sent from the Solr - User mailing list archive at 
 Nabble.com.




RE: how do I search the archives for solr-user

2012-07-03 Thread Petersen, Robert
This site is pretty cool also, just filter on solr-user like this:
http://markmail.org/search/?q=list%3Aorg.apache.lucene.solr-user


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Monday, July 02, 2012 5:34 PM
To: solr-user@lucene.apache.org
Subject: Re: how do I search the archives for solr-user



http://lucene.apache.org/solr/discussion.html#mail-archives


-Hoss




RE: Broken pipe error

2012-07-03 Thread Petersen, Robert
I also had this problem on solr/tomcat and finally saw the errors were coming 
from my application side disconnecting from solr after a timeout.  This was 
happening when solr was busy doing an optimize and thus not responding quickly 
enough.  Initially when I saw this in the logs, I was quite worried until I 
realized the source of the problem.

Robi

-Original Message-
From: alx...@aim.com [mailto:alx...@aim.com] 
Sent: Tuesday, July 03, 2012 10:38 AM
To: solr-user@lucene.apache.org
Subject: Re: Broken pipe error

I had the same problem with jetty. It turned out that broken pipe happens  when 
application disconnects from jetty. In my case I was using php client and it 
had 10 sec restriction in curl request. When solr takes more than 10 sec to 
respond, curl automatically disconnected from jetty.

Hope this can help.

Alex.



-Original Message-
From: Jason hialo...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Mon, Jul 2, 2012 7:41 pm
Subject: Broken pipe error


Hi, all

We're independently running three search servers.
One of three servers has bigger index size and more connection users than the 
others.
Except that, all configurations are same.
Problem is that server sometimes occurs broken pipe error.
But I don't know what problem is.
Please give some ideas.
Thanks in advance.
Jason


error message below...
===
2012-07-03 10:42:56,753 [http-8080-exec-3677] ERROR 
org.apache.solr.servlet.SolrDispatchFilter - null:ClientAbortException: 
java.io.IOException: Broken pipe
at
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:358)
at
org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:432)
at
org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:309)
at
org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288)
at
org.apache.catalina.connector.CoyoteOutputStream.flush(CoyoteOutputStream.java:98)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
at org.apache.solr.util.FastWriter.flush(FastWriter.java:115)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:402)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:279)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:470)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:732)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2262)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcher.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:69)
at sun.nio.ch.IOUtil.write(IOUtil.java:40)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
at org.apache.tomcat.util.net.NioChannel.write(NioChannel.java:116)
at
org.apache.tomcat.util.net.NioBlockingSelector.write(NioBlockingSelector.java:93)
at
org.apache.tomcat.util.net.NioSelectorPool.write(NioSelectorPool.java:156)
at
org.apache.coyote.http11.InternalNioOutputBuffer.writeToSocket(InternalNioOutputBuffer.java:460)
at
org.apache.coyote.http11.InternalNioOutputBuffer.flushBuffer(InternalNioOutputBuffer.java:804)
at
org.apache.coyote.http11.InternalNioOutputBuffer.addToBB(InternalNioOutputBuffer.java:644)
at
org.apache.coyote.http11.InternalNioOutputBuffer.access$000(InternalNioOutputBuffer.java:46)
at