Re: Solr hangs on distributed updates

2014-12-16 Thread Peter Keegan
 A distributed update is streamed to all available replicas in parallel.

Hmm, that's not what I'm seeing with 4.6.1, as I tail the logs on leader
and replicas. Mark Miller comments on this last May:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201404.mbox/%3CetPan.534d8d6d.74b0dc51.13a79@airmetal.local%3E

On Mon, Dec 15, 2014 at 8:11 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Mon, Dec 15, 2014 at 8:41 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
 
  If a timeout occurs, does the distributed update then go to the next
  replica?
 

 A distributed update is streamed to all available replicas in parallel.


 
  On Fri, Dec 12, 2014 at 3:42 PM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
  
   Sorry I should have specified. These timeouts go inside the solrcloud
   section and apply for inter-shard update requests only. The socket and
   connection timeout inside the shardHandlerFactory section apply for
   inter-shard search requests.
  
   On Fri, Dec 12, 2014 at 8:38 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
Btw, are the following timeouts still supported in solr.xml, and do
  they
only apply to distributed search?
   
  shardHandlerFactory name=shardHandlerFactory
class=HttpShardHandlerFactory
int name=socketTimeout${socketTimeout:0}/int
int name=connTimeout${connTimeout:0}/int
  /shardHandlerFactory
   
Thanks,
Peter
   
On Fri, Dec 12, 2014 at 3:14 PM, Peter Keegan 
 peterlkee...@gmail.com
wrote:
   
 No, I wasn't aware of these. I will give that a try. If I stop the
  Solr
 jetty service manually, things recover fine, but the hang occurs
  when I
 'stop' or 'terminate' the EC2 instance. The Zookeeper leader
 reports
  a
 15-sec timeout from the stopped node, and expires the session, but
  the
Solr
 leader never gets notified. This seems like a bug in ZK.

 Thanks,
 Peter


 On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 Do you have distribUpdateConnTimeout and distribUpdateSoTimeout
 set
  to
 reasonable values in your solr.xml? These are the timeouts used
 for
 inter-shard update requests.

 On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan 
  peterlkee...@gmail.com
   
 wrote:

  We are running SolrCloud in AWS and using their auto scaling
  groups
   to
 spin
  up new Solr replicas when CPU utilization exceeds a threshold
 for
  a
 period
  of time. All is well until the replicas are terminated when CPU
 utilization
  falls below another threshold. What happens is that index
 updates
   sent
 to
  the Solr leader hang forever in both the Solr leader and the
 SolrJ
 client
  app. Searches work fine.  Here are 2 thread stack traces from
 the
   Solr
  leader and 2 from the client app:
 
  1) Solr-leader thread doing a distributed commit:
 
  Thread 23527: (state = IN_NATIVE)
   -
 java.net.SocketInputStream.socketRead0(java.io.FileDescriptor,
 byte[],
  int, int, int) @bci=0 (Compiled frame; information may be
  imprecise)
   - java.net.SocketInputStream.read(byte[], int, int, int)
 @bci=79,
 line=150
  (Compiled frame)
   - java.net.SocketInputStream.read(byte[], int, int) @bci=11,
   line=121
  (Compiled frame)
   -
 org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer()
 @bci=71,
  line=166 (Compiled frame)
   - org.apache.http.impl.io.SocketInputBuffer.fillBuffer()
 @bci=1,
 line=90
  (Compiled frame)
   -
 
 

   
  
 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
  @bci=137, line=281 (Compiled frame)
   -
 
 

   
  
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
  @bci=16, line=92 (Compiled frame)
   -
 
 

   
  
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
  @bci=2, line=61 (Compiled frame)
   - org.apache.http.impl.io.AbstractMessageParser.parse()
 @bci=38,
 line=254
  (Compiled frame)
   -
 

   
  org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
  @bci=8, line=289 (Compiled frame)
   -
 

   
  org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
  @bci=1, line=252 (Compiled frame)
   -
 
 

   
  
 
 org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
  @bci=6, line=191 (Compiled frame)
   -
 
 

   
  
 
 org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
  org.apache.http.HttpClientConnection,
 org.apache.http.protocol.HttpContext)
  @bci=62, line=300 (Compiled frame

Re: Solr hangs on distributed updates

2014-12-16 Thread Peter Keegan
 As of 4.10, commits/optimize etc are executed in parallel.
Excellent - thanks.

On Tue, Dec 16, 2014 at 6:51 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Tue, Dec 16, 2014 at 11:34 AM, Peter Keegan peterlkee...@gmail.com
 wrote:
 
   A distributed update is streamed to all available replicas in parallel.
 
  Hmm, that's not what I'm seeing with 4.6.1, as I tail the logs on leader
  and replicas. Mark Miller comments on this last May:
 
 
 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201404.mbox/%3CetPan.534d8d6d.74b0dc51.13a79@airmetal.local%3E
 
 
 Yes, sorry I didn't notice that you are on 4.6.1. This was changed in 4.10
 with https://issues.apache.org/jira/browse/SOLR-6264

 As of 4.10, commits/optimize etc are executed in parallel.


  On Mon, Dec 15, 2014 at 8:11 PM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
  
   On Mon, Dec 15, 2014 at 8:41 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
   
If a timeout occurs, does the distributed update then go to the next
replica?
   
  
   A distributed update is streamed to all available replicas in parallel.
  
  
   
On Fri, Dec 12, 2014 at 3:42 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Sorry I should have specified. These timeouts go inside the
  solrcloud
 section and apply for inter-shard update requests only. The socket
  and
 connection timeout inside the shardHandlerFactory section apply for
 inter-shard search requests.

 On Fri, Dec 12, 2014 at 8:38 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  Btw, are the following timeouts still supported in solr.xml, and
 do
they
  only apply to distributed search?
 
shardHandlerFactory name=shardHandlerFactory
  class=HttpShardHandlerFactory
  int name=socketTimeout${socketTimeout:0}/int
  int name=connTimeout${connTimeout:0}/int
/shardHandlerFactory
 
  Thanks,
  Peter
 
  On Fri, Dec 12, 2014 at 3:14 PM, Peter Keegan 
   peterlkee...@gmail.com
  wrote:
 
   No, I wasn't aware of these. I will give that a try. If I stop
  the
Solr
   jetty service manually, things recover fine, but the hang
 occurs
when I
   'stop' or 'terminate' the EC2 instance. The Zookeeper leader
   reports
a
   15-sec timeout from the stopped node, and expires the session,
  but
the
  Solr
   leader never gets notified. This seems like a bug in ZK.
  
   Thanks,
   Peter
  
  
   On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar 
   shalinman...@gmail.com wrote:
  
   Do you have distribUpdateConnTimeout and
 distribUpdateSoTimeout
   set
to
   reasonable values in your solr.xml? These are the timeouts
 used
   for
   inter-shard update requests.
  
   On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan 
peterlkee...@gmail.com
 
   wrote:
  
We are running SolrCloud in AWS and using their auto scaling
groups
 to
   spin
up new Solr replicas when CPU utilization exceeds a
 threshold
   for
a
   period
of time. All is well until the replicas are terminated when
  CPU
   utilization
falls below another threshold. What happens is that index
   updates
 sent
   to
the Solr leader hang forever in both the Solr leader and the
   SolrJ
   client
app. Searches work fine.  Here are 2 thread stack traces
 from
   the
 Solr
leader and 2 from the client app:
   
1) Solr-leader thread doing a distributed commit:
   
Thread 23527: (state = IN_NATIVE)
 -
   java.net.SocketInputStream.socketRead0(java.io.FileDescriptor,
   byte[],
int, int, int) @bci=0 (Compiled frame; information may be
imprecise)
 - java.net.SocketInputStream.read(byte[], int, int, int)
   @bci=79,
   line=150
(Compiled frame)
 - java.net.SocketInputStream.read(byte[], int, int)
 @bci=11,
 line=121
(Compiled frame)
 -
   org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer()
   @bci=71,
line=166 (Compiled frame)
 - org.apache.http.impl.io.SocketInputBuffer.fillBuffer()
   @bci=1,
   line=90
(Compiled frame)
 -
   
   
  
 

   
  
 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=137, line=281 (Compiled frame)
 -
   
   
  
 

   
  
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=16, line=92 (Compiled frame)
 -
   
   
  
 

   
  
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=2, line=61 (Compiled frame

Re: Solr hangs on distributed updates

2014-12-15 Thread Peter Keegan
I added distribUpdateConnTimeout and distribUpdateSoTimeout to solr.xml and
the commit did timeout.(btw, is there any way to view solr.xml in the admin
console?).

Also, although we do have an init.d start/stop script for Solr, the 'stop'
command was not executed during shutdown because there was no lock file for
the script in '/var/lock/subsys'. I didn't know about this until I google'd
around and found '
http://www.redhat.com/magazine/008jun05/departments/tips_tricks'. When I
added the lock file, both the AWS 'stop' and 'terminate' actions did result
in an orderly shutdown of the replica which caused the Solr-leader to get
an exception and update the live_nodes, gracefully.

So now, the timeouts should only play a backup role.

Thanks for the help,
Peter


On Fri, Dec 12, 2014 at 5:21 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : No, I wasn't aware of these. I will give that a try. If I stop the Solr
 : jetty service manually, things recover fine, but the hang occurs when I
 : 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a

 I don't know squat about AWS Auto-Scaling, (and barely anything about AWS)
 but what you describe makes it sound like maybe your machine (ie AMI?)
 isn't really configured very well?

 Do you have some init.d/systemd type scripts to ensure a clean shutdown of
 Solr when the machine is shutdown/rebooted?  That seems like a pretty good
 idea in general (in dependent of wether you are using Auto-Scaling) and --
 assuming AWS auto-scaling does clean OS shutdowns when terminating
 instances -- would probably solve your problem.  It would help ensure you
 would never have to wait on the timeouts -- the nodes will each explicitly
 tell ZK they are going bye-bye.

 if you do have things setup so that *manually* shutting down your
 instances executes a clean shutdown of solr, but AWS Auto-Scaling is
 actaully totally brutal and doesn't even do a clean shutdown of your
 virtual machines -- just yanks the virtual power cord -- perhaps you could
 implement one of these LifecycleHook options that poped up when i did
 some googling for AWS Auto-Scale termination to explicitly do a clean
 shutdown of the Solr process before the machine vanishes into thin air?



 -Hoss
 http://www.lucidworks.com/



Re: Solr hangs on distributed updates

2014-12-15 Thread Peter Keegan
If a timeout occurs, does the distributed update then go to the next
replica?

On Fri, Dec 12, 2014 at 3:42 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Sorry I should have specified. These timeouts go inside the solrcloud
 section and apply for inter-shard update requests only. The socket and
 connection timeout inside the shardHandlerFactory section apply for
 inter-shard search requests.

 On Fri, Dec 12, 2014 at 8:38 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  Btw, are the following timeouts still supported in solr.xml, and do they
  only apply to distributed search?
 
shardHandlerFactory name=shardHandlerFactory
  class=HttpShardHandlerFactory
  int name=socketTimeout${socketTimeout:0}/int
  int name=connTimeout${connTimeout:0}/int
/shardHandlerFactory
 
  Thanks,
  Peter
 
  On Fri, Dec 12, 2014 at 3:14 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   No, I wasn't aware of these. I will give that a try. If I stop the Solr
   jetty service manually, things recover fine, but the hang occurs when I
   'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a
   15-sec timeout from the stopped node, and expires the session, but the
  Solr
   leader never gets notified. This seems like a bug in ZK.
  
   Thanks,
   Peter
  
  
   On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar 
   shalinman...@gmail.com wrote:
  
   Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to
   reasonable values in your solr.xml? These are the timeouts used for
   inter-shard update requests.
  
   On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com
 
   wrote:
  
We are running SolrCloud in AWS and using their auto scaling groups
 to
   spin
up new Solr replicas when CPU utilization exceeds a threshold for a
   period
of time. All is well until the replicas are terminated when CPU
   utilization
falls below another threshold. What happens is that index updates
 sent
   to
the Solr leader hang forever in both the Solr leader and the SolrJ
   client
app. Searches work fine.  Here are 2 thread stack traces from the
 Solr
leader and 2 from the client app:
   
1) Solr-leader thread doing a distributed commit:
   
Thread 23527: (state = IN_NATIVE)
 - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor,
   byte[],
int, int, int) @bci=0 (Compiled frame; information may be imprecise)
 - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79,
   line=150
(Compiled frame)
 - java.net.SocketInputStream.read(byte[], int, int) @bci=11,
 line=121
(Compiled frame)
 - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer()
   @bci=71,
line=166 (Compiled frame)
 - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1,
   line=90
(Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=137, line=281 (Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=16, line=92 (Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=2, line=61 (Compiled frame)
 - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38,
   line=254
(Compiled frame)
 -
   
  
  org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
@bci=8, line=289 (Compiled frame)
 -
   
  
  org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
@bci=1, line=252 (Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
@bci=6, line=191 (Compiled frame)
 -
   
   
  
 
 org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection,
   org.apache.http.protocol.HttpContext)
@bci=62, line=300 (Compiled frame)
 -
   
   
  
 
 org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection,
   org.apache.http.protocol.HttpContext)
@bci=60, line=127 (Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest,
org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled
   frame)
 -
   
   
  
 
 org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=574, line=520 (Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=344, line=906 (Compiled frame

Solr hangs on distributed updates

2014-12-12 Thread Peter Keegan
We are running SolrCloud in AWS and using their auto scaling groups to spin
up new Solr replicas when CPU utilization exceeds a threshold for a period
of time. All is well until the replicas are terminated when CPU utilization
falls below another threshold. What happens is that index updates sent to
the Solr leader hang forever in both the Solr leader and the SolrJ client
app. Searches work fine.  Here are 2 thread stack traces from the Solr
leader and 2 from the client app:

1) Solr-leader thread doing a distributed commit:

Thread 23527: (state = IN_NATIVE)
 - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[],
int, int, int) @bci=0 (Compiled frame; information may be imprecise)
 - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150
(Compiled frame)
 - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121
(Compiled frame)
 - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71,
line=166 (Compiled frame)
 - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90
(Compiled frame)
 -
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=137, line=281 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=16, line=92 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=2, line=61 (Compiled frame)
 - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254
(Compiled frame)
 -
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
@bci=8, line=289 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
@bci=1, line=252 (Compiled frame)
 -
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
@bci=6, line=191 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=62, line=300 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=60, line=127 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest,
org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=574, line=520 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=344, line=906 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest,
org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest)
@bci=6, line=784 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest,
org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395
(Interpreted frame)
 -
org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
@bci=17, line=199 (Interpreted frame)
 -
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
@bci=101, line=293 (Compiled frame)
 -
org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.update.SolrCmdDistributor$Req)
@bci=127, line=226 (Interpreted frame)
 -
org.apache.solr.update.SolrCmdDistributor.distribCommit(org.apache.solr.update.CommitUpdateCommand,
java.util.List, org.apache.solr.common.params.ModifiableSolrParams)
@bci=112, line=195 (Interpreted frame)
 -
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(org.apache.solr.update.CommitUpdateCommand)
@bci=174, line=1250 (Interpreted frame)
 -
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(org.apache.solr.update.CommitUpdateCommand)
@bci=61, line=157 (Interpreted frame)
 -
org.apache.solr.handler.RequestHandlerUtils.handleCommit(org.apache.solr.request.SolrQueryRequest,
org.apache.solr.update.processor.UpdateRequestProcessor,
org.apache.solr.common.params.SolrParams, boolean) @bci=100, line=69
(Interpreted frame)
 -
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(org.apache.solr.request.SolrQueryRequest,
org.apache.solr.response.SolrQueryResponse) @bci=60, line=68 (Compiled
frame)
 -
org.apache.solr.handler.RequestHandlerBase.handleRequest(org.apache.solr.request.SolrQueryRequest,
org.apache.solr.response.SolrQueryResponse) @bci=43, line=135 (Compiled
frame)
 -

Re: Solr hangs on distributed updates

2014-12-12 Thread Peter Keegan
No, I wasn't aware of these. I will give that a try. If I stop the Solr
jetty service manually, things recover fine, but the hang occurs when I
'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a
15-sec timeout from the stopped node, and expires the session, but the Solr
leader never gets notified. This seems like a bug in ZK.

Thanks,
Peter


On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to
 reasonable values in your solr.xml? These are the timeouts used for
 inter-shard update requests.

 On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  We are running SolrCloud in AWS and using their auto scaling groups to
 spin
  up new Solr replicas when CPU utilization exceeds a threshold for a
 period
  of time. All is well until the replicas are terminated when CPU
 utilization
  falls below another threshold. What happens is that index updates sent to
  the Solr leader hang forever in both the Solr leader and the SolrJ client
  app. Searches work fine.  Here are 2 thread stack traces from the Solr
  leader and 2 from the client app:
 
  1) Solr-leader thread doing a distributed commit:
 
  Thread 23527: (state = IN_NATIVE)
   - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[],
  int, int, int) @bci=0 (Compiled frame; information may be imprecise)
   - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79,
 line=150
  (Compiled frame)
   - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121
  (Compiled frame)
   - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer()
 @bci=71,
  line=166 (Compiled frame)
   - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90
  (Compiled frame)
   -
 
 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
  @bci=137, line=281 (Compiled frame)
   -
 
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
  @bci=16, line=92 (Compiled frame)
   -
 
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
  @bci=2, line=61 (Compiled frame)
   - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38,
 line=254
  (Compiled frame)
   -
  org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
  @bci=8, line=289 (Compiled frame)
   -
  org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
  @bci=1, line=252 (Compiled frame)
   -
 
 
 org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
  @bci=6, line=191 (Compiled frame)
   -
 
 
 org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
  org.apache.http.HttpClientConnection,
 org.apache.http.protocol.HttpContext)
  @bci=62, line=300 (Compiled frame)
   -
 
 
 org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
  org.apache.http.HttpClientConnection,
 org.apache.http.protocol.HttpContext)
  @bci=60, line=127 (Compiled frame)
   -
 
 
 org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest,
  org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled frame)
   -
 
 
 org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost,
  org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
  @bci=574, line=520 (Compiled frame)
   -
 
 
 org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost,
  org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
  @bci=344, line=906 (Compiled frame)
   -
 
 
 org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest,
  org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame)
   -
 
 
 org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest)
  @bci=6, line=784 (Compiled frame)
   -
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest,
  org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395
  (Interpreted frame)
   -
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
  @bci=17, line=199 (Interpreted frame)
   -
 
 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
  @bci=101, line=293 (Compiled frame)
   -
 
 
 org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.update.SolrCmdDistributor$Req)
  @bci=127, line=226 (Interpreted frame)
   -
 
 
 org.apache.solr.update.SolrCmdDistributor.distribCommit(org.apache.solr.update.CommitUpdateCommand,
  java.util.List, org.apache.solr.common.params.ModifiableSolrParams)
  @bci=112, line=195 (Interpreted frame

Re: Solr hangs on distributed updates

2014-12-12 Thread Peter Keegan
Btw, are the following timeouts still supported in solr.xml, and do they
only apply to distributed search?

  shardHandlerFactory name=shardHandlerFactory
class=HttpShardHandlerFactory
int name=socketTimeout${socketTimeout:0}/int
int name=connTimeout${connTimeout:0}/int
  /shardHandlerFactory

Thanks,
Peter

On Fri, Dec 12, 2014 at 3:14 PM, Peter Keegan peterlkee...@gmail.com
wrote:

 No, I wasn't aware of these. I will give that a try. If I stop the Solr
 jetty service manually, things recover fine, but the hang occurs when I
 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a
 15-sec timeout from the stopped node, and expires the session, but the Solr
 leader never gets notified. This seems like a bug in ZK.

 Thanks,
 Peter


 On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to
 reasonable values in your solr.xml? These are the timeouts used for
 inter-shard update requests.

 On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  We are running SolrCloud in AWS and using their auto scaling groups to
 spin
  up new Solr replicas when CPU utilization exceeds a threshold for a
 period
  of time. All is well until the replicas are terminated when CPU
 utilization
  falls below another threshold. What happens is that index updates sent
 to
  the Solr leader hang forever in both the Solr leader and the SolrJ
 client
  app. Searches work fine.  Here are 2 thread stack traces from the Solr
  leader and 2 from the client app:
 
  1) Solr-leader thread doing a distributed commit:
 
  Thread 23527: (state = IN_NATIVE)
   - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor,
 byte[],
  int, int, int) @bci=0 (Compiled frame; information may be imprecise)
   - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79,
 line=150
  (Compiled frame)
   - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121
  (Compiled frame)
   - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer()
 @bci=71,
  line=166 (Compiled frame)
   - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1,
 line=90
  (Compiled frame)
   -
 
 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
  @bci=137, line=281 (Compiled frame)
   -
 
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
  @bci=16, line=92 (Compiled frame)
   -
 
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
  @bci=2, line=61 (Compiled frame)
   - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38,
 line=254
  (Compiled frame)
   -
 
 org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
  @bci=8, line=289 (Compiled frame)
   -
 
 org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
  @bci=1, line=252 (Compiled frame)
   -
 
 
 org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
  @bci=6, line=191 (Compiled frame)
   -
 
 
 org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
  org.apache.http.HttpClientConnection,
 org.apache.http.protocol.HttpContext)
  @bci=62, line=300 (Compiled frame)
   -
 
 
 org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
  org.apache.http.HttpClientConnection,
 org.apache.http.protocol.HttpContext)
  @bci=60, line=127 (Compiled frame)
   -
 
 
 org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest,
  org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled
 frame)
   -
 
 
 org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost,
  org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
  @bci=574, line=520 (Compiled frame)
   -
 
 
 org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost,
  org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
  @bci=344, line=906 (Compiled frame)
   -
 
 
 org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest,
  org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame)
   -
 
 
 org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest)
  @bci=6, line=784 (Compiled frame)
   -
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest,
  org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395
  (Interpreted frame)
   -
 
 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
  @bci=17, line=199 (Interpreted frame)
   -
 
 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
  @bci=101, line=293 (Compiled frame

Re: Solr hangs on distributed updates

2014-12-12 Thread Peter Keegan
 The Solr leader should stop sending requests to the stopped replica once
 that replica's live node is removed from ZK (after session expiry).

Fwiw, here's the Zookeeper log entry for a graceful shutdown of the Solr
replica:

2014-12-12 15:04:21,304 [myid:2] - INFO  [ProcessThread(sid:2
cport:8181)::PrepRequestProcessor@476] - Processed session termination for
sessionid: 0x34a1701a1df0037

And here's the Zookeeper log entry for a non-graceful shutdown via EC2 stop
or terminate of the replica:

2014-12-12 14:19:22,000 [myid:2] - INFO  [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x14a1700c19c003f, timeout of 15000ms exceeded
2014-12-12 14:19:22,001 [myid:2] - INFO  [ProcessThread(sid:2
cport:8181)::PrepRequestProcessor@476] - Processed session termination for
sessionid: 0x14a1700c19c003f

There was no hang in the graceful shutdown.
I'm running ZK version  3.4.5 and Solr 4.6.1

Peter




On Fri, Dec 12, 2014 at 3:39 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Okay, that should solve the hung threads on the leader.

 When you stop the jetty service then it is a graceful shutdown where
 existing requests finish before the searcher thread pool is shutdown
 completely. A EC2 terminate probably just kills the processes and leader
 threads just wait due to a lack of read/connection timeouts.

 The Solr leader should stop sending requests to the stopped replica once
 that replica's live node is removed from ZK (after session expiry). I think
 most of these issues are because of the lack of timeouts. Just add them and
 if there are more problems, we can discuss more.

 On Fri, Dec 12, 2014 at 8:14 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  No, I wasn't aware of these. I will give that a try. If I stop the Solr
  jetty service manually, things recover fine, but the hang occurs when I
  'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a
  15-sec timeout from the stopped node, and expires the session, but the
 Solr
  leader never gets notified. This seems like a bug in ZK.
 
  Thanks,
  Peter
 
 
  On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
   Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to
   reasonable values in your solr.xml? These are the timeouts used for
   inter-shard update requests.
  
   On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
We are running SolrCloud in AWS and using their auto scaling groups
 to
   spin
up new Solr replicas when CPU utilization exceeds a threshold for a
   period
of time. All is well until the replicas are terminated when CPU
   utilization
falls below another threshold. What happens is that index updates
 sent
  to
the Solr leader hang forever in both the Solr leader and the SolrJ
  client
app. Searches work fine.  Here are 2 thread stack traces from the
 Solr
leader and 2 from the client app:
   
1) Solr-leader thread doing a distributed commit:
   
Thread 23527: (state = IN_NATIVE)
 - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor,
  byte[],
int, int, int) @bci=0 (Compiled frame; information may be imprecise)
 - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79,
   line=150
(Compiled frame)
 - java.net.SocketInputStream.read(byte[], int, int) @bci=11,
 line=121
(Compiled frame)
 - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer()
   @bci=71,
line=166 (Compiled frame)
 - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1,
  line=90
(Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=137, line=281 (Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=16, line=92 (Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=2, line=61 (Compiled frame)
 - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38,
   line=254
(Compiled frame)
 -
   
  org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
@bci=8, line=289 (Compiled frame)
 -
   
  org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
@bci=1, line=252 (Compiled frame)
 -
   
   
  
 
 org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
@bci=6, line=191 (Compiled frame)
 -
   
   
  
 
 org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection,
   org.apache.http.protocol.HttpContext)
@bci=62, line=300 (Compiled frame)
 -
   
   
  
 
 org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection

Re: Solr hangs on distributed updates

2014-12-12 Thread Peter Keegan
The AMIs are Red Hat (not Amazon's) and the instances are properly sized
for the environment (t1.micro for ZK, m3.xlarge for Solr). I do plan to add
hooks for a clean shutdown of Solr when the VM is shut down, but if Solr
takes too long, AWS may clobber it anyway. One frustrating part of auto
scaling shutdown is that you can't log into the 'vanishing machine' to view
the logs.

Peter

On Fri, Dec 12, 2014 at 5:21 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : No, I wasn't aware of these. I will give that a try. If I stop the Solr
 : jetty service manually, things recover fine, but the hang occurs when I
 : 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a

 I don't know squat about AWS Auto-Scaling, (and barely anything about AWS)
 but what you describe makes it sound like maybe your machine (ie AMI?)
 isn't really configured very well?

 Do you have some init.d/systemd type scripts to ensure a clean shutdown of
 Solr when the machine is shutdown/rebooted?  That seems like a pretty good
 idea in general (in dependent of wether you are using Auto-Scaling) and --
 assuming AWS auto-scaling does clean OS shutdowns when terminating
 instances -- would probably solve your problem.  It would help ensure you
 would never have to wait on the timeouts -- the nodes will each explicitly
 tell ZK they are going bye-bye.

 if you do have things setup so that *manually* shutting down your
 instances executes a clean shutdown of solr, but AWS Auto-Scaling is
 actaully totally brutal and doesn't even do a clean shutdown of your
 virtual machines -- just yanks the virtual power cord -- perhaps you could
 implement one of these LifecycleHook options that poped up when i did
 some googling for AWS Auto-Scale termination to explicitly do a clean
 shutdown of the Solr process before the machine vanishes into thin air?



 -Hoss
 http://www.lucidworks.com/



Solr exceptions during batch indexing

2014-11-07 Thread Peter Keegan
How are folks handling Solr exceptions that occur during batch indexing?
Solr stops parsing the docs stream when an error occurs (e.g. a doc with a
missing mandatory field), and stops indexing the batch. The bad document is
not identified, so it would be hard for the client to recover by skipping
over it.

Peter


Re: Solr exceptions during batch indexing

2014-11-07 Thread Peter Keegan
I'm seeing 9X throughput with 1000 docs/batch vs 1 doc/batch, with a single
thread, so it's certainly worth it.

Thanks,
Peter


On Fri, Nov 7, 2014 at 2:18 PM, Erick Erickson erickerick...@gmail.com
wrote:

 And Walter has also been around for a _long_ time ;)

 (sorry, couldn't resist)

 Erick

 On Fri, Nov 7, 2014 at 11:12 AM, Walter Underwood wun...@wunderwood.org
 wrote:
  Yes, I implemented exactly that fallback for Solr 1.2 at Netflix.
 
  It isn’t to hard if the code is structured for it; retry with a batch
 size of 1.
 
  wunder
 
  On Nov 7, 2014, at 11:01 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Yeah, this has been an ongoing issue for a _long_ time. Basically,
  you can't. So far, people have essentially written fallback logic to
  index the docs of a failing packet one at a time and report it.
 
  I'd really like better reporting back, but we haven't gotten there yet.
 
  Best,
  Erick
 
  On Fri, Nov 7, 2014 at 8:25 AM, Peter Keegan peterlkee...@gmail.com
 wrote:
  How are folks handling Solr exceptions that occur during batch
 indexing?
  Solr stops parsing the docs stream when an error occurs (e.g. a doc
 with a
  missing mandatory field), and stops indexing the batch. The bad
 document is
  not identified, so it would be hard for the client to recover by
 skipping
  over it.
 
  Peter
 



Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Peter Keegan
Regarding batch indexing:
When I send batches of 1000 docs to a standalone Solr server, the log file
reports (1000 adds) in LogUpdateProcessor. But when I send them to the
leader of a replicated index, the leader log file reports much smaller
numbers, usually (12 adds). Why do the batches appear to be broken up?

Peter

On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com
wrote:

 NP, just making sure.

 I suspect you'll get lots more bang for the buck, and
 results much more closely matching your expectations if

 1 you batch up a bunch of docs at once rather than
 sending them one at a time. That's probably the easiest
 thing to try. Sending docs one at a time is something of
 an anti-pattern. I usually start with batches of 1,000.

 And just to check.. You're not issuing any commits from the
 client, right? Performance will be terrible if you issue commits
 after every doc, that's totally an anti-pattern. Doubly so for
 optimizes Since you showed us your solrconfig  autocommit
 settings I'm assuming not but want to be sure.

 2 use a leader-aware client. I'm totally unfamiliar with Go,
 so I have no suggestions whatsoever to offer there But you'll
 want to batch in this case too.

 On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote:
  Hi Erick -
 
  Thanks for the detailed response and apologies for my confusing
  terminology.  I should have said WPS (writes per second) instead of QPS
  but I didn't want to introduce a weird new acronym since QPS is well
  known.  Clearly a bad decision on my part.  To clarify: I am doing
  *only* writes
  (document adds).  Whenever I wrote QPS I was referring to writes.
 
  It seems clear at this point that I should wrap up the code to do smart
  routing rather than choose Solr nodes randomly.  And then see if that
  changes things.  I must admit that although I understand that random node
  selection will impose a performance hit, theoretically it seems to me
 that
  the system should still scale up as you add more nodes (albeit at lower
  absolute level of performance than if you used a smart router).
  Nonetheless, I'm just theorycrafting here so the better thing to do is
 just
  try it experimentally.  I hope to have that working today - will report
  back on my findings.
 
  Cheers,
  - Ian
 
  p.s. To clarify why we are rolling our own smart router code, we use Go
  over here rather than Java.  Although if we still get bad performance
 with
  our custom Go router I may try a pure Java load client using
  CloudSolrServer to eliminate the possibility of bugs in our
 implementation.
 
 
  On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  I'm really confused:
 
  bq: I am not issuing any queries, only writes (document inserts)
 
  bq: It's clear that once the load test client has ~40 simulated users
 
  bq: A cluster of 3 shards over 3 Solr nodes *should* support
  a higher QPS than 2 shards over 2 Solr nodes, right
 
  QPS is usually used to mean Queries Per Second, which is different
 from
  the statement that I am not issuing any queries. And what do the
  number of users have to do with inserting documents?
 
  You also state:  In many cases, CPU on the solr servers is quite low as
  well
 
  So let's talk about indexing first. Indexing should scale nearly
  linearly as long as
  1 you are routing your docs to the correct leader, which happens with
  SolrJ
  and the CloudSolrSever automatically. Rather than rolling your own, I
  strongly
  suggest you try this out.
  2 you have enough clients feeding the cluster to push CPU utilization
  on them all.
  Very often slow indexing, or in your case lack of scaling is a
  result of document
  acquisition or, in your case, your doc generator is spending all it's
  time waiting for
  the individual documents to get to Solr and come back.
 
  bq: chooses a random solr server for each ADD request (with 1 doc per
 add
  request)
 
  Probably your culprit right there. Each and every document requires that
  you
  have to cross the network (and forward that doc to the correct leader).
 So
  given
  that you're not seeing high CPU utilization, I suspect that you're not
  sending
  enough docs to SolrCloud fast enough to see scaling. You need to batch
 up
  multiple docs, I generally send 1,000 docs at a time.
 
  But even if you do solve this, the inter-node routing will prevent
  linear scaling.
  When a doc (or a batch of docs) goes to a random Solr node, here's what
  happens:
  1 the docs are re-packaged into groups based on which shard they're
  destined for
  2 the sub-packets are forwarded to the leader for each shard
  3 the responses are gathered back and returned to the client.
 
  This set of operations will eventually degrade the scaling.
 
  bq:  A cluster of 3 shards over 3 Solr nodes *should* support
  a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole
 idea
  behind sharding.
 
  If we're talking search 

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Peter Keegan
Yes, I was inadvertently sending them to a replica. When I sent them to the
leader, the leader reported (1000 adds) and the replica reported only 1 add
per document. So, it looks like the leader forwards the batched jobs
individually to the replicas.

On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Internally, the docs are batched up into smaller buckets (10 as I
 remember) and forwarded to the correct shard leader. I suspect that's
 what you're seeing.

 Erick

 On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
  Regarding batch indexing:
  When I send batches of 1000 docs to a standalone Solr server, the log
 file
  reports (1000 adds) in LogUpdateProcessor. But when I send them to the
  leader of a replicated index, the leader log file reports much smaller
  numbers, usually (12 adds). Why do the batches appear to be broken up?
 
  Peter
 
  On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  NP, just making sure.
 
  I suspect you'll get lots more bang for the buck, and
  results much more closely matching your expectations if
 
  1 you batch up a bunch of docs at once rather than
  sending them one at a time. That's probably the easiest
  thing to try. Sending docs one at a time is something of
  an anti-pattern. I usually start with batches of 1,000.
 
  And just to check.. You're not issuing any commits from the
  client, right? Performance will be terrible if you issue commits
  after every doc, that's totally an anti-pattern. Doubly so for
  optimizes Since you showed us your solrconfig  autocommit
  settings I'm assuming not but want to be sure.
 
  2 use a leader-aware client. I'm totally unfamiliar with Go,
  so I have no suggestions whatsoever to offer there But you'll
  want to batch in this case too.
 
  On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com
 wrote:
   Hi Erick -
  
   Thanks for the detailed response and apologies for my confusing
   terminology.  I should have said WPS (writes per second) instead of
 QPS
   but I didn't want to introduce a weird new acronym since QPS is well
   known.  Clearly a bad decision on my part.  To clarify: I am doing
   *only* writes
   (document adds).  Whenever I wrote QPS I was referring to writes.
  
   It seems clear at this point that I should wrap up the code to do
 smart
   routing rather than choose Solr nodes randomly.  And then see if that
   changes things.  I must admit that although I understand that random
 node
   selection will impose a performance hit, theoretically it seems to me
  that
   the system should still scale up as you add more nodes (albeit at
 lower
   absolute level of performance than if you used a smart router).
   Nonetheless, I'm just theorycrafting here so the better thing to do is
  just
   try it experimentally.  I hope to have that working today - will
 report
   back on my findings.
  
   Cheers,
   - Ian
  
   p.s. To clarify why we are rolling our own smart router code, we use
 Go
   over here rather than Java.  Although if we still get bad performance
  with
   our custom Go router I may try a pure Java load client using
   CloudSolrServer to eliminate the possibility of bugs in our
  implementation.
  
  
   On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson 
 erickerick...@gmail.com
  
   wrote:
  
   I'm really confused:
  
   bq: I am not issuing any queries, only writes (document inserts)
  
   bq: It's clear that once the load test client has ~40 simulated users
  
   bq: A cluster of 3 shards over 3 Solr nodes *should* support
   a higher QPS than 2 shards over 2 Solr nodes, right
  
   QPS is usually used to mean Queries Per Second, which is different
  from
   the statement that I am not issuing any queries. And what do
 the
   number of users have to do with inserting documents?
  
   You also state:  In many cases, CPU on the solr servers is quite
 low as
   well
  
   So let's talk about indexing first. Indexing should scale nearly
   linearly as long as
   1 you are routing your docs to the correct leader, which happens
 with
   SolrJ
   and the CloudSolrSever automatically. Rather than rolling your own, I
   strongly
   suggest you try this out.
   2 you have enough clients feeding the cluster to push CPU
 utilization
   on them all.
   Very often slow indexing, or in your case lack of scaling is a
   result of document
   acquisition or, in your case, your doc generator is spending all it's
   time waiting for
   the individual documents to get to Solr and come back.
  
   bq: chooses a random solr server for each ADD request (with 1 doc
 per
  add
   request)
  
   Probably your culprit right there. Each and every document requires
 that
   you
   have to cross the network (and forward that doc to the correct
 leader).
  So
   given
   that you're not seeing high CPU utilization, I suspect that you're
 not
   sending
   enough docs to SolrCloud fast enough to see scaling

Re: QParserPlugin question

2014-10-24 Thread Peter Keegan
Thanks for the advice. I've moved this query rewriting logic (not really
business logic) to a SearchComponent and will leave the custom query parser
to deal with the keyword (q=) related aspects of the query. In my case, the
latter is mostly dealing with the presence of wildcard characters.

Peter


On Wed, Oct 22, 2014 at 6:35 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : It's for an optimization. If the keyword is 'match all docs', I want to
 : remove a custom PostFilter from the query and change the sort parameters
 : (so the app doesn't have to do it).  It looks like the responseHeader is
 : displaying the 'originalParams', which are immutable.

 that is in fact the point of including the params in the header - to make
 it clear what exatly the request handler got as input.

 echoParams can be used to control wether you get all the params
 (including those added as defaults/appens in configuration) or just the
 explicit params included in the request -- but there's no way for a
 QParserPlugin to change what the raw query param strings are -- the query
 it produces might not even have a meaningful toString.

 the params in the header are there for the very explicit reason of showing
 you exactly what input was used to produce this request -- if plugins
 could change them, they would be meaningless since the modified params
 might not produce the same request.

 if you want to have a custom plugin that applies business logic to hcnage
 the behavior internally and reports back info for hte client to use in
 future requests, i would suggest doing that as a SearchComponent and
 inclding your own section in the response with details about what the
 client should do moving forward.


 (for example: i had a serach component once upon a time that applied
 QueryElevationComponent type checking against the query string  filters,
 and based on what it found would set the sort  add some filters unless an
 explicit sort / filter params were provided by the client -- the sort 
 filters that were added were included along with some additional metadat
 about what rule was matched in a new section of the response.)


 -Hoss
 http://www.lucidworks.com/



QParserPlugin question

2014-10-22 Thread Peter Keegan
I have a custom query parser that modifies the filter query list based on
the keyword query. This works, but the 'fq' list in the responseHeader
contains the original filter list. The debugQuery output does display the
modified filter list. Is there a way to change the responseHeader? I could
probably do this in a custom QueryComponent, but the query parser seems
like a reasonable place to do this.

Thanks,
Peter


Re: QParserPlugin question

2014-10-22 Thread Peter Keegan
It's for an optimization. If the keyword is 'match all docs', I want to
remove a custom PostFilter from the query and change the sort parameters
(so the app doesn't have to do it).  It looks like the responseHeader is
displaying the 'originalParams', which are immutable.

On Wed, Oct 22, 2014 at 2:10 PM, Ramzi Alqrainy ramzi.alqra...@gmail.com
wrote:

 I don't know why you need to change it ? you can use omitHeader=true on
 the
 URL to remove header if you want.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/QParserPlugin-question-tp4165368p4165373.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: QParserPlugin question

2014-10-22 Thread Peter Keegan
I meant to say: If the keyword is *:* (MachAllDocsQuery)...

On Wed, Oct 22, 2014 at 2:17 PM, Peter Keegan peterlkee...@gmail.com
wrote:

 It's for an optimization. If the keyword is 'match all docs', I want to
 remove a custom PostFilter from the query and change the sort parameters
 (so the app doesn't have to do it).  It looks like the responseHeader is
 displaying the 'originalParams', which are immutable.

 On Wed, Oct 22, 2014 at 2:10 PM, Ramzi Alqrainy ramzi.alqra...@gmail.com
 wrote:

 I don't know why you need to change it ? you can use omitHeader=true on
 the
 URL to remove header if you want.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/QParserPlugin-question-tp4165368p4165373.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: Does Solr support this?

2014-10-16 Thread Peter Keegan
I'm doing something similar with a custom search component. See SOLR-6502
https://issues.apache.org/jira/browse/SOLR-6502

On Thu, Oct 16, 2014 at 8:14 AM, Upayavira u...@odoko.co.uk wrote:

 Nope, not yet.

 Someone did propose a JavascriptRequestHandler or such, which would
 allow you to code such things in Javascript (obviously), but I don't
 believe that has been accepted or completed yet.

 Upayavira

 On Thu, Oct 16, 2014, at 03:48 AM, Aaron Lewis wrote:
  Hi,
 
  I'm trying to a if first query is empty then do a second query, e.g
 
  if this returns no rows:
  title:XX AND subject:YY
 
  Then do a
  title:XX
 
  I can do that with two queries. But I'm wondering if I can merge them
  into a single one?
 
  --
  Best Regards,
  Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
  Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33



Question about filter cache size

2014-10-03 Thread Peter Keegan
Say I have a boolean field named 'hidden', and less than 1% of the
documents in the index have hidden=true.
Do both these filter queries use the same docset cache size? :
fq=hidden:false
fq=!hidden:true

Peter


Re: Question about filter cache size

2014-10-03 Thread Peter Keegan
 it will be cached as hidden:true and then inverted
Inverted at query time, so for best query performance use fq=hidden:false,
right?

On Fri, Oct 3, 2014 at 3:57 PM, Yonik Seeley yo...@heliosearch.com wrote:

 On Fri, Oct 3, 2014 at 3:42 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
  Say I have a boolean field named 'hidden', and less than 1% of the
  documents in the index have hidden=true.
  Do both these filter queries use the same docset cache size? :
  fq=hidden:false
  fq=!hidden:true

 Nope... !hidden:true will be smaller in the cache (it will be cached
 as hidden:true and then inverted)
 The downside is that you'll pay the cost of that inversion.

 -Yonik
 http://heliosearch.org - native code faceting, facet functions,
 sub-facets, off-heap data



Re: MaxScore

2014-09-17 Thread Peter Keegan
See if SOLR-5831 https://issues.apache.org/jira/browse/SOLR-5831 helps.

Peter

On Tue, Sep 16, 2014 at 11:32 PM, William Bell billnb...@gmail.com wrote:

 What we need is a function like scale(field,min,max) but only operates on
 the results that come back from the search results.

 scale() takes the min, max from the field in the index, not necessarily
 those in the results.

 I cannot think of a solution. max() only looks at one field, not across
 fields in the results.

 I tried a query() but cannot think of a way to get the max value of a field
 ONLY in the results...

 Ideas?


 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Re: Edismax mm and efficiency

2014-09-10 Thread Peter Keegan
I implemented a custom QueryComponent that issues the edismax query with
mm=100%, and if no results are found, it reissues the query with mm=1. This
doubled our query throughput (compared to mm=1 always), as we do some
expensive RankQuery processing. For your very long student queries, mm=100%
would obviously be too high, so you'd have to experiment.

On Fri, Sep 5, 2014 at 1:34 PM, Walter Underwood wun...@wunderwood.org
wrote:

 Great!

 We have some very long queries, where students paste entire homework
 problems. One of them was 1051 words. Many of them are over 100 words. This
 could help.

 In the Jira discussion, I saw some comments about handling the most sparse
 lists first. We did something like that in the Infoseek Ultra engine about
 twenty years ago. Short termlists (documents matching a term) were
 processed first, which kept the in-memory lists of matching docs small. It
 also allowed early short-circuiting for no-hits queries.

 What would be a high mm value, 75%?

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/


 On Sep 4, 2014, at 11:52 PM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

  indeed https://issues.apache.org/jira/browse/LUCENE-4571
  my feeling is it gives a significant gain in mm high values.
 
 
 
  On Fri, Sep 5, 2014 at 3:01 AM, Walter Underwood wun...@wunderwood.org
  wrote:
 
  Are there any speed advantages to using “mm”? I can imagine pruning the
  set of matching documents early, which could help, but is that (or
  something else) done?
 
  wunder
  Walter Underwood
  wun...@wunderwood.org
  http://observer.wunderwood.org/
 
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com




Re: Edismax mm and efficiency

2014-09-10 Thread Peter Keegan
Sure. I created SOLR-6502. The tricky part was handling the behavior in a
sharded index. When the index is sharded. the response from each shard will
contain a parameter that indicates if the search results are from the
conjunction of all keywords (mm=100%), or from disjunction (mm=1). If the
shards contain both types, then only return the results from the
conjunction. This is necessary in order to get the same results independent
of the number of shards.

Peter

On Wed, Sep 10, 2014 at 11:07 AM, Walter Underwood wun...@wunderwood.org
wrote:

 We do that strict/loose query sequence, but on the client side with two
 requests. Would you consider contributing the QueryComponent?

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/


 On Sep 10, 2014, at 3:47 AM, Peter Keegan peterlkee...@gmail.com wrote:

  I implemented a custom QueryComponent that issues the edismax query with
  mm=100%, and if no results are found, it reissues the query with mm=1.
 This
  doubled our query throughput (compared to mm=1 always), as we do some
  expensive RankQuery processing. For your very long student queries,
 mm=100%
  would obviously be too high, so you'd have to experiment.
 
  On Fri, Sep 5, 2014 at 1:34 PM, Walter Underwood wun...@wunderwood.org
  wrote:
 
  Great!
 
  We have some very long queries, where students paste entire homework
  problems. One of them was 1051 words. Many of them are over 100 words.
 This
  could help.
 
  In the Jira discussion, I saw some comments about handling the most
 sparse
  lists first. We did something like that in the Infoseek Ultra engine
 about
  twenty years ago. Short termlists (documents matching a term) were
  processed first, which kept the in-memory lists of matching docs small.
 It
  also allowed early short-circuiting for no-hits queries.
 
  What would be a high mm value, 75%?
 
  wunder
  Walter Underwood
  wun...@wunderwood.org
  http://observer.wunderwood.org/
 
 
  On Sep 4, 2014, at 11:52 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:
 
  indeed https://issues.apache.org/jira/browse/LUCENE-4571
  my feeling is it gives a significant gain in mm high values.
 
 
 
  On Fri, Sep 5, 2014 at 3:01 AM, Walter Underwood 
 wun...@wunderwood.org
  wrote:
 
  Are there any speed advantages to using “mm”? I can imagine pruning
 the
  set of matching documents early, which could help, but is that (or
  something else) done?
 
  wunder
  Walter Underwood
  wun...@wunderwood.org
  http://observer.wunderwood.org/
 
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com
 
 




Re: ExternalFileFieldReloader and commit

2014-08-06 Thread Peter Keegan
I entered SOLR-6326 https://issues.apache.org/jira/browse/SOLR-6326

thanks,
Peter


On Tue, Aug 5, 2014 at 6:50 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Hi Peter,

 It seems like a bug to me, too. Please file a JIRA ticket if you can
 so that someone can take it.

 Koji
 --
 http://soleami.com/blog/comparing-document-classification-functions-of-
 lucene-and-mahout.html


 (2014/08/05 22:34), Peter Keegan wrote:

 When there are multiple 'external file field' files available, Solr will
 reload the last one (lexicographically) with a commit, but only if changes
 were made to the index. Otherwise, it skips the reload and logs: No
 uncommitted changes. Skipping IW.commit.  Has anyone else noticed this?
 It
 seems like a bug to me. (yes, I do have firstSearcher and newSearcher
 event
 listeners in solrconfig.xml)

 Peter







Re: ExternalFileFieldReloader and commit

2014-08-06 Thread Peter Keegan
The use case is:

1. A SolrJ client updates the main index (and replicas) and issues a commit
at regular intervals.
2. Another component updates the external files at other intervals.

Usually, the commits result in a new searcher which triggers the
org.apache.solr.schema.ExternalFileFieldReloader, but only if there were
changes to the main index.

Using ReloadCacheRequestHandler in (2) above would result in the loss of
index/replica synchronization provided by the commit in (1), and reloading
the core is slow and overkill. I think it would be easier to have the SolrJ
client in (1) always update a dummy document during each commit interval to
force a new searcher.

Thanks,
Peter


On Wed, Aug 6, 2014 at 8:43 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Peter,

 Providing SOLR-6326 is about a bug in ExternalFileFieldReloader, I'm asking
 here:
 Did you try to use
 org.apache.solr.search.function.FileFloatSource.ReloadCacheRequestHandler ?
 Let's me know if you need help with it.
 As a workaround you can reload the core via REST or click a button at
 SolrAdmin, your questions are welcome.



 On Wed, Aug 6, 2014 at 4:02 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I entered SOLR-6326 https://issues.apache.org/jira/browse/SOLR-6326
 
  thanks,
  Peter
 
 
  On Tue, Aug 5, 2014 at 6:50 PM, Koji Sekiguchi k...@r.email.ne.jp
 wrote:
 
   Hi Peter,
  
   It seems like a bug to me, too. Please file a JIRA ticket if you can
   so that someone can take it.
  
   Koji
   --
  
 http://soleami.com/blog/comparing-document-classification-functions-of-
   lucene-and-mahout.html
  
  
   (2014/08/05 22:34), Peter Keegan wrote:
  
   When there are multiple 'external file field' files available, Solr
 will
   reload the last one (lexicographically) with a commit, but only if
  changes
   were made to the index. Otherwise, it skips the reload and logs: No
   uncommitted changes. Skipping IW.commit.  Has anyone else noticed
 this?
   It
   seems like a bug to me. (yes, I do have firstSearcher and newSearcher
   event
   listeners in solrconfig.xml)
  
   Peter
  
  
  
  
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



ExternalFileFieldReloader and commit

2014-08-05 Thread Peter Keegan
When there are multiple 'external file field' files available, Solr will
reload the last one (lexicographically) with a commit, but only if changes
were made to the index. Otherwise, it skips the reload and logs: No
uncommitted changes. Skipping IW.commit.  Has anyone else noticed this? It
seems like a bug to me. (yes, I do have firstSearcher and newSearcher event
listeners in solrconfig.xml)

Peter


Question about ReRankQuery

2014-07-23 Thread Peter Keegan
I'm looking at how 'ReRankQuery' works. If the main query has a Sort
criteria, it is only used to sort the first pass results. The QueryScorer
used in the second pass only reorders the ScoreDocs based on score and
docid, but doesn't use the original Sort fields. If the Sort criteria is
'score desc, myfield asc', I would expect 'myfield' to break score ties
from the second pass after rescoring.

Is this a bug or the intended behavior?

Thanks,
Peter


Re: Question about ReRankQuery

2014-07-23 Thread Peter Keegan
See http://heliosearch.org/solrs-new-re-ranking-feature/


On Wed, Jul 23, 2014 at 11:27 AM, Erick Erickson erickerick...@gmail.com
wrote:

 I'm having a little trouble understanding the use-case here. Why use
 re-ranking?
 Isn't this just combining the original query with the second query with an
 AND
 and using the original sort?

 At the end, you have your original list in it's original order, with
 (potentially) some
 documents removed that don't satisfy the secondary query.

 Or I'm missing the boat entirely.

 Best,
 Erick


 On Wed, Jul 23, 2014 at 6:31 AM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I'm looking at how 'ReRankQuery' works. If the main query has a Sort
  criteria, it is only used to sort the first pass results. The QueryScorer
  used in the second pass only reorders the ScoreDocs based on score and
  docid, but doesn't use the original Sort fields. If the Sort criteria is
  'score desc, myfield asc', I would expect 'myfield' to break score ties
  from the second pass after rescoring.
 
  Is this a bug or the intended behavior?
 
  Thanks,
  Peter
 



Re: Question about ReRankQuery

2014-07-23 Thread Peter Keegan
 The ReRankingQParserPlugin uses the Lucene QueryRescorer, which only uses
the score from the re-rank query when re-ranking the top N documents.

Understood, but if the re-rank scores produce new ties, wouldn't you want
to resort them with the FieldSortedHitQueue?

Anyway, I was looking to reimplement the ScaleScoreQParser PostFilter
plugin with RankQuery, and would need to implement the behavior of the
DelegateCollector there for handling multiple sort fields.

Peter

On Wednesday, July 23, 2014, Joel Bernstein joels...@gmail.com wrote:

 The ReRankingQParserPlugin uses the Lucene QueryRescorer, which only uses
 the score from the re-rank query when re-ranking the top N documents.

 The ReRanklingQParserPlugin is built as a RankQuery plugin so you can swap
 in your own implementation. Patches are also welcome for the existing
 implementation.

 Joel Bernstein
 Search Engineer at Heliosearch


 On Wed, Jul 23, 2014 at 11:37 AM, Peter Keegan peterlkee...@gmail.com
 javascript:;
 wrote:

  See http://heliosearch.org/solrs-new-re-ranking-feature/
 
 
  On Wed, Jul 23, 2014 at 11:27 AM, Erick Erickson 
 erickerick...@gmail.com javascript:;
  wrote:
 
   I'm having a little trouble understanding the use-case here. Why use
   re-ranking?
   Isn't this just combining the original query with the second query with
  an
   AND
   and using the original sort?
  
   At the end, you have your original list in it's original order, with
   (potentially) some
   documents removed that don't satisfy the secondary query.
  
   Or I'm missing the boat entirely.
  
   Best,
   Erick
  
  
   On Wed, Jul 23, 2014 at 6:31 AM, Peter Keegan peterlkee...@gmail.com
 javascript:;
   wrote:
  
I'm looking at how 'ReRankQuery' works. If the main query has a Sort
criteria, it is only used to sort the first pass results. The
  QueryScorer
used in the second pass only reorders the ScoreDocs based on score
 and
docid, but doesn't use the original Sort fields. If the Sort criteria
  is
'score desc, myfield asc', I would expect 'myfield' to break score
 ties
from the second pass after rescoring.
   
Is this a bug or the intended behavior?
   
Thanks,
Peter
   
  
 



Question about solrcloud recovery process

2014-07-03 Thread Peter Keegan
I bring up a new Solr node with no index and watch the index being
replicated from the leader. The index size is 12G and the replication takes
about 6 minutes, according to the replica log (from 'Starting recovery
process' to 'Finished recovery process). However, shortly after the
replication begins, while the index files are being copied, I am able to
query the index on the replica and see q=*:* find all of the documents.
But, from the core admin screen, numDocs = 0, and in the cloud screen the
replica is in 'recovering' mode. How can this be?

Peter


Re: Question about solrcloud recovery process

2014-07-03 Thread Peter Keegan
No, we're not doing NRT. The search clients aren't using CloudSolrServer
and they are behind an AWS load balancer, which calls the Solr ping handler
(implemented with ClusterStateAwarePingRequestHandler) to determine when
the node is active. This ping handler also responds during the index copy,
which doesn't seem right. I'll have to figure out why it does this before
the replica is really active.

Peter


On Thu, Jul 3, 2014 at 9:36 AM, Mark Miller markrmil...@gmail.com wrote:

 I don’t know offhand about the num docs issue - are you doing NRT?

 As far as being able to query the replica, I’m not sure anyone ever got to
 making that fail if you directly query a node that is not active. It
 certainly came up, but I have no memory of anyone tackling it. Of course in
 many other cases, information is being pulled from zookeeper and recovering
 nodes are ignored. If this is the issue I think it is, it should only be an
 issue when you directly query recovery node.

 The CloudSolrServer client works around this issue as well.

 --
 Mark Miller
 about.me/markrmiller

 On July 3, 2014 at 8:42:48 AM, Peter Keegan (peterlkee...@gmail.com)
 wrote:

 I bring up a new Solr node with no index and watch the index being
 replicated from the leader. The index size is 12G and the replication takes
 about 6 minutes, according to the replica log (from 'Starting recovery
 process' to 'Finished recovery process). However, shortly after the
 replication begins, while the index files are being copied, I am able to
 query the index on the replica and see q=*:* find all of the documents.
 But, from the core admin screen, numDocs = 0, and in the cloud screen the
 replica is in 'recovering' mode. How can this be?

 Peter



Re: Question about solrcloud recovery process

2014-07-03 Thread Peter Keegan
Aha, you are right wrdrvf! The query is forwarded to any of the active
shards (I saw the query alternate between both of mine). Nice feature.
Also, looking at 'ClusterStateAwarePingRequestHandler' (which I downloaded
from www.manning.com/SolrinAction), it is checking zookeeper to see if the
logical shard is active, not the specific 'this' replica, which is in
'recovering' state. I'll post a patch once I figure out the zookeeper api.

Thanks,
Peter


On Thu, Jul 3, 2014 at 12:03 PM, wrdrvr wrd...@gmail.com wrote:

 Try querying the recovering core with distrib=false, you should get the
 count
 of docs in it.

 Most likely, since the replica is recovering it is forwarding all queries
 to
 the active replica, this can be verified in the core logs.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Question-about-solrcloud-recovery-process-tp4145450p4145491.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Custom QueryComponent to rewrite dismax query

2014-06-10 Thread Peter Keegan
We are using the 'edismax' query parser for its many benefits over the
standard Lucene parser. For queries with more than 5 or 6 keywords (which
is a lot for our typical user), the recall can be very high (sometimes
matching 75% or more of the documents). This high recall, when coupled with
some custom PostFilter scoring, is hurting the query performance.  I tried
varying the 'mm' (minimum match) parameter, but at values less than 100%,
the response time didn't improve much, and at 100%, there were often no
results, which is unacceptable.

So, I wrote a custom QueryComponent which rewrites the DisMax query.
Initially, the MinShouldMatch value is set to 100%. If the search returns 0
results, MinShouldMatch is set to 1 and the search is retried. This
improved the QPS throughput by about 2.5X. However, this only worked with
an unsharded index. With a sharded index, each shard returned only the
results from the first search (mm=100%). In the debugger, I could see 2
'response/ResultContext' NV-Pairs in the SolrQueryResponse object, so I
added code to remove the first pair if there were 2 pair present, which
fixed this problem. My question: is removing the extra ResultContext a
reasonable solution to this problem? It just seems a little brittle to me.

Thanks,
Peter


Autoscaling Solr instances in AWS

2014-05-20 Thread Peter Keegan
We are running Solr 4.6.1 in AWS:
- 2 Solr instances (1 shard, 1 leader, 1 replica)
- 1 CloudSolrServer SolrJ client updating the index.
- 3 Zookeepers

The Solr instances are behind a load balanceer and also in an auto scaling
group. The ScaleUpPolicy will add up to 9 additional instances (replicas),
1 per minute. Later, the 9 replicas are terminated with the ScaleDownPolicy.

Problem: during the ScaleUpPolicy, when the Solr Leader is under heavy
query load, the SolrJ indexing client issues a commit which hangs and never
returns. Note that the index schema contains 3 ExternalFileFields wich slow
down the commit process. Here's the stack trace:

Thread 1959: (state = IN_NATIVE)
 - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[],
int, int, int) @bci=0 (Compiled frame; information may be imprecise)
 - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150
(Compiled frame)
 - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121
(Compiled frame)
 - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71,
line=166 (Compiled frame)
 - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90
(Compiled frame)
 -
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=137, line=281 (Compiled frame)
 -
org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=5, line=115 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=16, line=92 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=2, line=62 (Compiled frame)
 - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254
(Compiled frame)
 -
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
@bci=8, line=289 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
@bci=1, line=252 (Compiled frame)
 -
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
@bci=6, line=191 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=62, line=300 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=60, line=127 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest,
org.apache.http.protocol.HttpContext) @bci=198, line=717 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=597, line=522 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=344, line=906 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest,
org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest)
@bci=6, line=784 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest,
org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395 (Compiled
frame)
 -
org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
@bci=17, line=199 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(org.apache.solr.client.solrj.impl.LBHttpSolrServer$Req)
@bci=132, line=285 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.CloudSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
@bci=838, line=640 (Compiled frame)
 -
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(org.apache.solr.client.solrj.SolrServer)
@bci=17, line=117 (Compiled frame)
 - org.apache.solr.client.solrj.SolrServer.commit(boolean, boolean)
@bci=16, line=168 (Interpreted frame)
 - org.apache.solr.client.solrj.SolrServer.commit() @bci=3, line=146
(Interpreted frame)

 The Solr leader log shows many connection timeout exceptions from the
other Solr replicas during this period. Some of these timeouts may have
been caused by replicas disappearing from the ScaleDownPolicy. From the
search client application's point of view, everything looked fine, but
indexing stopped until I restarted the SolrJ client.

 Does this look like a case where a timeout value needs to be increased
somewhere? If so, which one?

 Thanks,
 Peter


Re: Distributed commits in CloudSolrServer

2014-04-16 Thread Peter Keegan
Are distributed commits also done in parallel across shards?

Peter


On Tue, Apr 15, 2014 at 3:50 PM, Mark Miller markrmil...@gmail.com wrote:

 Inline responses below.
 --
 Mark Miller
 about.me/markrmiller

 On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com)
 wrote:

 I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
 ZKs. The Solr indexes are behind a load balancer. There is one
 CloudSolrServer client updating the indexes. The index schema includes 3
 ExternalFileFields. When the CloudSolrServer client issues a hard commit,
 I
 observe that the commits occur sequentially, not in parallel, on the
 leader
 and replica. The duration of each commit is about a minute. Most of this
 time is spent reloading the 3 ExternalFileField files. Because of the
 sequential commits, there is a period of time (1 minute+) when the index
 searchers will return different results, which can cause a bad user
 experience. This will get worse as replicas are added to handle
 auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
 queries.

 My questions:

 1. Is there a reason that the distributed commits are done in sequence,
 not
 in parallel? Is there a way to change this behavior?


 The reason is that updates are currently done this way - it’s the only
 safe way to do it without solving some more problems. I don’t think you can
 easily change this. I think we should probably file a JIRA issue to track a
 better solution for commit handling. I think there are some complications
 because of how commits can be added on update requests, but its something
 we probably want to try and solve before tackling *all* updates to replicas
 in parallel with the leader.



 2. If instead, the commits were done in parallel by a separate client via
 a
 GET to each Solr instance, how would this client get the host/port values
 for each Solr instance from zookeeper? Are there any downsides to doing
 commits this way?

 Not really, other than the extra management.





 Thanks,
 Peter



Re: Distributed commits in CloudSolrServer

2014-04-16 Thread Peter Keegan
Are distributed commits also done in parallel across shards?
I meant 'sequentially' across shards.


On Wed, Apr 16, 2014 at 9:08 AM, Peter Keegan peterlkee...@gmail.comwrote:

 Are distributed commits also done in parallel across shards?

 Peter


 On Tue, Apr 15, 2014 at 3:50 PM, Mark Miller markrmil...@gmail.comwrote:

 Inline responses below.
 --
 Mark Miller
 about.me/markrmiller

 On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com)
 wrote:

 I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
 ZKs. The Solr indexes are behind a load balancer. There is one
 CloudSolrServer client updating the indexes. The index schema includes 3
 ExternalFileFields. When the CloudSolrServer client issues a hard commit,
 I
 observe that the commits occur sequentially, not in parallel, on the
 leader
 and replica. The duration of each commit is about a minute. Most of this
 time is spent reloading the 3 ExternalFileField files. Because of the
 sequential commits, there is a period of time (1 minute+) when the index
 searchers will return different results, which can cause a bad user
 experience. This will get worse as replicas are added to handle
 auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
 queries.

 My questions:

 1. Is there a reason that the distributed commits are done in sequence,
 not
 in parallel? Is there a way to change this behavior?


 The reason is that updates are currently done this way - it’s the only
 safe way to do it without solving some more problems. I don’t think you can
 easily change this. I think we should probably file a JIRA issue to track a
 better solution for commit handling. I think there are some complications
 because of how commits can be added on update requests, but its something
 we probably want to try and solve before tackling *all* updates to replicas
 in parallel with the leader.



 2. If instead, the commits were done in parallel by a separate client via
 a
 GET to each Solr instance, how would this client get the host/port values
 for each Solr instance from zookeeper? Are there any downsides to doing
 commits this way?

 Not really, other than the extra management.





 Thanks,
 Peter





Distributed commits in CloudSolrServer

2014-04-15 Thread Peter Keegan
I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
ZKs. The Solr indexes are behind a load balancer. There is one
CloudSolrServer client updating the indexes. The index schema includes 3
ExternalFileFields. When the CloudSolrServer client issues a hard commit, I
observe that the commits occur sequentially, not in parallel, on the leader
and replica. The duration of each commit is about a minute. Most of this
time is spent reloading the 3 ExternalFileField files. Because of the
sequential commits, there is a period of time (1 minute+) when the index
searchers will return different results, which can cause a bad user
experience. This will get worse as replicas are added to handle
auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
queries.

My questions:

1. Is there a reason that the distributed commits are done in sequence, not
in parallel? Is there a way to change this behavior?

2. If instead, the commits were done in parallel by a separate client via a
GET to each Solr instance, how would this client get the host/port values
for each Solr instance from zookeeper? Are there any downsides to doing
commits this way?

Thanks,
Peter


Re: Configurable collectors for custom ranking

2014-03-07 Thread Peter Keegan
Hi Joel,

Although I solved this issue with a custom CollectorFactory, I also have a
solution that uses a PostFilter and and optional ValueSource.
Could you take a look at SOLR-5831 and see if I've got this right?

Thanks,
Peter



On Mon, Dec 23, 2013 at 6:37 PM, Joel Bernstein joels...@gmail.com wrote:

 Peter,

 You actually only need the current score being collected to be in the
 request context. So you don't need a map, you just need an object wrapper
 around a mutable float.

 If you have a page size of X, only the top X scores need to be held onto,
 because all the other scores wouldn't have made it into that page anyway so
 they might as well be 0. Because the QueryResultCache caches's a larger
 window then the page size you should keep enough scores so the cached
 docList is correct. But if you're only dealing with 150K of results you
 could just keep all the scores in a FloatArrayList and not worry about the
 keeping the top X scores in a priority queue.

 During the collect hang onto the docIds and scores and build your scaling
 info.

 During the finish iterate your docIds and scale the scores as you go.

 Set your scaled score into the object wrapper that is in the request
 context before you collect each document.

 When you call collect on the delegate collectors they will call the custom
 value source for each document to perform the sort. Your custom value
 source will return whatever the float value is in the request context at
 that time.

 If you're also going to run this postfilter when you're doing a standard
 rank by score you'll also need to send down a dummy scorer to the delegate
 collectors. Spend some time with the CollapsingQParserPlugin in trunk to
 see how the dummy scorer works.

 I'll be adding value source collapse criteria to the
 CollapsingQParserPlugin this week and it will have a similar interaction
 between a PostFilter and value source. So you may want to watch SOLR-5536
 to see an example of this.

 Joel












 Joel Bernstein
 Search Engineer at Heliosearch


 On Mon, Dec 23, 2013 at 4:03 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  Hi Joel,
 
  Could you clarify what would be in the key,value Map added to the
  SearchRequest context? It seems that all the docId/score tuples need to
 be
  there, including the ones not in the 'top N ScoreDocs' PriorityQueue
  (score=0). If so would the Map be something like:
  scaled_scores,MapInteger,Float ?
 
  Also, what is the reason for passing score=0 for documents that aren't in
  the PriorityQueue? Will these docs get filtered out before a normal sort
 by
  score?
 
  Thanks,
  Peter
 
 
  On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com
  wrote:
 
   The sorting is going to happen in the lower level collectors. You need
 a
   value source that returns the score of the document being collected.
  
   Here is how you can make this happen:
  
   1) Create an object in your PostFilter that simply holds the current
  score.
   Place this object in the SearchRequest context map. Update object.score
  as
   you pass the docs and scores to the lower collectors.
  
   2) Create a values source that checks the SearchRequest context for the
   object that's holding the current score. Use this object to return the
   current score when called. For example if you give the value source a
   handle called score a compound function call will look like this:
   sum(score(), field(x))
  
   Joel
  
  
  
  
  
  
  
  
  
  
   On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
Regarding my original goal, which is to perform a math function using
  the
scaled score and a field value, and sort on the result, how does this
  fit
in? Must I implement another custom PostFilter with a higher cost
 than
   the
scale PostFilter?
   
Thanks,
Peter
   
   
On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan 
 peterlkee...@gmail.com
wrote:
   
 Thanks very much for the guidance. I'd be happy to donate a working
 solution.

 Peter


 On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein 
 joels...@gmail.com
wrote:

 SOLR-5020 has the commit info, it's mainly changes to
   SolrIndexSearcher
I
 believe. They might apply to 4.3.
 I think as long you have the finish method that's all you'll need.
  If
you
 can get this working it would be excellent if you could donate
 back
   the
 Scale PostFilter.


 On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  This is what I was looking for, but the DelegatingCollector
  'finish'
 method
  doesn't exist in 4.3.0 :(   Can this be patched in and are there
  any
 other
  PostFilter dependencies on 4.5?
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein 
  joels...@gmail.com
   
  wrote:
 
   Here is one approach to use in a postfilter

Getting index schema in SolrCloud mode

2014-02-03 Thread Peter Keegan
I'm indexing data with a SolrJ client via SolrServer. Currently, I parse
the schema returned from a HttpGet on:
localhost:8983/solr/collection/schema/fields

What is the recommended way to read the schema with CloudSolrServer? Can it
be done with a single HttpGet to a ZK server?

Thanks,
Peter


Re: How to override rollback behavior in DIH

2014-01-17 Thread Peter Keegan
Following up on this a bit - my main index is updated by a SolrJ client in
another process. If the DIH fails, the SolrJ client is never informed of
the index rollback, and any pending updates are lost. For now, I've made
sure that the DIH processor never throws an exception, but this makes it a
bit harder to detect the failure via the admin interface.

Thanks,
Peter


On Tue, Jan 14, 2014 at 11:12 AM, Peter Keegan peterlkee...@gmail.comwrote:

 I have a custom data import handler that creates an ExternalFileField from
 a source that is different from the main index. If the import fails (in my
 case, a connection refused in URLDataSource), I don't want to roll back any
 uncommitted changes to the main index. However, this seems to be the
 default behavior. Is there a way to override the IndexWriter rollback?

 Thanks,
 Peter



Re: How to override rollback behavior in DIH

2014-01-17 Thread Peter Keegan
I'm actually doing the 'skip' on every successful call to 'nextRow' with
this trick:
  row.put($externalfield,null); // DocBuilder.addFields will skip fields
starting with '$'
because I'm only creating ExternalFieldFields. However, an error could also
occur in the 'init' call, so exceptions have to be caught there, too.

Thanks,
Peter


On Fri, Jan 17, 2014 at 10:19 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Can you try using onError=skip on your entities which use this data source?

 It's been some time since I looked at the code so I don't know if this
 works with data source. Worth a try I guess.

 On Fri, Jan 17, 2014 at 7:20 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
  Following up on this a bit - my main index is updated by a SolrJ client
 in
  another process. If the DIH fails, the SolrJ client is never informed of
  the index rollback, and any pending updates are lost. For now, I've made
  sure that the DIH processor never throws an exception, but this makes it
 a
  bit harder to detect the failure via the admin interface.
 
  Thanks,
  Peter
 
 
  On Tue, Jan 14, 2014 at 11:12 AM, Peter Keegan peterlkee...@gmail.com
 wrote:
 
  I have a custom data import handler that creates an ExternalFileField
 from
  a source that is different from the main index. If the import fails (in
 my
  case, a connection refused in URLDataSource), I don't want to roll back
 any
  uncommitted changes to the main index. However, this seems to be the
  default behavior. Is there a way to override the IndexWriter rollback?
 
  Thanks,
  Peter
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: How to override rollback behavior in DIH

2014-01-17 Thread Peter Keegan
Hmm, this does get a bit complicated, and I'm not even doing any writes
with the DIH SolrWriter. In retrospect, using a DIH to create only EFFs
doesn't buy much except for the integration into the Solr Admin UI.  Thanks
for the pointer to 3671, James.

Peter


On Fri, Jan 17, 2014 at 10:59 AM, Dyer, James
james.d...@ingramcontent.comwrote:

 Peter,

 I think you can override org.apache.solr.handler.dataimport.SolrWriter to
 have a custom (no-op) rollback method.  Your new writer should implement
 org.apache.solr.handler.dataimport.DIHWriter.  You can specify the
 writerImpl request parameter to specify the new class.

 Unfortunately, it isn't actually this easy because your new writer is
 going to have to know what to do for all the other methods.  That is, there
 is no easy way to tell it how to write/commit/etc to Solr.  The default
 SolrWriter has a lot of hardcoded parameters it gets sent on construction
 in DataImportHandler#handleRequestBody.  You would have to somehow
 duplicate this construction on your own custom class.  See SOLR-3671 for an
 explanation of this dilemma.

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: pkeegan01...@gmail.com [mailto:pkeegan01...@gmail.com] On Behalf Of
 Peter Keegan
 Sent: Friday, January 17, 2014 7:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to override rollback behavior in DIH

 Following up on this a bit - my main index is updated by a SolrJ client in
 another process. If the DIH fails, the SolrJ client is never informed of
 the index rollback, and any pending updates are lost. For now, I've made
 sure that the DIH processor never throws an exception, but this makes it a
 bit harder to detect the failure via the admin interface.

 Thanks,
 Peter


 On Tue, Jan 14, 2014 at 11:12 AM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I have a custom data import handler that creates an ExternalFileField
 from
  a source that is different from the main index. If the import fails (in
 my
  case, a connection refused in URLDataSource), I don't want to roll back
 any
  uncommitted changes to the main index. However, this seems to be the
  default behavior. Is there a way to override the IndexWriter rollback?
 
  Thanks,
  Peter
 




How to override rollback behavior in DIH

2014-01-14 Thread Peter Keegan
I have a custom data import handler that creates an ExternalFileField from
a source that is different from the main index. If the import fails (in my
case, a connection refused in URLDataSource), I don't want to roll back any
uncommitted changes to the main index. However, this seems to be the
default behavior. Is there a way to override the IndexWriter rollback?

Thanks,
Peter


Re: leading wildcard characters

2014-01-14 Thread Peter Keegan
I created SOLR-5630.
Although WildcardQuery is much much faster now with AutomatonQuery, it can
still result in slow queries when used in multiple keywords. From my
testing, I think I will need to disable all WildcardQuerys and only allow
PrefixQuery.

Peter


On Sat, Jan 11, 2014 at 4:17 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Peter,

 Yes you are correct. There is no way to disable it.

 Weird thing is javadoc says default is false but it is enabled by default
 in SolrQueryParserBase.
 boolean allowLeadingWildcard = true;



 http://search-lucene.com/jd/solr/solr-core/org/apache/solr/parser/SolrQueryParserBase.html#setAllowLeadingWildcard(boolean)


 There is an effort for making such (allowLeadingWilcard,fuzzyMinSim,
 fuzzyPrefixLength) properties configurable :
 https://issues.apache.org/jira/browse/SOLR-218

 But this one is somehow old. Since its description is stale, do you want
 to open a new one?

 Ahmet


 On Friday, January 10, 2014 6:12 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
 Removing ReversedWildcardFilterFactory  had no effect.



 On Fri, Jan 10, 2014 at 10:48 AM, Ahmet Arslan iori...@yahoo.com wrote:

  Hi Peter,
 
  Can you remove any occurrence of ReversedWildcardFilterFactory in
  schema.xml? (even if you don't use it)
 
  Ahmet
 
 
 
  On Friday, January 10, 2014 3:34 PM, Peter Keegan 
 peterlkee...@gmail.com
  wrote:
  How do you disable leading wildcards in 4.X? The setAllowLeadingWildcard
  method is there in the parser, but nothing references the getter. Also,
 the
  Edismax parser always enables it and provides no way to override.
 
  Thanks,
  Peter
 
 




leading wildcard characters

2014-01-10 Thread Peter Keegan
How do you disable leading wildcards in 4.X? The setAllowLeadingWildcard
method is there in the parser, but nothing references the getter. Also, the
Edismax parser always enables it and provides no way to override.

Thanks,
Peter


Re: leading wildcard characters

2014-01-10 Thread Peter Keegan
Removing ReversedWildcardFilterFactory  had no effect.


On Fri, Jan 10, 2014 at 10:48 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Peter,

 Can you remove any occurrence of ReversedWildcardFilterFactory in
 schema.xml? (even if you don't use it)

 Ahmet



 On Friday, January 10, 2014 3:34 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
 How do you disable leading wildcards in 4.X? The setAllowLeadingWildcard
 method is there in the parser, but nothing references the getter. Also, the
 Edismax parser always enables it and provides no way to override.

 Thanks,
 Peter




Re: Zookeeper as Service

2014-01-09 Thread Peter Keegan
There's also: http://www.tanukisoftware.com/


On Thu, Jan 9, 2014 at 11:18 AM, Nazik Huq nazik...@yahoo.com wrote:



 From your email I gather your main concern is starting zookeeper on server
 startups.

 You may want to look at these non-native service oriented options too:
 Create  a script( cmd or bat) to start ZK on server bootup. This method
 may not restart Zk if Zk crashes(not the server).
 Create C# commad line program that starts on server bootup(see above) that
 uses the .Net System.Diagnostics.Process.Start method to start Zk on
 sever start and monitor the Zk process via a loop. Restart when Zk process
 crash or hang. I prefer this method. There might be a Java equivalent of
 this. There are many exmaples avaialble on the web.
 Cheers,
 @nazik_huq



 On Thursday, January 9, 2014 10:07 AM, Charlie Hull char...@flax.co.uk
 wrote:

 On 09/01/2014 09:44, Karthikeyan.Kannappan wrote:

  I am hosting in windows OS
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Zookeeper-as-Service-tp4110396p4110413.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 

 There are various ways to 'servicify' (yes that may not be an actual
 word) executable applications on Windows. The venerable SrvAny is one
 such option as is the newer
  nssm.exe (Non-Sucking Service Manager).

 Bear in mind that a Windows Service doesn't operate quite the same way
 with regard to stdout and stderr which may mean any error messages end
 up in a black hole, with you simply
  getting something unhelpful 'service
 failed to start' error messages from Windows itself if something goes
 wrong. The 'working directory' is another thing that needs careful
 setting up.

 Cheers

 Charlie

 --
 Charlie Hull
 Flax - Open Source Enterprise Search

 tel/fax: +44 (0)8700 118334
 mobile:  +44 (0)7767 825828
 web: www.flax.co.uk



Re: Function query matching

2014-01-06 Thread Peter Keegan
: The bottom line for Peter is still the same: using scale() wrapped arround
: a function/query does involve a computing hte results for every document,
: and that is going to scale linearly as the size of hte index grows -- but
: it it is *only* because of the scale function.

Another problem with this approach is that the scale() function will likely
generate incorrect values because it occurs before any filters. If the
filters drop high scoring docs, the scaled values will never include the
'maxTarget' value (and may not include the 'minTarget' value, either).

Peter


On Sat, Dec 7, 2013 at 2:30 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 (This is why i shouldn't send emails just before going to bed.)

 I woke up this morning realizing that of course I was completley wrong
 when i said this...

 : I want to be clear for 99% of the people reading this, if you find
 : yourself writting a query structure like this...
 :
 :   q={!func}..functions involving wrapping $qq ...
 ...
 : ...Try to restructure the match you want to do into the form of a
 : multiplier
 ...
 : Because the later case is much more efficient and Solr will only compute
 : the function values for hte docs it needs to (that match the wrapped $qq
 : query)

 The reason i was wrong...

 Even though function queries do by default match all documents, and even
 if the main query is a function query (ie: q={!func}...), if there is
 an fq that filters down the set of documents, then the (main) function
 query will only be calculated for the documents that match the filter.

 It was trivial to ammend the test i mentioned last night to show this (and
 i feel silly for not doing that last night and stoping myself from saying
 something foolish)...

   https://svn.apache.org/viewvc?view=revisionrevision=r1548955

 The bottom line for Peter is still the same: using scale() wrapped arround
 a function/query does involve a computing hte results for every document,
 and that is going to scale linearly as the size of hte index grows -- but
 it it is *only* because of the scale function.



 -Hoss
 http://www.lucidworks.com/



Re: how to include result ordinal in response

2014-01-04 Thread Peter Keegan
Thank you both. The DocTransformer solution was very simple:

import java.io.IOException;

import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.response.transform.DocTransformer;
import org.apache.solr.response.transform.TransformerFactory;

public class PositionAugmenterFactory extends TransformerFactory{

@Override
public DocTransformer create(String field, SolrParams params,
SolrQueryRequest req) {
return new PositionAugmenter( field );
}

class PositionAugmenter extends DocTransformer {
final String name;
int position;

public PositionAugmenter( String display )
{
this.name = display;
this.position = 1;
}
@Override
public String getName() {
return name;
}

@Override
public void transform(SolrDocument doc, int docid) throws
IOException {
doc.setField( name, position++);
}

}
}

@Jack: fl=[docid] is similar to using the uniqueKey, but still hard to
compare visually (for me).

The fields are not returned in the same order as specified in the 'fl'
parameter. Can the order be overridden?

Thanks,
Peter




On Fri, Jan 3, 2014 at 6:58 PM, Jack Krupansky j...@basetechnology.comwrote:

 Or just use the internal document ID: fl=*,[docid]

 Granted, the docID may change if a segment merge occurs and earlier
 documents have been deleted, but it may be sufficient for your purposes.

 -- Jack Krupansky

 -Original Message- From: Upayavira
 Sent: Friday, January 03, 2014 5:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to include result ordinal in response


 On Fri, Jan 3, 2014, at 10:00 PM, Peter Keegan wrote:

 Is there a simple way to output the result number (ordinal) with each
 returned document using the 'fl' parameter? This would be useful when
 visually comparing the results from 2 queries.


 I'm not aware of a simple way.

 If you're competent in Java, this could be a neat new DocTransformer
 component. You'd say:

 fl=*,[position]

 and you'd get a new field in your search results.

 Cruder ways would be to use XSLT to add it to an XML output, or a
 velocity template, but the DocTransformer approach would create
 something that could be of ongoing use.

 Upayavira



how to include result ordinal in response

2014-01-03 Thread Peter Keegan
Is there a simple way to output the result number (ordinal) with each
returned document using the 'fl' parameter? This would be useful when
visually comparing the results from 2 queries.

Thanks,
Peter


Re: Configurable collectors for custom ranking

2013-12-26 Thread Peter Keegan
In my case, the final function call looks something like this:
sum(product($k1,score()),product($k2,field(x)))
This means that all the scores would have to scaled and passed down, not
just the top N because even a low score could be offset by a high value in
'field(x)'.

Thanks,
Peter


On Mon, Dec 23, 2013 at 6:37 PM, Joel Bernstein joels...@gmail.com wrote:

 Peter,

 You actually only need the current score being collected to be in the
 request context. So you don't need a map, you just need an object wrapper
 around a mutable float.

 If you have a page size of X, only the top X scores need to be held onto,
 because all the other scores wouldn't have made it into that page anyway so
 they might as well be 0. Because the QueryResultCache caches's a larger
 window then the page size you should keep enough scores so the cached
 docList is correct. But if you're only dealing with 150K of results you
 could just keep all the scores in a FloatArrayList and not worry about the
 keeping the top X scores in a priority queue.

 During the collect hang onto the docIds and scores and build your scaling
 info.

 During the finish iterate your docIds and scale the scores as you go.

 Set your scaled score into the object wrapper that is in the request
 context before you collect each document.

 When you call collect on the delegate collectors they will call the custom
 value source for each document to perform the sort. Your custom value
 source will return whatever the float value is in the request context at
 that time.

 If you're also going to run this postfilter when you're doing a standard
 rank by score you'll also need to send down a dummy scorer to the delegate
 collectors. Spend some time with the CollapsingQParserPlugin in trunk to
 see how the dummy scorer works.

 I'll be adding value source collapse criteria to the
 CollapsingQParserPlugin this week and it will have a similar interaction
 between a PostFilter and value source. So you may want to watch SOLR-5536
 to see an example of this.

 Joel












 Joel Bernstein
 Search Engineer at Heliosearch


 On Mon, Dec 23, 2013 at 4:03 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  Hi Joel,
 
  Could you clarify what would be in the key,value Map added to the
  SearchRequest context? It seems that all the docId/score tuples need to
 be
  there, including the ones not in the 'top N ScoreDocs' PriorityQueue
  (score=0). If so would the Map be something like:
  scaled_scores,MapInteger,Float ?
 
  Also, what is the reason for passing score=0 for documents that aren't in
  the PriorityQueue? Will these docs get filtered out before a normal sort
 by
  score?
 
  Thanks,
  Peter
 
 
  On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com
  wrote:
 
   The sorting is going to happen in the lower level collectors. You need
 a
   value source that returns the score of the document being collected.
  
   Here is how you can make this happen:
  
   1) Create an object in your PostFilter that simply holds the current
  score.
   Place this object in the SearchRequest context map. Update object.score
  as
   you pass the docs and scores to the lower collectors.
  
   2) Create a values source that checks the SearchRequest context for the
   object that's holding the current score. Use this object to return the
   current score when called. For example if you give the value source a
   handle called score a compound function call will look like this:
   sum(score(), field(x))
  
   Joel
  
  
  
  
  
  
  
  
  
  
   On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
Regarding my original goal, which is to perform a math function using
  the
scaled score and a field value, and sort on the result, how does this
  fit
in? Must I implement another custom PostFilter with a higher cost
 than
   the
scale PostFilter?
   
Thanks,
Peter
   
   
On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan 
 peterlkee...@gmail.com
wrote:
   
 Thanks very much for the guidance. I'd be happy to donate a working
 solution.

 Peter


 On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein 
 joels...@gmail.com
wrote:

 SOLR-5020 has the commit info, it's mainly changes to
   SolrIndexSearcher
I
 believe. They might apply to 4.3.
 I think as long you have the finish method that's all you'll need.
  If
you
 can get this working it would be excellent if you could donate
 back
   the
 Scale PostFilter.


 On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  This is what I was looking for, but the DelegatingCollector
  'finish'
 method
  doesn't exist in 4.3.0 :(   Can this be patched in and are there
  any
 other
  PostFilter dependencies on 4.5?
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein 
  joels...@gmail.com
   
  wrote

Re: Configurable collectors for custom ranking

2013-12-23 Thread Peter Keegan
Hi Joel,

Could you clarify what would be in the key,value Map added to the
SearchRequest context? It seems that all the docId/score tuples need to be
there, including the ones not in the 'top N ScoreDocs' PriorityQueue
(score=0). If so would the Map be something like:
scaled_scores,MapInteger,Float ?

Also, what is the reason for passing score=0 for documents that aren't in
the PriorityQueue? Will these docs get filtered out before a normal sort by
score?

Thanks,
Peter


On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com wrote:

 The sorting is going to happen in the lower level collectors. You need a
 value source that returns the score of the document being collected.

 Here is how you can make this happen:

 1) Create an object in your PostFilter that simply holds the current score.
 Place this object in the SearchRequest context map. Update object.score as
 you pass the docs and scores to the lower collectors.

 2) Create a values source that checks the SearchRequest context for the
 object that's holding the current score. Use this object to return the
 current score when called. For example if you give the value source a
 handle called score a compound function call will look like this:
 sum(score(), field(x))

 Joel










 On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com
 wrote:

  Regarding my original goal, which is to perform a math function using the
  scaled score and a field value, and sort on the result, how does this fit
  in? Must I implement another custom PostFilter with a higher cost than
 the
  scale PostFilter?
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   Thanks very much for the guidance. I'd be happy to donate a working
   solution.
  
   Peter
  
  
   On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com
  wrote:
  
   SOLR-5020 has the commit info, it's mainly changes to
 SolrIndexSearcher
  I
   believe. They might apply to 4.3.
   I think as long you have the finish method that's all you'll need. If
  you
   can get this working it would be excellent if you could donate back
 the
   Scale PostFilter.
  
  
   On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
This is what I was looking for, but the DelegatingCollector 'finish'
   method
doesn't exist in 4.3.0 :(   Can this be patched in and are there any
   other
PostFilter dependencies on 4.5?
   
Thanks,
Peter
   
   
On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com
 
wrote:
   
 Here is one approach to use in a postfilter

 1) In the collect() method call score for each doc. Use the scores
  to
 create your scaleInfo.
 2) Keep a bitset of the hits and a priorityQueue of your top X
   ScoreDocs.
 3) Don't delegate any documents to lower collectors in the
 collect()
 method.
 4) In the finish method create a score mapping (use the hppc
 IntFloatOpenHashMap) with your top X docIds pointing to their
 score,
using
 the priorityQueue created in step 2. Then iterate the bitset (also
created
 in step 2) sending down each doc to the lower collectors,
 retrieving
   and
 scaling the score from the score map. If the document is not in
 the
   score
 map then send down 0.

 You'll have setup a dummy scorer to feed to lower collectors. The
 CollapsingQParserPlugin has an example of how to do this.




 On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  Hi Joel,
 
  I thought about using a PostFilter, but the problem is that the
   'scale'
  function must be done after all matching docs have been scored
 but
before
  adding them to the PriorityQueue that sorts just the rows to be
returned.
  Doing the 'scale' function wrapped in a 'query' is proving to be
  too
slow
  when it visits every document in the index.
 
  In the Collector, I can see how to get the field values like
 this:
 
 

   
  
 
 indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField,
  QParser).getValues()
 
  But, 'getValueSource' needs a QParser, which isn't available.
  And I can't create a QParser without a SolrQueryRequest, which
  isn't
  available.
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein 
  joels...@gmail.com
   
  wrote:
 
   Peter,
  
   It sounds like you could achieve what you want to do in a
   PostFilter
  rather
   then extending the TopDocsCollector. Is there a reason why a
PostFilter
   won't work for you?
  
   Joel
  
  
   On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan 
peterlkee...@gmail.com
   wrote:
  
Quick question:
In the context of a custom collector, how does one get

Re: Configurable collectors for custom ranking

2013-12-19 Thread Peter Keegan
In order to size the PriorityQueue, the result window size for the query is
needed. This has been computed in the SolrIndexSearcher and available in:
QueryCommand.getSupersetMaxDoc(), but doesn't seem to be available for the
PostFilter in either the SolrParms or SolrQueryRequest. Is there a way to
get this precomputed value or do I have to duplicate the logic from
SolrIndexSearcher?

Thanks,
Peter


On Thu, Dec 12, 2013 at 1:53 PM, Joel Bernstein joels...@gmail.com wrote:

 Thanks, I agree this powerful stuff. One of the reasons that I haven't
 gotten back to pluggable collectors is that I've been using PostFilters
 instead.

 When you start doing stuff with scores in postfilters you'll run into the
 bug in SOLR-5416. This will effect you when you use facets in combination
 with the QueryResultCache or tag and exclude faceting.

 The patch in SOLR-5416 resolves this issue. You'll just need your
 PostFilter to implement ScoreFilter and the SolrIndexSearcher will know how
 to handle things.

 The DelegatingCollector.finish() method is so new, these kinds of bugs are
 still being cleaned out of the system. SOLR-5416 should be in Solr 4.7.









 On Thu, Dec 12, 2013 at 12:54 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  This is pretty cool, and worthy of adding to Solr in Action (v2) and the
  other books. With function queries, flexible filter processing and
 caching,
  custom collectors, and post filters, there's a lot of flexibility here.
 
  Btw, the query times using a custom collector to scale/recompute scores
 is
  excellent (will have to see how it compares to your outlined solution).
 
  Thanks,
  Peter
 
 
  On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com
  wrote:
 
   The sorting is going to happen in the lower level collectors. You need
 a
   value source that returns the score of the document being collected.
  
   Here is how you can make this happen:
  
   1) Create an object in your PostFilter that simply holds the current
  score.
   Place this object in the SearchRequest context map. Update object.score
  as
   you pass the docs and scores to the lower collectors.
  
   2) Create a values source that checks the SearchRequest context for the
   object that's holding the current score. Use this object to return the
   current score when called. For example if you give the value source a
   handle called score a compound function call will look like this:
   sum(score(), field(x))
  
   Joel
  
  
  
  
  
  
  
  
  
  
   On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
Regarding my original goal, which is to perform a math function using
  the
scaled score and a field value, and sort on the result, how does this
  fit
in? Must I implement another custom PostFilter with a higher cost
 than
   the
scale PostFilter?
   
Thanks,
Peter
   
   
On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan 
 peterlkee...@gmail.com
wrote:
   
 Thanks very much for the guidance. I'd be happy to donate a working
 solution.

 Peter


 On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein 
 joels...@gmail.com
wrote:

 SOLR-5020 has the commit info, it's mainly changes to
   SolrIndexSearcher
I
 believe. They might apply to 4.3.
 I think as long you have the finish method that's all you'll need.
  If
you
 can get this working it would be excellent if you could donate
 back
   the
 Scale PostFilter.


 On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  This is what I was looking for, but the DelegatingCollector
  'finish'
 method
  doesn't exist in 4.3.0 :(   Can this be patched in and are there
  any
 other
  PostFilter dependencies on 4.5?
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein 
  joels...@gmail.com
   
  wrote:
 
   Here is one approach to use in a postfilter
  
   1) In the collect() method call score for each doc. Use the
  scores
to
   create your scaleInfo.
   2) Keep a bitset of the hits and a priorityQueue of your top X
 ScoreDocs.
   3) Don't delegate any documents to lower collectors in the
   collect()
   method.
   4) In the finish method create a score mapping (use the hppc
   IntFloatOpenHashMap) with your top X docIds pointing to their
   score,
  using
   the priorityQueue created in step 2. Then iterate the bitset
  (also
  created
   in step 2) sending down each doc to the lower collectors,
   retrieving
 and
   scaling the score from the score map. If the document is not
 in
   the
 score
   map then send down 0.
  
   You'll have setup a dummy scorer to feed to lower collectors.
  The
   CollapsingQParserPlugin has an example of how to do this.
  
  
  
  
   On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan

Re: Configurable collectors for custom ranking

2013-12-19 Thread Peter Keegan
I implemented the PostFilter approach described by Joel. Just iterating
over the OpenBitSet, even without the scaling or the HashMap lookup, added
30ms to a query time, which kinda surprised me. There were about 150K hits
out of a total of 500K. Is OpenBitSet the best way to do this?

Thanks,
Peter


On Thu, Dec 19, 2013 at 9:51 AM, Peter Keegan peterlkee...@gmail.comwrote:

 In order to size the PriorityQueue, the result window size for the query
 is needed. This has been computed in the SolrIndexSearcher and available
 in: QueryCommand.getSupersetMaxDoc(), but doesn't seem to be available for
 the PostFilter in either the SolrParms or SolrQueryRequest. Is there a way
 to get this precomputed value or do I have to duplicate the logic from
 SolrIndexSearcher?

 Thanks,
 Peter


 On Thu, Dec 12, 2013 at 1:53 PM, Joel Bernstein joels...@gmail.comwrote:

 Thanks, I agree this powerful stuff. One of the reasons that I haven't
 gotten back to pluggable collectors is that I've been using PostFilters
 instead.

 When you start doing stuff with scores in postfilters you'll run into the
 bug in SOLR-5416. This will effect you when you use facets in combination
 with the QueryResultCache or tag and exclude faceting.

 The patch in SOLR-5416 resolves this issue. You'll just need your
 PostFilter to implement ScoreFilter and the SolrIndexSearcher will know
 how
 to handle things.

 The DelegatingCollector.finish() method is so new, these kinds of bugs are
 still being cleaned out of the system. SOLR-5416 should be in Solr 4.7.









 On Thu, Dec 12, 2013 at 12:54 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  This is pretty cool, and worthy of adding to Solr in Action (v2) and the
  other books. With function queries, flexible filter processing and
 caching,
  custom collectors, and post filters, there's a lot of flexibility here.
 
  Btw, the query times using a custom collector to scale/recompute scores
 is
  excellent (will have to see how it compares to your outlined solution).
 
  Thanks,
  Peter
 
 
  On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com
  wrote:
 
   The sorting is going to happen in the lower level collectors. You
 need a
   value source that returns the score of the document being collected.
  
   Here is how you can make this happen:
  
   1) Create an object in your PostFilter that simply holds the current
  score.
   Place this object in the SearchRequest context map. Update
 object.score
  as
   you pass the docs and scores to the lower collectors.
  
   2) Create a values source that checks the SearchRequest context for
 the
   object that's holding the current score. Use this object to return the
   current score when called. For example if you give the value source a
   handle called score a compound function call will look like this:
   sum(score(), field(x))
  
   Joel
  
  
  
  
  
  
  
  
  
  
   On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
Regarding my original goal, which is to perform a math function
 using
  the
scaled score and a field value, and sort on the result, how does
 this
  fit
in? Must I implement another custom PostFilter with a higher cost
 than
   the
scale PostFilter?
   
Thanks,
Peter
   
   
On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan 
 peterlkee...@gmail.com
wrote:
   
 Thanks very much for the guidance. I'd be happy to donate a
 working
 solution.

 Peter


 On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein 
 joels...@gmail.com
wrote:

 SOLR-5020 has the commit info, it's mainly changes to
   SolrIndexSearcher
I
 believe. They might apply to 4.3.
 I think as long you have the finish method that's all you'll
 need.
  If
you
 can get this working it would be excellent if you could donate
 back
   the
 Scale PostFilter.


 On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  This is what I was looking for, but the DelegatingCollector
  'finish'
 method
  doesn't exist in 4.3.0 :(   Can this be patched in and are
 there
  any
 other
  PostFilter dependencies on 4.5?
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein 
  joels...@gmail.com
   
  wrote:
 
   Here is one approach to use in a postfilter
  
   1) In the collect() method call score for each doc. Use the
  scores
to
   create your scaleInfo.
   2) Keep a bitset of the hits and a priorityQueue of your top
 X
 ScoreDocs.
   3) Don't delegate any documents to lower collectors in the
   collect()
   method.
   4) In the finish method create a score mapping (use the hppc
   IntFloatOpenHashMap) with your top X docIds pointing to their
   score,
  using
   the priorityQueue created in step 2. Then iterate the bitset
  (also
  created
   in step 2) sending down each doc

Re: Configurable collectors for custom ranking

2013-12-12 Thread Peter Keegan
Regarding my original goal, which is to perform a math function using the
scaled score and a field value, and sort on the result, how does this fit
in? Must I implement another custom PostFilter with a higher cost than the
scale PostFilter?

Thanks,
Peter


On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.comwrote:

 Thanks very much for the guidance. I'd be happy to donate a working
 solution.

 Peter


 On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.comwrote:

 SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I
 believe. They might apply to 4.3.
 I think as long you have the finish method that's all you'll need. If you
 can get this working it would be excellent if you could donate back the
 Scale PostFilter.


 On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  This is what I was looking for, but the DelegatingCollector 'finish'
 method
  doesn't exist in 4.3.0 :(   Can this be patched in and are there any
 other
  PostFilter dependencies on 4.5?
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com
  wrote:
 
   Here is one approach to use in a postfilter
  
   1) In the collect() method call score for each doc. Use the scores to
   create your scaleInfo.
   2) Keep a bitset of the hits and a priorityQueue of your top X
 ScoreDocs.
   3) Don't delegate any documents to lower collectors in the collect()
   method.
   4) In the finish method create a score mapping (use the hppc
   IntFloatOpenHashMap) with your top X docIds pointing to their score,
  using
   the priorityQueue created in step 2. Then iterate the bitset (also
  created
   in step 2) sending down each doc to the lower collectors, retrieving
 and
   scaling the score from the score map. If the document is not in the
 score
   map then send down 0.
  
   You'll have setup a dummy scorer to feed to lower collectors. The
   CollapsingQParserPlugin has an example of how to do this.
  
  
  
  
   On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
Hi Joel,
   
I thought about using a PostFilter, but the problem is that the
 'scale'
function must be done after all matching docs have been scored but
  before
adding them to the PriorityQueue that sorts just the rows to be
  returned.
Doing the 'scale' function wrapped in a 'query' is proving to be too
  slow
when it visits every document in the index.
   
In the Collector, I can see how to get the field values like this:
   
   
  
 
 indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField,
QParser).getValues()
   
But, 'getValueSource' needs a QParser, which isn't available.
And I can't create a QParser without a SolrQueryRequest, which isn't
available.
   
Thanks,
Peter
   
   
On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com
 
wrote:
   
 Peter,

 It sounds like you could achieve what you want to do in a
 PostFilter
rather
 then extending the TopDocsCollector. Is there a reason why a
  PostFilter
 won't work for you?

 Joel


 On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  Quick question:
  In the context of a custom collector, how does one get the
 values
  of
   a
  field of type 'ExternalFileField'?
 
  Thanks,
  Peter
 
 
  On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan 
   peterlkee...@gmail.com
  wrote:
 
   Hi Joel,
  
   This is related to another thread on function query matching (
  
 

   
  
 
 http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513
  ).
   The patch in SOLR-4465 will allow me to extend
 TopDocsCollector
  and
  perform
   the 'scale' function on only the documents matching the main
  dismax
  query.
   As you mention, it is a slightly intrusive design and requires
   that I
   manage my own PriorityQueue (and a local duplicate of
 HitQueue),
   but
  should
   work. I think a better design would hide the PQ from the
 plugin.
  
   Thanks,
   Peter
  
  
   On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein 
  joels...@gmail.com
   
  wrote:
  
   Hi Peter,
  
   I've been meaning to revisit configurable ranking collectors,
  but
   I
   haven't
   yet had a chance. It's on the shortlist of things I'd like to
   tackle
   though.
  
  
  
   On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan 
peterlkee...@gmail.com
   wrote:
  
I looked at SOLR-4465 and SOLR-5045, where it appears that
  there
is
 a
   goal
to be able to do custom sorting and ranking in a
 PostFilter.
  So
far,
  it
looks like only custom aggregation can be implemented in
PostFilter
   (5045

Re: Configurable collectors for custom ranking

2013-12-12 Thread Peter Keegan
This is pretty cool, and worthy of adding to Solr in Action (v2) and the
other books. With function queries, flexible filter processing and caching,
custom collectors, and post filters, there's a lot of flexibility here.

Btw, the query times using a custom collector to scale/recompute scores is
excellent (will have to see how it compares to your outlined solution).

Thanks,
Peter


On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com wrote:

 The sorting is going to happen in the lower level collectors. You need a
 value source that returns the score of the document being collected.

 Here is how you can make this happen:

 1) Create an object in your PostFilter that simply holds the current score.
 Place this object in the SearchRequest context map. Update object.score as
 you pass the docs and scores to the lower collectors.

 2) Create a values source that checks the SearchRequest context for the
 object that's holding the current score. Use this object to return the
 current score when called. For example if you give the value source a
 handle called score a compound function call will look like this:
 sum(score(), field(x))

 Joel










 On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com
 wrote:

  Regarding my original goal, which is to perform a math function using the
  scaled score and a field value, and sort on the result, how does this fit
  in? Must I implement another custom PostFilter with a higher cost than
 the
  scale PostFilter?
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   Thanks very much for the guidance. I'd be happy to donate a working
   solution.
  
   Peter
  
  
   On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com
  wrote:
  
   SOLR-5020 has the commit info, it's mainly changes to
 SolrIndexSearcher
  I
   believe. They might apply to 4.3.
   I think as long you have the finish method that's all you'll need. If
  you
   can get this working it would be excellent if you could donate back
 the
   Scale PostFilter.
  
  
   On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
This is what I was looking for, but the DelegatingCollector 'finish'
   method
doesn't exist in 4.3.0 :(   Can this be patched in and are there any
   other
PostFilter dependencies on 4.5?
   
Thanks,
Peter
   
   
On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com
 
wrote:
   
 Here is one approach to use in a postfilter

 1) In the collect() method call score for each doc. Use the scores
  to
 create your scaleInfo.
 2) Keep a bitset of the hits and a priorityQueue of your top X
   ScoreDocs.
 3) Don't delegate any documents to lower collectors in the
 collect()
 method.
 4) In the finish method create a score mapping (use the hppc
 IntFloatOpenHashMap) with your top X docIds pointing to their
 score,
using
 the priorityQueue created in step 2. Then iterate the bitset (also
created
 in step 2) sending down each doc to the lower collectors,
 retrieving
   and
 scaling the score from the score map. If the document is not in
 the
   score
 map then send down 0.

 You'll have setup a dummy scorer to feed to lower collectors. The
 CollapsingQParserPlugin has an example of how to do this.




 On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  Hi Joel,
 
  I thought about using a PostFilter, but the problem is that the
   'scale'
  function must be done after all matching docs have been scored
 but
before
  adding them to the PriorityQueue that sorts just the rows to be
returned.
  Doing the 'scale' function wrapped in a 'query' is proving to be
  too
slow
  when it visits every document in the index.
 
  In the Collector, I can see how to get the field values like
 this:
 
 

   
  
 
 indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField,
  QParser).getValues()
 
  But, 'getValueSource' needs a QParser, which isn't available.
  And I can't create a QParser without a SolrQueryRequest, which
  isn't
  available.
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein 
  joels...@gmail.com
   
  wrote:
 
   Peter,
  
   It sounds like you could achieve what you want to do in a
   PostFilter
  rather
   then extending the TopDocsCollector. Is there a reason why a
PostFilter
   won't work for you?
  
   Joel
  
  
   On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan 
peterlkee...@gmail.com
   wrote:
  
Quick question:
In the context of a custom collector, how does one get the
   values
of
 a
field of type 'ExternalFileField'?
   
Thanks

Re: Configurable collectors for custom ranking

2013-12-11 Thread Peter Keegan
Hi Joel,

I thought about using a PostFilter, but the problem is that the 'scale'
function must be done after all matching docs have been scored but before
adding them to the PriorityQueue that sorts just the rows to be returned.
Doing the 'scale' function wrapped in a 'query' is proving to be too slow
when it visits every document in the index.

In the Collector, I can see how to get the field values like this:
indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField,
QParser).getValues()

But, 'getValueSource' needs a QParser, which isn't available.
And I can't create a QParser without a SolrQueryRequest, which isn't
available.

Thanks,
Peter


On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com wrote:

 Peter,

 It sounds like you could achieve what you want to do in a PostFilter rather
 then extending the TopDocsCollector. Is there a reason why a PostFilter
 won't work for you?

 Joel


 On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  Quick question:
  In the context of a custom collector, how does one get the values of a
  field of type 'ExternalFileField'?
 
  Thanks,
  Peter
 
 
  On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   Hi Joel,
  
   This is related to another thread on function query matching (
  
 
 http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513
  ).
   The patch in SOLR-4465 will allow me to extend TopDocsCollector and
  perform
   the 'scale' function on only the documents matching the main dismax
  query.
   As you mention, it is a slightly intrusive design and requires that I
   manage my own PriorityQueue (and a local duplicate of HitQueue), but
  should
   work. I think a better design would hide the PQ from the plugin.
  
   Thanks,
   Peter
  
  
   On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com
  wrote:
  
   Hi Peter,
  
   I've been meaning to revisit configurable ranking collectors, but I
   haven't
   yet had a chance. It's on the shortlist of things I'd like to tackle
   though.
  
  
  
   On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
I looked at SOLR-4465 and SOLR-5045, where it appears that there is
 a
   goal
to be able to do custom sorting and ranking in a PostFilter. So far,
  it
looks like only custom aggregation can be implemented in PostFilter
   (5045).
Custom sorting/ranking can be done in a pluggable collector (4465),
  but
this patch is no longer in dev.
   
Is there any other dev. being done on adding custom sorting (after
collection) via a plugin?
   
Thanks,
Peter
   
  
  
  
   --
   Joel Bernstein
   Search Engineer at Heliosearch
  
  
  
 



 --
 Joel Bernstein
 Search Engineer at Heliosearch



Re: Configurable collectors for custom ranking

2013-12-11 Thread Peter Keegan
From the Collector context, I suppose I can access the FileFloatSource
directly like this, although it's not generic:

SchemaField field = indexSearcher.getSchema().getField(fieldName);
dataDir = indexSearcher.getSchema().getResourceLoader().getDataDir();
ExternalFileField eff = (ExternalFileField)field.getType();
fieldValues = eff.getFileFloatSource(field, dataDir);

And then read the values in 'setNextReader'

Peter


On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.comwrote:

 Hi Joel,

 I thought about using a PostFilter, but the problem is that the 'scale'
 function must be done after all matching docs have been scored but before
 adding them to the PriorityQueue that sorts just the rows to be returned.
 Doing the 'scale' function wrapped in a 'query' is proving to be too slow
 when it visits every document in the index.

 In the Collector, I can see how to get the field values like this:
 indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField,
 QParser).getValues()

 But, 'getValueSource' needs a QParser, which isn't available.
 And I can't create a QParser without a SolrQueryRequest, which isn't
 available.

 Thanks,
 Peter


 On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.comwrote:

 Peter,

 It sounds like you could achieve what you want to do in a PostFilter
 rather
 then extending the TopDocsCollector. Is there a reason why a PostFilter
 won't work for you?

 Joel


 On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  Quick question:
  In the context of a custom collector, how does one get the values of a
  field of type 'ExternalFileField'?
 
  Thanks,
  Peter
 
 
  On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   Hi Joel,
  
   This is related to another thread on function query matching (
  
 
 http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513
  ).
   The patch in SOLR-4465 will allow me to extend TopDocsCollector and
  perform
   the 'scale' function on only the documents matching the main dismax
  query.
   As you mention, it is a slightly intrusive design and requires that I
   manage my own PriorityQueue (and a local duplicate of HitQueue), but
  should
   work. I think a better design would hide the PQ from the plugin.
  
   Thanks,
   Peter
  
  
   On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com
  wrote:
  
   Hi Peter,
  
   I've been meaning to revisit configurable ranking collectors, but I
   haven't
   yet had a chance. It's on the shortlist of things I'd like to tackle
   though.
  
  
  
   On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com
 
   wrote:
  
I looked at SOLR-4465 and SOLR-5045, where it appears that there
 is a
   goal
to be able to do custom sorting and ranking in a PostFilter. So
 far,
  it
looks like only custom aggregation can be implemented in PostFilter
   (5045).
Custom sorting/ranking can be done in a pluggable collector (4465),
  but
this patch is no longer in dev.
   
Is there any other dev. being done on adding custom sorting (after
collection) via a plugin?
   
Thanks,
Peter
   
  
  
  
   --
   Joel Bernstein
   Search Engineer at Heliosearch
  
  
  
 



 --
 Joel Bernstein
 Search Engineer at Heliosearch





Re: Configurable collectors for custom ranking

2013-12-11 Thread Peter Keegan
This is what I was looking for, but the DelegatingCollector 'finish' method
doesn't exist in 4.3.0 :(   Can this be patched in and are there any other
PostFilter dependencies on 4.5?

Thanks,
Peter


On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote:

 Here is one approach to use in a postfilter

 1) In the collect() method call score for each doc. Use the scores to
 create your scaleInfo.
 2) Keep a bitset of the hits and a priorityQueue of your top X ScoreDocs.
 3) Don't delegate any documents to lower collectors in the collect()
 method.
 4) In the finish method create a score mapping (use the hppc
 IntFloatOpenHashMap) with your top X docIds pointing to their score, using
 the priorityQueue created in step 2. Then iterate the bitset (also created
 in step 2) sending down each doc to the lower collectors, retrieving and
 scaling the score from the score map. If the document is not in the score
 map then send down 0.

 You'll have setup a dummy scorer to feed to lower collectors. The
 CollapsingQParserPlugin has an example of how to do this.




 On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  Hi Joel,
 
  I thought about using a PostFilter, but the problem is that the 'scale'
  function must be done after all matching docs have been scored but before
  adding them to the PriorityQueue that sorts just the rows to be returned.
  Doing the 'scale' function wrapped in a 'query' is proving to be too slow
  when it visits every document in the index.
 
  In the Collector, I can see how to get the field values like this:
 
 
 indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField,
  QParser).getValues()
 
  But, 'getValueSource' needs a QParser, which isn't available.
  And I can't create a QParser without a SolrQueryRequest, which isn't
  available.
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com
  wrote:
 
   Peter,
  
   It sounds like you could achieve what you want to do in a PostFilter
  rather
   then extending the TopDocsCollector. Is there a reason why a PostFilter
   won't work for you?
  
   Joel
  
  
   On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
Quick question:
In the context of a custom collector, how does one get the values of
 a
field of type 'ExternalFileField'?
   
Thanks,
Peter
   
   
On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan 
 peterlkee...@gmail.com
wrote:
   
 Hi Joel,

 This is related to another thread on function query matching (

   
  
 
 http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513
).
 The patch in SOLR-4465 will allow me to extend TopDocsCollector and
perform
 the 'scale' function on only the documents matching the main dismax
query.
 As you mention, it is a slightly intrusive design and requires
 that I
 manage my own PriorityQueue (and a local duplicate of HitQueue),
 but
should
 work. I think a better design would hide the PQ from the plugin.

 Thanks,
 Peter


 On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com
 
wrote:

 Hi Peter,

 I've been meaning to revisit configurable ranking collectors, but
 I
 haven't
 yet had a chance. It's on the shortlist of things I'd like to
 tackle
 though.



 On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  I looked at SOLR-4465 and SOLR-5045, where it appears that there
  is
   a
 goal
  to be able to do custom sorting and ranking in a PostFilter. So
  far,
it
  looks like only custom aggregation can be implemented in
  PostFilter
 (5045).
  Custom sorting/ranking can be done in a pluggable collector
  (4465),
but
  this patch is no longer in dev.
 
  Is there any other dev. being done on adding custom sorting
 (after
  collection) via a plugin?
 
  Thanks,
  Peter
 



 --
 Joel Bernstein
 Search Engineer at Heliosearch



   
  
  
  
   --
   Joel Bernstein
   Search Engineer at Heliosearch
  
 



 --
 Joel Bernstein
 Search Engineer at Heliosearch



Re: Configurable collectors for custom ranking

2013-12-11 Thread Peter Keegan
Thanks very much for the guidance. I'd be happy to donate a working
solution.

Peter


On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com wrote:

 SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I
 believe. They might apply to 4.3.
 I think as long you have the finish method that's all you'll need. If you
 can get this working it would be excellent if you could donate back the
 Scale PostFilter.


 On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  This is what I was looking for, but the DelegatingCollector 'finish'
 method
  doesn't exist in 4.3.0 :(   Can this be patched in and are there any
 other
  PostFilter dependencies on 4.5?
 
  Thanks,
  Peter
 
 
  On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com
  wrote:
 
   Here is one approach to use in a postfilter
  
   1) In the collect() method call score for each doc. Use the scores to
   create your scaleInfo.
   2) Keep a bitset of the hits and a priorityQueue of your top X
 ScoreDocs.
   3) Don't delegate any documents to lower collectors in the collect()
   method.
   4) In the finish method create a score mapping (use the hppc
   IntFloatOpenHashMap) with your top X docIds pointing to their score,
  using
   the priorityQueue created in step 2. Then iterate the bitset (also
  created
   in step 2) sending down each doc to the lower collectors, retrieving
 and
   scaling the score from the score map. If the document is not in the
 score
   map then send down 0.
  
   You'll have setup a dummy scorer to feed to lower collectors. The
   CollapsingQParserPlugin has an example of how to do this.
  
  
  
  
   On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
Hi Joel,
   
I thought about using a PostFilter, but the problem is that the
 'scale'
function must be done after all matching docs have been scored but
  before
adding them to the PriorityQueue that sorts just the rows to be
  returned.
Doing the 'scale' function wrapped in a 'query' is proving to be too
  slow
when it visits every document in the index.
   
In the Collector, I can see how to get the field values like this:
   
   
  
 
 indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField,
QParser).getValues()
   
But, 'getValueSource' needs a QParser, which isn't available.
And I can't create a QParser without a SolrQueryRequest, which isn't
available.
   
Thanks,
Peter
   
   
On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com
wrote:
   
 Peter,

 It sounds like you could achieve what you want to do in a
 PostFilter
rather
 then extending the TopDocsCollector. Is there a reason why a
  PostFilter
 won't work for you?

 Joel


 On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan 
  peterlkee...@gmail.com
 wrote:

  Quick question:
  In the context of a custom collector, how does one get the values
  of
   a
  field of type 'ExternalFileField'?
 
  Thanks,
  Peter
 
 
  On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan 
   peterlkee...@gmail.com
  wrote:
 
   Hi Joel,
  
   This is related to another thread on function query matching (
  
 

   
  
 
 http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513
  ).
   The patch in SOLR-4465 will allow me to extend TopDocsCollector
  and
  perform
   the 'scale' function on only the documents matching the main
  dismax
  query.
   As you mention, it is a slightly intrusive design and requires
   that I
   manage my own PriorityQueue (and a local duplicate of
 HitQueue),
   but
  should
   work. I think a better design would hide the PQ from the
 plugin.
  
   Thanks,
   Peter
  
  
   On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein 
  joels...@gmail.com
   
  wrote:
  
   Hi Peter,
  
   I've been meaning to revisit configurable ranking collectors,
  but
   I
   haven't
   yet had a chance. It's on the shortlist of things I'd like to
   tackle
   though.
  
  
  
   On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan 
peterlkee...@gmail.com
   wrote:
  
I looked at SOLR-4465 and SOLR-5045, where it appears that
  there
is
 a
   goal
to be able to do custom sorting and ranking in a PostFilter.
  So
far,
  it
looks like only custom aggregation can be implemented in
PostFilter
   (5045).
Custom sorting/ranking can be done in a pluggable collector
(4465),
  but
this patch is no longer in dev.
   
Is there any other dev. being done on adding custom sorting
   (after
collection) via a plugin?
   
Thanks,
Peter
   
  
  
  
   --
   Joel Bernstein

Re: Configurable collectors for custom ranking

2013-12-10 Thread Peter Keegan
Hi Joel,

This is related to another thread on function query matching (
http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513).
The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform
the 'scale' function on only the documents matching the main dismax query.
As you mention, it is a slightly intrusive design and requires that I
manage my own PriorityQueue (and a local duplicate of HitQueue), but should
work. I think a better design would hide the PQ from the plugin.

Thanks,
Peter


On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com wrote:

 Hi Peter,

 I've been meaning to revisit configurable ranking collectors, but I haven't
 yet had a chance. It's on the shortlist of things I'd like to tackle
 though.



 On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I looked at SOLR-4465 and SOLR-5045, where it appears that there is a
 goal
  to be able to do custom sorting and ranking in a PostFilter. So far, it
  looks like only custom aggregation can be implemented in PostFilter
 (5045).
  Custom sorting/ranking can be done in a pluggable collector (4465), but
  this patch is no longer in dev.
 
  Is there any other dev. being done on adding custom sorting (after
  collection) via a plugin?
 
  Thanks,
  Peter
 



 --
 Joel Bernstein
 Search Engineer at Heliosearch



Re: Configurable collectors for custom ranking

2013-12-10 Thread Peter Keegan
Quick question:
In the context of a custom collector, how does one get the values of a
field of type 'ExternalFileField'?

Thanks,
Peter


On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan peterlkee...@gmail.comwrote:

 Hi Joel,

 This is related to another thread on function query matching (
 http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513).
 The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform
 the 'scale' function on only the documents matching the main dismax query.
 As you mention, it is a slightly intrusive design and requires that I
 manage my own PriorityQueue (and a local duplicate of HitQueue), but should
 work. I think a better design would hide the PQ from the plugin.

 Thanks,
 Peter


 On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com wrote:

 Hi Peter,

 I've been meaning to revisit configurable ranking collectors, but I
 haven't
 yet had a chance. It's on the shortlist of things I'd like to tackle
 though.



 On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I looked at SOLR-4465 and SOLR-5045, where it appears that there is a
 goal
  to be able to do custom sorting and ranking in a PostFilter. So far, it
  looks like only custom aggregation can be implemented in PostFilter
 (5045).
  Custom sorting/ranking can be done in a pluggable collector (4465), but
  this patch is no longer in dev.
 
  Is there any other dev. being done on adding custom sorting (after
  collection) via a plugin?
 
  Thanks,
  Peter
 



 --
 Joel Bernstein
 Search Engineer at Heliosearch





Re: Function query matching

2013-12-07 Thread Peter Keegan
  But for your specific goal Peter: Yes, if the whole point of a function
  you have is to wrap generated a scaled score of your base $qq, ...

Thanks for the confirmation, Chris. So, to do this efficiently, I think I
need to implement a custom Collector that performs the scaling (and other
math) after collecting the matching dismax query docs. I started a separate
thread asking about the state of configurable collectors.

Thanks,
Peter


On Sat, Dec 7, 2013 at 1:45 AM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 I had to do a double take when i read this sentence...

 : Even with any improvements to 'scale', all function queries will add a
 : linear increase to the Qtime as index size increases, since they match
 all
 : docs.

 ...because that smelled like either a bug in your methodology, or a bug in
 Solr.  To convince myself there wasn't a bug in Solr, i wrote a test case
 (i'll commit tomorow, bunch of churn in svn right now making ant
 precommit unhappy) to prove that when wrapping boost functions arround
 queries, Solr will only evaluate the functions for docs matching the
 wrapped query -- so there is no linear increase as the index size
 increases, just the (neccessary) libera increase as the number of
 *matching* docs grows. (for most functions anyway -- as mentioned scale
 is special).

 BUT! ... then i remembered how this thread started, and your goal of
 scaling the scores from a wrapped query.

 I want to be clear for 99% of the people reading this, if you find
 yourself writting a query structure like this...

   q={!func}..functions involving wrapping $qq ...
  qq={!edismax ...lots of stuff but still only matching subset of the
 index...}
  fq={!query v=$qq}

 ...Try to restructure the match you want to do into the form of a
 multiplier

   q={!boost b=$b v=$qq}
   b=...functions producing a score multiplier...
  qq={!edismax ...lots of stuff but still only matching subset of the
 index...}

 Because the later case is much more efficient and Solr will only compute
 the function values for hte docs it needs to (that match the wrapped $qq
 query)

 But for your specific goal Peter: Yes, if the whole point of a function
 you have is to wrap generated a scaled score of your base $qq, then the
 function (wrapping the scale(), wrapping the query()) is going to have to
 be evaluated for every doc -- that will definitely be linear based on the
 size of the index.



 -Hoss
 http://www.lucidworks.com/



Re: Function query matching

2013-12-06 Thread Peter Keegan
I added some timing logging to IndexSearcher and ScaleFloatFunction and
compared a simple DisMax query with a DisMax query wrapped in the scale
function. The index size was 500K docs, 61K docs match the DisMax query.
The simple DisMax query took 33 ms, the function query took 89 ms. What I
found was:

1. The scale query only normalized the scores once (in
ScaleInfo.createScaleInfo) and added 33 ms to the Qtime.  Subsequent calls
to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and  added ~0 time.

2. The FunctionQuery 'nextDoc' iterations added 16 ms over the DisMax
'nextDoc' iterations.

Here's the breakdown:

Simple DisMax query:
weight.scorer: 3 ms (get term enum)
scorer.score: 23 ms (nextDoc iterations)
other: 3 ms
Total: 33 ms

DisMax wrapped in ScaleFloatFunction:
weight.scorer: 39 ms (get scaled values)
scorer.score: 39 ms (nextDoc iterations)
other: 11 ms
Total: 89 ms

Even with any improvements to 'scale', all function queries will add a
linear increase to the Qtime as index size increases, since they match all
docs.

Trey: I'd be happy to test any patch that you find improves the speed.



On Mon, Dec 2, 2013 at 11:21 PM, Trey Grainger solrt...@gmail.com wrote:

 We're working on the same problem with the combination of the
 scale(query(...)) combination, so I'd like to share a bit more information
 that may be useful.

 *On the scale function:*
 Even thought the scale query has to calculate the scores for all documents,
 it is actually doing this work twice for each ValueSource (once to
 calculate the min and max values, and then again when actually scoring the
 documents), which is inefficient.

 To solve the problem, we're in the process of putting a cache inside the
 scale function to remember the values for each document when they are
 initially computed (to find the min and max) so that the second pass can
 just use the previously computed values for each document.  Our theory is
 that most of the extra time due to the scale function is really just the
 result of doing duplicate work.

 No promises this won't be overly costly in terms of memory utilization, but
 we'll see what we get in terms of speed improvements and will share the
 code if it works out well.  Alternate implementation suggestions (or
 criticism of a cache like this) are also welcomed.


 *On the NoOp product function: scale(prod(1, query(...))):*
 We do the same thing, which ultimately is just an unnecessary waste of a
 loop through all documents to do an extra multiplication step.  I just
 debugged the code and uncovered the problem.  There is a Map (called
 context) that is passed through to each value source to store intermediate
 state, and both the query and scale functions are passing the ValueSource
 for the query function in as the KEY to this Map (as opposed to using some
 composite key that makes sense in the current context).  Essentially, these
 lines are overwriting each other:

 Inside ScaleFloatFunction: context.put(this.source, scaleInfo);
  //this.source refers to the QueryValueSource, and the scaleInfo refers to
 a ScaleInfo object
 Inside QueryValueSource: context.put(this, w); //this refers to the same
 QueryValueSource from above, and the w refers to a Weight object

 As such, when the ScaleFloatFunction later goes to read the ScaleInfo from
 the context Map, it unexpectedly pulls the Weight object out instead and
 thus the invalid case exception occurs.  The NoOp multiplication works
 because it puts an different ValueSource between the query and the
 ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this
 (in QueryValueSource).

 This should be an easy fix.  I'll create a JIRA ticket to use better key
 names in these functions and push up a patch.  This will eliminate the need
 for the extra NoOp function.

 -Trey


 On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I'm persuing this possible PostFilter solution, I can see how to collect
  all the hits and recompute the scores in a PostFilter, after all the hits
  have been collected (for scaling). Now, I can't see how to get the custom
  doc/score values back into the main query's HitQueue. Any advice?
 
  Thanks,
  Peter
 
 
  On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   Instead of using a function query, could I use the edismax query (plus
   some low cost filters not shown in the example) and implement the
   scale/sum/product computation in a PostFilter? Is the query's maxScore
   available there?
  
   Thanks,
   Peter
  
  
   On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
  
   Although the 'scale' is a big part of it, here's a closer breakdown.
  Here
   are 4 queries with increasing functions, and theei response times
  (caching
   turned off in solrconfig):
  
   100 msec:
   select?q={!edismax v='news' qf='title^2 body'}
  
   135 msec:
   select?qq={!edismax v='news' qf='title^2
   body'}q={!func}product(field

Re: Function query matching

2013-12-06 Thread Peter Keegan
In my previous posting, I said:

  Subsequent calls to ScaleFloatFuntion.getValues bypassed
'createScaleInfo and  added ~0 time.

These subsequent calls are for the remaining segments in the index reader
(21 segments).

Peter



On Fri, Dec 6, 2013 at 2:10 PM, Peter Keegan peterlkee...@gmail.com wrote:

 I added some timing logging to IndexSearcher and ScaleFloatFunction and
 compared a simple DisMax query with a DisMax query wrapped in the scale
 function. The index size was 500K docs, 61K docs match the DisMax query.
 The simple DisMax query took 33 ms, the function query took 89 ms. What I
 found was:

 1. The scale query only normalized the scores once (in
 ScaleInfo.createScaleInfo) and added 33 ms to the Qtime.  Subsequent calls
 to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and  added ~0 time.

 2. The FunctionQuery 'nextDoc' iterations added 16 ms over the DisMax
 'nextDoc' iterations.

 Here's the breakdown:

 Simple DisMax query:
 weight.scorer: 3 ms (get term enum)
 scorer.score: 23 ms (nextDoc iterations)
 other: 3 ms
 Total: 33 ms

 DisMax wrapped in ScaleFloatFunction:
 weight.scorer: 39 ms (get scaled values)
 scorer.score: 39 ms (nextDoc iterations)
 other: 11 ms
 Total: 89 ms

 Even with any improvements to 'scale', all function queries will add a
 linear increase to the Qtime as index size increases, since they match all
 docs.

 Trey: I'd be happy to test any patch that you find improves the speed.



 On Mon, Dec 2, 2013 at 11:21 PM, Trey Grainger solrt...@gmail.com wrote:

 We're working on the same problem with the combination of the
 scale(query(...)) combination, so I'd like to share a bit more information
 that may be useful.

 *On the scale function:*
 Even thought the scale query has to calculate the scores for all
 documents,
 it is actually doing this work twice for each ValueSource (once to
 calculate the min and max values, and then again when actually scoring the
 documents), which is inefficient.

 To solve the problem, we're in the process of putting a cache inside the
 scale function to remember the values for each document when they are
 initially computed (to find the min and max) so that the second pass can
 just use the previously computed values for each document.  Our theory is
 that most of the extra time due to the scale function is really just the
 result of doing duplicate work.

 No promises this won't be overly costly in terms of memory utilization,
 but
 we'll see what we get in terms of speed improvements and will share the
 code if it works out well.  Alternate implementation suggestions (or
 criticism of a cache like this) are also welcomed.


 *On the NoOp product function: scale(prod(1, query(...))):*
 We do the same thing, which ultimately is just an unnecessary waste of a
 loop through all documents to do an extra multiplication step.  I just
 debugged the code and uncovered the problem.  There is a Map (called
 context) that is passed through to each value source to store intermediate
 state, and both the query and scale functions are passing the ValueSource
 for the query function in as the KEY to this Map (as opposed to using some
 composite key that makes sense in the current context).  Essentially,
 these
 lines are overwriting each other:

 Inside ScaleFloatFunction: context.put(this.source, scaleInfo);
  //this.source refers to the QueryValueSource, and the scaleInfo refers to
 a ScaleInfo object
 Inside QueryValueSource: context.put(this, w); //this refers to the same
 QueryValueSource from above, and the w refers to a Weight object

 As such, when the ScaleFloatFunction later goes to read the ScaleInfo from
 the context Map, it unexpectedly pulls the Weight object out instead and
 thus the invalid case exception occurs.  The NoOp multiplication works
 because it puts an different ValueSource between the query and the
 ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this
 (in QueryValueSource).

 This should be an easy fix.  I'll create a JIRA ticket to use better key
 names in these functions and push up a patch.  This will eliminate the
 need
 for the extra NoOp function.

 -Trey


 On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I'm persuing this possible PostFilter solution, I can see how to collect
  all the hits and recompute the scores in a PostFilter, after all the
 hits
  have been collected (for scaling). Now, I can't see how to get the
 custom
  doc/score values back into the main query's HitQueue. Any advice?
 
  Thanks,
  Peter
 
 
  On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   Instead of using a function query, could I use the edismax query (plus
   some low cost filters not shown in the example) and implement the
   scale/sum/product computation in a PostFilter? Is the query's maxScore
   available there?
  
   Thanks,
   Peter
  
  
   On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
  
   Although

Configurable collectors for custom ranking

2013-12-06 Thread Peter Keegan
I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal
to be able to do custom sorting and ranking in a PostFilter. So far, it
looks like only custom aggregation can be implemented in PostFilter (5045).
Custom sorting/ranking can be done in a pluggable collector (4465), but
this patch is no longer in dev.

Is there any other dev. being done on adding custom sorting (after
collection) via a plugin?

Thanks,
Peter


Re: Function query matching

2013-12-02 Thread Peter Keegan
I'm persuing this possible PostFilter solution, I can see how to collect
all the hits and recompute the scores in a PostFilter, after all the hits
have been collected (for scaling). Now, I can't see how to get the custom
doc/score values back into the main query's HitQueue. Any advice?

Thanks,
Peter


On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan peterlkee...@gmail.comwrote:

 Instead of using a function query, could I use the edismax query (plus
 some low cost filters not shown in the example) and implement the
 scale/sum/product computation in a PostFilter? Is the query's maxScore
 available there?

 Thanks,
 Peter


 On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan peterlkee...@gmail.comwrote:

 Although the 'scale' is a big part of it, here's a closer breakdown. Here
 are 4 queries with increasing functions, and theei response times (caching
 turned off in solrconfig):

 100 msec:
 select?q={!edismax v='news' qf='title^2 body'}

 135 msec:
 select?qq={!edismax v='news' qf='title^2
 body'}q={!func}product(field(myfield),query($qq)fq={!query v=$qq}

 200 msec:
 select?qq={!edismax v='news' qf='title^2
 body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfieldfq={!query
 v=$qq}

 320 msec:
  select?qq={!edismax v='news' qf='title^2
 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!query
 v=$qq}

 Btw, that no-op product is necessary, else you get this exception:

 org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to 
 org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo

 thanks,

 peter



 On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter 
 hossman_luc...@fucit.org wrote:


 : So, this query does just what I want, but it's typically 3 times slower
 : than the edismax query  without the functions:

 that's because the scale() function is inhernetly slow (it has to
 compute the min  max value for every document in order to know how to
 scale them)

 what you are seeing is the price you have to pay to get that query with a
 normalized 0-1 value.

 (you might be able to save a little bit of time by eliminating that
 no-Op multiply by 1: product(query($qq),1) ... but i doubt you'll even
 notice much of a chnage given that scale function.

 : Is there any way to speed this up? Would writing a custom function
 query
 : that compiled all the function queries together be any faster?

 If you can find a faster implementation for scale() then by all means let
 us konw, and we can fold it back into Solr.


 -Hoss






Re: Function query matching

2013-11-29 Thread Peter Keegan
Instead of using a function query, could I use the edismax query (plus some
low cost filters not shown in the example) and implement the
scale/sum/product computation in a PostFilter? Is the query's maxScore
available there?

Thanks,
Peter


On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan peterlkee...@gmail.comwrote:

 Although the 'scale' is a big part of it, here's a closer breakdown. Here
 are 4 queries with increasing functions, and theei response times (caching
 turned off in solrconfig):

 100 msec:
 select?q={!edismax v='news' qf='title^2 body'}

 135 msec:
 select?qq={!edismax v='news' qf='title^2
 body'}q={!func}product(field(myfield),query($qq)fq={!query v=$qq}

 200 msec:
 select?qq={!edismax v='news' qf='title^2
 body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfieldfq={!query
 v=$qq}

 320 msec:
 select?qq={!edismax v='news' qf='title^2
 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!query
 v=$qq}

 Btw, that no-op product is necessary, else you get this exception:

 org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to 
 org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo

 thanks,

 peter



 On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter hossman_luc...@fucit.org
  wrote:


 : So, this query does just what I want, but it's typically 3 times slower
 : than the edismax query  without the functions:

 that's because the scale() function is inhernetly slow (it has to
 compute the min  max value for every document in order to know how to
 scale them)

 what you are seeing is the price you have to pay to get that query with a
 normalized 0-1 value.

 (you might be able to save a little bit of time by eliminating that
 no-Op multiply by 1: product(query($qq),1) ... but i doubt you'll even
 notice much of a chnage given that scale function.

 : Is there any way to speed this up? Would writing a custom function query
 : that compiled all the function queries together be any faster?

 If you can find a faster implementation for scale() then by all means let
 us konw, and we can fold it back into Solr.


 -Hoss





Re: Function query matching

2013-11-27 Thread Peter Keegan
Hi,

So, this query does just what I want, but it's typically 3 times slower
than the edismax query  without the functions:

select?qq={!edismax v='news' qf='title^2 body'}scaledQ=scale(product(
query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),
product(0.25,field(myfield)))fq={!query v=$qq}

Is there any way to speed this up? Would writing a custom function query
that compiled all the function queries together be any faster?

Thanks,
Peter


On Mon, Nov 11, 2013 at 1:31 PM, Peter Keegan peterlkee...@gmail.comwrote:

 Thanks


 On Mon, Nov 11, 2013 at 11:46 AM, Yonik Seeley yo...@heliosearch.comwrote:

 On Mon, Nov 11, 2013 at 11:39 AM, Peter Keegan peterlkee...@gmail.com
 wrote:
  fq=$qq
 
  What is the proper syntax?

 fq={!query v=$qq}

 -Yonik
 http://heliosearch.com -- making solr shine





Re: Function query matching

2013-11-27 Thread Peter Keegan
Although the 'scale' is a big part of it, here's a closer breakdown. Here
are 4 queries with increasing functions, and theei response times (caching
turned off in solrconfig):

100 msec:
select?q={!edismax v='news' qf='title^2 body'}

135 msec:
select?qq={!edismax v='news' qf='title^2
body'}q={!func}product(field(myfield),query($qq)fq={!query v=$qq}

200 msec:
select?qq={!edismax v='news' qf='title^2
body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfieldfq={!query
v=$qq}

320 msec:
select?qq={!edismax v='news' qf='title^2
body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!query
v=$qq}

Btw, that no-op product is necessary, else you get this exception:

org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to
org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo

thanks,

peter



On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : So, this query does just what I want, but it's typically 3 times slower
 : than the edismax query  without the functions:

 that's because the scale() function is inhernetly slow (it has to
 compute the min  max value for every document in order to know how to
 scale them)

 what you are seeing is the price you have to pay to get that query with a
 normalized 0-1 value.

 (you might be able to save a little bit of time by eliminating that
 no-Op multiply by 1: product(query($qq),1) ... but i doubt you'll even
 notice much of a chnage given that scale function.

 : Is there any way to speed this up? Would writing a custom function query
 : that compiled all the function queries together be any faster?

 If you can find a faster implementation for scale() then by all means let
 us konw, and we can fold it back into Solr.


 -Hoss



Re: Function query matching

2013-11-11 Thread Peter Keegan
I replaced the frange filter with the following filter and got the correct
no. of results and it was 3X faster:

select?qq={!edismax v='news' qf='title^2
body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!edismax
v='news' qf='title^2 body'}

Then, I tried to simplify the query with parameter substitution, but 'fq'
didn't parse correctly:

select?qq={!edismax v='news' qf='title^2
body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq=$qq

What is the proper syntax?

Thanks,
Peter


On Thu, Nov 7, 2013 at 2:16 PM, Peter Keegan peterlkee...@gmail.com wrote:

 I'm trying to used a normalized score in a query as I described in a
 recent thread titled Re: How to get similarity score between 0 and 1 not
 relative score

 I'm using this query:
 select?qq={!edismax v='news' qf='title^2
 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!frange
 l=0.001}$q

 Is there another way to accomplish this using dismax boosting?



 On Thu, Nov 7, 2013 at 12:55 PM, Jason Hellman 
 jhell...@innoventsolutions.com wrote:

 You can, of course, us a function range query:

 select?q=text:newsfq={!frange l=0 u=100}sum(x,y)


 http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html

 This will give you a bit more flexibility to meet your goal.

 On Nov 7, 2013, at 7:26 AM, Erik Hatcher erik.hatc...@gmail.com wrote:

  Function queries score (all) documents, but don't filter them.  All
 documents effectively match a function query.
 
Erik
 
  On Nov 7, 2013, at 1:48 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
 
  Why does this function query return docs that don't match the embedded
  query?
  select?qq=text:newsq={!func}sum(query($qq),0)
 





Re: Function query matching

2013-11-11 Thread Peter Keegan
Thanks


On Mon, Nov 11, 2013 at 11:46 AM, Yonik Seeley yo...@heliosearch.comwrote:

 On Mon, Nov 11, 2013 at 11:39 AM, Peter Keegan peterlkee...@gmail.com
 wrote:
  fq=$qq
 
  What is the proper syntax?

 fq={!query v=$qq}

 -Yonik
 http://heliosearch.com -- making solr shine



Function query matching

2013-11-07 Thread Peter Keegan
Why does this function query return docs that don't match the embedded
query?
select?qq=text:newsq={!func}sum(query($qq),0)


Re: Function query matching

2013-11-07 Thread Peter Keegan
I'm trying to used a normalized score in a query as I described in a recent
thread titled Re: How to get similarity score between 0 and 1 not relative
score

I'm using this query:
select?qq={!edismax v='news' qf='title^2
body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!frange
l=0.001}$q

Is there another way to accomplish this using dismax boosting?



On Thu, Nov 7, 2013 at 12:55 PM, Jason Hellman 
jhell...@innoventsolutions.com wrote:

 You can, of course, us a function range query:

 select?q=text:newsfq={!frange l=0 u=100}sum(x,y)


 http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html

 This will give you a bit more flexibility to meet your goal.

 On Nov 7, 2013, at 7:26 AM, Erik Hatcher erik.hatc...@gmail.com wrote:

  Function queries score (all) documents, but don't filter them.  All
 documents effectively match a function query.
 
Erik
 
  On Nov 7, 2013, at 1:48 PM, Peter Keegan peterlkee...@gmail.com wrote:
 
  Why does this function query return docs that don't match the embedded
  query?
  select?qq=text:newsq={!func}sum(query($qq),0)
 




Re: Data Import Handler

2013-11-06 Thread Peter Keegan
I've done this by adding an attribute to the entity element (e.g.
myconfig=myconfig.xml), and reading it in the 'init' method with
context.getResolvedEntityAttribute(myconfig).

Peter


On Wed, Nov 6, 2013 at 8:25 AM, Ramesh ramesh.po...@vensaiinc.com wrote:

 Hi Folks,



 Can anyone suggest me how can customize dataconfig.xml file

 I want to provide database details like( db_url,uname,password ) from my
 own
 properties file instead of dataconfig.xaml file




Re: How to get similarity score between 0 and 1 not relative score

2013-11-01 Thread Peter Keegan
There's another use case for scaling the score. Suppose I want to compute a
custom score based on the weighted sum of:

- product(0.75, relevance score)
- product(0.25, value from another field)

For this to work, both fields must have values between 0-1, for example.
Toby's example using the scale function seems to work, but you have to use
fq to eliminate results with score=0. It seems this is somewhat expensive,
since the scaling can't be done until all results have been collected to
get the max score. Then, are the results resorted? I haven't looked
closely, yet.

Peter


Peter




On Thu, Oct 31, 2013 at 7:48 PM, Toby Lazar tla...@capitaltg.com wrote:

 I think you are looking for something like this, though you can omit the fq
 section:



 http://localhost:8983/solr/collection/select?abc=text:bobq={!func}scale(product(query($abc),1),0,1)fq={
 !
 frange l=0.9}$q

 Also, I don't understand all the fuss about normalized scores.  In the
 linked example, I can see an interest in searching for apple bannana,
 zzz yyy xxx qqq kkk ttt rrr 111, etc. and wanting only close matches for
 that point in time.  Would this be a good use for this approach?  I
 understand that the results can change if the documents in the index
 change.

 Thanks,

 Toby



 On Thu, Oct 31, 2013 at 12:56 AM, Anshum Gupta ans...@anshumgupta.net
 wrote:

  Hi Susheel,
 
  Have a look at this:
  http://wiki.apache.org/lucene-java/ScoresAsPercentages
 
  You may really want to reconsider doing that.
 
 
 
 
  On Thu, Oct 31, 2013 at 9:41 AM, sushil sharma sushil2...@yahoo.co.in
  wrote:
 
   Hi,
  
   We have a requirement where user would like to see a score (between 0
 to
   1) which can tell how close the input search string is with result
  string.
   So if input was very close but not exact matach, score could be .90
 etc.
  
   I do understand that we can get score from solr  divide by highest
 score
   but that will always show 1 even if we match was not exact.
  
   Regards,
   Susheel
 
 
 
 
  --
 
  Anshum Gupta
  http://www.anshumgupta.net
 



How to reinitialize a solrcloud replica

2013-10-25 Thread Peter Keegan
I'm running 4.3 in solrcloud mode and trying to test index recovery, but
it's failing.
I have one shard, 2 replicas:
Leader: 10.159.8.105
Replica: 10.159.6.73

To test, I stopped the replica, deleted the 'data' directory and restarted
solr. Here is the replica's logging:

INFO  - 2013-10-25 12:19:40.773; org.apache.solr.cloud.ZkController; We are
http://10.159.6.73:8983/solr/collection/ and leader is
http://10.159.8.105:8983/solr/collection/
INFO  - 2013-10-25 12:19:40.774; org.apache.solr.cloud.ZkController; No
LogReplay needed for core=collection baseURL=http://10.159.6.73:8983/solr
INFO  - 2013-10-25 12:19:40.774; org.apache.solr.cloud.ZkController; Core
needs to recover:collection
INFO  - 2013-10-25 12:19:40.774;
org.apache.solr.update.DefaultSolrCoreState; Running recovery - first
canceling any ongoing recovery
INFO  - 2013-10-25 12:19:40.778; org.apache.solr.cloud.RecoveryStrategy;
Starting recovery process.  core=collection recoveringAfterStartup=true
...
ERROR - 2013-10-25 12:20:25.281; org.apache.solr.common.SolrException;
Error while trying to recover.
core=collection:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
I was asked to wait on state recovering for 10.159.6.73:8983_solr but I
still do not see the requested state. I see state: down live:true
...
ERROR - 2013-10-25 12:20:25.281; org.apache.solr.cloud.RecoveryStrategy;
Recovery failed - trying again... (5) core=collection
ERROR - 2013-10-25 12:20:25.281; org.apache.solr.common.SolrException;
Recovery failed - interrupted. core=collection
ERROR - 2013-10-25 12:20:25.282; org.apache.solr.common.SolrException;
Recovery failed - I give up. core=collection
INFO  - 2013-10-25 12:20:25.282; org.apache.solr.cloud.ZkController;
publishing core=collection state=recovery_failed

Here is the Leader's logging:

INFO  - 2013-10-25 12:19:40.883;
org.apache.solr.handler.admin.CoreAdminHandler; Going to wait for
coreNodeName: 10.159.6.73:8983_solr_collection, state: recovering,
checkLive: true, onlyIfLeader: true
INFO  - 2013-10-25 12:19:55.886;
org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
ZooKeeper...
ERROR - 2013-10-25 12:20:25.277; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: I was asked to wait on state
recovering for 10.159.6.73:8983_solr but I still do not see the requested
state. I see state: down live:true
(repeats every minute)

Is it valid to simply delete the 'data' directory, or does a znode have to
be modified, too?
What's the right way to reinitialize and re-synch a core?

Peter


Re: Solr timeout after reboot

2013-10-21 Thread Peter Keegan
Have you tried this old trick to warm the FS cache?
cat .../core/data/index/* /dev/null

Peter


On Mon, Oct 21, 2013 at 5:31 AM, michael.boom my_sky...@yahoo.com wrote:

 Thank you, Otis!

 I've integrated the SPM on my Solr instances and now I have access to
 monitoring data.
 Could you give me some hints on which metrics should I watch?

 Below I've added my query configs. Is there anything I could tweak here?

 query
 maxBooleanClauses1024/maxBooleanClauses

 filterCache class=solr.FastLRUCache
  size=1000
  initialSize=1000
  autowarmCount=0/

 queryResultCache class=solr.LRUCache
  size=1000
  initialSize=1000
  autowarmCount=0/

 documentCache class=solr.LRUCache
size=1000
initialSize=1000
autowarmCount=0/


 fieldValueCache class=solr.FastLRUCache
 size=1000
 initialSize=1000
 autowarmCount=0 /


 enableLazyFieldLoadingtrue/enableLazyFieldLoading

queryResultWindowSize20/queryResultWindowSize

queryResultMaxDocsCached100/queryResultMaxDocsCached

 listener event=firstSearcher class=solr.QuerySenderListener
   arr name=queries
 lst
   str name=qactive:true/str
 /lst
   /arr
 /listener

 useColdSearcherfalse/useColdSearcher

 maxWarmingSearchers10/maxWarmingSearchers

   /query



 -
 Thanks,
 Michael
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096780.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr timeout after reboot

2013-10-21 Thread Peter Keegan
I found this warming to be especially necessary after starting an instance
of those m3.xlarge servers, else the response times for the first minutes
was terrible.

Peter


On Mon, Oct 21, 2013 at 8:39 AM, François Schiettecatte 
fschietteca...@gmail.com wrote:

 To put the file data into file system cache which would make for faster
 access.

 François


 On Oct 21, 2013, at 8:33 AM, michael.boom my_sky...@yahoo.com wrote:

  Hmm, no, I haven't...
 
  What would be the effect of this ?
 
 
 
  -
  Thanks,
  Michael
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096809.html
  Sent from the Solr - User mailing list archive at Nabble.com.




Re: limiting deep pagination

2013-10-17 Thread Peter Keegan
Yes, right now this constraint could be implemented in either the web app
or Solr. I see now that many of the QTimes on these queries are 10 ms
(probably due to caching), so I'm a bit less concerned.


On Wed, Oct 16, 2013 at 2:13 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 I just wonder that: Don't you implement a custom API that interacts with
 Solr and limits such kinds of requestst? (I know that you are asking about
 how to do that in Solr but I handle such situations at my custom search
 APIs and want to learn what fellows do)


 9 Ekim 2013 Çarşamba tarihinde Michael Sokolov 
 msoko...@safaribooksonline.com adlı kullanıcı şöyle yazdı:
  On 10/8/13 6:51 PM, Peter Keegan wrote:
 
  Is there a way to configure Solr 'defaults/appends/invariants' such that
  the product of the 'start' and 'rows' parameters doesn't exceed a given
  value? This would be to prevent deep pagination.  Or would this require
 a
  custom requestHandler?
 
  Peter
 
  Just wondering -- isn't it the sum that you should be concerned about
 rather than the product?  Actually I think what we usually do is limit both
 independently, with slightly different concerns, since. eg start=1,
 rows=1000 causes memory problems if you have large fields in your results,
 where start=1000, rows=1 may not actually be a problem
 
  -Mike
 



limiting deep pagination

2013-10-08 Thread Peter Keegan
Is there a way to configure Solr 'defaults/appends/invariants' such that
the product of the 'start' and 'rows' parameters doesn't exceed a given
value? This would be to prevent deep pagination.  Or would this require a
custom requestHandler?

Peter


Re: How to get values of external file field(s) in Solr query?

2013-10-03 Thread Peter Keegan
In 4.3, frange query using an external file works for both q and fq. The
Solr wiki and SIA both state that ExternalFileField does not support
searching. Was the search/filter capability added recently, or is it not
supported?

Thanks,
Peter



On Wed, Jun 26, 2013 at 4:59 PM, Upayavira u...@odoko.co.uk wrote:

 The only way is using a frange (function range) query:

 q={!frange l=0 u=10}my_external_field

 Will pull out documents that have your external field with a value
 between zero and 10.

 Upayavira

 On Wed, Jun 26, 2013, at 09:02 PM, Arun Rangarajan wrote:
 
 http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
  says
  this about external file fields:
  They can be used only for function queries or display.
  I understand how to use them in function queries, but how do I retrieve
  the
  values for display?
 
  If I want to fetch only the values of a single external file field for a
  set of primary keys, I can do:
  q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score
  For this query, the score is the value of the external file field.
 
  But how to get the values for docs that match some arbitrary query? Is
  there a syntax trick that will work where the value of the ext file field
  does not affect the score of the main query, but I can still retrieve its
  value?
 
  Also is it possible to retrieve the values of more than one external file
  field in a single query?



Re: Cross index join query performance

2013-09-30 Thread Peter Keegan
Ah, got it now - thanks for the explanation.


On Sat, Sep 28, 2013 at 3:33 AM, Upayavira u...@odoko.co.uk wrote:

 The thing here is to understand how a join works.

 Effectively, it does the inner query first, which results in a list of
 terms. It then effectively does a multi-term query with those values.

 q=size:large {!join fromIndex=other from=someid
 to=someotherid}type:shirt

 Imagine the inner join returned values A,B,C. Your inner query is, on
 core 'other', q=type:shirtfl=someid.

 Then your outer query becomes size:large someotherid:(A B C)

 Your inner query returns 25k values. You're having to do a multi-term
 query for 25k terms. That is *bound* to be slow.

 The pseudo-joins in Solr 4.x are intended for a small to medium number
 of values returned by the inner query, otherwise performance degrades as
 you are seeing.

 Is there a way you can reduce the number of values returned by the inner
 query?

 As Joel mentions, those other joins are attempts to find other ways to
 work with this limitation.

 Upayavira

 On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote:
  Hi Joel,
 
  I tried this patch and it is quite a bit faster. Using the same query on
  a
  larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
  QTime was 100 msec! This was for true for large and small result sets.
 
  A few notes: the patch didn't compile with 4.3 because of the
  SolrCore.getLatestSchema call (which I worked around), and the package
  name
  should be:
  queryParser name=hjoin
  class=org.apache.solr.search.joins.HashSetJoinQParserPlugin/
 
  Unfortunately, I just learned that our uniqueKey may have to be an
  alphanumeric string instead of an int, so I'm not out of the woods yet.
 
  Good stuff - thanks.
 
  Peter
 
 
  On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein joels...@gmail.com
  wrote:
 
   It looks like you are using int join keys so you may want to check out
   SOLR-4787, specifically the hjoin and bjoin.
  
   These perform well when you have a large number of results from the
   fromIndex. If you have a small number of results in the fromIndex the
   standard join will be faster.
  
  
   On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan peterlkee...@gmail.com
   wrote:
  
I forgot to mention - this is Solr 4.3
   
Peter
   
   
   
On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan 
 peterlkee...@gmail.com
wrote:
   
 I'm doing a cross-core join query and the join query is 30X slower
 than
 each of the 2 individual queries. Here are the queries:

 Main query:
 http://localhost:8983/solr/mainindex/select?q=title:java
 QTime: 5 msec
 hit count: 1000

 Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1TO
0.3]
 QTime: 4 msec
 hit count: 25K

 Join query:

   
  
 http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindextoIndex=subindexfrom=docidto=docid}fld1:[0.1
  TO 0.3]
 QTime: 160 msec
 hit count: 205

 Here are the index spec's:

 mainindex size: 117K docs, 1 segment
 mainindex schema:
field name=docid type=int indexed=true stored=true
 required=true multiValued=false /
field name=title type=text_en_splitting indexed=true
 stored=true multiValued=false /
uniqueKeydocid/uniqueKey

 subindex size: 117K docs, 1 segment
 subindex schema:
field name=docid type=int indexed=true stored=true
 required=true multiValued=false /
field name=fld1 type=float indexed=true stored=true
 required=false multiValued=false /
uniqueKeydocid/uniqueKey

 With debugQuery=true I see:
   debug:{
 join:{
   {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
   0.3]:{
 time:155,
 fromSetSize:24742,
 toSetSize:24742,
 fromTermCount:117810,
 fromTermTotalDf:117810,
 fromTermDirectCount:117810,
 fromTermHits:24742,
 fromTermHitsTotalDf:24742,
 toTermHits:24742,
 toTermHitsTotalDf:24742,
 toTermDirectCount:24627,
 smallSetsDeferred:115,
 toSetDocsAdded:24742}},

 Via profiler and debugger, I see 150 msec spent in the outer
 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This
 seems
like a
 lot of time to join the bitsets. Does this seem right?

 Peter


   
  
  
  
   --
   Joel Bernstein
   Professional Services LucidWorks
  



Re: Cross index join query performance

2013-09-27 Thread Peter Keegan
Hi Joel,

I tried this patch and it is quite a bit faster. Using the same query on a
larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
QTime was 100 msec! This was for true for large and small result sets.

A few notes: the patch didn't compile with 4.3 because of the
SolrCore.getLatestSchema call (which I worked around), and the package name
should be:
queryParser name=hjoin
class=org.apache.solr.search.joins.HashSetJoinQParserPlugin/

Unfortunately, I just learned that our uniqueKey may have to be an
alphanumeric string instead of an int, so I'm not out of the woods yet.

Good stuff - thanks.

Peter


On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein joels...@gmail.com wrote:

 It looks like you are using int join keys so you may want to check out
 SOLR-4787, specifically the hjoin and bjoin.

 These perform well when you have a large number of results from the
 fromIndex. If you have a small number of results in the fromIndex the
 standard join will be faster.


 On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan peterlkee...@gmail.com
 wrote:

  I forgot to mention - this is Solr 4.3
 
  Peter
 
 
 
  On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
 
   I'm doing a cross-core join query and the join query is 30X slower than
   each of the 2 individual queries. Here are the queries:
  
   Main query: http://localhost:8983/solr/mainindex/select?q=title:java
   QTime: 5 msec
   hit count: 1000
  
   Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
  0.3]
   QTime: 4 msec
   hit count: 25K
  
   Join query:
  
 
 http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindextoIndex=subindexfrom=docid
  to=docid}fld1:[0.1 TO 0.3]
   QTime: 160 msec
   hit count: 205
  
   Here are the index spec's:
  
   mainindex size: 117K docs, 1 segment
   mainindex schema:
  field name=docid type=int indexed=true stored=true
   required=true multiValued=false /
  field name=title type=text_en_splitting indexed=true
   stored=true multiValued=false /
  uniqueKeydocid/uniqueKey
  
   subindex size: 117K docs, 1 segment
   subindex schema:
  field name=docid type=int indexed=true stored=true
   required=true multiValued=false /
  field name=fld1 type=float indexed=true stored=true
   required=false multiValued=false /
  uniqueKeydocid/uniqueKey
  
   With debugQuery=true I see:
 debug:{
   join:{
 {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
 0.3]:{
   time:155,
   fromSetSize:24742,
   toSetSize:24742,
   fromTermCount:117810,
   fromTermTotalDf:117810,
   fromTermDirectCount:117810,
   fromTermHits:24742,
   fromTermHitsTotalDf:24742,
   toTermHits:24742,
   toTermHitsTotalDf:24742,
   toTermDirectCount:24627,
   smallSetsDeferred:115,
   toSetDocsAdded:24742}},
  
   Via profiler and debugger, I see 150 msec spent in the outer
   'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
  like a
   lot of time to join the bitsets. Does this seem right?
  
   Peter
  
  
 



 --
 Joel Bernstein
 Professional Services LucidWorks



Cross index join query performance

2013-09-25 Thread Peter Keegan
I'm doing a cross-core join query and the join query is 30X slower than
each of the 2 individual queries. Here are the queries:

Main query: http://localhost:8983/solr/mainindex/select?q=title:java
QTime: 5 msec
hit count: 1000

Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3]
QTime: 4 msec
hit count: 25K

Join query:
http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindex
toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3]
QTime: 160 msec
hit count: 205

Here are the index spec's:

mainindex size: 117K docs, 1 segment
mainindex schema:
   field name=docid type=int indexed=true stored=true
required=true multiValued=false /
   field name=title type=text_en_splitting indexed=true
stored=true multiValued=false /
   uniqueKeydocid/uniqueKey

subindex size: 117K docs, 1 segment
subindex schema:
   field name=docid type=int indexed=true stored=true
required=true multiValued=false /
   field name=fld1 type=float indexed=true stored=true
required=false multiValued=false /
   uniqueKeydocid/uniqueKey

With debugQuery=true I see:
  debug:{
join:{
  {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]:{
time:155,
fromSetSize:24742,
toSetSize:24742,
fromTermCount:117810,
fromTermTotalDf:117810,
fromTermDirectCount:117810,
fromTermHits:24742,
fromTermHitsTotalDf:24742,
toTermHits:24742,
toTermHitsTotalDf:24742,
toTermDirectCount:24627,
smallSetsDeferred:115,
toSetDocsAdded:24742}},

Via profiler and debugger, I see 150 msec spent in the outer
'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a
lot of time to join the bitsets. Does this seem right?

Peter


Re: A question about attaching shards to load balancers

2013-01-30 Thread Peter Keegan
Aren't you concerned about having a single point of failure with this setup?

On Wed, Jan 30, 2013 at 10:38 AM, Michael Ryan mr...@moreover.com wrote:

 From a performance point of view, I can't imagine it mattering. In our
 setup, we have a dedicated Solr server that is not a shard that takes
 incoming requests (we call it the coordinator). This server is very
 lightweight and practically has no load at all.

 My gut feeling is that having a separate dedicated server might be a
 slightly better approach, as it will have totally different performance
 characteristics than the shards, and so you can tune it for this.

 -Michael



Re: Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-25 Thread Peter Keegan
Yes #5 is the same thing (sorry, I didn't read them all thoroughly). Your
description of the phrases being 'tags' suggests that you don't need term
positions for matching, and as you noted, you would get unwanted partial
matches. And, the TermQuerys would be much faster.

Peter


On Wed, Oct 24, 2012 at 8:33 PM, Aaron Daubman daub...@gmail.com wrote:

 Hi Peter,

 Thanks for the recommendation - I believe we are thinking along the
 same lines, but wanted to check to make sure. Are you suggesting
 something different than my #5 (below) or are we essentially
 suggesting the same thing?

 On Wed, Oct 24, 2012 at 1:20 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
  Could you index your 'phrase tags' as single tokens? Then your phrase
  queries become simple TermQuerys.

 
  5) *This is my current favorite*: stop tokenizing/analyzing these
  terms and just use KeywordTokenizer. Most of these phrases are
  pre-vetted, and it may be possible to clean/process any others before
  creating the docs. My main worry here is that, currently, if I
  understand correctly, a document with the phrase brazilian pop would
  still be returned as a match to a seed document containing only the
  phrase brazilian (not the other way around, but that is not
  necessary), however, with KeywordTokenizer, this would no longer be
  the case. If I switched from the current dubious tokenize/stem/etc...
  and just used Keyword, would this allow queries like this used to be
  a long phrase query to match documents that have this used to be a
  long phrase query as one of the multivalued values in the field
  without having to pull term positions? (and thus significantly speed
  up performance).
 

 Thanks again,
  Aaron



Re: Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Peter Keegan
Could you index your 'phrase tags' as single tokens? Then your phrase
queries become simple TermQuerys.

On Wed, Oct 24, 2012 at 12:26 PM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman daub...@gmail.com wrote:
  Greetings,
 
  We have a solr instance in use that gets some perhaps atypical queries
  and suffers from poor (2 second) QTimes.
 
  Documents (~2,350,000) in this instance are mainly comprised of
  various descriptive fields, such as multi-word (phrase) tags - an
  average document contains 200-400 phrases like this across several
  different multi-valued field types.
 
  A custom QueryComponent has been built that functions somewhat like a
  very specific MoreLikeThis. A seed document is specified via the
  incoming query, its terms are retrieved, boosted both by query
  parameters as well as fields within the document that specify term
  weighting, sorted by this custom boosting, and then a second query is
  crafted by taking the top 200 (sorted by the custom boosting)
  resulting field values paired with their fields and searching for
  documents matching these 200 values.

 a few more ideas:
 * use shingles e.g. to turn two-word phrases into single terms (how
 long is your average phrase?).
 * in addition to the above, maybe for phrases with  2 terms, consider
 just a boolean conjunction of the shingled phrases instead of a real
 phrase query: e.g. more like this - (more_like AND like_this). This
 would have some false positives.
 * use a more aggressive stopwords list for your MorePhrasesLikeThis.
 * reduce this number 200, and instead work harder to prune out which
 phrases are the most descriptive from the seed document, e.g. based
 on some heuristics like their frequency or location within that seed
 document, so your query isnt so massive.



Re: Anyone using mmseg analyzer in solr multi core?

2012-10-09 Thread Peter Keegan
We're using MMSeg with Lucene, but not Solr. Since each SolrCore is
independent, I'm not sure how you can avoid each having a copy of the
dictionary, unless you modified MMSeg to use shared memory. Or, maybe I
missing something.

On Mon, Oct 8, 2012 at 3:37 AM, liyun liyun2...@corp.netease.com wrote:

 Hi all,
 Is anybody using mmseg analyzer for Chinese word analyze? When we use this
 in solr multi-core, I find it will load the dictionary per core and each
 core cost about 50MB memory. I think this is a big waste when our JVM has
 only 1GB memory…… Anyone have a good idea for handle this trouble ?

 2012-10-08



 Li Yun
 Software Engineer @ Netease
 Mail: liyun2...@corp.netease.com
 MSN: rockiee...@gmail.com


Re: How to plug a new ANTLR grammar

2011-09-14 Thread Peter Keegan
Also, a question for Peter, at which stage do you use lucene analyzers
on the query? After it was parsed into the tree, or before we start
processing the query string?

I do the analysis before creating the tree. I'm pretty sure Lucene
QueryParser does this, too.

Peter

On Wed, Sep 14, 2011 at 5:15 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Peter,

 Yes, with the tree it is pretty straightforward. I'd prefer to do it
 that way, but what is the purpose of the new qParser then? Is it just
 that the qParser was built with a different paradigms in mind where
 the parse tree was not in the equation? Anybody knows if there is any
 advantage?

 I looked bit more into the contrib

 org.apache.lucene.queryParser.standard.StandardQueryParser.java
 org.apache.lucene.queryParser.standard.QueryParserWrapper.java

 And some things there (like setting default fuzzy value) are in my
 case set directly in the grammar. So the query builder is still
 somehow involved in parsing (IMHO not good).

 But if someone knows some reasons to keep using the qParser, please
 let me know.

 Also, a question for Peter, at which stage do you use lucene analyzers
 on the query? After it was parsed into the tree, or before we start
 processing the query string?

 Thanks!

  Roman





 On Tue, Sep 13, 2011 at 10:14 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
  Roman,
 
  I'm not familiar with the contrib, but you can write your own Java code
 to
  create Query objects from the tree produced by your lexer and parser
  something like this:
 
  StandardLuceneGrammarLexer lexer = new ANTLRReaderStream(new
  StringReader(queryString));
  CommonTokenStream tokens = new CommonTokenStream(lexer);
  StandardLuceneGrammarParser parser = new
  StandardLuceneGrammarParser(tokens);
  StandardLuceneGrammarParser.query_return ret = parser.mainQ();
  CommonTree t = (CommonTree) ret.getTree();
  parseTree(t);
 
  parseTree (Tree t) {
 
  // recursively parse the Tree, visit each node
 
visit (node);
 
  }
 
  visit (Tree node) {
 
  switch (node.getType()) {
  case (StandardLuceneGrammarParser.AND:
  // Create BooleanQuery, push onto stack
  ...
  }
  }
 
  I use the stack to build up the final Query from the queries produced in
 the
  tree parsing.
 
  Hope this helps.
  Peter
 
 
  On Tue, Sep 13, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote:
 
  I'd love to see the progress on this.
 
  On Tue, Sep 13, 2011 at 10:34 AM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Hi,
  
   The standard lucene/solr parsing is nice but not really flexible. I
   saw questions and discussion about ANTLR, but unfortunately never a
   working grammar, so... maybe you find this useful:
  
  
 
 https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr
  
   In the grammar, the parsing is completely abstracted from the Lucene
   objects, and the parser is not mixed with Java code. At first it
   produces structures like this:
  
  
 
 https://svnweb.cern.ch/trac/rcarepo/raw-attachment/wiki/MontySolrQueryParser/index.html
  
   But now I have a problem. I don't know if I should use query parsing
   framework in contrib.
  
   It seems that the qParser in contrib can use different parser
   generators (the default JavaCC, but also ANTLR). But I am confused and
   I don't understand this new queryParser from contrib. It is really
   very confusing to me. Is there any benefit in trying to plug the ANTLR
   tree into it? Because looking at the AST pictures, it seems that with
   a relatively simple tree walker we could build the same queries as the
   current standard lucene query parser. And it would be much simpler and
   flexible. Does it bring something new? I have a feeling I miss
   something...
  
   Many thanks for help,
  
Roman
  
 
 
 
  --
  - sent from my mobile
  6176064373
 
 



Re: How to plug a new ANTLR grammar

2011-09-13 Thread Peter Keegan
Roman,

I'm not familiar with the contrib, but you can write your own Java code to
create Query objects from the tree produced by your lexer and parser
something like this:

StandardLuceneGrammarLexer lexer = new ANTLRReaderStream(new
StringReader(queryString));
CommonTokenStream tokens = new CommonTokenStream(lexer);
StandardLuceneGrammarParser parser = new
StandardLuceneGrammarParser(tokens);
StandardLuceneGrammarParser.query_return ret = parser.mainQ();
CommonTree t = (CommonTree) ret.getTree();
parseTree(t);

parseTree (Tree t) {

// recursively parse the Tree, visit each node

   visit (node);

}

visit (Tree node) {

switch (node.getType()) {
case (StandardLuceneGrammarParser.AND:
// Create BooleanQuery, push onto stack
...
}
}

I use the stack to build up the final Query from the queries produced in the
tree parsing.

Hope this helps.
Peter


On Tue, Sep 13, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote:

 I'd love to see the progress on this.

 On Tue, Sep 13, 2011 at 10:34 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi,
 
  The standard lucene/solr parsing is nice but not really flexible. I
  saw questions and discussion about ANTLR, but unfortunately never a
  working grammar, so... maybe you find this useful:
 
 
 https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr
 
  In the grammar, the parsing is completely abstracted from the Lucene
  objects, and the parser is not mixed with Java code. At first it
  produces structures like this:
 
 
 https://svnweb.cern.ch/trac/rcarepo/raw-attachment/wiki/MontySolrQueryParser/index.html
 
  But now I have a problem. I don't know if I should use query parsing
  framework in contrib.
 
  It seems that the qParser in contrib can use different parser
  generators (the default JavaCC, but also ANTLR). But I am confused and
  I don't understand this new queryParser from contrib. It is really
  very confusing to me. Is there any benefit in trying to plug the ANTLR
  tree into it? Because looking at the AST pictures, it seems that with
  a relatively simple tree walker we could build the same queries as the
  current standard lucene query parser. And it would be much simpler and
  flexible. Does it bring something new? I have a feeling I miss
  something...
 
  Many thanks for help,
 
   Roman
 



 --
 - sent from my mobile
 6176064373



Re: performance crossover between single index and sharding

2011-08-04 Thread Peter Keegan
We have 16 shards on 4 physical servers. Shard size was determined by
measuring query response times as a function of doc count. Multiple shards
per server provides parallelism. In a VM environment, I would lean towards 1
shard per VM (with 1/4 the RAM). We implemented our own distributed search
(pre-Solr) and the extra sort/merge processing is not a performance issue.

Peter


On Tue, Aug 2, 2011 at 2:35 PM, Burton-West, Tom tburt...@umich.edu wrote:

 Hi Jonothan and Markus,

 Why 3 shards on one machine instead of one larger shard per machine?

 Good question!

 We made this architectural decision several years ago and I'm not
 remembering the rationale at the moment. I believe we originally made the
 decision due to some tests showing a sweetspot for I/O performance for
 shards with 500,000-600,000 documents, but those tests were made before we
 implemented CommonGrams and when we were still using attached storage.  I
 think we also might have had concerns about Java OOM errors with a really
 large shard/index, but we now know that we can keep memory usage under
 control by tweaking the amount of the terms index that gets read into
 memory.

 We should probably do some tests and revisit the question.

 The reason we don't have 12 shards on 12 machines is that current
 performance is good enough that we can't justify buying 8 more machines:)

 Tom

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Tuesday, August 02, 2011 2:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: performance crossover between single index and sharding

 Hi Tom,

 Very interesting indeed! But i keep wondering why some engineers choose to
 store multiple shards of the same index on the same machine, there must be
 significant overhead. The only reason i can think of is ease of maintenance
 in
 moving shards to a separate physical machine.
 I know that rearranging the shard topology can be a real pain in a large
 existing cluster (e.g. consistent hashing is not consistent anymore and
 having
 to shuffle docs to their new shard), is this the reason you choose this
 approach?

 Cheers,
 bble.com.



Re: Localized alphabetical order

2011-04-22 Thread Peter Keegan
On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece preec...@umn.edu wrote:

 As someone who's new to Solr/Lucene, I'm having trouble finding information
 on sorting results in localized alphabetical order. I've ineffectively
 searched the wiki and the mail archives.

 I'm thinking for example about Hawai'ian, where mīka (with an i-macron)
 comes after mika (i without the macron) but before miki (also without the
 macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as
 single letters, or about Ojibwe, where the apostrophe ' is a letter which
 sorts between h and i.

 How do non-English languages typically handle this?

 -Ben



Re: Info about Debugging SOLR in Eclipse

2011-03-17 Thread Peter Keegan
Can you use jetty?
http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse

On Thu, Mar 17, 2011 at 12:17 PM, Geeta Subramanian 
gsubraman...@commvault.com wrote:

 Hi,

 Can some please let me know the steps on how can I debug the solr code in
 my eclipse?

 I tried to compile the source, use the jars and place in tomcat where I am
 running solr. And do remote debugging, but it did not stop at any break
 point.
 I also tried to write a sample standalone java class to push the document.
 But I stopped at solr j classes and not solr server classes.


 Please let me know if I am making any mistake.

 Regards,
 Geeta













 **Legal Disclaimer***
 This communication may contain confidential and privileged material
 for the sole use of the intended recipient.  Any unauthorized review,
 use or distribution by others is strictly prohibited.  If you have
 received the message in error, please advise the sender by reply
 email and delete the message. Thank you.
 



Re: Info about Debugging SOLR in Eclipse

2011-03-17 Thread Peter Keegan
The instructions refer to the 'Run configuration' menu. Did you try 'Debug
configurations'?


On Thu, Mar 17, 2011 at 3:27 PM, Peter Keegan peterlkee...@gmail.comwrote:

 Can you use jetty?


 http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse

 On Thu, Mar 17, 2011 at 12:17 PM, Geeta Subramanian 
 gsubraman...@commvault.com wrote:

 Hi,

 Can some please let me know the steps on how can I debug the solr code in
 my eclipse?

 I tried to compile the source, use the jars and place in tomcat where I am
 running solr. And do remote debugging, but it did not stop at any break
 point.
 I also tried to write a sample standalone java class to push the document.
 But I stopped at solr j classes and not solr server classes.


 Please let me know if I am making any mistake.

 Regards,
 Geeta













 **Legal Disclaimer***
 This communication may contain confidential and privileged material
 for the sole use of the intended recipient.  Any unauthorized review,
 use or distribution by others is strictly prohibited.  If you have
 received the message in error, please advise the sender by reply
 email and delete the message. Thank you.
 





CapitalizationFilter

2010-12-29 Thread Peter Keegan
I was looking at 'CapitalizationFilter' and noticed that the
'incrementToken' method splits words at ' ' (space) and '.' (period). I'm
curious as to why the period is treated as a word separator? This could
cause unexpected results, for example:

Hello There My Name Is Dr. Watson --- Hello There My Name Is Dr. watson


Peter


Re: Does anyone notice this site?

2010-10-25 Thread Peter Keegan
fwiw, our proxy server has blocked this site for malicious content.

Peter

On Mon, Oct 25, 2010 at 1:25 PM, Grant Ingersoll gsing...@apache.orgwrote:


 On Oct 25, 2010, at 12:54 PM, scott chu wrote:

  I happen to bump into this site: http://www.solr.biz/
 
  They said they are also developing a search engine? Is this any
 connection to open source Solr?


 No, it is not a connection and they likely should not be using the name
 that way, as Solr is a TM of the ASF.




LuceneRevolution - NoSQL: A comparison

2010-10-11 Thread Peter Keegan
I listened with great interest to Grant's presentation of the NoSQL
comparisons/alternatives to Solr/Lucene. It sounds like the jury is still
out on much of this. Here's a use case that might favor using a NoSQL
alternative for storing 'stored fields' outside of Lucene.

When Solr does a distributed search across shards, it does this in 2 phases
(correct me if I'm wrong):

1. 1st query to get the docIds and facet counts
2. 2nd query to retrieve the stored fields of the top hits

The problem here is that the index could change between (1) and (2), so it's
not an atomic transaction. If the stored fields were kept outside of Lucene,
only the first query would be necessary. However, this would mean that the
external NoSQL data store would have to be synchronized with the Lucene
index, which might present its own problems. (I'm just throwing this out for
discussion)

Peter


Re: Range queries

2009-06-16 Thread Peter Keegan
How about this: x:[5 TO 8] AND x:{0 TO 8}

On Tue, Jun 16, 2009 at 1:16 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:


 Hi,

 I think the square brackets/curly braces need to be balanced, so this is
 currently not doable with existing query parsers.

  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: gwk g...@eyefi.nl
  To: solr-user@lucene.apache.org
  Sent: Tuesday, June 16, 2009 11:52:12 AM
  Subject: Range queries
 
  Hi,
 
  When doing range queries it seems the query is either x:[5 TO 8] which
 means 5
  = x = 8 or x:{5 TO 8} which means 5  x  8. But how do you get one
 half
  exclusive, the other inclusive for double fields the following: 5 = x 
 8? Is
  this possible?
 
  Regards,
 
  gwk




Re: new faceting algorithm

2008-12-05 Thread Peter Keegan
Hi Yonik,

May I ask in which class(es) this improvement was made? I've been using the
DocSet, DocList, BitDocSet, HashDocSet from Solr from a few years ago with a
Lucene based app. to do faceting.

Thanks,
Peter


On Mon, Nov 24, 2008 at 11:12 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

 A new faceting algorithm has been committed to the development version
 of Solr, and should be available in the next nightly test build (will
 be dated 11-25).  This change should generally improve field faceting
 where the field has many unique values but relatively few values per
 document.  This new algorithm is now the default for multi-valued
 fields (including tokenized fields) so you shouldn't have to do
 anything to enable it.  We'd love some feedback on how it works to
 ensure that it actually is a win for the majority and should be the
 default.

 -Yonik



  1   2   >