Re: Solr hangs on distributed updates
A distributed update is streamed to all available replicas in parallel. Hmm, that's not what I'm seeing with 4.6.1, as I tail the logs on leader and replicas. Mark Miller comments on this last May: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201404.mbox/%3CetPan.534d8d6d.74b0dc51.13a79@airmetal.local%3E On Mon, Dec 15, 2014 at 8:11 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Mon, Dec 15, 2014 at 8:41 PM, Peter Keegan peterlkee...@gmail.com wrote: If a timeout occurs, does the distributed update then go to the next replica? A distributed update is streamed to all available replicas in parallel. On Fri, Dec 12, 2014 at 3:42 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Sorry I should have specified. These timeouts go inside the solrcloud section and apply for inter-shard update requests only. The socket and connection timeout inside the shardHandlerFactory section apply for inter-shard search requests. On Fri, Dec 12, 2014 at 8:38 PM, Peter Keegan peterlkee...@gmail.com wrote: Btw, are the following timeouts still supported in solr.xml, and do they only apply to distributed search? shardHandlerFactory name=shardHandlerFactory class=HttpShardHandlerFactory int name=socketTimeout${socketTimeout:0}/int int name=connTimeout${connTimeout:0}/int /shardHandlerFactory Thanks, Peter On Fri, Dec 12, 2014 at 3:14 PM, Peter Keegan peterlkee...@gmail.com wrote: No, I wasn't aware of these. I will give that a try. If I stop the Solr jetty service manually, things recover fine, but the hang occurs when I 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a 15-sec timeout from the stopped node, and expires the session, but the Solr leader never gets notified. This seems like a bug in ZK. Thanks, Peter On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to reasonable values in your solr.xml? These are the timeouts used for inter-shard update requests. On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com wrote: We are running SolrCloud in AWS and using their auto scaling groups to spin up new Solr replicas when CPU utilization exceeds a threshold for a period of time. All is well until the replicas are terminated when CPU utilization falls below another threshold. What happens is that index updates sent to the Solr leader hang forever in both the Solr leader and the SolrJ client app. Searches work fine. Here are 2 thread stack traces from the Solr leader and 2 from the client app: 1) Solr-leader thread doing a distributed commit: Thread 23527: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise) - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame) - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71, line=166 (Compiled frame) - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=2, line=61 (Compiled frame) - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254 (Compiled frame) - org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader() @bci=8, line=289 (Compiled frame) - org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader() @bci=1, line=252 (Compiled frame) - org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader() @bci=6, line=191 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=62, line=300 (Compiled frame
Re: Solr hangs on distributed updates
As of 4.10, commits/optimize etc are executed in parallel. Excellent - thanks. On Tue, Dec 16, 2014 at 6:51 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Dec 16, 2014 at 11:34 AM, Peter Keegan peterlkee...@gmail.com wrote: A distributed update is streamed to all available replicas in parallel. Hmm, that's not what I'm seeing with 4.6.1, as I tail the logs on leader and replicas. Mark Miller comments on this last May: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201404.mbox/%3CetPan.534d8d6d.74b0dc51.13a79@airmetal.local%3E Yes, sorry I didn't notice that you are on 4.6.1. This was changed in 4.10 with https://issues.apache.org/jira/browse/SOLR-6264 As of 4.10, commits/optimize etc are executed in parallel. On Mon, Dec 15, 2014 at 8:11 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Mon, Dec 15, 2014 at 8:41 PM, Peter Keegan peterlkee...@gmail.com wrote: If a timeout occurs, does the distributed update then go to the next replica? A distributed update is streamed to all available replicas in parallel. On Fri, Dec 12, 2014 at 3:42 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Sorry I should have specified. These timeouts go inside the solrcloud section and apply for inter-shard update requests only. The socket and connection timeout inside the shardHandlerFactory section apply for inter-shard search requests. On Fri, Dec 12, 2014 at 8:38 PM, Peter Keegan peterlkee...@gmail.com wrote: Btw, are the following timeouts still supported in solr.xml, and do they only apply to distributed search? shardHandlerFactory name=shardHandlerFactory class=HttpShardHandlerFactory int name=socketTimeout${socketTimeout:0}/int int name=connTimeout${connTimeout:0}/int /shardHandlerFactory Thanks, Peter On Fri, Dec 12, 2014 at 3:14 PM, Peter Keegan peterlkee...@gmail.com wrote: No, I wasn't aware of these. I will give that a try. If I stop the Solr jetty service manually, things recover fine, but the hang occurs when I 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a 15-sec timeout from the stopped node, and expires the session, but the Solr leader never gets notified. This seems like a bug in ZK. Thanks, Peter On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to reasonable values in your solr.xml? These are the timeouts used for inter-shard update requests. On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com wrote: We are running SolrCloud in AWS and using their auto scaling groups to spin up new Solr replicas when CPU utilization exceeds a threshold for a period of time. All is well until the replicas are terminated when CPU utilization falls below another threshold. What happens is that index updates sent to the Solr leader hang forever in both the Solr leader and the SolrJ client app. Searches work fine. Here are 2 thread stack traces from the Solr leader and 2 from the client app: 1) Solr-leader thread doing a distributed commit: Thread 23527: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise) - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame) - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71, line=166 (Compiled frame) - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=2, line=61 (Compiled frame
Re: Solr hangs on distributed updates
I added distribUpdateConnTimeout and distribUpdateSoTimeout to solr.xml and the commit did timeout.(btw, is there any way to view solr.xml in the admin console?). Also, although we do have an init.d start/stop script for Solr, the 'stop' command was not executed during shutdown because there was no lock file for the script in '/var/lock/subsys'. I didn't know about this until I google'd around and found ' http://www.redhat.com/magazine/008jun05/departments/tips_tricks'. When I added the lock file, both the AWS 'stop' and 'terminate' actions did result in an orderly shutdown of the replica which caused the Solr-leader to get an exception and update the live_nodes, gracefully. So now, the timeouts should only play a backup role. Thanks for the help, Peter On Fri, Dec 12, 2014 at 5:21 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : No, I wasn't aware of these. I will give that a try. If I stop the Solr : jetty service manually, things recover fine, but the hang occurs when I : 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a I don't know squat about AWS Auto-Scaling, (and barely anything about AWS) but what you describe makes it sound like maybe your machine (ie AMI?) isn't really configured very well? Do you have some init.d/systemd type scripts to ensure a clean shutdown of Solr when the machine is shutdown/rebooted? That seems like a pretty good idea in general (in dependent of wether you are using Auto-Scaling) and -- assuming AWS auto-scaling does clean OS shutdowns when terminating instances -- would probably solve your problem. It would help ensure you would never have to wait on the timeouts -- the nodes will each explicitly tell ZK they are going bye-bye. if you do have things setup so that *manually* shutting down your instances executes a clean shutdown of solr, but AWS Auto-Scaling is actaully totally brutal and doesn't even do a clean shutdown of your virtual machines -- just yanks the virtual power cord -- perhaps you could implement one of these LifecycleHook options that poped up when i did some googling for AWS Auto-Scale termination to explicitly do a clean shutdown of the Solr process before the machine vanishes into thin air? -Hoss http://www.lucidworks.com/
Re: Solr hangs on distributed updates
If a timeout occurs, does the distributed update then go to the next replica? On Fri, Dec 12, 2014 at 3:42 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Sorry I should have specified. These timeouts go inside the solrcloud section and apply for inter-shard update requests only. The socket and connection timeout inside the shardHandlerFactory section apply for inter-shard search requests. On Fri, Dec 12, 2014 at 8:38 PM, Peter Keegan peterlkee...@gmail.com wrote: Btw, are the following timeouts still supported in solr.xml, and do they only apply to distributed search? shardHandlerFactory name=shardHandlerFactory class=HttpShardHandlerFactory int name=socketTimeout${socketTimeout:0}/int int name=connTimeout${connTimeout:0}/int /shardHandlerFactory Thanks, Peter On Fri, Dec 12, 2014 at 3:14 PM, Peter Keegan peterlkee...@gmail.com wrote: No, I wasn't aware of these. I will give that a try. If I stop the Solr jetty service manually, things recover fine, but the hang occurs when I 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a 15-sec timeout from the stopped node, and expires the session, but the Solr leader never gets notified. This seems like a bug in ZK. Thanks, Peter On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to reasonable values in your solr.xml? These are the timeouts used for inter-shard update requests. On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com wrote: We are running SolrCloud in AWS and using their auto scaling groups to spin up new Solr replicas when CPU utilization exceeds a threshold for a period of time. All is well until the replicas are terminated when CPU utilization falls below another threshold. What happens is that index updates sent to the Solr leader hang forever in both the Solr leader and the SolrJ client app. Searches work fine. Here are 2 thread stack traces from the Solr leader and 2 from the client app: 1) Solr-leader thread doing a distributed commit: Thread 23527: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise) - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame) - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71, line=166 (Compiled frame) - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=2, line=61 (Compiled frame) - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254 (Compiled frame) - org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader() @bci=8, line=289 (Compiled frame) - org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader() @bci=1, line=252 (Compiled frame) - org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader() @bci=6, line=191 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=62, line=300 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=60, line=127 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest, org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=574, line=520 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=344, line=906 (Compiled frame
Solr hangs on distributed updates
We are running SolrCloud in AWS and using their auto scaling groups to spin up new Solr replicas when CPU utilization exceeds a threshold for a period of time. All is well until the replicas are terminated when CPU utilization falls below another threshold. What happens is that index updates sent to the Solr leader hang forever in both the Solr leader and the SolrJ client app. Searches work fine. Here are 2 thread stack traces from the Solr leader and 2 from the client app: 1) Solr-leader thread doing a distributed commit: Thread 23527: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise) - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame) - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71, line=166 (Compiled frame) - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=2, line=61 (Compiled frame) - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254 (Compiled frame) - org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader() @bci=8, line=289 (Compiled frame) - org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader() @bci=1, line=252 (Compiled frame) - org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader() @bci=6, line=191 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=62, line=300 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=60, line=127 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest, org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=574, line=520 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=344, line=906 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest, org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest) @bci=6, line=784 (Compiled frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest, org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395 (Interpreted frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=17, line=199 (Interpreted frame) - org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=101, line=293 (Compiled frame) - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.update.SolrCmdDistributor$Req) @bci=127, line=226 (Interpreted frame) - org.apache.solr.update.SolrCmdDistributor.distribCommit(org.apache.solr.update.CommitUpdateCommand, java.util.List, org.apache.solr.common.params.ModifiableSolrParams) @bci=112, line=195 (Interpreted frame) - org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(org.apache.solr.update.CommitUpdateCommand) @bci=174, line=1250 (Interpreted frame) - org.apache.solr.update.processor.LogUpdateProcessor.processCommit(org.apache.solr.update.CommitUpdateCommand) @bci=61, line=157 (Interpreted frame) - org.apache.solr.handler.RequestHandlerUtils.handleCommit(org.apache.solr.request.SolrQueryRequest, org.apache.solr.update.processor.UpdateRequestProcessor, org.apache.solr.common.params.SolrParams, boolean) @bci=100, line=69 (Interpreted frame) - org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(org.apache.solr.request.SolrQueryRequest, org.apache.solr.response.SolrQueryResponse) @bci=60, line=68 (Compiled frame) - org.apache.solr.handler.RequestHandlerBase.handleRequest(org.apache.solr.request.SolrQueryRequest, org.apache.solr.response.SolrQueryResponse) @bci=43, line=135 (Compiled frame) -
Re: Solr hangs on distributed updates
No, I wasn't aware of these. I will give that a try. If I stop the Solr jetty service manually, things recover fine, but the hang occurs when I 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a 15-sec timeout from the stopped node, and expires the session, but the Solr leader never gets notified. This seems like a bug in ZK. Thanks, Peter On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to reasonable values in your solr.xml? These are the timeouts used for inter-shard update requests. On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com wrote: We are running SolrCloud in AWS and using their auto scaling groups to spin up new Solr replicas when CPU utilization exceeds a threshold for a period of time. All is well until the replicas are terminated when CPU utilization falls below another threshold. What happens is that index updates sent to the Solr leader hang forever in both the Solr leader and the SolrJ client app. Searches work fine. Here are 2 thread stack traces from the Solr leader and 2 from the client app: 1) Solr-leader thread doing a distributed commit: Thread 23527: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise) - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame) - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71, line=166 (Compiled frame) - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=2, line=61 (Compiled frame) - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254 (Compiled frame) - org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader() @bci=8, line=289 (Compiled frame) - org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader() @bci=1, line=252 (Compiled frame) - org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader() @bci=6, line=191 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=62, line=300 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=60, line=127 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest, org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=574, line=520 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=344, line=906 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest, org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest) @bci=6, line=784 (Compiled frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest, org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395 (Interpreted frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=17, line=199 (Interpreted frame) - org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=101, line=293 (Compiled frame) - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr.update.SolrCmdDistributor$Req) @bci=127, line=226 (Interpreted frame) - org.apache.solr.update.SolrCmdDistributor.distribCommit(org.apache.solr.update.CommitUpdateCommand, java.util.List, org.apache.solr.common.params.ModifiableSolrParams) @bci=112, line=195 (Interpreted frame
Re: Solr hangs on distributed updates
Btw, are the following timeouts still supported in solr.xml, and do they only apply to distributed search? shardHandlerFactory name=shardHandlerFactory class=HttpShardHandlerFactory int name=socketTimeout${socketTimeout:0}/int int name=connTimeout${connTimeout:0}/int /shardHandlerFactory Thanks, Peter On Fri, Dec 12, 2014 at 3:14 PM, Peter Keegan peterlkee...@gmail.com wrote: No, I wasn't aware of these. I will give that a try. If I stop the Solr jetty service manually, things recover fine, but the hang occurs when I 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a 15-sec timeout from the stopped node, and expires the session, but the Solr leader never gets notified. This seems like a bug in ZK. Thanks, Peter On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to reasonable values in your solr.xml? These are the timeouts used for inter-shard update requests. On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com wrote: We are running SolrCloud in AWS and using their auto scaling groups to spin up new Solr replicas when CPU utilization exceeds a threshold for a period of time. All is well until the replicas are terminated when CPU utilization falls below another threshold. What happens is that index updates sent to the Solr leader hang forever in both the Solr leader and the SolrJ client app. Searches work fine. Here are 2 thread stack traces from the Solr leader and 2 from the client app: 1) Solr-leader thread doing a distributed commit: Thread 23527: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise) - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame) - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71, line=166 (Compiled frame) - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=2, line=61 (Compiled frame) - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254 (Compiled frame) - org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader() @bci=8, line=289 (Compiled frame) - org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader() @bci=1, line=252 (Compiled frame) - org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader() @bci=6, line=191 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=62, line=300 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=60, line=127 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest, org.apache.http.protocol.HttpContext) @bci=198, line=715 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=574, line=520 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=344, line=906 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest, org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest) @bci=6, line=784 (Compiled frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest, org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395 (Interpreted frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=17, line=199 (Interpreted frame) - org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=101, line=293 (Compiled frame
Re: Solr hangs on distributed updates
The Solr leader should stop sending requests to the stopped replica once that replica's live node is removed from ZK (after session expiry). Fwiw, here's the Zookeeper log entry for a graceful shutdown of the Solr replica: 2014-12-12 15:04:21,304 [myid:2] - INFO [ProcessThread(sid:2 cport:8181)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x34a1701a1df0037 And here's the Zookeeper log entry for a non-graceful shutdown via EC2 stop or terminate of the replica: 2014-12-12 14:19:22,000 [myid:2] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x14a1700c19c003f, timeout of 15000ms exceeded 2014-12-12 14:19:22,001 [myid:2] - INFO [ProcessThread(sid:2 cport:8181)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x14a1700c19c003f There was no hang in the graceful shutdown. I'm running ZK version 3.4.5 and Solr 4.6.1 Peter On Fri, Dec 12, 2014 at 3:39 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Okay, that should solve the hung threads on the leader. When you stop the jetty service then it is a graceful shutdown where existing requests finish before the searcher thread pool is shutdown completely. A EC2 terminate probably just kills the processes and leader threads just wait due to a lack of read/connection timeouts. The Solr leader should stop sending requests to the stopped replica once that replica's live node is removed from ZK (after session expiry). I think most of these issues are because of the lack of timeouts. Just add them and if there are more problems, we can discuss more. On Fri, Dec 12, 2014 at 8:14 PM, Peter Keegan peterlkee...@gmail.com wrote: No, I wasn't aware of these. I will give that a try. If I stop the Solr jetty service manually, things recover fine, but the hang occurs when I 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a 15-sec timeout from the stopped node, and expires the session, but the Solr leader never gets notified. This seems like a bug in ZK. Thanks, Peter On Fri, Dec 12, 2014 at 2:43 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Do you have distribUpdateConnTimeout and distribUpdateSoTimeout set to reasonable values in your solr.xml? These are the timeouts used for inter-shard update requests. On Fri, Dec 12, 2014 at 2:20 PM, Peter Keegan peterlkee...@gmail.com wrote: We are running SolrCloud in AWS and using their auto scaling groups to spin up new Solr replicas when CPU utilization exceeds a threshold for a period of time. All is well until the replicas are terminated when CPU utilization falls below another threshold. What happens is that index updates sent to the Solr leader hang forever in both the Solr leader and the SolrJ client app. Searches work fine. Here are 2 thread stack traces from the Solr leader and 2 from the client app: 1) Solr-leader thread doing a distributed commit: Thread 23527: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise) - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame) - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71, line=166 (Compiled frame) - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=2, line=61 (Compiled frame) - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254 (Compiled frame) - org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader() @bci=8, line=289 (Compiled frame) - org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader() @bci=1, line=252 (Compiled frame) - org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader() @bci=6, line=191 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=62, line=300 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection
Re: Solr hangs on distributed updates
The AMIs are Red Hat (not Amazon's) and the instances are properly sized for the environment (t1.micro for ZK, m3.xlarge for Solr). I do plan to add hooks for a clean shutdown of Solr when the VM is shut down, but if Solr takes too long, AWS may clobber it anyway. One frustrating part of auto scaling shutdown is that you can't log into the 'vanishing machine' to view the logs. Peter On Fri, Dec 12, 2014 at 5:21 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : No, I wasn't aware of these. I will give that a try. If I stop the Solr : jetty service manually, things recover fine, but the hang occurs when I : 'stop' or 'terminate' the EC2 instance. The Zookeeper leader reports a I don't know squat about AWS Auto-Scaling, (and barely anything about AWS) but what you describe makes it sound like maybe your machine (ie AMI?) isn't really configured very well? Do you have some init.d/systemd type scripts to ensure a clean shutdown of Solr when the machine is shutdown/rebooted? That seems like a pretty good idea in general (in dependent of wether you are using Auto-Scaling) and -- assuming AWS auto-scaling does clean OS shutdowns when terminating instances -- would probably solve your problem. It would help ensure you would never have to wait on the timeouts -- the nodes will each explicitly tell ZK they are going bye-bye. if you do have things setup so that *manually* shutting down your instances executes a clean shutdown of solr, but AWS Auto-Scaling is actaully totally brutal and doesn't even do a clean shutdown of your virtual machines -- just yanks the virtual power cord -- perhaps you could implement one of these LifecycleHook options that poped up when i did some googling for AWS Auto-Scale termination to explicitly do a clean shutdown of the Solr process before the machine vanishes into thin air? -Hoss http://www.lucidworks.com/
Solr exceptions during batch indexing
How are folks handling Solr exceptions that occur during batch indexing? Solr stops parsing the docs stream when an error occurs (e.g. a doc with a missing mandatory field), and stops indexing the batch. The bad document is not identified, so it would be hard for the client to recover by skipping over it. Peter
Re: Solr exceptions during batch indexing
I'm seeing 9X throughput with 1000 docs/batch vs 1 doc/batch, with a single thread, so it's certainly worth it. Thanks, Peter On Fri, Nov 7, 2014 at 2:18 PM, Erick Erickson erickerick...@gmail.com wrote: And Walter has also been around for a _long_ time ;) (sorry, couldn't resist) Erick On Fri, Nov 7, 2014 at 11:12 AM, Walter Underwood wun...@wunderwood.org wrote: Yes, I implemented exactly that fallback for Solr 1.2 at Netflix. It isn’t to hard if the code is structured for it; retry with a batch size of 1. wunder On Nov 7, 2014, at 11:01 AM, Erick Erickson erickerick...@gmail.com wrote: Yeah, this has been an ongoing issue for a _long_ time. Basically, you can't. So far, people have essentially written fallback logic to index the docs of a failing packet one at a time and report it. I'd really like better reporting back, but we haven't gotten there yet. Best, Erick On Fri, Nov 7, 2014 at 8:25 AM, Peter Keegan peterlkee...@gmail.com wrote: How are folks handling Solr exceptions that occur during batch indexing? Solr stops parsing the docs stream when an error occurs (e.g. a doc with a missing mandatory field), and stops indexing the batch. The bad document is not identified, so it would be hard for the client to recover by skipping over it. Peter
Re: Ideas for debugging poor SolrCloud scalability
Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports (1000 adds) in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually (12 adds). Why do the batches appear to be broken up? Peter On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com wrote: NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1 you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2 use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which happens with SolrJ and the CloudSolrSever automatically. Rather than rolling your own, I strongly suggest you try this out. 2 you have enough clients feeding the cluster to push CPU utilization on them all. Very often slow indexing, or in your case lack of scaling is a result of document acquisition or, in your case, your doc generator is spending all it's time waiting for the individual documents to get to Solr and come back. bq: chooses a random solr server for each ADD request (with 1 doc per add request) Probably your culprit right there. Each and every document requires that you have to cross the network (and forward that doc to the correct leader). So given that you're not seeing high CPU utilization, I suspect that you're not sending enough docs to SolrCloud fast enough to see scaling. You need to batch up multiple docs, I generally send 1,000 docs at a time. But even if you do solve this, the inter-node routing will prevent linear scaling. When a doc (or a batch of docs) goes to a random Solr node, here's what happens: 1 the docs are re-packaged into groups based on which shard they're destined for 2 the sub-packets are forwarded to the leader for each shard 3 the responses are gathered back and returned to the client. This set of operations will eventually degrade the scaling. bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea behind sharding. If we're talking search
Re: Ideas for debugging poor SolrCloud scalability
Yes, I was inadvertently sending them to a replica. When I sent them to the leader, the leader reported (1000 adds) and the replica reported only 1 add per document. So, it looks like the leader forwards the batched jobs individually to the replicas. On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com wrote: Internally, the docs are batched up into smaller buckets (10 as I remember) and forwarded to the correct shard leader. I suspect that's what you're seeing. Erick On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com wrote: Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports (1000 adds) in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually (12 adds). Why do the batches appear to be broken up? Peter On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com wrote: NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1 you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2 use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which happens with SolrJ and the CloudSolrSever automatically. Rather than rolling your own, I strongly suggest you try this out. 2 you have enough clients feeding the cluster to push CPU utilization on them all. Very often slow indexing, or in your case lack of scaling is a result of document acquisition or, in your case, your doc generator is spending all it's time waiting for the individual documents to get to Solr and come back. bq: chooses a random solr server for each ADD request (with 1 doc per add request) Probably your culprit right there. Each and every document requires that you have to cross the network (and forward that doc to the correct leader). So given that you're not seeing high CPU utilization, I suspect that you're not sending enough docs to SolrCloud fast enough to see scaling
Re: QParserPlugin question
Thanks for the advice. I've moved this query rewriting logic (not really business logic) to a SearchComponent and will leave the custom query parser to deal with the keyword (q=) related aspects of the query. In my case, the latter is mostly dealing with the presence of wildcard characters. Peter On Wed, Oct 22, 2014 at 6:35 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : It's for an optimization. If the keyword is 'match all docs', I want to : remove a custom PostFilter from the query and change the sort parameters : (so the app doesn't have to do it). It looks like the responseHeader is : displaying the 'originalParams', which are immutable. that is in fact the point of including the params in the header - to make it clear what exatly the request handler got as input. echoParams can be used to control wether you get all the params (including those added as defaults/appens in configuration) or just the explicit params included in the request -- but there's no way for a QParserPlugin to change what the raw query param strings are -- the query it produces might not even have a meaningful toString. the params in the header are there for the very explicit reason of showing you exactly what input was used to produce this request -- if plugins could change them, they would be meaningless since the modified params might not produce the same request. if you want to have a custom plugin that applies business logic to hcnage the behavior internally and reports back info for hte client to use in future requests, i would suggest doing that as a SearchComponent and inclding your own section in the response with details about what the client should do moving forward. (for example: i had a serach component once upon a time that applied QueryElevationComponent type checking against the query string filters, and based on what it found would set the sort add some filters unless an explicit sort / filter params were provided by the client -- the sort filters that were added were included along with some additional metadat about what rule was matched in a new section of the response.) -Hoss http://www.lucidworks.com/
QParserPlugin question
I have a custom query parser that modifies the filter query list based on the keyword query. This works, but the 'fq' list in the responseHeader contains the original filter list. The debugQuery output does display the modified filter list. Is there a way to change the responseHeader? I could probably do this in a custom QueryComponent, but the query parser seems like a reasonable place to do this. Thanks, Peter
Re: QParserPlugin question
It's for an optimization. If the keyword is 'match all docs', I want to remove a custom PostFilter from the query and change the sort parameters (so the app doesn't have to do it). It looks like the responseHeader is displaying the 'originalParams', which are immutable. On Wed, Oct 22, 2014 at 2:10 PM, Ramzi Alqrainy ramzi.alqra...@gmail.com wrote: I don't know why you need to change it ? you can use omitHeader=true on the URL to remove header if you want. -- View this message in context: http://lucene.472066.n3.nabble.com/QParserPlugin-question-tp4165368p4165373.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: QParserPlugin question
I meant to say: If the keyword is *:* (MachAllDocsQuery)... On Wed, Oct 22, 2014 at 2:17 PM, Peter Keegan peterlkee...@gmail.com wrote: It's for an optimization. If the keyword is 'match all docs', I want to remove a custom PostFilter from the query and change the sort parameters (so the app doesn't have to do it). It looks like the responseHeader is displaying the 'originalParams', which are immutable. On Wed, Oct 22, 2014 at 2:10 PM, Ramzi Alqrainy ramzi.alqra...@gmail.com wrote: I don't know why you need to change it ? you can use omitHeader=true on the URL to remove header if you want. -- View this message in context: http://lucene.472066.n3.nabble.com/QParserPlugin-question-tp4165368p4165373.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Does Solr support this?
I'm doing something similar with a custom search component. See SOLR-6502 https://issues.apache.org/jira/browse/SOLR-6502 On Thu, Oct 16, 2014 at 8:14 AM, Upayavira u...@odoko.co.uk wrote: Nope, not yet. Someone did propose a JavascriptRequestHandler or such, which would allow you to code such things in Javascript (obviously), but I don't believe that has been accepted or completed yet. Upayavira On Thu, Oct 16, 2014, at 03:48 AM, Aaron Lewis wrote: Hi, I'm trying to a if first query is empty then do a second query, e.g if this returns no rows: title:XX AND subject:YY Then do a title:XX I can do that with two queries. But I'm wondering if I can merge them into a single one? -- Best Regards, Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/ Finger Print: 9F67 391B B770 8FF6 99DC D92D 87F6 2602 1371 4D33
Question about filter cache size
Say I have a boolean field named 'hidden', and less than 1% of the documents in the index have hidden=true. Do both these filter queries use the same docset cache size? : fq=hidden:false fq=!hidden:true Peter
Re: Question about filter cache size
it will be cached as hidden:true and then inverted Inverted at query time, so for best query performance use fq=hidden:false, right? On Fri, Oct 3, 2014 at 3:57 PM, Yonik Seeley yo...@heliosearch.com wrote: On Fri, Oct 3, 2014 at 3:42 PM, Peter Keegan peterlkee...@gmail.com wrote: Say I have a boolean field named 'hidden', and less than 1% of the documents in the index have hidden=true. Do both these filter queries use the same docset cache size? : fq=hidden:false fq=!hidden:true Nope... !hidden:true will be smaller in the cache (it will be cached as hidden:true and then inverted) The downside is that you'll pay the cost of that inversion. -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
Re: MaxScore
See if SOLR-5831 https://issues.apache.org/jira/browse/SOLR-5831 helps. Peter On Tue, Sep 16, 2014 at 11:32 PM, William Bell billnb...@gmail.com wrote: What we need is a function like scale(field,min,max) but only operates on the results that come back from the search results. scale() takes the min, max from the field in the index, not necessarily those in the results. I cannot think of a solution. max() only looks at one field, not across fields in the results. I tried a query() but cannot think of a way to get the max value of a field ONLY in the results... Ideas? -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Edismax mm and efficiency
I implemented a custom QueryComponent that issues the edismax query with mm=100%, and if no results are found, it reissues the query with mm=1. This doubled our query throughput (compared to mm=1 always), as we do some expensive RankQuery processing. For your very long student queries, mm=100% would obviously be too high, so you'd have to experiment. On Fri, Sep 5, 2014 at 1:34 PM, Walter Underwood wun...@wunderwood.org wrote: Great! We have some very long queries, where students paste entire homework problems. One of them was 1051 words. Many of them are over 100 words. This could help. In the Jira discussion, I saw some comments about handling the most sparse lists first. We did something like that in the Infoseek Ultra engine about twenty years ago. Short termlists (documents matching a term) were processed first, which kept the in-memory lists of matching docs small. It also allowed early short-circuiting for no-hits queries. What would be a high mm value, 75%? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Sep 4, 2014, at 11:52 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: indeed https://issues.apache.org/jira/browse/LUCENE-4571 my feeling is it gives a significant gain in mm high values. On Fri, Sep 5, 2014 at 3:01 AM, Walter Underwood wun...@wunderwood.org wrote: Are there any speed advantages to using “mm”? I can imagine pruning the set of matching documents early, which could help, but is that (or something else) done? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Edismax mm and efficiency
Sure. I created SOLR-6502. The tricky part was handling the behavior in a sharded index. When the index is sharded. the response from each shard will contain a parameter that indicates if the search results are from the conjunction of all keywords (mm=100%), or from disjunction (mm=1). If the shards contain both types, then only return the results from the conjunction. This is necessary in order to get the same results independent of the number of shards. Peter On Wed, Sep 10, 2014 at 11:07 AM, Walter Underwood wun...@wunderwood.org wrote: We do that strict/loose query sequence, but on the client side with two requests. Would you consider contributing the QueryComponent? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Sep 10, 2014, at 3:47 AM, Peter Keegan peterlkee...@gmail.com wrote: I implemented a custom QueryComponent that issues the edismax query with mm=100%, and if no results are found, it reissues the query with mm=1. This doubled our query throughput (compared to mm=1 always), as we do some expensive RankQuery processing. For your very long student queries, mm=100% would obviously be too high, so you'd have to experiment. On Fri, Sep 5, 2014 at 1:34 PM, Walter Underwood wun...@wunderwood.org wrote: Great! We have some very long queries, where students paste entire homework problems. One of them was 1051 words. Many of them are over 100 words. This could help. In the Jira discussion, I saw some comments about handling the most sparse lists first. We did something like that in the Infoseek Ultra engine about twenty years ago. Short termlists (documents matching a term) were processed first, which kept the in-memory lists of matching docs small. It also allowed early short-circuiting for no-hits queries. What would be a high mm value, 75%? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Sep 4, 2014, at 11:52 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: indeed https://issues.apache.org/jira/browse/LUCENE-4571 my feeling is it gives a significant gain in mm high values. On Fri, Sep 5, 2014 at 3:01 AM, Walter Underwood wun...@wunderwood.org wrote: Are there any speed advantages to using “mm”? I can imagine pruning the set of matching documents early, which could help, but is that (or something else) done? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: ExternalFileFieldReloader and commit
I entered SOLR-6326 https://issues.apache.org/jira/browse/SOLR-6326 thanks, Peter On Tue, Aug 5, 2014 at 6:50 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Peter, It seems like a bug to me, too. Please file a JIRA ticket if you can so that someone can take it. Koji -- http://soleami.com/blog/comparing-document-classification-functions-of- lucene-and-mahout.html (2014/08/05 22:34), Peter Keegan wrote: When there are multiple 'external file field' files available, Solr will reload the last one (lexicographically) with a commit, but only if changes were made to the index. Otherwise, it skips the reload and logs: No uncommitted changes. Skipping IW.commit. Has anyone else noticed this? It seems like a bug to me. (yes, I do have firstSearcher and newSearcher event listeners in solrconfig.xml) Peter
Re: ExternalFileFieldReloader and commit
The use case is: 1. A SolrJ client updates the main index (and replicas) and issues a commit at regular intervals. 2. Another component updates the external files at other intervals. Usually, the commits result in a new searcher which triggers the org.apache.solr.schema.ExternalFileFieldReloader, but only if there were changes to the main index. Using ReloadCacheRequestHandler in (2) above would result in the loss of index/replica synchronization provided by the commit in (1), and reloading the core is slow and overkill. I think it would be easier to have the SolrJ client in (1) always update a dummy document during each commit interval to force a new searcher. Thanks, Peter On Wed, Aug 6, 2014 at 8:43 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Peter, Providing SOLR-6326 is about a bug in ExternalFileFieldReloader, I'm asking here: Did you try to use org.apache.solr.search.function.FileFloatSource.ReloadCacheRequestHandler ? Let's me know if you need help with it. As a workaround you can reload the core via REST or click a button at SolrAdmin, your questions are welcome. On Wed, Aug 6, 2014 at 4:02 PM, Peter Keegan peterlkee...@gmail.com wrote: I entered SOLR-6326 https://issues.apache.org/jira/browse/SOLR-6326 thanks, Peter On Tue, Aug 5, 2014 at 6:50 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Peter, It seems like a bug to me, too. Please file a JIRA ticket if you can so that someone can take it. Koji -- http://soleami.com/blog/comparing-document-classification-functions-of- lucene-and-mahout.html (2014/08/05 22:34), Peter Keegan wrote: When there are multiple 'external file field' files available, Solr will reload the last one (lexicographically) with a commit, but only if changes were made to the index. Otherwise, it skips the reload and logs: No uncommitted changes. Skipping IW.commit. Has anyone else noticed this? It seems like a bug to me. (yes, I do have firstSearcher and newSearcher event listeners in solrconfig.xml) Peter -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
ExternalFileFieldReloader and commit
When there are multiple 'external file field' files available, Solr will reload the last one (lexicographically) with a commit, but only if changes were made to the index. Otherwise, it skips the reload and logs: No uncommitted changes. Skipping IW.commit. Has anyone else noticed this? It seems like a bug to me. (yes, I do have firstSearcher and newSearcher event listeners in solrconfig.xml) Peter
Question about ReRankQuery
I'm looking at how 'ReRankQuery' works. If the main query has a Sort criteria, it is only used to sort the first pass results. The QueryScorer used in the second pass only reorders the ScoreDocs based on score and docid, but doesn't use the original Sort fields. If the Sort criteria is 'score desc, myfield asc', I would expect 'myfield' to break score ties from the second pass after rescoring. Is this a bug or the intended behavior? Thanks, Peter
Re: Question about ReRankQuery
See http://heliosearch.org/solrs-new-re-ranking-feature/ On Wed, Jul 23, 2014 at 11:27 AM, Erick Erickson erickerick...@gmail.com wrote: I'm having a little trouble understanding the use-case here. Why use re-ranking? Isn't this just combining the original query with the second query with an AND and using the original sort? At the end, you have your original list in it's original order, with (potentially) some documents removed that don't satisfy the secondary query. Or I'm missing the boat entirely. Best, Erick On Wed, Jul 23, 2014 at 6:31 AM, Peter Keegan peterlkee...@gmail.com wrote: I'm looking at how 'ReRankQuery' works. If the main query has a Sort criteria, it is only used to sort the first pass results. The QueryScorer used in the second pass only reorders the ScoreDocs based on score and docid, but doesn't use the original Sort fields. If the Sort criteria is 'score desc, myfield asc', I would expect 'myfield' to break score ties from the second pass after rescoring. Is this a bug or the intended behavior? Thanks, Peter
Re: Question about ReRankQuery
The ReRankingQParserPlugin uses the Lucene QueryRescorer, which only uses the score from the re-rank query when re-ranking the top N documents. Understood, but if the re-rank scores produce new ties, wouldn't you want to resort them with the FieldSortedHitQueue? Anyway, I was looking to reimplement the ScaleScoreQParser PostFilter plugin with RankQuery, and would need to implement the behavior of the DelegateCollector there for handling multiple sort fields. Peter On Wednesday, July 23, 2014, Joel Bernstein joels...@gmail.com wrote: The ReRankingQParserPlugin uses the Lucene QueryRescorer, which only uses the score from the re-rank query when re-ranking the top N documents. The ReRanklingQParserPlugin is built as a RankQuery plugin so you can swap in your own implementation. Patches are also welcome for the existing implementation. Joel Bernstein Search Engineer at Heliosearch On Wed, Jul 23, 2014 at 11:37 AM, Peter Keegan peterlkee...@gmail.com javascript:; wrote: See http://heliosearch.org/solrs-new-re-ranking-feature/ On Wed, Jul 23, 2014 at 11:27 AM, Erick Erickson erickerick...@gmail.com javascript:; wrote: I'm having a little trouble understanding the use-case here. Why use re-ranking? Isn't this just combining the original query with the second query with an AND and using the original sort? At the end, you have your original list in it's original order, with (potentially) some documents removed that don't satisfy the secondary query. Or I'm missing the boat entirely. Best, Erick On Wed, Jul 23, 2014 at 6:31 AM, Peter Keegan peterlkee...@gmail.com javascript:; wrote: I'm looking at how 'ReRankQuery' works. If the main query has a Sort criteria, it is only used to sort the first pass results. The QueryScorer used in the second pass only reorders the ScoreDocs based on score and docid, but doesn't use the original Sort fields. If the Sort criteria is 'score desc, myfield asc', I would expect 'myfield' to break score ties from the second pass after rescoring. Is this a bug or the intended behavior? Thanks, Peter
Question about solrcloud recovery process
I bring up a new Solr node with no index and watch the index being replicated from the leader. The index size is 12G and the replication takes about 6 minutes, according to the replica log (from 'Starting recovery process' to 'Finished recovery process). However, shortly after the replication begins, while the index files are being copied, I am able to query the index on the replica and see q=*:* find all of the documents. But, from the core admin screen, numDocs = 0, and in the cloud screen the replica is in 'recovering' mode. How can this be? Peter
Re: Question about solrcloud recovery process
No, we're not doing NRT. The search clients aren't using CloudSolrServer and they are behind an AWS load balancer, which calls the Solr ping handler (implemented with ClusterStateAwarePingRequestHandler) to determine when the node is active. This ping handler also responds during the index copy, which doesn't seem right. I'll have to figure out why it does this before the replica is really active. Peter On Thu, Jul 3, 2014 at 9:36 AM, Mark Miller markrmil...@gmail.com wrote: I don’t know offhand about the num docs issue - are you doing NRT? As far as being able to query the replica, I’m not sure anyone ever got to making that fail if you directly query a node that is not active. It certainly came up, but I have no memory of anyone tackling it. Of course in many other cases, information is being pulled from zookeeper and recovering nodes are ignored. If this is the issue I think it is, it should only be an issue when you directly query recovery node. The CloudSolrServer client works around this issue as well. -- Mark Miller about.me/markrmiller On July 3, 2014 at 8:42:48 AM, Peter Keegan (peterlkee...@gmail.com) wrote: I bring up a new Solr node with no index and watch the index being replicated from the leader. The index size is 12G and the replication takes about 6 minutes, according to the replica log (from 'Starting recovery process' to 'Finished recovery process). However, shortly after the replication begins, while the index files are being copied, I am able to query the index on the replica and see q=*:* find all of the documents. But, from the core admin screen, numDocs = 0, and in the cloud screen the replica is in 'recovering' mode. How can this be? Peter
Re: Question about solrcloud recovery process
Aha, you are right wrdrvf! The query is forwarded to any of the active shards (I saw the query alternate between both of mine). Nice feature. Also, looking at 'ClusterStateAwarePingRequestHandler' (which I downloaded from www.manning.com/SolrinAction), it is checking zookeeper to see if the logical shard is active, not the specific 'this' replica, which is in 'recovering' state. I'll post a patch once I figure out the zookeeper api. Thanks, Peter On Thu, Jul 3, 2014 at 12:03 PM, wrdrvr wrd...@gmail.com wrote: Try querying the recovering core with distrib=false, you should get the count of docs in it. Most likely, since the replica is recovering it is forwarding all queries to the active replica, this can be verified in the core logs. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-solrcloud-recovery-process-tp4145450p4145491.html Sent from the Solr - User mailing list archive at Nabble.com.
Custom QueryComponent to rewrite dismax query
We are using the 'edismax' query parser for its many benefits over the standard Lucene parser. For queries with more than 5 or 6 keywords (which is a lot for our typical user), the recall can be very high (sometimes matching 75% or more of the documents). This high recall, when coupled with some custom PostFilter scoring, is hurting the query performance. I tried varying the 'mm' (minimum match) parameter, but at values less than 100%, the response time didn't improve much, and at 100%, there were often no results, which is unacceptable. So, I wrote a custom QueryComponent which rewrites the DisMax query. Initially, the MinShouldMatch value is set to 100%. If the search returns 0 results, MinShouldMatch is set to 1 and the search is retried. This improved the QPS throughput by about 2.5X. However, this only worked with an unsharded index. With a sharded index, each shard returned only the results from the first search (mm=100%). In the debugger, I could see 2 'response/ResultContext' NV-Pairs in the SolrQueryResponse object, so I added code to remove the first pair if there were 2 pair present, which fixed this problem. My question: is removing the extra ResultContext a reasonable solution to this problem? It just seems a little brittle to me. Thanks, Peter
Autoscaling Solr instances in AWS
We are running Solr 4.6.1 in AWS: - 2 Solr instances (1 shard, 1 leader, 1 replica) - 1 CloudSolrServer SolrJ client updating the index. - 3 Zookeepers The Solr instances are behind a load balanceer and also in an auto scaling group. The ScaleUpPolicy will add up to 9 additional instances (replicas), 1 per minute. Later, the 9 replicas are terminated with the ScaleDownPolicy. Problem: during the ScaleUpPolicy, when the Solr Leader is under heavy query load, the SolrJ indexing client issues a commit which hangs and never returns. Note that the index schema contains 3 ExternalFileFields wich slow down the commit process. Here's the stack trace: Thread 1959: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise) - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame) - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71, line=166 (Compiled frame) - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281 (Compiled frame) - org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=5, line=115 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=2, line=62 (Compiled frame) - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254 (Compiled frame) - org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader() @bci=8, line=289 (Compiled frame) - org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader() @bci=1, line=252 (Compiled frame) - org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader() @bci=6, line=191 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=62, line=300 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=60, line=127 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest, org.apache.http.protocol.HttpContext) @bci=198, line=717 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=597, line=522 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=344, line=906 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest, org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest) @bci=6, line=784 (Compiled frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest, org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395 (Compiled frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=17, line=199 (Compiled frame) - org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(org.apache.solr.client.solrj.impl.LBHttpSolrServer$Req) @bci=132, line=285 (Compiled frame) - org.apache.solr.client.solrj.impl.CloudSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=838, line=640 (Compiled frame) - org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(org.apache.solr.client.solrj.SolrServer) @bci=17, line=117 (Compiled frame) - org.apache.solr.client.solrj.SolrServer.commit(boolean, boolean) @bci=16, line=168 (Interpreted frame) - org.apache.solr.client.solrj.SolrServer.commit() @bci=3, line=146 (Interpreted frame) The Solr leader log shows many connection timeout exceptions from the other Solr replicas during this period. Some of these timeouts may have been caused by replicas disappearing from the ScaleDownPolicy. From the search client application's point of view, everything looked fine, but indexing stopped until I restarted the SolrJ client. Does this look like a case where a timeout value needs to be increased somewhere? If so, which one? Thanks, Peter
Re: Distributed commits in CloudSolrServer
Are distributed commits also done in parallel across shards? Peter On Tue, Apr 15, 2014 at 3:50 PM, Mark Miller markrmil...@gmail.com wrote: Inline responses below. -- Mark Miller about.me/markrmiller On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com) wrote: I have a SolrCloud index, 1 shard, with a leader and one replica, and 3 ZKs. The Solr indexes are behind a load balancer. There is one CloudSolrServer client updating the indexes. The index schema includes 3 ExternalFileFields. When the CloudSolrServer client issues a hard commit, I observe that the commits occur sequentially, not in parallel, on the leader and replica. The duration of each commit is about a minute. Most of this time is spent reloading the 3 ExternalFileField files. Because of the sequential commits, there is a period of time (1 minute+) when the index searchers will return different results, which can cause a bad user experience. This will get worse as replicas are added to handle auto-scaling. The goal is to keep all replicas in sync w.r.t. the user queries. My questions: 1. Is there a reason that the distributed commits are done in sequence, not in parallel? Is there a way to change this behavior? The reason is that updates are currently done this way - it’s the only safe way to do it without solving some more problems. I don’t think you can easily change this. I think we should probably file a JIRA issue to track a better solution for commit handling. I think there are some complications because of how commits can be added on update requests, but its something we probably want to try and solve before tackling *all* updates to replicas in parallel with the leader. 2. If instead, the commits were done in parallel by a separate client via a GET to each Solr instance, how would this client get the host/port values for each Solr instance from zookeeper? Are there any downsides to doing commits this way? Not really, other than the extra management. Thanks, Peter
Re: Distributed commits in CloudSolrServer
Are distributed commits also done in parallel across shards? I meant 'sequentially' across shards. On Wed, Apr 16, 2014 at 9:08 AM, Peter Keegan peterlkee...@gmail.comwrote: Are distributed commits also done in parallel across shards? Peter On Tue, Apr 15, 2014 at 3:50 PM, Mark Miller markrmil...@gmail.comwrote: Inline responses below. -- Mark Miller about.me/markrmiller On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com) wrote: I have a SolrCloud index, 1 shard, with a leader and one replica, and 3 ZKs. The Solr indexes are behind a load balancer. There is one CloudSolrServer client updating the indexes. The index schema includes 3 ExternalFileFields. When the CloudSolrServer client issues a hard commit, I observe that the commits occur sequentially, not in parallel, on the leader and replica. The duration of each commit is about a minute. Most of this time is spent reloading the 3 ExternalFileField files. Because of the sequential commits, there is a period of time (1 minute+) when the index searchers will return different results, which can cause a bad user experience. This will get worse as replicas are added to handle auto-scaling. The goal is to keep all replicas in sync w.r.t. the user queries. My questions: 1. Is there a reason that the distributed commits are done in sequence, not in parallel? Is there a way to change this behavior? The reason is that updates are currently done this way - it’s the only safe way to do it without solving some more problems. I don’t think you can easily change this. I think we should probably file a JIRA issue to track a better solution for commit handling. I think there are some complications because of how commits can be added on update requests, but its something we probably want to try and solve before tackling *all* updates to replicas in parallel with the leader. 2. If instead, the commits were done in parallel by a separate client via a GET to each Solr instance, how would this client get the host/port values for each Solr instance from zookeeper? Are there any downsides to doing commits this way? Not really, other than the extra management. Thanks, Peter
Distributed commits in CloudSolrServer
I have a SolrCloud index, 1 shard, with a leader and one replica, and 3 ZKs. The Solr indexes are behind a load balancer. There is one CloudSolrServer client updating the indexes. The index schema includes 3 ExternalFileFields. When the CloudSolrServer client issues a hard commit, I observe that the commits occur sequentially, not in parallel, on the leader and replica. The duration of each commit is about a minute. Most of this time is spent reloading the 3 ExternalFileField files. Because of the sequential commits, there is a period of time (1 minute+) when the index searchers will return different results, which can cause a bad user experience. This will get worse as replicas are added to handle auto-scaling. The goal is to keep all replicas in sync w.r.t. the user queries. My questions: 1. Is there a reason that the distributed commits are done in sequence, not in parallel? Is there a way to change this behavior? 2. If instead, the commits were done in parallel by a separate client via a GET to each Solr instance, how would this client get the host/port values for each Solr instance from zookeeper? Are there any downsides to doing commits this way? Thanks, Peter
Re: Configurable collectors for custom ranking
Hi Joel, Although I solved this issue with a custom CollectorFactory, I also have a solution that uses a PostFilter and and optional ValueSource. Could you take a look at SOLR-5831 and see if I've got this right? Thanks, Peter On Mon, Dec 23, 2013 at 6:37 PM, Joel Bernstein joels...@gmail.com wrote: Peter, You actually only need the current score being collected to be in the request context. So you don't need a map, you just need an object wrapper around a mutable float. If you have a page size of X, only the top X scores need to be held onto, because all the other scores wouldn't have made it into that page anyway so they might as well be 0. Because the QueryResultCache caches's a larger window then the page size you should keep enough scores so the cached docList is correct. But if you're only dealing with 150K of results you could just keep all the scores in a FloatArrayList and not worry about the keeping the top X scores in a priority queue. During the collect hang onto the docIds and scores and build your scaling info. During the finish iterate your docIds and scale the scores as you go. Set your scaled score into the object wrapper that is in the request context before you collect each document. When you call collect on the delegate collectors they will call the custom value source for each document to perform the sort. Your custom value source will return whatever the float value is in the request context at that time. If you're also going to run this postfilter when you're doing a standard rank by score you'll also need to send down a dummy scorer to the delegate collectors. Spend some time with the CollapsingQParserPlugin in trunk to see how the dummy scorer works. I'll be adding value source collapse criteria to the CollapsingQParserPlugin this week and it will have a similar interaction between a PostFilter and value source. So you may want to watch SOLR-5536 to see an example of this. Joel Joel Bernstein Search Engineer at Heliosearch On Mon, Dec 23, 2013 at 4:03 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, Could you clarify what would be in the key,value Map added to the SearchRequest context? It seems that all the docId/score tuples need to be there, including the ones not in the 'top N ScoreDocs' PriorityQueue (score=0). If so would the Map be something like: scaled_scores,MapInteger,Float ? Also, what is the reason for passing score=0 for documents that aren't in the PriorityQueue? Will these docs get filtered out before a normal sort by score? Thanks, Peter On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com wrote: The sorting is going to happen in the lower level collectors. You need a value source that returns the score of the document being collected. Here is how you can make this happen: 1) Create an object in your PostFilter that simply holds the current score. Place this object in the SearchRequest context map. Update object.score as you pass the docs and scores to the lower collectors. 2) Create a values source that checks the SearchRequest context for the object that's holding the current score. Use this object to return the current score when called. For example if you give the value source a handle called score a compound function call will look like this: sum(score(), field(x)) Joel On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com wrote: Regarding my original goal, which is to perform a math function using the scaled score and a field value, and sort on the result, how does this fit in? Must I implement another custom PostFilter with a higher cost than the scale PostFilter? Thanks, Peter On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.com wrote: Thanks very much for the guidance. I'd be happy to donate a working solution. Peter On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I believe. They might apply to 4.3. I think as long you have the finish method that's all you'll need. If you can get this working it would be excellent if you could donate back the Scale PostFilter. On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com wrote: This is what I was looking for, but the DelegatingCollector 'finish' method doesn't exist in 4.3.0 :( Can this be patched in and are there any other PostFilter dependencies on 4.5? Thanks, Peter On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote: Here is one approach to use in a postfilter
Getting index schema in SolrCloud mode
I'm indexing data with a SolrJ client via SolrServer. Currently, I parse the schema returned from a HttpGet on: localhost:8983/solr/collection/schema/fields What is the recommended way to read the schema with CloudSolrServer? Can it be done with a single HttpGet to a ZK server? Thanks, Peter
Re: How to override rollback behavior in DIH
Following up on this a bit - my main index is updated by a SolrJ client in another process. If the DIH fails, the SolrJ client is never informed of the index rollback, and any pending updates are lost. For now, I've made sure that the DIH processor never throws an exception, but this makes it a bit harder to detect the failure via the admin interface. Thanks, Peter On Tue, Jan 14, 2014 at 11:12 AM, Peter Keegan peterlkee...@gmail.comwrote: I have a custom data import handler that creates an ExternalFileField from a source that is different from the main index. If the import fails (in my case, a connection refused in URLDataSource), I don't want to roll back any uncommitted changes to the main index. However, this seems to be the default behavior. Is there a way to override the IndexWriter rollback? Thanks, Peter
Re: How to override rollback behavior in DIH
I'm actually doing the 'skip' on every successful call to 'nextRow' with this trick: row.put($externalfield,null); // DocBuilder.addFields will skip fields starting with '$' because I'm only creating ExternalFieldFields. However, an error could also occur in the 'init' call, so exceptions have to be caught there, too. Thanks, Peter On Fri, Jan 17, 2014 at 10:19 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Can you try using onError=skip on your entities which use this data source? It's been some time since I looked at the code so I don't know if this works with data source. Worth a try I guess. On Fri, Jan 17, 2014 at 7:20 PM, Peter Keegan peterlkee...@gmail.com wrote: Following up on this a bit - my main index is updated by a SolrJ client in another process. If the DIH fails, the SolrJ client is never informed of the index rollback, and any pending updates are lost. For now, I've made sure that the DIH processor never throws an exception, but this makes it a bit harder to detect the failure via the admin interface. Thanks, Peter On Tue, Jan 14, 2014 at 11:12 AM, Peter Keegan peterlkee...@gmail.com wrote: I have a custom data import handler that creates an ExternalFileField from a source that is different from the main index. If the import fails (in my case, a connection refused in URLDataSource), I don't want to roll back any uncommitted changes to the main index. However, this seems to be the default behavior. Is there a way to override the IndexWriter rollback? Thanks, Peter -- Regards, Shalin Shekhar Mangar.
Re: How to override rollback behavior in DIH
Hmm, this does get a bit complicated, and I'm not even doing any writes with the DIH SolrWriter. In retrospect, using a DIH to create only EFFs doesn't buy much except for the integration into the Solr Admin UI. Thanks for the pointer to 3671, James. Peter On Fri, Jan 17, 2014 at 10:59 AM, Dyer, James james.d...@ingramcontent.comwrote: Peter, I think you can override org.apache.solr.handler.dataimport.SolrWriter to have a custom (no-op) rollback method. Your new writer should implement org.apache.solr.handler.dataimport.DIHWriter. You can specify the writerImpl request parameter to specify the new class. Unfortunately, it isn't actually this easy because your new writer is going to have to know what to do for all the other methods. That is, there is no easy way to tell it how to write/commit/etc to Solr. The default SolrWriter has a lot of hardcoded parameters it gets sent on construction in DataImportHandler#handleRequestBody. You would have to somehow duplicate this construction on your own custom class. See SOLR-3671 for an explanation of this dilemma. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: pkeegan01...@gmail.com [mailto:pkeegan01...@gmail.com] On Behalf Of Peter Keegan Sent: Friday, January 17, 2014 7:51 AM To: solr-user@lucene.apache.org Subject: Re: How to override rollback behavior in DIH Following up on this a bit - my main index is updated by a SolrJ client in another process. If the DIH fails, the SolrJ client is never informed of the index rollback, and any pending updates are lost. For now, I've made sure that the DIH processor never throws an exception, but this makes it a bit harder to detect the failure via the admin interface. Thanks, Peter On Tue, Jan 14, 2014 at 11:12 AM, Peter Keegan peterlkee...@gmail.com wrote: I have a custom data import handler that creates an ExternalFileField from a source that is different from the main index. If the import fails (in my case, a connection refused in URLDataSource), I don't want to roll back any uncommitted changes to the main index. However, this seems to be the default behavior. Is there a way to override the IndexWriter rollback? Thanks, Peter
How to override rollback behavior in DIH
I have a custom data import handler that creates an ExternalFileField from a source that is different from the main index. If the import fails (in my case, a connection refused in URLDataSource), I don't want to roll back any uncommitted changes to the main index. However, this seems to be the default behavior. Is there a way to override the IndexWriter rollback? Thanks, Peter
Re: leading wildcard characters
I created SOLR-5630. Although WildcardQuery is much much faster now with AutomatonQuery, it can still result in slow queries when used in multiple keywords. From my testing, I think I will need to disable all WildcardQuerys and only allow PrefixQuery. Peter On Sat, Jan 11, 2014 at 4:17 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Peter, Yes you are correct. There is no way to disable it. Weird thing is javadoc says default is false but it is enabled by default in SolrQueryParserBase. boolean allowLeadingWildcard = true; http://search-lucene.com/jd/solr/solr-core/org/apache/solr/parser/SolrQueryParserBase.html#setAllowLeadingWildcard(boolean) There is an effort for making such (allowLeadingWilcard,fuzzyMinSim, fuzzyPrefixLength) properties configurable : https://issues.apache.org/jira/browse/SOLR-218 But this one is somehow old. Since its description is stale, do you want to open a new one? Ahmet On Friday, January 10, 2014 6:12 PM, Peter Keegan peterlkee...@gmail.com wrote: Removing ReversedWildcardFilterFactory had no effect. On Fri, Jan 10, 2014 at 10:48 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Peter, Can you remove any occurrence of ReversedWildcardFilterFactory in schema.xml? (even if you don't use it) Ahmet On Friday, January 10, 2014 3:34 PM, Peter Keegan peterlkee...@gmail.com wrote: How do you disable leading wildcards in 4.X? The setAllowLeadingWildcard method is there in the parser, but nothing references the getter. Also, the Edismax parser always enables it and provides no way to override. Thanks, Peter
leading wildcard characters
How do you disable leading wildcards in 4.X? The setAllowLeadingWildcard method is there in the parser, but nothing references the getter. Also, the Edismax parser always enables it and provides no way to override. Thanks, Peter
Re: leading wildcard characters
Removing ReversedWildcardFilterFactory had no effect. On Fri, Jan 10, 2014 at 10:48 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Peter, Can you remove any occurrence of ReversedWildcardFilterFactory in schema.xml? (even if you don't use it) Ahmet On Friday, January 10, 2014 3:34 PM, Peter Keegan peterlkee...@gmail.com wrote: How do you disable leading wildcards in 4.X? The setAllowLeadingWildcard method is there in the parser, but nothing references the getter. Also, the Edismax parser always enables it and provides no way to override. Thanks, Peter
Re: Zookeeper as Service
There's also: http://www.tanukisoftware.com/ On Thu, Jan 9, 2014 at 11:18 AM, Nazik Huq nazik...@yahoo.com wrote: From your email I gather your main concern is starting zookeeper on server startups. You may want to look at these non-native service oriented options too: Create a script( cmd or bat) to start ZK on server bootup. This method may not restart Zk if Zk crashes(not the server). Create C# commad line program that starts on server bootup(see above) that uses the .Net System.Diagnostics.Process.Start method to start Zk on sever start and monitor the Zk process via a loop. Restart when Zk process crash or hang. I prefer this method. There might be a Java equivalent of this. There are many exmaples avaialble on the web. Cheers, @nazik_huq On Thursday, January 9, 2014 10:07 AM, Charlie Hull char...@flax.co.uk wrote: On 09/01/2014 09:44, Karthikeyan.Kannappan wrote: I am hosting in windows OS -- View this message in context: http://lucene.472066.n3.nabble.com/Zookeeper-as-Service-tp4110396p4110413.html Sent from the Solr - User mailing list archive at Nabble.com. There are various ways to 'servicify' (yes that may not be an actual word) executable applications on Windows. The venerable SrvAny is one such option as is the newer nssm.exe (Non-Sucking Service Manager). Bear in mind that a Windows Service doesn't operate quite the same way with regard to stdout and stderr which may mean any error messages end up in a black hole, with you simply getting something unhelpful 'service failed to start' error messages from Windows itself if something goes wrong. The 'working directory' is another thing that needs careful setting up. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Function query matching
: The bottom line for Peter is still the same: using scale() wrapped arround : a function/query does involve a computing hte results for every document, : and that is going to scale linearly as the size of hte index grows -- but : it it is *only* because of the scale function. Another problem with this approach is that the scale() function will likely generate incorrect values because it occurs before any filters. If the filters drop high scoring docs, the scaled values will never include the 'maxTarget' value (and may not include the 'minTarget' value, either). Peter On Sat, Dec 7, 2013 at 2:30 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: (This is why i shouldn't send emails just before going to bed.) I woke up this morning realizing that of course I was completley wrong when i said this... : I want to be clear for 99% of the people reading this, if you find : yourself writting a query structure like this... : : q={!func}..functions involving wrapping $qq ... ... : ...Try to restructure the match you want to do into the form of a : multiplier ... : Because the later case is much more efficient and Solr will only compute : the function values for hte docs it needs to (that match the wrapped $qq : query) The reason i was wrong... Even though function queries do by default match all documents, and even if the main query is a function query (ie: q={!func}...), if there is an fq that filters down the set of documents, then the (main) function query will only be calculated for the documents that match the filter. It was trivial to ammend the test i mentioned last night to show this (and i feel silly for not doing that last night and stoping myself from saying something foolish)... https://svn.apache.org/viewvc?view=revisionrevision=r1548955 The bottom line for Peter is still the same: using scale() wrapped arround a function/query does involve a computing hte results for every document, and that is going to scale linearly as the size of hte index grows -- but it it is *only* because of the scale function. -Hoss http://www.lucidworks.com/
Re: how to include result ordinal in response
Thank you both. The DocTransformer solution was very simple: import java.io.IOException; import org.apache.solr.common.SolrDocument; import org.apache.solr.common.params.SolrParams; import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.response.transform.DocTransformer; import org.apache.solr.response.transform.TransformerFactory; public class PositionAugmenterFactory extends TransformerFactory{ @Override public DocTransformer create(String field, SolrParams params, SolrQueryRequest req) { return new PositionAugmenter( field ); } class PositionAugmenter extends DocTransformer { final String name; int position; public PositionAugmenter( String display ) { this.name = display; this.position = 1; } @Override public String getName() { return name; } @Override public void transform(SolrDocument doc, int docid) throws IOException { doc.setField( name, position++); } } } @Jack: fl=[docid] is similar to using the uniqueKey, but still hard to compare visually (for me). The fields are not returned in the same order as specified in the 'fl' parameter. Can the order be overridden? Thanks, Peter On Fri, Jan 3, 2014 at 6:58 PM, Jack Krupansky j...@basetechnology.comwrote: Or just use the internal document ID: fl=*,[docid] Granted, the docID may change if a segment merge occurs and earlier documents have been deleted, but it may be sufficient for your purposes. -- Jack Krupansky -Original Message- From: Upayavira Sent: Friday, January 03, 2014 5:58 PM To: solr-user@lucene.apache.org Subject: Re: how to include result ordinal in response On Fri, Jan 3, 2014, at 10:00 PM, Peter Keegan wrote: Is there a simple way to output the result number (ordinal) with each returned document using the 'fl' parameter? This would be useful when visually comparing the results from 2 queries. I'm not aware of a simple way. If you're competent in Java, this could be a neat new DocTransformer component. You'd say: fl=*,[position] and you'd get a new field in your search results. Cruder ways would be to use XSLT to add it to an XML output, or a velocity template, but the DocTransformer approach would create something that could be of ongoing use. Upayavira
how to include result ordinal in response
Is there a simple way to output the result number (ordinal) with each returned document using the 'fl' parameter? This would be useful when visually comparing the results from 2 queries. Thanks, Peter
Re: Configurable collectors for custom ranking
In my case, the final function call looks something like this: sum(product($k1,score()),product($k2,field(x))) This means that all the scores would have to scaled and passed down, not just the top N because even a low score could be offset by a high value in 'field(x)'. Thanks, Peter On Mon, Dec 23, 2013 at 6:37 PM, Joel Bernstein joels...@gmail.com wrote: Peter, You actually only need the current score being collected to be in the request context. So you don't need a map, you just need an object wrapper around a mutable float. If you have a page size of X, only the top X scores need to be held onto, because all the other scores wouldn't have made it into that page anyway so they might as well be 0. Because the QueryResultCache caches's a larger window then the page size you should keep enough scores so the cached docList is correct. But if you're only dealing with 150K of results you could just keep all the scores in a FloatArrayList and not worry about the keeping the top X scores in a priority queue. During the collect hang onto the docIds and scores and build your scaling info. During the finish iterate your docIds and scale the scores as you go. Set your scaled score into the object wrapper that is in the request context before you collect each document. When you call collect on the delegate collectors they will call the custom value source for each document to perform the sort. Your custom value source will return whatever the float value is in the request context at that time. If you're also going to run this postfilter when you're doing a standard rank by score you'll also need to send down a dummy scorer to the delegate collectors. Spend some time with the CollapsingQParserPlugin in trunk to see how the dummy scorer works. I'll be adding value source collapse criteria to the CollapsingQParserPlugin this week and it will have a similar interaction between a PostFilter and value source. So you may want to watch SOLR-5536 to see an example of this. Joel Joel Bernstein Search Engineer at Heliosearch On Mon, Dec 23, 2013 at 4:03 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, Could you clarify what would be in the key,value Map added to the SearchRequest context? It seems that all the docId/score tuples need to be there, including the ones not in the 'top N ScoreDocs' PriorityQueue (score=0). If so would the Map be something like: scaled_scores,MapInteger,Float ? Also, what is the reason for passing score=0 for documents that aren't in the PriorityQueue? Will these docs get filtered out before a normal sort by score? Thanks, Peter On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com wrote: The sorting is going to happen in the lower level collectors. You need a value source that returns the score of the document being collected. Here is how you can make this happen: 1) Create an object in your PostFilter that simply holds the current score. Place this object in the SearchRequest context map. Update object.score as you pass the docs and scores to the lower collectors. 2) Create a values source that checks the SearchRequest context for the object that's holding the current score. Use this object to return the current score when called. For example if you give the value source a handle called score a compound function call will look like this: sum(score(), field(x)) Joel On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com wrote: Regarding my original goal, which is to perform a math function using the scaled score and a field value, and sort on the result, how does this fit in? Must I implement another custom PostFilter with a higher cost than the scale PostFilter? Thanks, Peter On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.com wrote: Thanks very much for the guidance. I'd be happy to donate a working solution. Peter On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I believe. They might apply to 4.3. I think as long you have the finish method that's all you'll need. If you can get this working it would be excellent if you could donate back the Scale PostFilter. On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com wrote: This is what I was looking for, but the DelegatingCollector 'finish' method doesn't exist in 4.3.0 :( Can this be patched in and are there any other PostFilter dependencies on 4.5? Thanks, Peter On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote
Re: Configurable collectors for custom ranking
Hi Joel, Could you clarify what would be in the key,value Map added to the SearchRequest context? It seems that all the docId/score tuples need to be there, including the ones not in the 'top N ScoreDocs' PriorityQueue (score=0). If so would the Map be something like: scaled_scores,MapInteger,Float ? Also, what is the reason for passing score=0 for documents that aren't in the PriorityQueue? Will these docs get filtered out before a normal sort by score? Thanks, Peter On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com wrote: The sorting is going to happen in the lower level collectors. You need a value source that returns the score of the document being collected. Here is how you can make this happen: 1) Create an object in your PostFilter that simply holds the current score. Place this object in the SearchRequest context map. Update object.score as you pass the docs and scores to the lower collectors. 2) Create a values source that checks the SearchRequest context for the object that's holding the current score. Use this object to return the current score when called. For example if you give the value source a handle called score a compound function call will look like this: sum(score(), field(x)) Joel On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com wrote: Regarding my original goal, which is to perform a math function using the scaled score and a field value, and sort on the result, how does this fit in? Must I implement another custom PostFilter with a higher cost than the scale PostFilter? Thanks, Peter On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.com wrote: Thanks very much for the guidance. I'd be happy to donate a working solution. Peter On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I believe. They might apply to 4.3. I think as long you have the finish method that's all you'll need. If you can get this working it would be excellent if you could donate back the Scale PostFilter. On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com wrote: This is what I was looking for, but the DelegatingCollector 'finish' method doesn't exist in 4.3.0 :( Can this be patched in and are there any other PostFilter dependencies on 4.5? Thanks, Peter On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote: Here is one approach to use in a postfilter 1) In the collect() method call score for each doc. Use the scores to create your scaleInfo. 2) Keep a bitset of the hits and a priorityQueue of your top X ScoreDocs. 3) Don't delegate any documents to lower collectors in the collect() method. 4) In the finish method create a score mapping (use the hppc IntFloatOpenHashMap) with your top X docIds pointing to their score, using the priorityQueue created in step 2. Then iterate the bitset (also created in step 2) sending down each doc to the lower collectors, retrieving and scaling the score from the score map. If the document is not in the score map then send down 0. You'll have setup a dummy scorer to feed to lower collectors. The CollapsingQParserPlugin has an example of how to do this. On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, I thought about using a PostFilter, but the problem is that the 'scale' function must be done after all matching docs have been scored but before adding them to the PriorityQueue that sorts just the rows to be returned. Doing the 'scale' function wrapped in a 'query' is proving to be too slow when it visits every document in the index. In the Collector, I can see how to get the field values like this: indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField, QParser).getValues() But, 'getValueSource' needs a QParser, which isn't available. And I can't create a QParser without a SolrQueryRequest, which isn't available. Thanks, Peter On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com wrote: Peter, It sounds like you could achieve what you want to do in a PostFilter rather then extending the TopDocsCollector. Is there a reason why a PostFilter won't work for you? Joel On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com wrote: Quick question: In the context of a custom collector, how does one get
Re: Configurable collectors for custom ranking
In order to size the PriorityQueue, the result window size for the query is needed. This has been computed in the SolrIndexSearcher and available in: QueryCommand.getSupersetMaxDoc(), but doesn't seem to be available for the PostFilter in either the SolrParms or SolrQueryRequest. Is there a way to get this precomputed value or do I have to duplicate the logic from SolrIndexSearcher? Thanks, Peter On Thu, Dec 12, 2013 at 1:53 PM, Joel Bernstein joels...@gmail.com wrote: Thanks, I agree this powerful stuff. One of the reasons that I haven't gotten back to pluggable collectors is that I've been using PostFilters instead. When you start doing stuff with scores in postfilters you'll run into the bug in SOLR-5416. This will effect you when you use facets in combination with the QueryResultCache or tag and exclude faceting. The patch in SOLR-5416 resolves this issue. You'll just need your PostFilter to implement ScoreFilter and the SolrIndexSearcher will know how to handle things. The DelegatingCollector.finish() method is so new, these kinds of bugs are still being cleaned out of the system. SOLR-5416 should be in Solr 4.7. On Thu, Dec 12, 2013 at 12:54 PM, Peter Keegan peterlkee...@gmail.com wrote: This is pretty cool, and worthy of adding to Solr in Action (v2) and the other books. With function queries, flexible filter processing and caching, custom collectors, and post filters, there's a lot of flexibility here. Btw, the query times using a custom collector to scale/recompute scores is excellent (will have to see how it compares to your outlined solution). Thanks, Peter On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com wrote: The sorting is going to happen in the lower level collectors. You need a value source that returns the score of the document being collected. Here is how you can make this happen: 1) Create an object in your PostFilter that simply holds the current score. Place this object in the SearchRequest context map. Update object.score as you pass the docs and scores to the lower collectors. 2) Create a values source that checks the SearchRequest context for the object that's holding the current score. Use this object to return the current score when called. For example if you give the value source a handle called score a compound function call will look like this: sum(score(), field(x)) Joel On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com wrote: Regarding my original goal, which is to perform a math function using the scaled score and a field value, and sort on the result, how does this fit in? Must I implement another custom PostFilter with a higher cost than the scale PostFilter? Thanks, Peter On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.com wrote: Thanks very much for the guidance. I'd be happy to donate a working solution. Peter On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I believe. They might apply to 4.3. I think as long you have the finish method that's all you'll need. If you can get this working it would be excellent if you could donate back the Scale PostFilter. On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com wrote: This is what I was looking for, but the DelegatingCollector 'finish' method doesn't exist in 4.3.0 :( Can this be patched in and are there any other PostFilter dependencies on 4.5? Thanks, Peter On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote: Here is one approach to use in a postfilter 1) In the collect() method call score for each doc. Use the scores to create your scaleInfo. 2) Keep a bitset of the hits and a priorityQueue of your top X ScoreDocs. 3) Don't delegate any documents to lower collectors in the collect() method. 4) In the finish method create a score mapping (use the hppc IntFloatOpenHashMap) with your top X docIds pointing to their score, using the priorityQueue created in step 2. Then iterate the bitset (also created in step 2) sending down each doc to the lower collectors, retrieving and scaling the score from the score map. If the document is not in the score map then send down 0. You'll have setup a dummy scorer to feed to lower collectors. The CollapsingQParserPlugin has an example of how to do this. On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan
Re: Configurable collectors for custom ranking
I implemented the PostFilter approach described by Joel. Just iterating over the OpenBitSet, even without the scaling or the HashMap lookup, added 30ms to a query time, which kinda surprised me. There were about 150K hits out of a total of 500K. Is OpenBitSet the best way to do this? Thanks, Peter On Thu, Dec 19, 2013 at 9:51 AM, Peter Keegan peterlkee...@gmail.comwrote: In order to size the PriorityQueue, the result window size for the query is needed. This has been computed in the SolrIndexSearcher and available in: QueryCommand.getSupersetMaxDoc(), but doesn't seem to be available for the PostFilter in either the SolrParms or SolrQueryRequest. Is there a way to get this precomputed value or do I have to duplicate the logic from SolrIndexSearcher? Thanks, Peter On Thu, Dec 12, 2013 at 1:53 PM, Joel Bernstein joels...@gmail.comwrote: Thanks, I agree this powerful stuff. One of the reasons that I haven't gotten back to pluggable collectors is that I've been using PostFilters instead. When you start doing stuff with scores in postfilters you'll run into the bug in SOLR-5416. This will effect you when you use facets in combination with the QueryResultCache or tag and exclude faceting. The patch in SOLR-5416 resolves this issue. You'll just need your PostFilter to implement ScoreFilter and the SolrIndexSearcher will know how to handle things. The DelegatingCollector.finish() method is so new, these kinds of bugs are still being cleaned out of the system. SOLR-5416 should be in Solr 4.7. On Thu, Dec 12, 2013 at 12:54 PM, Peter Keegan peterlkee...@gmail.com wrote: This is pretty cool, and worthy of adding to Solr in Action (v2) and the other books. With function queries, flexible filter processing and caching, custom collectors, and post filters, there's a lot of flexibility here. Btw, the query times using a custom collector to scale/recompute scores is excellent (will have to see how it compares to your outlined solution). Thanks, Peter On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com wrote: The sorting is going to happen in the lower level collectors. You need a value source that returns the score of the document being collected. Here is how you can make this happen: 1) Create an object in your PostFilter that simply holds the current score. Place this object in the SearchRequest context map. Update object.score as you pass the docs and scores to the lower collectors. 2) Create a values source that checks the SearchRequest context for the object that's holding the current score. Use this object to return the current score when called. For example if you give the value source a handle called score a compound function call will look like this: sum(score(), field(x)) Joel On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com wrote: Regarding my original goal, which is to perform a math function using the scaled score and a field value, and sort on the result, how does this fit in? Must I implement another custom PostFilter with a higher cost than the scale PostFilter? Thanks, Peter On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.com wrote: Thanks very much for the guidance. I'd be happy to donate a working solution. Peter On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I believe. They might apply to 4.3. I think as long you have the finish method that's all you'll need. If you can get this working it would be excellent if you could donate back the Scale PostFilter. On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com wrote: This is what I was looking for, but the DelegatingCollector 'finish' method doesn't exist in 4.3.0 :( Can this be patched in and are there any other PostFilter dependencies on 4.5? Thanks, Peter On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote: Here is one approach to use in a postfilter 1) In the collect() method call score for each doc. Use the scores to create your scaleInfo. 2) Keep a bitset of the hits and a priorityQueue of your top X ScoreDocs. 3) Don't delegate any documents to lower collectors in the collect() method. 4) In the finish method create a score mapping (use the hppc IntFloatOpenHashMap) with your top X docIds pointing to their score, using the priorityQueue created in step 2. Then iterate the bitset (also created in step 2) sending down each doc
Re: Configurable collectors for custom ranking
Regarding my original goal, which is to perform a math function using the scaled score and a field value, and sort on the result, how does this fit in? Must I implement another custom PostFilter with a higher cost than the scale PostFilter? Thanks, Peter On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.comwrote: Thanks very much for the guidance. I'd be happy to donate a working solution. Peter On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.comwrote: SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I believe. They might apply to 4.3. I think as long you have the finish method that's all you'll need. If you can get this working it would be excellent if you could donate back the Scale PostFilter. On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com wrote: This is what I was looking for, but the DelegatingCollector 'finish' method doesn't exist in 4.3.0 :( Can this be patched in and are there any other PostFilter dependencies on 4.5? Thanks, Peter On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote: Here is one approach to use in a postfilter 1) In the collect() method call score for each doc. Use the scores to create your scaleInfo. 2) Keep a bitset of the hits and a priorityQueue of your top X ScoreDocs. 3) Don't delegate any documents to lower collectors in the collect() method. 4) In the finish method create a score mapping (use the hppc IntFloatOpenHashMap) with your top X docIds pointing to their score, using the priorityQueue created in step 2. Then iterate the bitset (also created in step 2) sending down each doc to the lower collectors, retrieving and scaling the score from the score map. If the document is not in the score map then send down 0. You'll have setup a dummy scorer to feed to lower collectors. The CollapsingQParserPlugin has an example of how to do this. On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, I thought about using a PostFilter, but the problem is that the 'scale' function must be done after all matching docs have been scored but before adding them to the PriorityQueue that sorts just the rows to be returned. Doing the 'scale' function wrapped in a 'query' is proving to be too slow when it visits every document in the index. In the Collector, I can see how to get the field values like this: indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField, QParser).getValues() But, 'getValueSource' needs a QParser, which isn't available. And I can't create a QParser without a SolrQueryRequest, which isn't available. Thanks, Peter On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com wrote: Peter, It sounds like you could achieve what you want to do in a PostFilter rather then extending the TopDocsCollector. Is there a reason why a PostFilter won't work for you? Joel On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com wrote: Quick question: In the context of a custom collector, how does one get the values of a field of type 'ExternalFileField'? Thanks, Peter On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, This is related to another thread on function query matching ( http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513 ). The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform the 'scale' function on only the documents matching the main dismax query. As you mention, it is a slightly intrusive design and requires that I manage my own PriorityQueue (and a local duplicate of HitQueue), but should work. I think a better design would hide the PQ from the plugin. Thanks, Peter On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com wrote: Hi Peter, I've been meaning to revisit configurable ranking collectors, but I haven't yet had a chance. It's on the shortlist of things I'd like to tackle though. On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com wrote: I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal to be able to do custom sorting and ranking in a PostFilter. So far, it looks like only custom aggregation can be implemented in PostFilter (5045
Re: Configurable collectors for custom ranking
This is pretty cool, and worthy of adding to Solr in Action (v2) and the other books. With function queries, flexible filter processing and caching, custom collectors, and post filters, there's a lot of flexibility here. Btw, the query times using a custom collector to scale/recompute scores is excellent (will have to see how it compares to your outlined solution). Thanks, Peter On Thu, Dec 12, 2013 at 11:13 AM, Joel Bernstein joels...@gmail.com wrote: The sorting is going to happen in the lower level collectors. You need a value source that returns the score of the document being collected. Here is how you can make this happen: 1) Create an object in your PostFilter that simply holds the current score. Place this object in the SearchRequest context map. Update object.score as you pass the docs and scores to the lower collectors. 2) Create a values source that checks the SearchRequest context for the object that's holding the current score. Use this object to return the current score when called. For example if you give the value source a handle called score a compound function call will look like this: sum(score(), field(x)) Joel On Thu, Dec 12, 2013 at 9:58 AM, Peter Keegan peterlkee...@gmail.com wrote: Regarding my original goal, which is to perform a math function using the scaled score and a field value, and sort on the result, how does this fit in? Must I implement another custom PostFilter with a higher cost than the scale PostFilter? Thanks, Peter On Wed, Dec 11, 2013 at 4:01 PM, Peter Keegan peterlkee...@gmail.com wrote: Thanks very much for the guidance. I'd be happy to donate a working solution. Peter On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I believe. They might apply to 4.3. I think as long you have the finish method that's all you'll need. If you can get this working it would be excellent if you could donate back the Scale PostFilter. On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com wrote: This is what I was looking for, but the DelegatingCollector 'finish' method doesn't exist in 4.3.0 :( Can this be patched in and are there any other PostFilter dependencies on 4.5? Thanks, Peter On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote: Here is one approach to use in a postfilter 1) In the collect() method call score for each doc. Use the scores to create your scaleInfo. 2) Keep a bitset of the hits and a priorityQueue of your top X ScoreDocs. 3) Don't delegate any documents to lower collectors in the collect() method. 4) In the finish method create a score mapping (use the hppc IntFloatOpenHashMap) with your top X docIds pointing to their score, using the priorityQueue created in step 2. Then iterate the bitset (also created in step 2) sending down each doc to the lower collectors, retrieving and scaling the score from the score map. If the document is not in the score map then send down 0. You'll have setup a dummy scorer to feed to lower collectors. The CollapsingQParserPlugin has an example of how to do this. On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, I thought about using a PostFilter, but the problem is that the 'scale' function must be done after all matching docs have been scored but before adding them to the PriorityQueue that sorts just the rows to be returned. Doing the 'scale' function wrapped in a 'query' is proving to be too slow when it visits every document in the index. In the Collector, I can see how to get the field values like this: indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField, QParser).getValues() But, 'getValueSource' needs a QParser, which isn't available. And I can't create a QParser without a SolrQueryRequest, which isn't available. Thanks, Peter On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com wrote: Peter, It sounds like you could achieve what you want to do in a PostFilter rather then extending the TopDocsCollector. Is there a reason why a PostFilter won't work for you? Joel On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com wrote: Quick question: In the context of a custom collector, how does one get the values of a field of type 'ExternalFileField'? Thanks
Re: Configurable collectors for custom ranking
Hi Joel, I thought about using a PostFilter, but the problem is that the 'scale' function must be done after all matching docs have been scored but before adding them to the PriorityQueue that sorts just the rows to be returned. Doing the 'scale' function wrapped in a 'query' is proving to be too slow when it visits every document in the index. In the Collector, I can see how to get the field values like this: indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField, QParser).getValues() But, 'getValueSource' needs a QParser, which isn't available. And I can't create a QParser without a SolrQueryRequest, which isn't available. Thanks, Peter On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com wrote: Peter, It sounds like you could achieve what you want to do in a PostFilter rather then extending the TopDocsCollector. Is there a reason why a PostFilter won't work for you? Joel On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com wrote: Quick question: In the context of a custom collector, how does one get the values of a field of type 'ExternalFileField'? Thanks, Peter On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, This is related to another thread on function query matching ( http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513 ). The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform the 'scale' function on only the documents matching the main dismax query. As you mention, it is a slightly intrusive design and requires that I manage my own PriorityQueue (and a local duplicate of HitQueue), but should work. I think a better design would hide the PQ from the plugin. Thanks, Peter On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com wrote: Hi Peter, I've been meaning to revisit configurable ranking collectors, but I haven't yet had a chance. It's on the shortlist of things I'd like to tackle though. On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com wrote: I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal to be able to do custom sorting and ranking in a PostFilter. So far, it looks like only custom aggregation can be implemented in PostFilter (5045). Custom sorting/ranking can be done in a pluggable collector (4465), but this patch is no longer in dev. Is there any other dev. being done on adding custom sorting (after collection) via a plugin? Thanks, Peter -- Joel Bernstein Search Engineer at Heliosearch -- Joel Bernstein Search Engineer at Heliosearch
Re: Configurable collectors for custom ranking
From the Collector context, I suppose I can access the FileFloatSource directly like this, although it's not generic: SchemaField field = indexSearcher.getSchema().getField(fieldName); dataDir = indexSearcher.getSchema().getResourceLoader().getDataDir(); ExternalFileField eff = (ExternalFileField)field.getType(); fieldValues = eff.getFileFloatSource(field, dataDir); And then read the values in 'setNextReader' Peter On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.comwrote: Hi Joel, I thought about using a PostFilter, but the problem is that the 'scale' function must be done after all matching docs have been scored but before adding them to the PriorityQueue that sorts just the rows to be returned. Doing the 'scale' function wrapped in a 'query' is proving to be too slow when it visits every document in the index. In the Collector, I can see how to get the field values like this: indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField, QParser).getValues() But, 'getValueSource' needs a QParser, which isn't available. And I can't create a QParser without a SolrQueryRequest, which isn't available. Thanks, Peter On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.comwrote: Peter, It sounds like you could achieve what you want to do in a PostFilter rather then extending the TopDocsCollector. Is there a reason why a PostFilter won't work for you? Joel On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com wrote: Quick question: In the context of a custom collector, how does one get the values of a field of type 'ExternalFileField'? Thanks, Peter On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, This is related to another thread on function query matching ( http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513 ). The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform the 'scale' function on only the documents matching the main dismax query. As you mention, it is a slightly intrusive design and requires that I manage my own PriorityQueue (and a local duplicate of HitQueue), but should work. I think a better design would hide the PQ from the plugin. Thanks, Peter On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com wrote: Hi Peter, I've been meaning to revisit configurable ranking collectors, but I haven't yet had a chance. It's on the shortlist of things I'd like to tackle though. On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com wrote: I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal to be able to do custom sorting and ranking in a PostFilter. So far, it looks like only custom aggregation can be implemented in PostFilter (5045). Custom sorting/ranking can be done in a pluggable collector (4465), but this patch is no longer in dev. Is there any other dev. being done on adding custom sorting (after collection) via a plugin? Thanks, Peter -- Joel Bernstein Search Engineer at Heliosearch -- Joel Bernstein Search Engineer at Heliosearch
Re: Configurable collectors for custom ranking
This is what I was looking for, but the DelegatingCollector 'finish' method doesn't exist in 4.3.0 :( Can this be patched in and are there any other PostFilter dependencies on 4.5? Thanks, Peter On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote: Here is one approach to use in a postfilter 1) In the collect() method call score for each doc. Use the scores to create your scaleInfo. 2) Keep a bitset of the hits and a priorityQueue of your top X ScoreDocs. 3) Don't delegate any documents to lower collectors in the collect() method. 4) In the finish method create a score mapping (use the hppc IntFloatOpenHashMap) with your top X docIds pointing to their score, using the priorityQueue created in step 2. Then iterate the bitset (also created in step 2) sending down each doc to the lower collectors, retrieving and scaling the score from the score map. If the document is not in the score map then send down 0. You'll have setup a dummy scorer to feed to lower collectors. The CollapsingQParserPlugin has an example of how to do this. On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, I thought about using a PostFilter, but the problem is that the 'scale' function must be done after all matching docs have been scored but before adding them to the PriorityQueue that sorts just the rows to be returned. Doing the 'scale' function wrapped in a 'query' is proving to be too slow when it visits every document in the index. In the Collector, I can see how to get the field values like this: indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField, QParser).getValues() But, 'getValueSource' needs a QParser, which isn't available. And I can't create a QParser without a SolrQueryRequest, which isn't available. Thanks, Peter On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com wrote: Peter, It sounds like you could achieve what you want to do in a PostFilter rather then extending the TopDocsCollector. Is there a reason why a PostFilter won't work for you? Joel On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com wrote: Quick question: In the context of a custom collector, how does one get the values of a field of type 'ExternalFileField'? Thanks, Peter On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, This is related to another thread on function query matching ( http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513 ). The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform the 'scale' function on only the documents matching the main dismax query. As you mention, it is a slightly intrusive design and requires that I manage my own PriorityQueue (and a local duplicate of HitQueue), but should work. I think a better design would hide the PQ from the plugin. Thanks, Peter On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com wrote: Hi Peter, I've been meaning to revisit configurable ranking collectors, but I haven't yet had a chance. It's on the shortlist of things I'd like to tackle though. On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com wrote: I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal to be able to do custom sorting and ranking in a PostFilter. So far, it looks like only custom aggregation can be implemented in PostFilter (5045). Custom sorting/ranking can be done in a pluggable collector (4465), but this patch is no longer in dev. Is there any other dev. being done on adding custom sorting (after collection) via a plugin? Thanks, Peter -- Joel Bernstein Search Engineer at Heliosearch -- Joel Bernstein Search Engineer at Heliosearch -- Joel Bernstein Search Engineer at Heliosearch
Re: Configurable collectors for custom ranking
Thanks very much for the guidance. I'd be happy to donate a working solution. Peter On Wed, Dec 11, 2013 at 3:53 PM, Joel Bernstein joels...@gmail.com wrote: SOLR-5020 has the commit info, it's mainly changes to SolrIndexSearcher I believe. They might apply to 4.3. I think as long you have the finish method that's all you'll need. If you can get this working it would be excellent if you could donate back the Scale PostFilter. On Wed, Dec 11, 2013 at 3:36 PM, Peter Keegan peterlkee...@gmail.com wrote: This is what I was looking for, but the DelegatingCollector 'finish' method doesn't exist in 4.3.0 :( Can this be patched in and are there any other PostFilter dependencies on 4.5? Thanks, Peter On Wed, Dec 11, 2013 at 3:16 PM, Joel Bernstein joels...@gmail.com wrote: Here is one approach to use in a postfilter 1) In the collect() method call score for each doc. Use the scores to create your scaleInfo. 2) Keep a bitset of the hits and a priorityQueue of your top X ScoreDocs. 3) Don't delegate any documents to lower collectors in the collect() method. 4) In the finish method create a score mapping (use the hppc IntFloatOpenHashMap) with your top X docIds pointing to their score, using the priorityQueue created in step 2. Then iterate the bitset (also created in step 2) sending down each doc to the lower collectors, retrieving and scaling the score from the score map. If the document is not in the score map then send down 0. You'll have setup a dummy scorer to feed to lower collectors. The CollapsingQParserPlugin has an example of how to do this. On Wed, Dec 11, 2013 at 2:05 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, I thought about using a PostFilter, but the problem is that the 'scale' function must be done after all matching docs have been scored but before adding them to the PriorityQueue that sorts just the rows to be returned. Doing the 'scale' function wrapped in a 'query' is proving to be too slow when it visits every document in the index. In the Collector, I can see how to get the field values like this: indexSearcher.getSchema().getField(field(myfield).getType().getValueSource(SchemaField, QParser).getValues() But, 'getValueSource' needs a QParser, which isn't available. And I can't create a QParser without a SolrQueryRequest, which isn't available. Thanks, Peter On Wed, Dec 11, 2013 at 1:48 PM, Joel Bernstein joels...@gmail.com wrote: Peter, It sounds like you could achieve what you want to do in a PostFilter rather then extending the TopDocsCollector. Is there a reason why a PostFilter won't work for you? Joel On Tue, Dec 10, 2013 at 3:24 PM, Peter Keegan peterlkee...@gmail.com wrote: Quick question: In the context of a custom collector, how does one get the values of a field of type 'ExternalFileField'? Thanks, Peter On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan peterlkee...@gmail.com wrote: Hi Joel, This is related to another thread on function query matching ( http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513 ). The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform the 'scale' function on only the documents matching the main dismax query. As you mention, it is a slightly intrusive design and requires that I manage my own PriorityQueue (and a local duplicate of HitQueue), but should work. I think a better design would hide the PQ from the plugin. Thanks, Peter On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com wrote: Hi Peter, I've been meaning to revisit configurable ranking collectors, but I haven't yet had a chance. It's on the shortlist of things I'd like to tackle though. On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com wrote: I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal to be able to do custom sorting and ranking in a PostFilter. So far, it looks like only custom aggregation can be implemented in PostFilter (5045). Custom sorting/ranking can be done in a pluggable collector (4465), but this patch is no longer in dev. Is there any other dev. being done on adding custom sorting (after collection) via a plugin? Thanks, Peter -- Joel Bernstein
Re: Configurable collectors for custom ranking
Hi Joel, This is related to another thread on function query matching ( http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513). The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform the 'scale' function on only the documents matching the main dismax query. As you mention, it is a slightly intrusive design and requires that I manage my own PriorityQueue (and a local duplicate of HitQueue), but should work. I think a better design would hide the PQ from the plugin. Thanks, Peter On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com wrote: Hi Peter, I've been meaning to revisit configurable ranking collectors, but I haven't yet had a chance. It's on the shortlist of things I'd like to tackle though. On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com wrote: I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal to be able to do custom sorting and ranking in a PostFilter. So far, it looks like only custom aggregation can be implemented in PostFilter (5045). Custom sorting/ranking can be done in a pluggable collector (4465), but this patch is no longer in dev. Is there any other dev. being done on adding custom sorting (after collection) via a plugin? Thanks, Peter -- Joel Bernstein Search Engineer at Heliosearch
Re: Configurable collectors for custom ranking
Quick question: In the context of a custom collector, how does one get the values of a field of type 'ExternalFileField'? Thanks, Peter On Tue, Dec 10, 2013 at 1:18 PM, Peter Keegan peterlkee...@gmail.comwrote: Hi Joel, This is related to another thread on function query matching ( http://lucene.472066.n3.nabble.com/Function-query-matching-td4099807.html#a4105513). The patch in SOLR-4465 will allow me to extend TopDocsCollector and perform the 'scale' function on only the documents matching the main dismax query. As you mention, it is a slightly intrusive design and requires that I manage my own PriorityQueue (and a local duplicate of HitQueue), but should work. I think a better design would hide the PQ from the plugin. Thanks, Peter On Sun, Dec 8, 2013 at 5:32 PM, Joel Bernstein joels...@gmail.com wrote: Hi Peter, I've been meaning to revisit configurable ranking collectors, but I haven't yet had a chance. It's on the shortlist of things I'd like to tackle though. On Fri, Dec 6, 2013 at 4:17 PM, Peter Keegan peterlkee...@gmail.com wrote: I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal to be able to do custom sorting and ranking in a PostFilter. So far, it looks like only custom aggregation can be implemented in PostFilter (5045). Custom sorting/ranking can be done in a pluggable collector (4465), but this patch is no longer in dev. Is there any other dev. being done on adding custom sorting (after collection) via a plugin? Thanks, Peter -- Joel Bernstein Search Engineer at Heliosearch
Re: Function query matching
But for your specific goal Peter: Yes, if the whole point of a function you have is to wrap generated a scaled score of your base $qq, ... Thanks for the confirmation, Chris. So, to do this efficiently, I think I need to implement a custom Collector that performs the scaling (and other math) after collecting the matching dismax query docs. I started a separate thread asking about the state of configurable collectors. Thanks, Peter On Sat, Dec 7, 2013 at 1:45 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: I had to do a double take when i read this sentence... : Even with any improvements to 'scale', all function queries will add a : linear increase to the Qtime as index size increases, since they match all : docs. ...because that smelled like either a bug in your methodology, or a bug in Solr. To convince myself there wasn't a bug in Solr, i wrote a test case (i'll commit tomorow, bunch of churn in svn right now making ant precommit unhappy) to prove that when wrapping boost functions arround queries, Solr will only evaluate the functions for docs matching the wrapped query -- so there is no linear increase as the index size increases, just the (neccessary) libera increase as the number of *matching* docs grows. (for most functions anyway -- as mentioned scale is special). BUT! ... then i remembered how this thread started, and your goal of scaling the scores from a wrapped query. I want to be clear for 99% of the people reading this, if you find yourself writting a query structure like this... q={!func}..functions involving wrapping $qq ... qq={!edismax ...lots of stuff but still only matching subset of the index...} fq={!query v=$qq} ...Try to restructure the match you want to do into the form of a multiplier q={!boost b=$b v=$qq} b=...functions producing a score multiplier... qq={!edismax ...lots of stuff but still only matching subset of the index...} Because the later case is much more efficient and Solr will only compute the function values for hte docs it needs to (that match the wrapped $qq query) But for your specific goal Peter: Yes, if the whole point of a function you have is to wrap generated a scaled score of your base $qq, then the function (wrapping the scale(), wrapping the query()) is going to have to be evaluated for every doc -- that will definitely be linear based on the size of the index. -Hoss http://www.lucidworks.com/
Re: Function query matching
I added some timing logging to IndexSearcher and ScaleFloatFunction and compared a simple DisMax query with a DisMax query wrapped in the scale function. The index size was 500K docs, 61K docs match the DisMax query. The simple DisMax query took 33 ms, the function query took 89 ms. What I found was: 1. The scale query only normalized the scores once (in ScaleInfo.createScaleInfo) and added 33 ms to the Qtime. Subsequent calls to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and added ~0 time. 2. The FunctionQuery 'nextDoc' iterations added 16 ms over the DisMax 'nextDoc' iterations. Here's the breakdown: Simple DisMax query: weight.scorer: 3 ms (get term enum) scorer.score: 23 ms (nextDoc iterations) other: 3 ms Total: 33 ms DisMax wrapped in ScaleFloatFunction: weight.scorer: 39 ms (get scaled values) scorer.score: 39 ms (nextDoc iterations) other: 11 ms Total: 89 ms Even with any improvements to 'scale', all function queries will add a linear increase to the Qtime as index size increases, since they match all docs. Trey: I'd be happy to test any patch that you find improves the speed. On Mon, Dec 2, 2013 at 11:21 PM, Trey Grainger solrt...@gmail.com wrote: We're working on the same problem with the combination of the scale(query(...)) combination, so I'd like to share a bit more information that may be useful. *On the scale function:* Even thought the scale query has to calculate the scores for all documents, it is actually doing this work twice for each ValueSource (once to calculate the min and max values, and then again when actually scoring the documents), which is inefficient. To solve the problem, we're in the process of putting a cache inside the scale function to remember the values for each document when they are initially computed (to find the min and max) so that the second pass can just use the previously computed values for each document. Our theory is that most of the extra time due to the scale function is really just the result of doing duplicate work. No promises this won't be overly costly in terms of memory utilization, but we'll see what we get in terms of speed improvements and will share the code if it works out well. Alternate implementation suggestions (or criticism of a cache like this) are also welcomed. *On the NoOp product function: scale(prod(1, query(...))):* We do the same thing, which ultimately is just an unnecessary waste of a loop through all documents to do an extra multiplication step. I just debugged the code and uncovered the problem. There is a Map (called context) that is passed through to each value source to store intermediate state, and both the query and scale functions are passing the ValueSource for the query function in as the KEY to this Map (as opposed to using some composite key that makes sense in the current context). Essentially, these lines are overwriting each other: Inside ScaleFloatFunction: context.put(this.source, scaleInfo); //this.source refers to the QueryValueSource, and the scaleInfo refers to a ScaleInfo object Inside QueryValueSource: context.put(this, w); //this refers to the same QueryValueSource from above, and the w refers to a Weight object As such, when the ScaleFloatFunction later goes to read the ScaleInfo from the context Map, it unexpectedly pulls the Weight object out instead and thus the invalid case exception occurs. The NoOp multiplication works because it puts an different ValueSource between the query and the ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this (in QueryValueSource). This should be an easy fix. I'll create a JIRA ticket to use better key names in these functions and push up a patch. This will eliminate the need for the extra NoOp function. -Trey On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan peterlkee...@gmail.com wrote: I'm persuing this possible PostFilter solution, I can see how to collect all the hits and recompute the scores in a PostFilter, after all the hits have been collected (for scaling). Now, I can't see how to get the custom doc/score values back into the main query's HitQueue. Any advice? Thanks, Peter On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan peterlkee...@gmail.com wrote: Instead of using a function query, could I use the edismax query (plus some low cost filters not shown in the example) and implement the scale/sum/product computation in a PostFilter? Is the query's maxScore available there? Thanks, Peter On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan peterlkee...@gmail.com wrote: Although the 'scale' is a big part of it, here's a closer breakdown. Here are 4 queries with increasing functions, and theei response times (caching turned off in solrconfig): 100 msec: select?q={!edismax v='news' qf='title^2 body'} 135 msec: select?qq={!edismax v='news' qf='title^2 body'}q={!func}product(field
Re: Function query matching
In my previous posting, I said: Subsequent calls to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and added ~0 time. These subsequent calls are for the remaining segments in the index reader (21 segments). Peter On Fri, Dec 6, 2013 at 2:10 PM, Peter Keegan peterlkee...@gmail.com wrote: I added some timing logging to IndexSearcher and ScaleFloatFunction and compared a simple DisMax query with a DisMax query wrapped in the scale function. The index size was 500K docs, 61K docs match the DisMax query. The simple DisMax query took 33 ms, the function query took 89 ms. What I found was: 1. The scale query only normalized the scores once (in ScaleInfo.createScaleInfo) and added 33 ms to the Qtime. Subsequent calls to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and added ~0 time. 2. The FunctionQuery 'nextDoc' iterations added 16 ms over the DisMax 'nextDoc' iterations. Here's the breakdown: Simple DisMax query: weight.scorer: 3 ms (get term enum) scorer.score: 23 ms (nextDoc iterations) other: 3 ms Total: 33 ms DisMax wrapped in ScaleFloatFunction: weight.scorer: 39 ms (get scaled values) scorer.score: 39 ms (nextDoc iterations) other: 11 ms Total: 89 ms Even with any improvements to 'scale', all function queries will add a linear increase to the Qtime as index size increases, since they match all docs. Trey: I'd be happy to test any patch that you find improves the speed. On Mon, Dec 2, 2013 at 11:21 PM, Trey Grainger solrt...@gmail.com wrote: We're working on the same problem with the combination of the scale(query(...)) combination, so I'd like to share a bit more information that may be useful. *On the scale function:* Even thought the scale query has to calculate the scores for all documents, it is actually doing this work twice for each ValueSource (once to calculate the min and max values, and then again when actually scoring the documents), which is inefficient. To solve the problem, we're in the process of putting a cache inside the scale function to remember the values for each document when they are initially computed (to find the min and max) so that the second pass can just use the previously computed values for each document. Our theory is that most of the extra time due to the scale function is really just the result of doing duplicate work. No promises this won't be overly costly in terms of memory utilization, but we'll see what we get in terms of speed improvements and will share the code if it works out well. Alternate implementation suggestions (or criticism of a cache like this) are also welcomed. *On the NoOp product function: scale(prod(1, query(...))):* We do the same thing, which ultimately is just an unnecessary waste of a loop through all documents to do an extra multiplication step. I just debugged the code and uncovered the problem. There is a Map (called context) that is passed through to each value source to store intermediate state, and both the query and scale functions are passing the ValueSource for the query function in as the KEY to this Map (as opposed to using some composite key that makes sense in the current context). Essentially, these lines are overwriting each other: Inside ScaleFloatFunction: context.put(this.source, scaleInfo); //this.source refers to the QueryValueSource, and the scaleInfo refers to a ScaleInfo object Inside QueryValueSource: context.put(this, w); //this refers to the same QueryValueSource from above, and the w refers to a Weight object As such, when the ScaleFloatFunction later goes to read the ScaleInfo from the context Map, it unexpectedly pulls the Weight object out instead and thus the invalid case exception occurs. The NoOp multiplication works because it puts an different ValueSource between the query and the ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this (in QueryValueSource). This should be an easy fix. I'll create a JIRA ticket to use better key names in these functions and push up a patch. This will eliminate the need for the extra NoOp function. -Trey On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan peterlkee...@gmail.com wrote: I'm persuing this possible PostFilter solution, I can see how to collect all the hits and recompute the scores in a PostFilter, after all the hits have been collected (for scaling). Now, I can't see how to get the custom doc/score values back into the main query's HitQueue. Any advice? Thanks, Peter On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan peterlkee...@gmail.com wrote: Instead of using a function query, could I use the edismax query (plus some low cost filters not shown in the example) and implement the scale/sum/product computation in a PostFilter? Is the query's maxScore available there? Thanks, Peter On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan peterlkee...@gmail.com wrote: Although
Configurable collectors for custom ranking
I looked at SOLR-4465 and SOLR-5045, where it appears that there is a goal to be able to do custom sorting and ranking in a PostFilter. So far, it looks like only custom aggregation can be implemented in PostFilter (5045). Custom sorting/ranking can be done in a pluggable collector (4465), but this patch is no longer in dev. Is there any other dev. being done on adding custom sorting (after collection) via a plugin? Thanks, Peter
Re: Function query matching
I'm persuing this possible PostFilter solution, I can see how to collect all the hits and recompute the scores in a PostFilter, after all the hits have been collected (for scaling). Now, I can't see how to get the custom doc/score values back into the main query's HitQueue. Any advice? Thanks, Peter On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan peterlkee...@gmail.comwrote: Instead of using a function query, could I use the edismax query (plus some low cost filters not shown in the example) and implement the scale/sum/product computation in a PostFilter? Is the query's maxScore available there? Thanks, Peter On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan peterlkee...@gmail.comwrote: Although the 'scale' is a big part of it, here's a closer breakdown. Here are 4 queries with increasing functions, and theei response times (caching turned off in solrconfig): 100 msec: select?q={!edismax v='news' qf='title^2 body'} 135 msec: select?qq={!edismax v='news' qf='title^2 body'}q={!func}product(field(myfield),query($qq)fq={!query v=$qq} 200 msec: select?qq={!edismax v='news' qf='title^2 body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfieldfq={!query v=$qq} 320 msec: select?qq={!edismax v='news' qf='title^2 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!query v=$qq} Btw, that no-op product is necessary, else you get this exception: org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo thanks, peter On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : So, this query does just what I want, but it's typically 3 times slower : than the edismax query without the functions: that's because the scale() function is inhernetly slow (it has to compute the min max value for every document in order to know how to scale them) what you are seeing is the price you have to pay to get that query with a normalized 0-1 value. (you might be able to save a little bit of time by eliminating that no-Op multiply by 1: product(query($qq),1) ... but i doubt you'll even notice much of a chnage given that scale function. : Is there any way to speed this up? Would writing a custom function query : that compiled all the function queries together be any faster? If you can find a faster implementation for scale() then by all means let us konw, and we can fold it back into Solr. -Hoss
Re: Function query matching
Instead of using a function query, could I use the edismax query (plus some low cost filters not shown in the example) and implement the scale/sum/product computation in a PostFilter? Is the query's maxScore available there? Thanks, Peter On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan peterlkee...@gmail.comwrote: Although the 'scale' is a big part of it, here's a closer breakdown. Here are 4 queries with increasing functions, and theei response times (caching turned off in solrconfig): 100 msec: select?q={!edismax v='news' qf='title^2 body'} 135 msec: select?qq={!edismax v='news' qf='title^2 body'}q={!func}product(field(myfield),query($qq)fq={!query v=$qq} 200 msec: select?qq={!edismax v='news' qf='title^2 body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfieldfq={!query v=$qq} 320 msec: select?qq={!edismax v='news' qf='title^2 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!query v=$qq} Btw, that no-op product is necessary, else you get this exception: org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo thanks, peter On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : So, this query does just what I want, but it's typically 3 times slower : than the edismax query without the functions: that's because the scale() function is inhernetly slow (it has to compute the min max value for every document in order to know how to scale them) what you are seeing is the price you have to pay to get that query with a normalized 0-1 value. (you might be able to save a little bit of time by eliminating that no-Op multiply by 1: product(query($qq),1) ... but i doubt you'll even notice much of a chnage given that scale function. : Is there any way to speed this up? Would writing a custom function query : that compiled all the function queries together be any faster? If you can find a faster implementation for scale() then by all means let us konw, and we can fold it back into Solr. -Hoss
Re: Function query matching
Hi, So, this query does just what I want, but it's typically 3 times slower than the edismax query without the functions: select?qq={!edismax v='news' qf='title^2 body'}scaledQ=scale(product( query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ), product(0.25,field(myfield)))fq={!query v=$qq} Is there any way to speed this up? Would writing a custom function query that compiled all the function queries together be any faster? Thanks, Peter On Mon, Nov 11, 2013 at 1:31 PM, Peter Keegan peterlkee...@gmail.comwrote: Thanks On Mon, Nov 11, 2013 at 11:46 AM, Yonik Seeley yo...@heliosearch.comwrote: On Mon, Nov 11, 2013 at 11:39 AM, Peter Keegan peterlkee...@gmail.com wrote: fq=$qq What is the proper syntax? fq={!query v=$qq} -Yonik http://heliosearch.com -- making solr shine
Re: Function query matching
Although the 'scale' is a big part of it, here's a closer breakdown. Here are 4 queries with increasing functions, and theei response times (caching turned off in solrconfig): 100 msec: select?q={!edismax v='news' qf='title^2 body'} 135 msec: select?qq={!edismax v='news' qf='title^2 body'}q={!func}product(field(myfield),query($qq)fq={!query v=$qq} 200 msec: select?qq={!edismax v='news' qf='title^2 body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfieldfq={!query v=$qq} 320 msec: select?qq={!edismax v='news' qf='title^2 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!query v=$qq} Btw, that no-op product is necessary, else you get this exception: org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo thanks, peter On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : So, this query does just what I want, but it's typically 3 times slower : than the edismax query without the functions: that's because the scale() function is inhernetly slow (it has to compute the min max value for every document in order to know how to scale them) what you are seeing is the price you have to pay to get that query with a normalized 0-1 value. (you might be able to save a little bit of time by eliminating that no-Op multiply by 1: product(query($qq),1) ... but i doubt you'll even notice much of a chnage given that scale function. : Is there any way to speed this up? Would writing a custom function query : that compiled all the function queries together be any faster? If you can find a faster implementation for scale() then by all means let us konw, and we can fold it back into Solr. -Hoss
Re: Function query matching
I replaced the frange filter with the following filter and got the correct no. of results and it was 3X faster: select?qq={!edismax v='news' qf='title^2 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!edismax v='news' qf='title^2 body'} Then, I tried to simplify the query with parameter substitution, but 'fq' didn't parse correctly: select?qq={!edismax v='news' qf='title^2 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq=$qq What is the proper syntax? Thanks, Peter On Thu, Nov 7, 2013 at 2:16 PM, Peter Keegan peterlkee...@gmail.com wrote: I'm trying to used a normalized score in a query as I described in a recent thread titled Re: How to get similarity score between 0 and 1 not relative score I'm using this query: select?qq={!edismax v='news' qf='title^2 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!frange l=0.001}$q Is there another way to accomplish this using dismax boosting? On Thu, Nov 7, 2013 at 12:55 PM, Jason Hellman jhell...@innoventsolutions.com wrote: You can, of course, us a function range query: select?q=text:newsfq={!frange l=0 u=100}sum(x,y) http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html This will give you a bit more flexibility to meet your goal. On Nov 7, 2013, at 7:26 AM, Erik Hatcher erik.hatc...@gmail.com wrote: Function queries score (all) documents, but don't filter them. All documents effectively match a function query. Erik On Nov 7, 2013, at 1:48 PM, Peter Keegan peterlkee...@gmail.com wrote: Why does this function query return docs that don't match the embedded query? select?qq=text:newsq={!func}sum(query($qq),0)
Re: Function query matching
Thanks On Mon, Nov 11, 2013 at 11:46 AM, Yonik Seeley yo...@heliosearch.comwrote: On Mon, Nov 11, 2013 at 11:39 AM, Peter Keegan peterlkee...@gmail.com wrote: fq=$qq What is the proper syntax? fq={!query v=$qq} -Yonik http://heliosearch.com -- making solr shine
Function query matching
Why does this function query return docs that don't match the embedded query? select?qq=text:newsq={!func}sum(query($qq),0)
Re: Function query matching
I'm trying to used a normalized score in a query as I described in a recent thread titled Re: How to get similarity score between 0 and 1 not relative score I'm using this query: select?qq={!edismax v='news' qf='title^2 body'}scaledQ=scale(product(query($qq),1),0,1)q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))fq={!frange l=0.001}$q Is there another way to accomplish this using dismax boosting? On Thu, Nov 7, 2013 at 12:55 PM, Jason Hellman jhell...@innoventsolutions.com wrote: You can, of course, us a function range query: select?q=text:newsfq={!frange l=0 u=100}sum(x,y) http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/search/FunctionRangeQParserPlugin.html This will give you a bit more flexibility to meet your goal. On Nov 7, 2013, at 7:26 AM, Erik Hatcher erik.hatc...@gmail.com wrote: Function queries score (all) documents, but don't filter them. All documents effectively match a function query. Erik On Nov 7, 2013, at 1:48 PM, Peter Keegan peterlkee...@gmail.com wrote: Why does this function query return docs that don't match the embedded query? select?qq=text:newsq={!func}sum(query($qq),0)
Re: Data Import Handler
I've done this by adding an attribute to the entity element (e.g. myconfig=myconfig.xml), and reading it in the 'init' method with context.getResolvedEntityAttribute(myconfig). Peter On Wed, Nov 6, 2013 at 8:25 AM, Ramesh ramesh.po...@vensaiinc.com wrote: Hi Folks, Can anyone suggest me how can customize dataconfig.xml file I want to provide database details like( db_url,uname,password ) from my own properties file instead of dataconfig.xaml file
Re: How to get similarity score between 0 and 1 not relative score
There's another use case for scaling the score. Suppose I want to compute a custom score based on the weighted sum of: - product(0.75, relevance score) - product(0.25, value from another field) For this to work, both fields must have values between 0-1, for example. Toby's example using the scale function seems to work, but you have to use fq to eliminate results with score=0. It seems this is somewhat expensive, since the scaling can't be done until all results have been collected to get the max score. Then, are the results resorted? I haven't looked closely, yet. Peter Peter On Thu, Oct 31, 2013 at 7:48 PM, Toby Lazar tla...@capitaltg.com wrote: I think you are looking for something like this, though you can omit the fq section: http://localhost:8983/solr/collection/select?abc=text:bobq={!func}scale(product(query($abc),1),0,1)fq={ ! frange l=0.9}$q Also, I don't understand all the fuss about normalized scores. In the linked example, I can see an interest in searching for apple bannana, zzz yyy xxx qqq kkk ttt rrr 111, etc. and wanting only close matches for that point in time. Would this be a good use for this approach? I understand that the results can change if the documents in the index change. Thanks, Toby On Thu, Oct 31, 2013 at 12:56 AM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Susheel, Have a look at this: http://wiki.apache.org/lucene-java/ScoresAsPercentages You may really want to reconsider doing that. On Thu, Oct 31, 2013 at 9:41 AM, sushil sharma sushil2...@yahoo.co.in wrote: Hi, We have a requirement where user would like to see a score (between 0 to 1) which can tell how close the input search string is with result string. So if input was very close but not exact matach, score could be .90 etc. I do understand that we can get score from solr divide by highest score but that will always show 1 even if we match was not exact. Regards, Susheel -- Anshum Gupta http://www.anshumgupta.net
How to reinitialize a solrcloud replica
I'm running 4.3 in solrcloud mode and trying to test index recovery, but it's failing. I have one shard, 2 replicas: Leader: 10.159.8.105 Replica: 10.159.6.73 To test, I stopped the replica, deleted the 'data' directory and restarted solr. Here is the replica's logging: INFO - 2013-10-25 12:19:40.773; org.apache.solr.cloud.ZkController; We are http://10.159.6.73:8983/solr/collection/ and leader is http://10.159.8.105:8983/solr/collection/ INFO - 2013-10-25 12:19:40.774; org.apache.solr.cloud.ZkController; No LogReplay needed for core=collection baseURL=http://10.159.6.73:8983/solr INFO - 2013-10-25 12:19:40.774; org.apache.solr.cloud.ZkController; Core needs to recover:collection INFO - 2013-10-25 12:19:40.774; org.apache.solr.update.DefaultSolrCoreState; Running recovery - first canceling any ongoing recovery INFO - 2013-10-25 12:19:40.778; org.apache.solr.cloud.RecoveryStrategy; Starting recovery process. core=collection recoveringAfterStartup=true ... ERROR - 2013-10-25 12:20:25.281; org.apache.solr.common.SolrException; Error while trying to recover. core=collection:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state recovering for 10.159.6.73:8983_solr but I still do not see the requested state. I see state: down live:true ... ERROR - 2013-10-25 12:20:25.281; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (5) core=collection ERROR - 2013-10-25 12:20:25.281; org.apache.solr.common.SolrException; Recovery failed - interrupted. core=collection ERROR - 2013-10-25 12:20:25.282; org.apache.solr.common.SolrException; Recovery failed - I give up. core=collection INFO - 2013-10-25 12:20:25.282; org.apache.solr.cloud.ZkController; publishing core=collection state=recovery_failed Here is the Leader's logging: INFO - 2013-10-25 12:19:40.883; org.apache.solr.handler.admin.CoreAdminHandler; Going to wait for coreNodeName: 10.159.6.73:8983_solr_collection, state: recovering, checkLive: true, onlyIfLeader: true INFO - 2013-10-25 12:19:55.886; org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from ZooKeeper... ERROR - 2013-10-25 12:20:25.277; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: I was asked to wait on state recovering for 10.159.6.73:8983_solr but I still do not see the requested state. I see state: down live:true (repeats every minute) Is it valid to simply delete the 'data' directory, or does a znode have to be modified, too? What's the right way to reinitialize and re-synch a core? Peter
Re: Solr timeout after reboot
Have you tried this old trick to warm the FS cache? cat .../core/data/index/* /dev/null Peter On Mon, Oct 21, 2013 at 5:31 AM, michael.boom my_sky...@yahoo.com wrote: Thank you, Otis! I've integrated the SPM on my Solr instances and now I have access to monitoring data. Could you give me some hints on which metrics should I watch? Below I've added my query configs. Is there anything I could tweak here? query maxBooleanClauses1024/maxBooleanClauses filterCache class=solr.FastLRUCache size=1000 initialSize=1000 autowarmCount=0/ queryResultCache class=solr.LRUCache size=1000 initialSize=1000 autowarmCount=0/ documentCache class=solr.LRUCache size=1000 initialSize=1000 autowarmCount=0/ fieldValueCache class=solr.FastLRUCache size=1000 initialSize=1000 autowarmCount=0 / enableLazyFieldLoadingtrue/enableLazyFieldLoading queryResultWindowSize20/queryResultWindowSize queryResultMaxDocsCached100/queryResultMaxDocsCached listener event=firstSearcher class=solr.QuerySenderListener arr name=queries lst str name=qactive:true/str /lst /arr /listener useColdSearcherfalse/useColdSearcher maxWarmingSearchers10/maxWarmingSearchers /query - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096780.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr timeout after reboot
I found this warming to be especially necessary after starting an instance of those m3.xlarge servers, else the response times for the first minutes was terrible. Peter On Mon, Oct 21, 2013 at 8:39 AM, François Schiettecatte fschietteca...@gmail.com wrote: To put the file data into file system cache which would make for faster access. François On Oct 21, 2013, at 8:33 AM, michael.boom my_sky...@yahoo.com wrote: Hmm, no, I haven't... What would be the effect of this ? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096809.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: limiting deep pagination
Yes, right now this constraint could be implemented in either the web app or Solr. I see now that many of the QTimes on these queries are 10 ms (probably due to caching), so I'm a bit less concerned. On Wed, Oct 16, 2013 at 2:13 AM, Furkan KAMACI furkankam...@gmail.comwrote: I just wonder that: Don't you implement a custom API that interacts with Solr and limits such kinds of requestst? (I know that you are asking about how to do that in Solr but I handle such situations at my custom search APIs and want to learn what fellows do) 9 Ekim 2013 Çarşamba tarihinde Michael Sokolov msoko...@safaribooksonline.com adlı kullanıcı şöyle yazdı: On 10/8/13 6:51 PM, Peter Keegan wrote: Is there a way to configure Solr 'defaults/appends/invariants' such that the product of the 'start' and 'rows' parameters doesn't exceed a given value? This would be to prevent deep pagination. Or would this require a custom requestHandler? Peter Just wondering -- isn't it the sum that you should be concerned about rather than the product? Actually I think what we usually do is limit both independently, with slightly different concerns, since. eg start=1, rows=1000 causes memory problems if you have large fields in your results, where start=1000, rows=1 may not actually be a problem -Mike
limiting deep pagination
Is there a way to configure Solr 'defaults/appends/invariants' such that the product of the 'start' and 'rows' parameters doesn't exceed a given value? This would be to prevent deep pagination. Or would this require a custom requestHandler? Peter
Re: How to get values of external file field(s) in Solr query?
In 4.3, frange query using an external file works for both q and fq. The Solr wiki and SIA both state that ExternalFileField does not support searching. Was the search/filter capability added recently, or is it not supported? Thanks, Peter On Wed, Jun 26, 2013 at 4:59 PM, Upayavira u...@odoko.co.uk wrote: The only way is using a frange (function range) query: q={!frange l=0 u=10}my_external_field Will pull out documents that have your external field with a value between zero and 10. Upayavira On Wed, Jun 26, 2013, at 09:02 PM, Arun Rangarajan wrote: http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes says this about external file fields: They can be used only for function queries or display. I understand how to use them in function queries, but how do I retrieve the values for display? If I want to fetch only the values of a single external file field for a set of primary keys, I can do: q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score For this query, the score is the value of the external file field. But how to get the values for docs that match some arbitrary query? Is there a syntax trick that will work where the value of the ext file field does not affect the score of the main query, but I can still retrieve its value? Also is it possible to retrieve the values of more than one external file field in a single query?
Re: Cross index join query performance
Ah, got it now - thanks for the explanation. On Sat, Sep 28, 2013 at 3:33 AM, Upayavira u...@odoko.co.uk wrote: The thing here is to understand how a join works. Effectively, it does the inner query first, which results in a list of terms. It then effectively does a multi-term query with those values. q=size:large {!join fromIndex=other from=someid to=someotherid}type:shirt Imagine the inner join returned values A,B,C. Your inner query is, on core 'other', q=type:shirtfl=someid. Then your outer query becomes size:large someotherid:(A B C) Your inner query returns 25k values. You're having to do a multi-term query for 25k terms. That is *bound* to be slow. The pseudo-joins in Solr 4.x are intended for a small to medium number of values returned by the inner query, otherwise performance degrades as you are seeing. Is there a way you can reduce the number of values returned by the inner query? As Joel mentions, those other joins are attempts to find other ways to work with this limitation. Upayavira On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote: Hi Joel, I tried this patch and it is quite a bit faster. Using the same query on a larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin' QTime was 100 msec! This was for true for large and small result sets. A few notes: the patch didn't compile with 4.3 because of the SolrCore.getLatestSchema call (which I worked around), and the package name should be: queryParser name=hjoin class=org.apache.solr.search.joins.HashSetJoinQParserPlugin/ Unfortunately, I just learned that our uniqueKey may have to be an alphanumeric string instead of an int, so I'm not out of the woods yet. Good stuff - thanks. Peter On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein joels...@gmail.com wrote: It looks like you are using int join keys so you may want to check out SOLR-4787, specifically the hjoin and bjoin. These perform well when you have a large number of results from the fromIndex. If you have a small number of results in the fromIndex the standard join will be faster. On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan peterlkee...@gmail.com wrote: I forgot to mention - this is Solr 4.3 Peter On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan peterlkee...@gmail.com wrote: I'm doing a cross-core join query and the join query is 30X slower than each of the 2 individual queries. Here are the queries: Main query: http://localhost:8983/solr/mainindex/select?q=title:java QTime: 5 msec hit count: 1000 Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1TO 0.3] QTime: 4 msec hit count: 25K Join query: http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindextoIndex=subindexfrom=docidto=docid}fld1:[0.1 TO 0.3] QTime: 160 msec hit count: 205 Here are the index spec's: mainindex size: 117K docs, 1 segment mainindex schema: field name=docid type=int indexed=true stored=true required=true multiValued=false / field name=title type=text_en_splitting indexed=true stored=true multiValued=false / uniqueKeydocid/uniqueKey subindex size: 117K docs, 1 segment subindex schema: field name=docid type=int indexed=true stored=true required=true multiValued=false / field name=fld1 type=float indexed=true stored=true required=false multiValued=false / uniqueKeydocid/uniqueKey With debugQuery=true I see: debug:{ join:{ {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]:{ time:155, fromSetSize:24742, toSetSize:24742, fromTermCount:117810, fromTermTotalDf:117810, fromTermDirectCount:117810, fromTermHits:24742, fromTermHitsTotalDf:24742, toTermHits:24742, toTermHitsTotalDf:24742, toTermDirectCount:24627, smallSetsDeferred:115, toSetDocsAdded:24742}}, Via profiler and debugger, I see 150 msec spent in the outer 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a lot of time to join the bitsets. Does this seem right? Peter -- Joel Bernstein Professional Services LucidWorks
Re: Cross index join query performance
Hi Joel, I tried this patch and it is quite a bit faster. Using the same query on a larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin' QTime was 100 msec! This was for true for large and small result sets. A few notes: the patch didn't compile with 4.3 because of the SolrCore.getLatestSchema call (which I worked around), and the package name should be: queryParser name=hjoin class=org.apache.solr.search.joins.HashSetJoinQParserPlugin/ Unfortunately, I just learned that our uniqueKey may have to be an alphanumeric string instead of an int, so I'm not out of the woods yet. Good stuff - thanks. Peter On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein joels...@gmail.com wrote: It looks like you are using int join keys so you may want to check out SOLR-4787, specifically the hjoin and bjoin. These perform well when you have a large number of results from the fromIndex. If you have a small number of results in the fromIndex the standard join will be faster. On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan peterlkee...@gmail.com wrote: I forgot to mention - this is Solr 4.3 Peter On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan peterlkee...@gmail.com wrote: I'm doing a cross-core join query and the join query is 30X slower than each of the 2 individual queries. Here are the queries: Main query: http://localhost:8983/solr/mainindex/select?q=title:java QTime: 5 msec hit count: 1000 Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3] QTime: 4 msec hit count: 25K Join query: http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindextoIndex=subindexfrom=docid to=docid}fld1:[0.1 TO 0.3] QTime: 160 msec hit count: 205 Here are the index spec's: mainindex size: 117K docs, 1 segment mainindex schema: field name=docid type=int indexed=true stored=true required=true multiValued=false / field name=title type=text_en_splitting indexed=true stored=true multiValued=false / uniqueKeydocid/uniqueKey subindex size: 117K docs, 1 segment subindex schema: field name=docid type=int indexed=true stored=true required=true multiValued=false / field name=fld1 type=float indexed=true stored=true required=false multiValued=false / uniqueKeydocid/uniqueKey With debugQuery=true I see: debug:{ join:{ {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]:{ time:155, fromSetSize:24742, toSetSize:24742, fromTermCount:117810, fromTermTotalDf:117810, fromTermDirectCount:117810, fromTermHits:24742, fromTermHitsTotalDf:24742, toTermHits:24742, toTermHitsTotalDf:24742, toTermDirectCount:24627, smallSetsDeferred:115, toSetDocsAdded:24742}}, Via profiler and debugger, I see 150 msec spent in the outer 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a lot of time to join the bitsets. Does this seem right? Peter -- Joel Bernstein Professional Services LucidWorks
Cross index join query performance
I'm doing a cross-core join query and the join query is 30X slower than each of the 2 individual queries. Here are the queries: Main query: http://localhost:8983/solr/mainindex/select?q=title:java QTime: 5 msec hit count: 1000 Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3] QTime: 4 msec hit count: 25K Join query: http://localhost:8983/solr/mainindex/select?q=title:javafq={!joinfromIndex=mainindex toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3] QTime: 160 msec hit count: 205 Here are the index spec's: mainindex size: 117K docs, 1 segment mainindex schema: field name=docid type=int indexed=true stored=true required=true multiValued=false / field name=title type=text_en_splitting indexed=true stored=true multiValued=false / uniqueKeydocid/uniqueKey subindex size: 117K docs, 1 segment subindex schema: field name=docid type=int indexed=true stored=true required=true multiValued=false / field name=fld1 type=float indexed=true stored=true required=false multiValued=false / uniqueKeydocid/uniqueKey With debugQuery=true I see: debug:{ join:{ {!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]:{ time:155, fromSetSize:24742, toSetSize:24742, fromTermCount:117810, fromTermTotalDf:117810, fromTermDirectCount:117810, fromTermHits:24742, fromTermHitsTotalDf:24742, toTermHits:24742, toTermHitsTotalDf:24742, toTermDirectCount:24627, smallSetsDeferred:115, toSetDocsAdded:24742}}, Via profiler and debugger, I see 150 msec spent in the outer 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a lot of time to join the bitsets. Does this seem right? Peter
Re: A question about attaching shards to load balancers
Aren't you concerned about having a single point of failure with this setup? On Wed, Jan 30, 2013 at 10:38 AM, Michael Ryan mr...@moreover.com wrote: From a performance point of view, I can't imagine it mattering. In our setup, we have a dedicated Solr server that is not a shard that takes incoming requests (we call it the coordinator). This server is very lightweight and practically has no load at all. My gut feeling is that having a separate dedicated server might be a slightly better approach, as it will have totally different performance characteristics than the shards, and so you can tune it for this. -Michael
Re: Improving performance for use-case where large (200) number of phrase queries are used?
Yes #5 is the same thing (sorry, I didn't read them all thoroughly). Your description of the phrases being 'tags' suggests that you don't need term positions for matching, and as you noted, you would get unwanted partial matches. And, the TermQuerys would be much faster. Peter On Wed, Oct 24, 2012 at 8:33 PM, Aaron Daubman daub...@gmail.com wrote: Hi Peter, Thanks for the recommendation - I believe we are thinking along the same lines, but wanted to check to make sure. Are you suggesting something different than my #5 (below) or are we essentially suggesting the same thing? On Wed, Oct 24, 2012 at 1:20 PM, Peter Keegan peterlkee...@gmail.com wrote: Could you index your 'phrase tags' as single tokens? Then your phrase queries become simple TermQuerys. 5) *This is my current favorite*: stop tokenizing/analyzing these terms and just use KeywordTokenizer. Most of these phrases are pre-vetted, and it may be possible to clean/process any others before creating the docs. My main worry here is that, currently, if I understand correctly, a document with the phrase brazilian pop would still be returned as a match to a seed document containing only the phrase brazilian (not the other way around, but that is not necessary), however, with KeywordTokenizer, this would no longer be the case. If I switched from the current dubious tokenize/stem/etc... and just used Keyword, would this allow queries like this used to be a long phrase query to match documents that have this used to be a long phrase query as one of the multivalued values in the field without having to pull term positions? (and thus significantly speed up performance). Thanks again, Aaron
Re: Improving performance for use-case where large (200) number of phrase queries are used?
Could you index your 'phrase tags' as single tokens? Then your phrase queries become simple TermQuerys. On Wed, Oct 24, 2012 at 12:26 PM, Robert Muir rcm...@gmail.com wrote: On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman daub...@gmail.com wrote: Greetings, We have a solr instance in use that gets some perhaps atypical queries and suffers from poor (2 second) QTimes. Documents (~2,350,000) in this instance are mainly comprised of various descriptive fields, such as multi-word (phrase) tags - an average document contains 200-400 phrases like this across several different multi-valued field types. A custom QueryComponent has been built that functions somewhat like a very specific MoreLikeThis. A seed document is specified via the incoming query, its terms are retrieved, boosted both by query parameters as well as fields within the document that specify term weighting, sorted by this custom boosting, and then a second query is crafted by taking the top 200 (sorted by the custom boosting) resulting field values paired with their fields and searching for documents matching these 200 values. a few more ideas: * use shingles e.g. to turn two-word phrases into single terms (how long is your average phrase?). * in addition to the above, maybe for phrases with 2 terms, consider just a boolean conjunction of the shingled phrases instead of a real phrase query: e.g. more like this - (more_like AND like_this). This would have some false positives. * use a more aggressive stopwords list for your MorePhrasesLikeThis. * reduce this number 200, and instead work harder to prune out which phrases are the most descriptive from the seed document, e.g. based on some heuristics like their frequency or location within that seed document, so your query isnt so massive.
Re: Anyone using mmseg analyzer in solr multi core?
We're using MMSeg with Lucene, but not Solr. Since each SolrCore is independent, I'm not sure how you can avoid each having a copy of the dictionary, unless you modified MMSeg to use shared memory. Or, maybe I missing something. On Mon, Oct 8, 2012 at 3:37 AM, liyun liyun2...@corp.netease.com wrote: Hi all, Is anybody using mmseg analyzer for Chinese word analyze? When we use this in solr multi-core, I find it will load the dictionary per core and each core cost about 50MB memory. I think this is a big waste when our JVM has only 1GB memory…… Anyone have a good idea for handle this trouble ? 2012-10-08 Li Yun Software Engineer @ Netease Mail: liyun2...@corp.netease.com MSN: rockiee...@gmail.com
Re: How to plug a new ANTLR grammar
Also, a question for Peter, at which stage do you use lucene analyzers on the query? After it was parsed into the tree, or before we start processing the query string? I do the analysis before creating the tree. I'm pretty sure Lucene QueryParser does this, too. Peter On Wed, Sep 14, 2011 at 5:15 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi Peter, Yes, with the tree it is pretty straightforward. I'd prefer to do it that way, but what is the purpose of the new qParser then? Is it just that the qParser was built with a different paradigms in mind where the parse tree was not in the equation? Anybody knows if there is any advantage? I looked bit more into the contrib org.apache.lucene.queryParser.standard.StandardQueryParser.java org.apache.lucene.queryParser.standard.QueryParserWrapper.java And some things there (like setting default fuzzy value) are in my case set directly in the grammar. So the query builder is still somehow involved in parsing (IMHO not good). But if someone knows some reasons to keep using the qParser, please let me know. Also, a question for Peter, at which stage do you use lucene analyzers on the query? After it was parsed into the tree, or before we start processing the query string? Thanks! Roman On Tue, Sep 13, 2011 at 10:14 PM, Peter Keegan peterlkee...@gmail.com wrote: Roman, I'm not familiar with the contrib, but you can write your own Java code to create Query objects from the tree produced by your lexer and parser something like this: StandardLuceneGrammarLexer lexer = new ANTLRReaderStream(new StringReader(queryString)); CommonTokenStream tokens = new CommonTokenStream(lexer); StandardLuceneGrammarParser parser = new StandardLuceneGrammarParser(tokens); StandardLuceneGrammarParser.query_return ret = parser.mainQ(); CommonTree t = (CommonTree) ret.getTree(); parseTree(t); parseTree (Tree t) { // recursively parse the Tree, visit each node visit (node); } visit (Tree node) { switch (node.getType()) { case (StandardLuceneGrammarParser.AND: // Create BooleanQuery, push onto stack ... } } I use the stack to build up the final Query from the queries produced in the tree parsing. Hope this helps. Peter On Tue, Sep 13, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote: I'd love to see the progress on this. On Tue, Sep 13, 2011 at 10:34 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, The standard lucene/solr parsing is nice but not really flexible. I saw questions and discussion about ANTLR, but unfortunately never a working grammar, so... maybe you find this useful: https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr In the grammar, the parsing is completely abstracted from the Lucene objects, and the parser is not mixed with Java code. At first it produces structures like this: https://svnweb.cern.ch/trac/rcarepo/raw-attachment/wiki/MontySolrQueryParser/index.html But now I have a problem. I don't know if I should use query parsing framework in contrib. It seems that the qParser in contrib can use different parser generators (the default JavaCC, but also ANTLR). But I am confused and I don't understand this new queryParser from contrib. It is really very confusing to me. Is there any benefit in trying to plug the ANTLR tree into it? Because looking at the AST pictures, it seems that with a relatively simple tree walker we could build the same queries as the current standard lucene query parser. And it would be much simpler and flexible. Does it bring something new? I have a feeling I miss something... Many thanks for help, Roman -- - sent from my mobile 6176064373
Re: How to plug a new ANTLR grammar
Roman, I'm not familiar with the contrib, but you can write your own Java code to create Query objects from the tree produced by your lexer and parser something like this: StandardLuceneGrammarLexer lexer = new ANTLRReaderStream(new StringReader(queryString)); CommonTokenStream tokens = new CommonTokenStream(lexer); StandardLuceneGrammarParser parser = new StandardLuceneGrammarParser(tokens); StandardLuceneGrammarParser.query_return ret = parser.mainQ(); CommonTree t = (CommonTree) ret.getTree(); parseTree(t); parseTree (Tree t) { // recursively parse the Tree, visit each node visit (node); } visit (Tree node) { switch (node.getType()) { case (StandardLuceneGrammarParser.AND: // Create BooleanQuery, push onto stack ... } } I use the stack to build up the final Query from the queries produced in the tree parsing. Hope this helps. Peter On Tue, Sep 13, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote: I'd love to see the progress on this. On Tue, Sep 13, 2011 at 10:34 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, The standard lucene/solr parsing is nice but not really flexible. I saw questions and discussion about ANTLR, but unfortunately never a working grammar, so... maybe you find this useful: https://github.com/romanchyla/montysolr/tree/master/src/java/org/apache/lucene/queryParser/iqp/antlr In the grammar, the parsing is completely abstracted from the Lucene objects, and the parser is not mixed with Java code. At first it produces structures like this: https://svnweb.cern.ch/trac/rcarepo/raw-attachment/wiki/MontySolrQueryParser/index.html But now I have a problem. I don't know if I should use query parsing framework in contrib. It seems that the qParser in contrib can use different parser generators (the default JavaCC, but also ANTLR). But I am confused and I don't understand this new queryParser from contrib. It is really very confusing to me. Is there any benefit in trying to plug the ANTLR tree into it? Because looking at the AST pictures, it seems that with a relatively simple tree walker we could build the same queries as the current standard lucene query parser. And it would be much simpler and flexible. Does it bring something new? I have a feeling I miss something... Many thanks for help, Roman -- - sent from my mobile 6176064373
Re: performance crossover between single index and sharding
We have 16 shards on 4 physical servers. Shard size was determined by measuring query response times as a function of doc count. Multiple shards per server provides parallelism. In a VM environment, I would lean towards 1 shard per VM (with 1/4 the RAM). We implemented our own distributed search (pre-Solr) and the extra sort/merge processing is not a performance issue. Peter On Tue, Aug 2, 2011 at 2:35 PM, Burton-West, Tom tburt...@umich.edu wrote: Hi Jonothan and Markus, Why 3 shards on one machine instead of one larger shard per machine? Good question! We made this architectural decision several years ago and I'm not remembering the rationale at the moment. I believe we originally made the decision due to some tests showing a sweetspot for I/O performance for shards with 500,000-600,000 documents, but those tests were made before we implemented CommonGrams and when we were still using attached storage. I think we also might have had concerns about Java OOM errors with a really large shard/index, but we now know that we can keep memory usage under control by tweaking the amount of the terms index that gets read into memory. We should probably do some tests and revisit the question. The reason we don't have 12 shards on 12 machines is that current performance is good enough that we can't justify buying 8 more machines:) Tom -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, August 02, 2011 2:12 PM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Hi Tom, Very interesting indeed! But i keep wondering why some engineers choose to store multiple shards of the same index on the same machine, there must be significant overhead. The only reason i can think of is ease of maintenance in moving shards to a separate physical machine. I know that rearranging the shard topology can be a real pain in a large existing cluster (e.g. consistent hashing is not consistent anymore and having to shuffle docs to their new shard), is this the reason you choose this approach? Cheers, bble.com.
Re: Localized alphabetical order
On Fri, Apr 22, 2011 at 12:33 PM, Ben Preece preec...@umn.edu wrote: As someone who's new to Solr/Lucene, I'm having trouble finding information on sorting results in localized alphabetical order. I've ineffectively searched the wiki and the mail archives. I'm thinking for example about Hawai'ian, where mīka (with an i-macron) comes after mika (i without the macron) but before miki (also without the macron), or about Welsh, where the digraphs (ch, dd, etc.) are treated as single letters, or about Ojibwe, where the apostrophe ' is a letter which sorts between h and i. How do non-English languages typically handle this? -Ben
Re: Info about Debugging SOLR in Eclipse
Can you use jetty? http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse On Thu, Mar 17, 2011 at 12:17 PM, Geeta Subramanian gsubraman...@commvault.com wrote: Hi, Can some please let me know the steps on how can I debug the solr code in my eclipse? I tried to compile the source, use the jars and place in tomcat where I am running solr. And do remote debugging, but it did not stop at any break point. I also tried to write a sample standalone java class to push the document. But I stopped at solr j classes and not solr server classes. Please let me know if I am making any mistake. Regards, Geeta **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you.
Re: Info about Debugging SOLR in Eclipse
The instructions refer to the 'Run configuration' menu. Did you try 'Debug configurations'? On Thu, Mar 17, 2011 at 3:27 PM, Peter Keegan peterlkee...@gmail.comwrote: Can you use jetty? http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse On Thu, Mar 17, 2011 at 12:17 PM, Geeta Subramanian gsubraman...@commvault.com wrote: Hi, Can some please let me know the steps on how can I debug the solr code in my eclipse? I tried to compile the source, use the jars and place in tomcat where I am running solr. And do remote debugging, but it did not stop at any break point. I also tried to write a sample standalone java class to push the document. But I stopped at solr j classes and not solr server classes. Please let me know if I am making any mistake. Regards, Geeta **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you.
CapitalizationFilter
I was looking at 'CapitalizationFilter' and noticed that the 'incrementToken' method splits words at ' ' (space) and '.' (period). I'm curious as to why the period is treated as a word separator? This could cause unexpected results, for example: Hello There My Name Is Dr. Watson --- Hello There My Name Is Dr. watson Peter
Re: Does anyone notice this site?
fwiw, our proxy server has blocked this site for malicious content. Peter On Mon, Oct 25, 2010 at 1:25 PM, Grant Ingersoll gsing...@apache.orgwrote: On Oct 25, 2010, at 12:54 PM, scott chu wrote: I happen to bump into this site: http://www.solr.biz/ They said they are also developing a search engine? Is this any connection to open source Solr? No, it is not a connection and they likely should not be using the name that way, as Solr is a TM of the ASF.
LuceneRevolution - NoSQL: A comparison
I listened with great interest to Grant's presentation of the NoSQL comparisons/alternatives to Solr/Lucene. It sounds like the jury is still out on much of this. Here's a use case that might favor using a NoSQL alternative for storing 'stored fields' outside of Lucene. When Solr does a distributed search across shards, it does this in 2 phases (correct me if I'm wrong): 1. 1st query to get the docIds and facet counts 2. 2nd query to retrieve the stored fields of the top hits The problem here is that the index could change between (1) and (2), so it's not an atomic transaction. If the stored fields were kept outside of Lucene, only the first query would be necessary. However, this would mean that the external NoSQL data store would have to be synchronized with the Lucene index, which might present its own problems. (I'm just throwing this out for discussion) Peter
Re: Range queries
How about this: x:[5 TO 8] AND x:{0 TO 8} On Tue, Jun 16, 2009 at 1:16 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I think the square brackets/curly braces need to be balanced, so this is currently not doable with existing query parsers. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: gwk g...@eyefi.nl To: solr-user@lucene.apache.org Sent: Tuesday, June 16, 2009 11:52:12 AM Subject: Range queries Hi, When doing range queries it seems the query is either x:[5 TO 8] which means 5 = x = 8 or x:{5 TO 8} which means 5 x 8. But how do you get one half exclusive, the other inclusive for double fields the following: 5 = x 8? Is this possible? Regards, gwk
Re: new faceting algorithm
Hi Yonik, May I ask in which class(es) this improvement was made? I've been using the DocSet, DocList, BitDocSet, HashDocSet from Solr from a few years ago with a Lucene based app. to do faceting. Thanks, Peter On Mon, Nov 24, 2008 at 11:12 PM, Yonik Seeley [EMAIL PROTECTED] wrote: A new faceting algorithm has been committed to the development version of Solr, and should be available in the next nightly test build (will be dated 11-25). This change should generally improve field faceting where the field has many unique values but relatively few values per document. This new algorithm is now the default for multi-valued fields (including tokenized fields) so you shouldn't have to do anything to enable it. We'd love some feedback on how it works to ensure that it actually is a win for the majority and should be the default. -Yonik