Re: Processing a lot of results in Solr
Mikhail, Yes, +1. This question comes up a few times a year. Grant created a JIRA issue for this many moons ago. https://issues.apache.org/jira/browse/LUCENE-2127 https://issues.apache.org/jira/browse/SOLR-1726 Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Wed, Jul 24, 2013 at 9:58 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: fwiw, i did some prototype with the following differences: - it streams straight to the socket output stream - it streams on-going during collecting, without necessity to store a bitset. It might have some limited extreme usage. Is there anyone interested? On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla roman.ch...@gmail.com wrote: On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote: That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? JSON How did you implement the streaming processor ? (what tool did you use for this? Not familiar with that) this is what dumps the docs: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java it is called by one of our batch processors, which can pass it a bitset of recs https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java as far as streaming is concerned, we were all very nicely surprised, a few GB file (on local network) took ridiculously short time - in fact, a colleague of mine was assuming it is not working, until we looked into the downloaded file ;-), you may want to look at line 463 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java roman You say it takes a few minutes only to dump the data - how long does it to stream it back in, are performances acceptable (~ within minutes) ? Thanks, Matt On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Processing a lot of results in Solr
Roman, Can you disclosure how that streaming writer works? What does it stream docList or docSet? Thanks On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Processing a lot of results in Solr
Mikhail, It is a slightly hacked JSONWriter - actually, while poking around, I have discovered that dumping big hitsets would be possible - the main hurdle right now, is that writer is expecting to receive docuemnts with fields loaded, but if it received something that loads docs lazily, you could stream thousands and thousands of recs just as it is done with the normal response - standard operation. Well, people may cry this is not how SOLR is meant to operate ;-) roman On Wed, Jul 24, 2013 at 5:28 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Roman, Can you disclosure how that streaming writer works? What does it stream docList or docSet? Thanks On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Processing a lot of results in Solr
On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote: That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? JSON How did you implement the streaming processor ? (what tool did you use for this? Not familiar with that) this is what dumps the docs: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java it is called by one of our batch processors, which can pass it a bitset of recs https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java as far as streaming is concerned, we were all very nicely surprised, a few GB file (on local network) took ridiculously short time - in fact, a colleague of mine was assuming it is not working, until we looked into the downloaded file ;-), you may want to look at line 463 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java roman You say it takes a few minutes only to dump the data - how long does it to stream it back in, are performances acceptable (~ within minutes) ? Thanks, Matt On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Processing a lot of results in Solr
fwiw, i did some prototype with the following differences: - it streams straight to the socket output stream - it streams on-going during collecting, without necessity to store a bitset. It might have some limited extreme usage. Is there anyone interested? On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla roman.ch...@gmail.com wrote: On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote: That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? JSON How did you implement the streaming processor ? (what tool did you use for this? Not familiar with that) this is what dumps the docs: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java it is called by one of our batch processors, which can pass it a bitset of recs https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java as far as streaming is concerned, we were all very nicely surprised, a few GB file (on local network) took ridiculously short time - in fact, a colleague of mine was assuming it is not working, until we looked into the downloaded file ;-), you may want to look at line 463 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java roman You say it takes a few minutes only to dump the data - how long does it to stream it back in, are performances acceptable (~ within minutes) ? Thanks, Matt On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Processing a lot of results in Solr
: Subject: Processing a lot of results in Solr : Message-ID: d57c2b719b792f428beca7b0096c88e22c0...@mail1.impetus.co.in : In-Reply-To: 1374612243070-4079869.p...@n3.nabble.com https://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -Hoss
Re: Processing a lot of results in Solr
Hi Matt, This feature is commonly known as deep paging and Lucene and Solr have issues with it ... take a look at http://solr.pl/en/2011/07/18/deep-paging-problem/ as a potential starting point using filters to bucketize a result set into sets of sub result sets. Cheers, Tim On Tue, Jul 23, 2013 at 3:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Processing a lot of results in Solr
Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Processing a lot of results in Solr
That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? How did you implement the streaming processor ? (what tool did you use for this? Not familiar with that) You say it takes a few minutes only to dump the data - how long does it to stream it back in, are performances acceptable (~ within minutes) ? Thanks, Matt On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming writer can go very long way :) roman On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote: Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the Solr pagination to retrieve a batch of rows at a time (start, rows in Solr Admin console ) Any other ideas that I may be missing ? Thanks, Matt NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.