Re: Processing a lot of results in Solr

2013-07-25 Thread Otis Gospodnetic
Mikhail,

Yes, +1.
This question comes up a few times a year.  Grant created a JIRA issue
for this many moons ago.

https://issues.apache.org/jira/browse/LUCENE-2127
https://issues.apache.org/jira/browse/SOLR-1726

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Wed, Jul 24, 2013 at 9:58 PM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 fwiw,
 i did some prototype with the following differences:
 - it streams straight to the socket output stream
 - it streams on-going during collecting, without necessity to store a
 bitset.
 It might have some limited extreme usage. Is there anyone interested?


 On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla roman.ch...@gmail.com wrote:

 On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote:

  That sounds like a satisfactory solution for the time being -
  I am assuming you dump the data from Solr in a csv format?
 

 JSON


  How did you implement the streaming processor ? (what tool did you use
 for
  this? Not familiar with that)
 

 this is what dumps the docs:

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java

 it is called by one of our batch processors, which can pass it a bitset of
 recs

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java

 as far as streaming is concerned, we were all very nicely surprised, a few
 GB file (on local network) took ridiculously short time - in fact, a
 colleague of mine was assuming it is not working, until we looked into the
 downloaded file ;-), you may want to look at line 463

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java

 roman


  You say it takes a few minutes only to dump the data - how long does it
 to
  stream it back in, are performances acceptable (~ within minutes) ?
 
  Thanks,
  Matt
 
  On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote:
 
  Hello Matt,
  
  You can consider writing a batch processing handler, which receives a
  query
  and instead of sending results back, it writes them into a file which is
  then available for streaming (it has its own UUID). I am dumping many
 GBs
  of data from solr in few minutes - your query + streaming writer can go
  very long way :)
  
  roman
  
  
  On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com
 wrote:
  
   Hello Solr users,
  
   Question regarding processing a lot of docs returned from a query; I
   potentially have millions of documents returned back from a query.
 What
  is
   the common design to deal with this ?
  
   2 ideas I have are:
   - create a client service that is multithreaded to handled this
   - Use the Solr pagination to retrieve a batch of rows at a time
  (start,
   rows in Solr Admin console )
  
   Any other ideas that I may be missing ?
  
   Thanks,
   Matt
  
  
   
  
  
  
  
  
  
   NOTE: This message may contain information that is confidential,
   proprietary, privileged or otherwise protected by law. The message is
   intended solely for the named addressee. If received in error, please
   destroy and notify the sender. Any use of this email is prohibited
 when
   received in error. Impetus does not represent, warrant and/or
 guarantee,
   that the integrity of this communication has been maintained nor that
  the
   communication is free of errors, virus, interception or interference.
  
 
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited when
  received in error. Impetus does not represent, warrant and/or guarantee,
  that the integrity of this communication has been maintained nor that the
  communication is free of errors, virus, interception or interference.
 




 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com


Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
Roman,

Can you disclosure how that streaming writer works? What does it stream
docList or docSet?

Thanks


On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Hello Matt,

 You can consider writing a batch processing handler, which receives a query
 and instead of sending results back, it writes them into a file which is
 then available for streaming (it has its own UUID). I am dumping many GBs
 of data from solr in few minutes - your query + streaming writer can go
 very long way :)

 roman


 On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote:

  Hello Solr users,
 
  Question regarding processing a lot of docs returned from a query; I
  potentially have millions of documents returned back from a query. What
 is
  the common design to deal with this ?
 
  2 ideas I have are:
  - create a client service that is multithreaded to handled this
  - Use the Solr pagination to retrieve a batch of rows at a time
 (start,
  rows in Solr Admin console )
 
  Any other ideas that I may be missing ?
 
  Thanks,
  Matt
 
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited when
  received in error. Impetus does not represent, warrant and/or guarantee,
  that the integrity of this communication has been maintained nor that the
  communication is free of errors, virus, interception or interference.
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
Mikhail,
It is a slightly hacked JSONWriter - actually, while poking around, I have
discovered that dumping big hitsets would be possible - the main hurdle
right now, is that writer is expecting to receive docuemnts with fields
loaded, but if it received something that loads docs lazily, you could
stream thousands and thousands of recs just as it is done with the normal
response - standard operation. Well, people may cry this is not how SOLR is
meant to operate ;-)

roman


On Wed, Jul 24, 2013 at 5:28 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Roman,

 Can you disclosure how that streaming writer works? What does it stream
 docList or docSet?

 Thanks


 On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hello Matt,
 
  You can consider writing a batch processing handler, which receives a
 query
  and instead of sending results back, it writes them into a file which is
  then available for streaming (it has its own UUID). I am dumping many GBs
  of data from solr in few minutes - your query + streaming writer can go
  very long way :)
 
  roman
 
 
  On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com
 wrote:
 
   Hello Solr users,
  
   Question regarding processing a lot of docs returned from a query; I
   potentially have millions of documents returned back from a query. What
  is
   the common design to deal with this ?
  
   2 ideas I have are:
   - create a client service that is multithreaded to handled this
   - Use the Solr pagination to retrieve a batch of rows at a time
  (start,
   rows in Solr Admin console )
  
   Any other ideas that I may be missing ?
  
   Thanks,
   Matt
  
  
   
  
  
  
  
  
  
   NOTE: This message may contain information that is confidential,
   proprietary, privileged or otherwise protected by law. The message is
   intended solely for the named addressee. If received in error, please
   destroy and notify the sender. Any use of this email is prohibited when
   received in error. Impetus does not represent, warrant and/or
 guarantee,
   that the integrity of this communication has been maintained nor that
 the
   communication is free of errors, virus, interception or interference.
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote:

 That sounds like a satisfactory solution for the time being -
 I am assuming you dump the data from Solr in a csv format?


JSON


 How did you implement the streaming processor ? (what tool did you use for
 this? Not familiar with that)


this is what dumps the docs:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java

it is called by one of our batch processors, which can pass it a bitset of
recs
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java

as far as streaming is concerned, we were all very nicely surprised, a few
GB file (on local network) took ridiculously short time - in fact, a
colleague of mine was assuming it is not working, until we looked into the
downloaded file ;-), you may want to look at line 463
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java

roman


 You say it takes a few minutes only to dump the data - how long does it to
 stream it back in, are performances acceptable (~ within minutes) ?

 Thanks,
 Matt

 On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hello Matt,
 
 You can consider writing a batch processing handler, which receives a
 query
 and instead of sending results back, it writes them into a file which is
 then available for streaming (it has its own UUID). I am dumping many GBs
 of data from solr in few minutes - your query + streaming writer can go
 very long way :)
 
 roman
 
 
 On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote:
 
  Hello Solr users,
 
  Question regarding processing a lot of docs returned from a query; I
  potentially have millions of documents returned back from a query. What
 is
  the common design to deal with this ?
 
  2 ideas I have are:
  - create a client service that is multithreaded to handled this
  - Use the Solr pagination to retrieve a batch of rows at a time
 (start,
  rows in Solr Admin console )
 
  Any other ideas that I may be missing ?
 
  Thanks,
  Matt
 
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited when
  received in error. Impetus does not represent, warrant and/or guarantee,
  that the integrity of this communication has been maintained nor that
 the
  communication is free of errors, virus, interception or interference.
 


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.



Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
fwiw,
i did some prototype with the following differences:
- it streams straight to the socket output stream
- it streams on-going during collecting, without necessity to store a
bitset.
It might have some limited extreme usage. Is there anyone interested?


On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla roman.ch...@gmail.com wrote:

 On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote:

  That sounds like a satisfactory solution for the time being -
  I am assuming you dump the data from Solr in a csv format?
 

 JSON


  How did you implement the streaming processor ? (what tool did you use
 for
  this? Not familiar with that)
 

 this is what dumps the docs:

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java

 it is called by one of our batch processors, which can pass it a bitset of
 recs

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java

 as far as streaming is concerned, we were all very nicely surprised, a few
 GB file (on local network) took ridiculously short time - in fact, a
 colleague of mine was assuming it is not working, until we looked into the
 downloaded file ;-), you may want to look at line 463

 https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java

 roman


  You say it takes a few minutes only to dump the data - how long does it
 to
  stream it back in, are performances acceptable (~ within minutes) ?
 
  Thanks,
  Matt
 
  On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote:
 
  Hello Matt,
  
  You can consider writing a batch processing handler, which receives a
  query
  and instead of sending results back, it writes them into a file which is
  then available for streaming (it has its own UUID). I am dumping many
 GBs
  of data from solr in few minutes - your query + streaming writer can go
  very long way :)
  
  roman
  
  
  On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com
 wrote:
  
   Hello Solr users,
  
   Question regarding processing a lot of docs returned from a query; I
   potentially have millions of documents returned back from a query.
 What
  is
   the common design to deal with this ?
  
   2 ideas I have are:
   - create a client service that is multithreaded to handled this
   - Use the Solr pagination to retrieve a batch of rows at a time
  (start,
   rows in Solr Admin console )
  
   Any other ideas that I may be missing ?
  
   Thanks,
   Matt
  
  
   
  
  
  
  
  
  
   NOTE: This message may contain information that is confidential,
   proprietary, privileged or otherwise protected by law. The message is
   intended solely for the named addressee. If received in error, please
   destroy and notify the sender. Any use of this email is prohibited
 when
   received in error. Impetus does not represent, warrant and/or
 guarantee,
   that the integrity of this communication has been maintained nor that
  the
   communication is free of errors, virus, interception or interference.
  
 
 
  
 
 
 
 
 
 
  NOTE: This message may contain information that is confidential,
  proprietary, privileged or otherwise protected by law. The message is
  intended solely for the named addressee. If received in error, please
  destroy and notify the sender. Any use of this email is prohibited when
  received in error. Impetus does not represent, warrant and/or guarantee,
  that the integrity of this communication has been maintained nor that the
  communication is free of errors, virus, interception or interference.
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Processing a lot of results in Solr

2013-07-24 Thread Chris Hostetter

: Subject: Processing a lot of results in Solr
: Message-ID: d57c2b719b792f428beca7b0096c88e22c0...@mail1.impetus.co.in
: In-Reply-To: 1374612243070-4079869.p...@n3.nabble.com

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss


Re: Processing a lot of results in Solr

2013-07-23 Thread Timothy Potter
Hi Matt,

This feature is commonly known as deep paging and Lucene and Solr have
issues with it ... take a look at
http://solr.pl/en/2011/07/18/deep-paging-problem/ as a potential
starting point using filters to bucketize a result set into sets of
sub result sets.

Cheers,
Tim

On Tue, Jul 23, 2013 at 3:04 PM, Matt Lieber mlie...@impetus.com wrote:
 Hello Solr users,

 Question regarding processing a lot of docs returned from a query; I
 potentially have millions of documents returned back from a query. What is
 the common design to deal with this ?

 2 ideas I have are:
 - create a client service that is multithreaded to handled this
 - Use the Solr pagination to retrieve a batch of rows at a time (start,
 rows in Solr Admin console )

 Any other ideas that I may be missing ?

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential, proprietary, 
 privileged or otherwise protected by law. The message is intended solely for 
 the named addressee. If received in error, please destroy and notify the 
 sender. Any use of this email is prohibited when received in error. Impetus 
 does not represent, warrant and/or guarantee, that the integrity of this 
 communication has been maintained nor that the communication is free of 
 errors, virus, interception or interference.


Re: Processing a lot of results in Solr

2013-07-23 Thread Roman Chyla
Hello Matt,

You can consider writing a batch processing handler, which receives a query
and instead of sending results back, it writes them into a file which is
then available for streaming (it has its own UUID). I am dumping many GBs
of data from solr in few minutes - your query + streaming writer can go
very long way :)

roman


On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote:

 Hello Solr users,

 Question regarding processing a lot of docs returned from a query; I
 potentially have millions of documents returned back from a query. What is
 the common design to deal with this ?

 2 ideas I have are:
 - create a client service that is multithreaded to handled this
 - Use the Solr pagination to retrieve a batch of rows at a time (start,
 rows in Solr Admin console )

 Any other ideas that I may be missing ?

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.



Re: Processing a lot of results in Solr

2013-07-23 Thread Matt Lieber
That sounds like a satisfactory solution for the time being -
I am assuming you dump the data from Solr in a csv format?
How did you implement the streaming processor ? (what tool did you use for
this? Not familiar with that)
You say it takes a few minutes only to dump the data - how long does it to
stream it back in, are performances acceptable (~ within minutes) ?

Thanks,
Matt

On 7/23/13 6:57 PM, Roman Chyla roman.ch...@gmail.com wrote:

Hello Matt,

You can consider writing a batch processing handler, which receives a
query
and instead of sending results back, it writes them into a file which is
then available for streaming (it has its own UUID). I am dumping many GBs
of data from solr in few minutes - your query + streaming writer can go
very long way :)

roman


On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber mlie...@impetus.com wrote:

 Hello Solr users,

 Question regarding processing a lot of docs returned from a query; I
 potentially have millions of documents returned back from a query. What
is
 the common design to deal with this ?

 2 ideas I have are:
 - create a client service that is multithreaded to handled this
 - Use the Solr pagination to retrieve a batch of rows at a time
(start,
 rows in Solr Admin console )

 Any other ideas that I may be missing ?

 Thanks,
 Matt


 






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that
the
 communication is free of errors, virus, interception or interference.










NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.