Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-28 Thread Erick Erickson
Shawn had an interesting idea on another thread. It depends on having basically an identity field (which I see how to do manually, but don't see how to make work as a new field type in a distributed environment). And it's brilliantly simple, just a range query identity:{ TO *]sort=identity

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Otis, You gave links to 'deep paging' when I asked about response streaming. Let me understand. From my POV, deep paging is a special case for regular search scenarios. We definitely need it in Solr. However, if we are talking about data analytic like problems, when we need to select an endless

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Mikhail, If your solution gives lazy loading of solr docs /and thus streaming of huge result lists/ it should be big YES! Roman On 27 Jul 2013 07:55, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Otis, You gave links to 'deep paging' when I asked about response streaming. Let me

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Roman, Let me briefly explain the design special RequestParser stores servlet output stream into the context https://github.com/m-khl/solr-patches/compare/streaming#L7R22 then special component injects special PostFilter/DelegatingCollector which writes right into output

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Hi Mikhail, I can see it is lazy-loading, but I can't judge how much complex it becomes (presumably, the filter dispatching mechanism is doing also other things - it is there not only for streaming). Let me just explain better what I found when I dug inside solr: documents (results of the query)

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 4:30 PM, Roman Chyla roman.ch...@gmail.com wrote: Let me just explain better what I found when I dug inside solr: documents (results of the query) are loaded before they are passed into a writer - so the writers are expecting to encounter the solr documents, but these

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Hello, Please find below Let me just explain better what I found when I dug inside solr: documents (results of the query) are loaded before they are passed into a writer - so the writers are expecting to encounter the solr documents, but these documents were loaded by one of the components

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 5:05 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: anyway, even if writer pulls docs one by one, it doesn't allow to stream a billion of them. Solr writes out DocList, which is really problematic even in deep-paging scenarios. Which part is problematic... the

Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
On Sun, Jul 28, 2013 at 1:25 AM, Yonik Seeley yo...@lucidworks.com wrote: Which part is problematic... the creation of the DocList (the search), Literally DocList is a copy of TopDocs. Creating TopDocs is not a search, but ranking. And ranking costs is log(rows+start) beside of numFound, which

Re: Processing a lot of results in Solr

2013-07-25 Thread Otis Gospodnetic
Mikhail, Yes, +1. This question comes up a few times a year. Grant created a JIRA issue for this many moons ago. https://issues.apache.org/jira/browse/LUCENE-2127 https://issues.apache.org/jira/browse/SOLR-1726 Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring

Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
Roman, Can you disclosure how that streaming writer works? What does it stream docList or docSet? Thanks On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla roman.ch...@gmail.com wrote: Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending

Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
Mikhail, It is a slightly hacked JSONWriter - actually, while poking around, I have discovered that dumping big hitsets would be possible - the main hurdle right now, is that writer is expecting to receive docuemnts with fields loaded, but if it received something that loads docs lazily, you could

Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber mlie...@impetus.com wrote: That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? JSON How did you implement the streaming processor ? (what tool did you use for this? Not

Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
fwiw, i did some prototype with the following differences: - it streams straight to the socket output stream - it streams on-going during collecting, without necessity to store a bitset. It might have some limited extreme usage. Is there anyone interested? On Wed, Jul 24, 2013 at 7:19 PM, Roman

Re: Processing a lot of results in Solr

2013-07-24 Thread Chris Hostetter
: Subject: Processing a lot of results in Solr : Message-ID: d57c2b719b792f428beca7b0096c88e22c0...@mail1.impetus.co.in : In-Reply-To: 1374612243070-4079869.p...@n3.nabble.com https://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a

Processing a lot of results in Solr

2013-07-23 Thread Matt Lieber
Hello Solr users, Question regarding processing a lot of docs returned from a query; I potentially have millions of documents returned back from a query. What is the common design to deal with this ? 2 ideas I have are: - create a client service that is multithreaded to handled this - Use the

Re: Processing a lot of results in Solr

2013-07-23 Thread Timothy Potter
Hi Matt, This feature is commonly known as deep paging and Lucene and Solr have issues with it ... take a look at http://solr.pl/en/2011/07/18/deep-paging-problem/ as a potential starting point using filters to bucketize a result set into sets of sub result sets. Cheers, Tim On Tue, Jul 23,

Re: Processing a lot of results in Solr

2013-07-23 Thread Roman Chyla
Hello Matt, You can consider writing a batch processing handler, which receives a query and instead of sending results back, it writes them into a file which is then available for streaming (it has its own UUID). I am dumping many GBs of data from solr in few minutes - your query + streaming

Re: Processing a lot of results in Solr

2013-07-23 Thread Matt Lieber
That sounds like a satisfactory solution for the time being - I am assuming you dump the data from Solr in a csv format? How did you implement the streaming processor ? (what tool did you use for this? Not familiar with that) You say it takes a few minutes only to dump the data - how long does it