Re: Parallelize Cursor approach

2016-11-05 Thread Erick Erickson
Hmmm, export is supposed to handle 10s of million result sets. I know
of a situation where the Streaming Aggregation functionality back
ported to Solr 4.10 processes on that scale. So do you have any clue
what exactly is failing? Is there anything in the Solr logs?

_How_ are you using /export, through Streaming Aggregation (SolrJ) or
just the raw xport handler? It might be worth trying to do this from
SolrJ if you're not, it should be a very quick program to write, just
to test we're talking 100 lines max.

You could always roll your own cursor mark stuff by partitioning the
data amongst N threads/processes if you have any reasonable
expectation that you could form filter queries that partition the
result set anywhere near evenly.

For example, let's say you have a field with random numbers between 0
and 100. You could spin off 10 cursorMark-aware processes each with
its own fq clause like

fq=partition_field:[0 TO 10}
fq=[10 TO 20}

fq=[90 TO 100]

Note the use of inclusive/exclusive end points

Each one would be totally independent of all others with no
overlapping documents. And since the fq's would presumably be cached
you should be able to go as fast as you can drive your cluster. Of
course you lose query-wide sorting and the like, if that's important
you'd need to figure something out there.

Do be aware of a potential issue. When regular doc fields are
returned, for each document returned, a 16K block of data will be
decompressed to get the stored field data. Streaming Aggregation
(/xport) reads docValues entries which are held in MMapDirectory space
so will be much, much faster. As of Solr 5.5. You can override the
decompression stuff, see:
https://issues.apache.org/jira/browse/SOLR-8220 for fields that are
both stored and docvalues...

Best,
Erick

On Sat, Nov 5, 2016 at 6:41 PM, Chetas Joshi  wrote:
> Thanks Yonik for the explanation.
>
> Hi Erick,
> I was using the /xport functionality. But it hasn't been stable (Solr
> 5.5.0). I started running into run time Exceptions (JSON parsing
> exceptions) while reading the stream of Tuples. This started happening as
> the size of my collection increased 3 times and I started running queries
> that return millions of documents (>10mm). I don't know if it is the query
> result size or the actual data size (total number of docs in the
> collection) that is causing the instability.
>
> org.noggit.JSONParser$ParseException: Expected ',' or '}':
> char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
> I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
> 0lG99sHT8P5e'
>
> I won't be able to move to Solr 6.0 due to some constraints in our
> production environment and hence moving back to the cursor approach. Do you
> have any other suggestion for me?
>
> Thanks,
> Chetas.
>
> On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson 
> wrote:
>
>> Have you considered the /xport functionality?
>>
>> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley  wrote:
>> > No, you can't get cursor-marks ahead of time.
>> > They are the serialized representation of the last sort values
>> > encountered (hence not known ahead of time).
>> >
>> > -Yonik
>> >
>> >
>> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi 
>> wrote:
>> >> Hi,
>> >>
>> >> I am using the cursor approach to fetch results from Solr (5.5.0). Most
>> of
>> >> my queries return millions of results. Is there a way I can read the
>> pages
>> >> in parallel? Is there a way I can get all the cursors well in advance?
>> >>
>> >> Let's say my query returns 2M documents and I have set rows=100,000.
>> >> Can I have multiple threads iterating over different pages like
>> >> Thread1 -> docs 1 to 100K
>> >> Thread2 -> docs 101K to 200K
>> >> ..
>> >> ..
>> >>
>> >> for this to happen, can I get all the cursorMarks for a given query so
>> that
>> >> I can leverage the following code in parallel
>> >>
>> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> >> val rsp: QueryResponse = c.query(cursorQ)
>> >>
>> >> Thank you,
>> >> Chetas.
>>


Re: Parallelize Cursor approach

2016-11-05 Thread Chetas Joshi
Thanks Yonik for the explanation.

Hi Erick,
I was using the /xport functionality. But it hasn't been stable (Solr
5.5.0). I started running into run time Exceptions (JSON parsing
exceptions) while reading the stream of Tuples. This started happening as
the size of my collection increased 3 times and I started running queries
that return millions of documents (>10mm). I don't know if it is the query
result size or the actual data size (total number of docs in the
collection) that is causing the instability.

org.noggit.JSONParser$ParseException: Expected ',' or '}':
char=5,position=110938 BEFORE='uuid":"0lG99s8vyaKB2I/
I","space":"uuid","timestamp":1 5' AFTER='DB6 474294954},{"uuid":"
0lG99sHT8P5e'

I won't be able to move to Solr 6.0 due to some constraints in our
production environment and hence moving back to the cursor approach. Do you
have any other suggestion for me?

Thanks,
Chetas.

On Fri, Nov 4, 2016 at 10:17 PM, Erick Erickson 
wrote:

> Have you considered the /xport functionality?
>
> On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley  wrote:
> > No, you can't get cursor-marks ahead of time.
> > They are the serialized representation of the last sort values
> > encountered (hence not known ahead of time).
> >
> > -Yonik
> >
> >
> > On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi 
> wrote:
> >> Hi,
> >>
> >> I am using the cursor approach to fetch results from Solr (5.5.0). Most
> of
> >> my queries return millions of results. Is there a way I can read the
> pages
> >> in parallel? Is there a way I can get all the cursors well in advance?
> >>
> >> Let's say my query returns 2M documents and I have set rows=100,000.
> >> Can I have multiple threads iterating over different pages like
> >> Thread1 -> docs 1 to 100K
> >> Thread2 -> docs 101K to 200K
> >> ..
> >> ..
> >>
> >> for this to happen, can I get all the cursorMarks for a given query so
> that
> >> I can leverage the following code in parallel
> >>
> >> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> >> val rsp: QueryResponse = c.query(cursorQ)
> >>
> >> Thank you,
> >> Chetas.
>


Re: Facets based on sampling

2016-11-05 Thread Mikhail Khludnev
Hello, John!

You can try to do that manually by applying filter by random field.

On Fri, Nov 4, 2016 at 10:02 PM, John Davis 
wrote:

> Hi,
> I am trying to improve the performance of queries with facets. I understand
> that for queries with high facet cardinality and large number results the
> current facet computation algorithms can be slow as they are trying to loop
> across all docs and facet values.
>
> Does there exist an option to compute facets by just looking at the top-n
> results instead of all of them or a sample of results based on some query
> parameters? I couldn't find one and if it does not exist, has this come up
> before? This would definitely not be a precise facet count but using
> reasonable sampling algorithms we should be able to extrapolate well.
>
> Thank you in advance for any advice!
>
> John
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Facets based on sampling

2016-11-05 Thread Toke Eskildsen
From: John Davis  wrote:
> Does there exist an option to compute facets by just looking at the top-n
> results instead of all of them or a sample of results based on some query
> parameters?

Doing it for the top-n results does not play well with the current query flow 
in Solr (I might be wrong here, as I am not too familiar with that part of the 
code). It also seems to collide somewhat as documents are (often) sorted by 
score and facets are (often) sorted by count. So the result would be something 
like facet-values present in the highest scoring documents and also being in 
many documents? It might work in some situations, but be confusing in others.

Sampling based faceting seems like a more straight-forward concept to me.

> I couldn't find one and if it does not exist, has this come up
> before? This would definitely not be a precise facet count but using
> reasonable sampling algorithms we should be able to extrapolate well.

I implemented something like that 2 years ago for Solr 4.10. There is a 
write-up at
https://sbdevel.wordpress.com/2015/06/19/dubious-guesses-counted-correctly/

Interestingly enough, it is possible to get precise counts with sampling. Then 
the "only" downside is a possibility that the terms guessed to be in the top-X 
are not the correct ones.

(and yes, I do plan on porting to Solr 6.x. Hopefully spring 2017, but no 
promises)

- Toke Eskildsen


Re: Query formulation help

2016-11-05 Thread Erick Erickson
Assuming these are numerics, use function queries I should think, see:
https://cwiki.apache.org/confluence/display/solr/Function+Queries#FunctionQueries-AvailableFunctions.
You'll see lt, gt, etc.

Best,
Erick

On Fri, Nov 4, 2016 at 11:23 PM, Prasanna S. Dhakephalkar
 wrote:
> Hi John,
>
> I need to formulate a query where the both query variable are from document.
> Like get me all documents where var_1 >  var_2 (var_1 and var_2 both are in 
> document.)
>
> Thanks and Regards,
>
> Prasanna.
>
>
> -Original Message-
> From: John Bickerstaff [mailto:j...@johnbickerstaff.com]
> Sent: Wednesday, October 26, 2016 9:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query formulation help
>
> For what it's worth- you can do some complex stuff - including using document 
> fields as "variables" -- I did it on an Solr query endpoint (like
> /search) because I had stuff that was constant for every query.  The syntax 
> is challenging, but it can be done.
>
> I won't confuse the issue more unless you need something like that - let me 
> know if you do.
>
> On Wed, Oct 26, 2016 at 9:52 AM, Tom Evans  wrote:
>
>> On Wed, Oct 26, 2016 at 4:00 PM, Prasanna S. Dhakephalkar
>>  wrote:
>> > Hi,
>> >
>> > Thanks for reply, I did
>> >
>> > "q": "cost:[2 TO (2+5000)]"
>> >
>> > Got
>> >
>> >   "error": {
>> > "msg": "org.apache.solr.search.SyntaxError: Cannot parse
>> 'cost:[2 to (2+5000)]': Encountered \"  \"(2+5000)
>> \"\" at line 1, column 18.\nWas expecting one of:\n\"]\" ...\n\"}\"
>> ...\n",
>> >   }
>> >
>> > I want solr to do the addition.
>> > I tried
>> > "q": "cost:[2 TO (2+5000)]"
>> > "q": "cost:[2 TO sum(2,5000)]"
>> >
>> > I has not worked. I am missing something. I donot know what. May be
>> > how
>> to invoke functions.
>> >
>> > Regards,
>> >
>> > Prasanna.
>>
>> Sorry, I was unclear - do the maths before constructing the query!
>>
>> You might be able to do this with function queries, but why bother? If
>> the number is fixed, then fix it in the query, if it varies then there
>> must be some code executing on your client that can be used to do a
>> simple addition.
>>
>> Cheers
>>
>> Tom
>>
>


RE: Query formulation help

2016-11-05 Thread Prasanna S. Dhakephalkar
Hi John,

I need to formulate a query where the both query variable are from document.
Like get me all documents where var_1 >  var_2 (var_1 and var_2 both are in 
document.)

Thanks and Regards,

Prasanna.


-Original Message-
From: John Bickerstaff [mailto:j...@johnbickerstaff.com] 
Sent: Wednesday, October 26, 2016 9:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Query formulation help

For what it's worth- you can do some complex stuff - including using document 
fields as "variables" -- I did it on an Solr query endpoint (like
/search) because I had stuff that was constant for every query.  The syntax is 
challenging, but it can be done.

I won't confuse the issue more unless you need something like that - let me 
know if you do.

On Wed, Oct 26, 2016 at 9:52 AM, Tom Evans  wrote:

> On Wed, Oct 26, 2016 at 4:00 PM, Prasanna S. Dhakephalkar 
>  wrote:
> > Hi,
> >
> > Thanks for reply, I did
> >
> > "q": "cost:[2 TO (2+5000)]"
> >
> > Got
> >
> >   "error": {
> > "msg": "org.apache.solr.search.SyntaxError: Cannot parse
> 'cost:[2 to (2+5000)]': Encountered \"  \"(2+5000)
> \"\" at line 1, column 18.\nWas expecting one of:\n\"]\" ...\n\"}\"
> ...\n",
> >   }
> >
> > I want solr to do the addition.
> > I tried
> > "q": "cost:[2 TO (2+5000)]"
> > "q": "cost:[2 TO sum(2,5000)]"
> >
> > I has not worked. I am missing something. I donot know what. May be 
> > how
> to invoke functions.
> >
> > Regards,
> >
> > Prasanna.
>
> Sorry, I was unclear - do the maths before constructing the query!
>
> You might be able to do this with function queries, but why bother? If 
> the number is fixed, then fix it in the query, if it varies then there 
> must be some code executing on your client that can be used to do a 
> simple addition.
>
> Cheers
>
> Tom
>