Wrong link to the parser, should be:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java


On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla <roman.ch...@gmail.com> wrote:

> Hello @,
>
> This thread 'kicked' me into finishing som long-past task of
> sending/receiving large boolean (bitset) filter. We have been using bitsets
> with solr before, but now I sat down and wrote it as a qparser. The use
> cases, as you have discussed are:
>
>  - necessity to send loooong list of ids as a query (where it is not
> possible to do it the 'normal' way)
>  - or filtering ACLs
>
>
> It works in the following way:
>
>   - external application constructs bitset and sends it as a query to solr
> (q or fq, depends on your needs)
>   - solr unpacks the bitset (translated bits into lucene ids, if
> necessary), and wraps this into a query which then has the easy job of
> 'filtering' wanted/unwanted items
>
> Therefore it is good only if you can search against something that is
> indexed as integer (id's often are).
>
> A simple benchmark shows acceptable performance, to send the bitset
> (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)
>
> To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
> (5+14+68ms)
>
> But I haven't tested latency of sending it over the network and the query
> performance, but since the query is very similar as MatchAllDocs, it is
> probably very fast (and I know that sending many Mbs to Solr is fast as
> well)
>
> I know this is not exactly 'standard' solution, and it is probably not
> something you want to see with hundreds of millions of docs, but people
> seem to be doing 'not the right thing' all the time;)
> So if you think this is something useful for the community, please let me
> know. If somebody would be willing to test it, i can file a JIRA ticket.
>
> Thanks!
>
> Roman
>
>
> The code, if no JIRA is needed, can be found here:
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java
>
> 839ms.  run
> 154ms.  Building random bitset indexSize=10000000 fill=0.5 --
> Size=15054208,cardinality=3934477 highestBit=9999999
>  25ms.  Converting bitset to byte array -- resulting array length=1250000
> 20ms.  Encoding byte array into base64 -- resulting array length=1666668
> ratio=1.3333344
>  62ms.  Compressing byte array with GZIP -- resulting array
> length=1218602 ratio=0.9748816
> 20ms.  Encoding gzipped byte array into base64 -- resulting string
> length=1624804 ratio=1.2998432
>  5ms.  Decoding gzipped byte array from base64
> 14ms.  Uncompressing decoded byte array
> 68ms.  Converting from byte array to bitset
>  743ms.  running
>
>
> On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
>
>> Not necessarily. If the auth tokens are available on some
>> other system (DB, LDAP, whatever), one could get them
>> in the PostFilter and cache them somewhere since,
>> presumably, they wouldn't be changing all that often. Or
>> use a UserCache and get notified whenever a new searcher
>> was opened and regenerate or purge the cache.
>>
>> Of course you're right if the post filter does NOT have
>> access to the source of truth for the user's privileges.
>>
>> FWIW,
>> Erick
>>
>> On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
>> <otis.gospodne...@gmail.com> wrote:
>> > Hi,
>> >
>> > The unfortunate thing about this is what you still have to *pass* that
>> > filter from the client to the server every time you want to use that
>> > filter.  If that filter is big/long, passing that in all the time has
>> > some price that could be eliminated by using "server-side named
>> > filters".
>> >
>> > Otis
>> > --
>> > Solr & ElasticSearch Support
>> > http://sematext.com/
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson <
>> erickerick...@gmail.com> wrote:
>> >> You might consider "post filters". The idea
>> >> is to write a custom filter that gets applied
>> >> after all other filters etc. One use-case
>> >> here is exactly ACL lists, and can be quite
>> >> helpful if you're not doing *:* type queries.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
>> >> <otis.gospodne...@gmail.com> wrote:
>> >>> Btw. ElasticSearch has a nice feature here.  Not sure what it's
>> >>> called, but I call it "named filter".
>> >>>
>> >>> http://www.elasticsearch.org/blog/terms-filter-lookup/
>> >>>
>> >>> Maybe that's what OP was after?
>> >>>
>> >>> Otis
>> >>> --
>> >>> Solr & ElasticSearch Support
>> >>> http://sematext.com/
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
>> >>> <arafa...@gmail.com> wrote:
>> >>>> On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov <ivkus...@gmail.com>
>> wrote:
>> >>>>> So I'm using query like
>> >>>>>
>> http://127.0.0.1:8080/solr/select?q=*:*&fq={!mqparser}id:%281%202%203%29<http://127.0.0.1:8080/solr/select?q=*:*&fq=%7B!mqparser%7Did:%281%202%203%29>
>> >>>>
>> >>>> If the IDs are purely numeric, I wonder if the better way is to send
>> a
>> >>>> bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if
>> ID:2000
>> >>>> is included. Even using URL-encoding rules, you can fit at least 65
>> >>>> sequential ID flags per character and I am sure there are more
>> >>>> efficient encoding schemes for long empty sequences.
>> >>>>
>> >>>> Regards,
>> >>>>    Alex.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Personal website: http://www.outerthoughts.com/
>> >>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> >>>> - Time is the quality of nature that keeps events from happening all
>> >>>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via
>> GTD
>> >>>> book)
>>
>
>

Reply via email to