Hello @,

This thread 'kicked' me into finishing som long-past task of
sending/receiving large boolean (bitset) filter. We have been using bitsets
with solr before, but now I sat down and wrote it as a qparser. The use
cases, as you have discussed are:

 - necessity to send loooong list of ids as a query (where it is not
possible to do it the 'normal' way)
 - or filtering ACLs


It works in the following way:

  - external application constructs bitset and sends it as a query to solr
(q or fq, depends on your needs)
  - solr unpacks the bitset (translated bits into lucene ids, if
necessary), and wraps this into a query which then has the easy job of
'filtering' wanted/unwanted items

Therefore it is good only if you can search against something that is
indexed as integer (id's often are).

A simple benchmark shows acceptable performance, to send the bitset
(randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20)

To decode this string (resulting byte size 1.5Mb!) it takes ~90ms
(5+14+68ms)

But I haven't tested latency of sending it over the network and the query
performance, but since the query is very similar as MatchAllDocs, it is
probably very fast (and I know that sending many Mbs to Solr is fast as
well)

I know this is not exactly 'standard' solution, and it is probably not
something you want to see with hundreds of millions of docs, but people
seem to be doing 'not the right thing' all the time;)
So if you think this is something useful for the community, please let me
know. If somebody would be willing to test it, i can file a JIRA ticket.

Thanks!

Roman


The code, if no JIRA is needed, can be found here:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java

839ms.  run
154ms.  Building random bitset indexSize=10000000 fill=0.5 --
Size=15054208,cardinality=3934477 highestBit=9999999
 25ms.  Converting bitset to byte array -- resulting array length=1250000
20ms.  Encoding byte array into base64 -- resulting array length=1666668
ratio=1.3333344
 62ms.  Compressing byte array with GZIP -- resulting array length=1218602
ratio=0.9748816
20ms.  Encoding gzipped byte array into base64 -- resulting string
length=1624804 ratio=1.2998432
 5ms.  Decoding gzipped byte array from base64
14ms.  Uncompressing decoded byte array
68ms.  Converting from byte array to bitset
 743ms.  running


On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Not necessarily. If the auth tokens are available on some
> other system (DB, LDAP, whatever), one could get them
> in the PostFilter and cache them somewhere since,
> presumably, they wouldn't be changing all that often. Or
> use a UserCache and get notified whenever a new searcher
> was opened and regenerate or purge the cache.
>
> Of course you're right if the post filter does NOT have
> access to the source of truth for the user's privileges.
>
> FWIW,
> Erick
>
> On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic
> <otis.gospodne...@gmail.com> wrote:
> > Hi,
> >
> > The unfortunate thing about this is what you still have to *pass* that
> > filter from the client to the server every time you want to use that
> > filter.  If that filter is big/long, passing that in all the time has
> > some price that could be eliminated by using "server-side named
> > filters".
> >
> > Otis
> > --
> > Solr & ElasticSearch Support
> > http://sematext.com/
> >
> >
> >
> >
> >
> > On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
> >> You might consider "post filters". The idea
> >> is to write a custom filter that gets applied
> >> after all other filters etc. One use-case
> >> here is exactly ACL lists, and can be quite
> >> helpful if you're not doing *:* type queries.
> >>
> >> Best
> >> Erick
> >>
> >> On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic
> >> <otis.gospodne...@gmail.com> wrote:
> >>> Btw. ElasticSearch has a nice feature here.  Not sure what it's
> >>> called, but I call it "named filter".
> >>>
> >>> http://www.elasticsearch.org/blog/terms-filter-lookup/
> >>>
> >>> Maybe that's what OP was after?
> >>>
> >>> Otis
> >>> --
> >>> Solr & ElasticSearch Support
> >>> http://sematext.com/
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch
> >>> <arafa...@gmail.com> wrote:
> >>>> On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov <ivkus...@gmail.com>
> wrote:
> >>>>> So I'm using query like
> >>>>>
> http://127.0.0.1:8080/solr/select?q=*:*&fq={!mqparser}id:%281%202%203%29
> >>>>
> >>>> If the IDs are purely numeric, I wonder if the better way is to send a
> >>>> bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000
> >>>> is included. Even using URL-encoding rules, you can fit at least 65
> >>>> sequential ID flags per character and I am sure there are more
> >>>> efficient encoding schemes for long empty sequences.
> >>>>
> >>>> Regards,
> >>>>    Alex.
> >>>>
> >>>>
> >>>>
> >>>> Personal website: http://www.outerthoughts.com/
> >>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> >>>> - Time is the quality of nature that keeps events from happening all
> >>>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> >>>> book)
>

Reply via email to