Re: Bloom filters

aa mm Mon, 24 Jul 2017 06:07:20 -0700

Hi guys! I've open sourced the bloom filters server.

https://github.com/assafmo/BloomREST


Hope this will help some of you.

I also discovered that it is not possible in CouchDB to listen to view
changes, only to listen to DB changes and then query the wanted view.
This is far less than ideal for online bloom update, because the view might
take some time to build the changes after a DB change. Querying it when a
DB change is received is not good enough and some more logic is required
around this whole operation.

I'd love to hear your thoughts about this. ☺️

בתאריך 16 ביולי 2017 23:22,‏ "aa mm" <[email protected]> כתב:

> Hi guys.
>
> For a couple of months now I've been using the bulk API to query a lot of
> data from my databases. I have some databases with hundreds of millions of
> documents and a few with billions of documents. All and all about 10TB of
> hard disk is used.
>
> I'm on 2.0 single mode.
>
> Sometimes querying for 1000-2000 keys at once can take up to 150 seconds.
> Especially with reduce=true, group=true and include_docs=true. I found that
> ~80% of the query keys are unknown to my databases.
>
> What I've discovered is that using bloom filters I can reduce query times
> in these situations to ~2-3 seconds!
>
> The general flow of my setup is as follows:
> 1. Get all the keys of a view (e.g. curl "$view_url" -G -d reduce=false |
> awk -F '"' '{print $6}' > keys)
> 2. Build a bloom filter for this view .This can be very large. In some of
> my views I use this configuration - https://hur.st/bloomfilter?n=
> 20000000000&p=1e-7 - Which cannot cheaply be stored in memory. This is
> why I used this library - https://github.com/axiak/pybloomfiltermmap -
> that uses mmap and is memory efficient. (I probably should use p=1e-4 or
> p=1e-3 because a false positive is okay here)
> 3. When a query of multiple keys comes along, use CouchDB bulk API only on
> the keys that can be found in the bloom filter.
>
> This has worked pretty well for me, but the downside is obviously step 1 -
> getting all the keys - which takes a lot of time. A more efficient
> solutions would be to use the changes API. This is my next plan.
>
> It will be great if this was part of CouchDB (Check if a key exists in a
> bloom filter before querying the database), but in the mean time I'm just
> sharing my experience. Maybe someone will find it useful.
>
> I wrote a HTTP REST wrapper for https://github.com/axiak/pybloomfiltermmap,
> so it'll be independent from my business logic code and I could query it
> remotely. I also wrote an efficient command line tool to create and
> populate a bloom filter using https://github.com/axiak/pybloomfiltermmap.
>
> I'll open source my code in the near future.
>

Re: Bloom filters

Reply via email to