You are right. We can use counting bloom filters or cuckoo filters for
deletions.

In my case I don't have deletions so I just need to add new keys to the
bloom filter when I make an insertion.

בתאריך 17 ביולי 2017 12:04,‏ "Carlos Alonso" <carlos.alo...@cabify.com> כתב:

> Hi, this is a very interesting topic and I agree it would be lovely to have
> something like this in CouchDB, however I have one concern. How do you
> handle deletions? Bloom filters have the disadvantage that you cannot
> delete records as you don't know whether you may be affecting other
> records. The more deletions you have, the more false positives your filter
> will produce and the more it will penalise performance.
>
> Aside from that I think that's a very good idea and I'd love to collaborate
> on adding it into Couch if possible.
>
> Regards
>
> On Sun, Jul 16, 2017 at 10:23 PM aa mm <assaf.mor...@gmail.com> wrote:
>
> > Hi guys.
> >
> > For a couple of months now I've been using the bulk API to query a lot of
> > data from my databases. I have some databases with hundreds of millions
> of
> > documents and a few with billions of documents. All and all about 10TB of
> > hard disk is used.
> >
> > I'm on 2.0 single mode.
> >
> > Sometimes querying for 1000-2000 keys at once can take up to 150 seconds.
> > Especially with reduce=true, group=true and include_docs=true. I found
> that
> > ~80% of the query keys are unknown to my databases.
> >
> > What I've discovered is that using bloom filters I can reduce query times
> > in these situations to ~2-3 seconds!
> >
> > The general flow of my setup is as follows:
> > 1. Get all the keys of a view (e.g. curl "$view_url" -G -d reduce=false |
> > awk -F '"' '{print $6}' > keys)
> > 2. Build a bloom filter for this view .This can be very large. In some of
> > my views I use this configuration -
> > https://hur.st/bloomfilter?n=20000000000&p=1e-7 - Which cannot cheaply
> be
> > stored in memory. This is why I used this library -
> > https://github.com/axiak/pybloomfiltermmap - that uses mmap and is
> memory
> > efficient. (I probably should use p=1e-4 or p=1e-3 because a false
> positive
> > is okay here)
> > 3. When a query of multiple keys comes along, use CouchDB bulk API only
> on
> > the keys that can be found in the bloom filter.
> >
> > This has worked pretty well for me, but the downside is obviously step 1
> -
> > getting all the keys - which takes a lot of time. A more efficient
> > solutions would be to use the changes API. This is my next plan.
> >
> > It will be great if this was part of CouchDB (Check if a key exists in a
> > bloom filter before querying the database), but in the mean time I'm just
> > sharing my experience. Maybe someone will find it useful.
> >
> > I wrote a HTTP REST wrapper for https://github.com/axiak/
> pybloomfiltermmap
> > ,
> > so it'll be independent from my business logic code and I could query it
> > remotely. I also wrote an efficient command line tool to create and
> > populate a bloom filter using https://github.com/axiak/pybloomfiltermmap
> .
> >
> > I'll open source my code in the near future.
> >
> --
> [image: Cabify - Your private Driver] <http://www.cabify.com/>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos.alo...@cabify.com
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319>
> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso>
>
> --
> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
> destinatario, pudiendo contener información confidencial sometida a secreto
> profesional. No está permitida su reproducción o distribución sin la
> autorización expresa de Cabify. Si usted no es el destinatario final por
> favor elimínelo e infórmenos por esta vía.
>
> This message and any attached file are intended exclusively for the
> addressee, and it may be confidential. You are not allowed to copy or
> disclose it without Cabify's prior written authorization. If you are not
> the intended recipient please delete it from your system and notify us by
> e-mail.
>

Reply via email to