You are right. We can use counting bloom filters or cuckoo filters for deletions.
In my case I don't have deletions so I just need to add new keys to the bloom filter when I make an insertion. בתאריך 17 ביולי 2017 12:04, "Carlos Alonso" <carlos.alo...@cabify.com> כתב: > Hi, this is a very interesting topic and I agree it would be lovely to have > something like this in CouchDB, however I have one concern. How do you > handle deletions? Bloom filters have the disadvantage that you cannot > delete records as you don't know whether you may be affecting other > records. The more deletions you have, the more false positives your filter > will produce and the more it will penalise performance. > > Aside from that I think that's a very good idea and I'd love to collaborate > on adding it into Couch if possible. > > Regards > > On Sun, Jul 16, 2017 at 10:23 PM aa mm <assaf.mor...@gmail.com> wrote: > > > Hi guys. > > > > For a couple of months now I've been using the bulk API to query a lot of > > data from my databases. I have some databases with hundreds of millions > of > > documents and a few with billions of documents. All and all about 10TB of > > hard disk is used. > > > > I'm on 2.0 single mode. > > > > Sometimes querying for 1000-2000 keys at once can take up to 150 seconds. > > Especially with reduce=true, group=true and include_docs=true. I found > that > > ~80% of the query keys are unknown to my databases. > > > > What I've discovered is that using bloom filters I can reduce query times > > in these situations to ~2-3 seconds! > > > > The general flow of my setup is as follows: > > 1. Get all the keys of a view (e.g. curl "$view_url" -G -d reduce=false | > > awk -F '"' '{print $6}' > keys) > > 2. Build a bloom filter for this view .This can be very large. In some of > > my views I use this configuration - > > https://hur.st/bloomfilter?n=20000000000&p=1e-7 - Which cannot cheaply > be > > stored in memory. This is why I used this library - > > https://github.com/axiak/pybloomfiltermmap - that uses mmap and is > memory > > efficient. (I probably should use p=1e-4 or p=1e-3 because a false > positive > > is okay here) > > 3. When a query of multiple keys comes along, use CouchDB bulk API only > on > > the keys that can be found in the bloom filter. > > > > This has worked pretty well for me, but the downside is obviously step 1 > - > > getting all the keys - which takes a lot of time. A more efficient > > solutions would be to use the changes API. This is my next plan. > > > > It will be great if this was part of CouchDB (Check if a key exists in a > > bloom filter before querying the database), but in the mean time I'm just > > sharing my experience. Maybe someone will find it useful. > > > > I wrote a HTTP REST wrapper for https://github.com/axiak/ > pybloomfiltermmap > > , > > so it'll be independent from my business logic code and I could query it > > remotely. I also wrote an efficient command line tool to create and > > populate a bloom filter using https://github.com/axiak/pybloomfiltermmap > . > > > > I'll open source my code in the near future. > > > -- > [image: Cabify - Your private Driver] <http://www.cabify.com/> > > *Carlos Alonso* > Data Engineer > Madrid, Spain > > carlos.alo...@cabify.com > > Prueba gratis con este código > #CARLOSA6319 <https://cabify.com/i/carlosa6319> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image: > Linkedin] <https://www.linkedin.com/in/mrcalonso> > > -- > Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su > destinatario, pudiendo contener información confidencial sometida a secreto > profesional. No está permitida su reproducción o distribución sin la > autorización expresa de Cabify. Si usted no es el destinatario final por > favor elimínelo e infórmenos por esta vía. > > This message and any attached file are intended exclusively for the > addressee, and it may be confidential. You are not allowed to copy or > disclose it without Cabify's prior written authorization. If you are not > the intended recipient please delete it from your system and notify us by > e-mail. >