Re: [BloomFilters] changes to BloomFilter

2020-05-10 Thread Gilles Sadowski
Hi. 2020-05-10 18:38 UTC+02:00, Claude Warren : > I keep wondering if Bloom filters belong in Collections. They are not a > collection in the standard sense of the word. If it does not depend on [Collections] code and it provides a functionality that can be used without [Collections], then IMO

Re: [BloomFilters] changes to BloomFilter

2020-05-10 Thread Claude Warren
I keep wondering if Bloom filters belong in Collections. They are not a collection in the standard sense of the word. Would it make more sense to spit it out as a new Commons project? How does one even go about that? On Wed, Apr 22, 2020 at 5:37 PM Alex Herbert wrote: > On Wed, 22 Apr 2020

Re: [BloomFilters] changes to BloomFilter

2020-04-22 Thread Alex Herbert
On Wed, 22 Apr 2020 at 17:17, Gilles Sadowski wrote: > > > - Does the BF code reuse Commons Collection utilities? > > Yes, in class "HasherBloomFilter": > import org.apache.commons.collections4.iterators.EmptyIterator; > import org.apache.commons.collections4.iterators.IteratorChain; > Missed

Re: [BloomFilters] changes to BloomFilter

2020-04-22 Thread Alex Herbert
On Wed, 22 Apr 2020 at 13:56, Gary Gregory wrote: > Hi All, > > I'd like to pat ourselves on the back for a smooth ride so far in bringing > in new code for a new non-trivial feature :-) > > IMO, let's not get hung up on generics for the first cut of the Bloom > Filter code. Why? Because

Re: [BloomFilters] changes to BloomFilter

2020-04-22 Thread Gilles Sadowski
Hello. Le mer. 22 avr. 2020 à 14:56, Gary Gregory a écrit : > > Hi All, > > I'd like to pat ourselves on the back for a smooth ride so far in bringing > in new code for a new non-trivial feature :-) > > IMO, let's not get hung up on generics for the first cut of the Bloom > Filter code. Why?

Re: [BloomFilters] changes to BloomFilter

2020-04-22 Thread Gary Gregory
Hi All, I'd like to pat ourselves on the back for a smooth ride so far in bringing in new code for a new non-trivial feature :-) IMO, let's not get hung up on generics for the first cut of the Bloom Filter code. Why? Because generics are erased by the compiler and we can always add them later

Re: [BloomFilters] changes to BloomFilter

2020-04-22 Thread Claude Warren
Bloom filters should not use generics. That has been my stated opinion. They are not like other collections in that you don't get out what you put in. They are collections of hashes so the idea that generics should be used to somehow define what goes in is misleading. If commons-collections is

Re: [BloomFilters] changes to BloomFilter

2020-04-21 Thread Gilles Sadowski
Hello. > > [...] > > > > Attempting to re-awaken this thread. > > IMO the bloomfilter package is currently under development. The original > contributor was working through submitting Bloom filter functionality in > parts. My involvement was to ensure the current code that was initially >

Re: [BloomFilters] changes to BloomFilter

2020-04-21 Thread Alex Herbert
> On 23 Mar 2020, at 10:13, Alex Herbert wrote: > > > >> On 22 Mar 2020, at 17:44, Claude Warren > > wrote: >> >> Sorry it has taken me so long to respond, but I spent a long time thinking >> about this and started a response several times. >> >> I see no reason

Re: [BloomFilters] changes to BloomFilter

2020-03-23 Thread Alex Herbert
> On 22 Mar 2020, at 17:44, Claude Warren wrote: > > Sorry it has taken me so long to respond, but I spent a long time thinking > about this and started a response several times. > > I see no reason not to go to long bit indexes. This would be reflected in > Shape.getNumberOfBits returning a

Re: [BloomFilters] changes to BloomFilter

2020-03-22 Thread Claude Warren
Sorry it has taken me so long to respond, but I spent a long time thinking about this and started a response several times. I see no reason not to go to long bit indexes. This would be reflected in Shape.getNumberOfBits returning a long as well as other places. I would like to revisit some

Re: [BloomFilters] changes to BloomFilter

2020-03-20 Thread Alex Herbert
Sorry for an absence. I have been thinking on ways to move the BloomFilter API forward that consolidates the current functionality but makes it simpler to use for the common case. > On 18 Mar 2020, at 17:12, Claude Warren wrote: > > bf.getBits() * Long.BYTES may be as long as Math.Ceil( >

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren
bf.getBits() * Long.BYTES may be as long as Math.Ceil( Shape.getNumberOfBits() / 8.0 ) or it may be shorter. I am building byte buffers of fixed length that is the maximum size that any valid bf.getBits() * Long.BYTES I need to know Shape.getNumberOfBytes(). The conversion is required for some

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Alex Herbert
> On 18 Mar 2020, at 14:39, Claude Warren wrote: > >>> Shape Discussion: >>> >>> as for getNumberOfBytes() it should return the maximum number of bytes >>> returned by a getBits() call to a filter with this shape. So yes, if > there >>> is a compressed internal representation, no it won't

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren
>> Shape Discussion: >> >> as for getNumberOfBytes() it should return the maximum number of bytes >> returned by a getBits() call to a filter with this shape. So yes, if there >> is a compressed internal representation, no it won't be that. It is a >> method on Shape so it should literally be

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren
We are getting to the point where there are a lot of options that determine which implementation is "best". We could take a stab at creating a BloomFIlterFactory that takes a Shape as an argument and does a "finger in the air" guestimate of which implementation best fits. Store values in long

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren
You don't need Iterator iterator() as we have forEachCount( BitCountConsumer ) I guess we need something like add( Iterator) or add( Collection ) or add( Stream ) It would be nice if we could have a BitCountProducer class that we could just pass to an add() method. On Wed, Mar 18, 2020 at

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Alex Herbert
> On 18 Mar 2020, at 11:14, Claude Warren wrote: > > On a slightly different note. CountingBloomFilters have no way to perform > a reload. All other bloom filters you can dump the bits and reload > (trivial) but if you preserve the counts from a bloom filter and want to > reload them you

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Alex Herbert
> On 17 Mar 2020, at 22:34, Claude Warren wrote: > > Builder discussion: > > Let's go with > >>> Builder with(CharSequence, Charset); >>> Builder withUnencoded(CharSequence); Added to master. I already note that not having it mandate UTF-8 is annoying. I had to include StandardCharsets in

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren
On a slightly different note. CountingBloomFilters have no way to perform a reload. All other bloom filters you can dump the bits and reload (trivial) but if you preserve the counts from a bloom filter and want to reload them you can't. We need a constructor that takes the index,count pairs

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Claude Warren
Builder discussion: Let's go with >> Builder with(CharSequence, Charset); >> Builder withUnencoded(CharSequence); Shape Discussion: as for getNumberOfBytes() it should return the maximum number of bytes returned by a getBits() call to a filter with this shape. So yes, if there is a compressed

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Alex Herbert
> On 17 Mar 2020, at 17:06, Claude Warren wrote: > > On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert > wrote: > >> >> >>> On 17 Mar 2020, at 15:41, Claude Warren wrote: >>> >>> I agree with the HashFunction changes. >> >> OK, but which ones? >> > > DOH! this one... > >> >> Changing

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Claude Warren
On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert wrote: > > > > On 17 Mar 2020, at 15:41, Claude Warren wrote: > > > > I agree with the HashFunction changes. > > OK, but which ones? > DOH! this one... > > Changing HashFunction to have two methods: > > long hash(byte[]) > long increment(int seed)

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Alex Herbert
> On 17 Mar 2020, at 15:41, Claude Warren wrote: > > I agree with the HashFunction changes. OK, but which ones? Changing HashFunction to have two methods: long hash(byte[]) long increment(int seed) Or just correctly documenting what we have. And/or fixing the ObjectsHashIterative to be

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Claude Warren
I agree with the HashFunction changes. The only place I know of where more items are added to a hasher is in our test code. So I think just requiring Builder.build() to do a reset is correct solution. I think Builder should have with(byte[]) with(byte[], int offset, int len ) with(String) I

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Alex Herbert
> On 17 Mar 2020, at 11:08, Claude Warren wrote: > > On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert > > wrote: > >> >>> The Shape tells the hasher how many hash functions to apply to each item. >> >> OK. This is may misunderstanding. There is a contract that

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Claude Warren
On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert wrote: > > > The Shape tells the hasher how many hash functions to apply to each item. > > OK. This is may misunderstanding. There is a contract that the Hasher is > expected to fulfil but it is just not recorded in the javadoc. I can update > the

Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Alex Herbert
Another item: ObjectsHashIterative is marked as Signedness.SIGNED. The computation is done using 32-bit integers. So the long output can be negative but the upper 32-bits are always either entirely 0 or entirely 1. I think this is a candidate for converting to an unsigned long and allowing it

Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Alex Herbert
> On 16 Mar 2020, at 18:58, Claude Warren wrote: > > First I think that the hasher.getBits( Shape ) should be renamed to > iterator( Shape ). It was poorly named at the start. I can fix that. > > By definition a Hasher knows how many items it is going to insert. > > The Shape tells the

Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Claude Warren
First I think that the hasher.getBits( Shape ) should be renamed to iterator( Shape ). It was poorly named at the start. By definition a Hasher knows how many items it is going to insert. The Shape tells the hasher how many hash functions to apply to each item. The Shape number of items is how

Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Alex Herbert
On 16/03/2020 07:57, Claude Warren wrote: I made a quick pass at changing getHasher() to iterator(). A look at the feasibility or have you started work on this? If so then I'll not start work on it as well. I changed master to return a boolean for the merge operations in BloomFilter. So

Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Claude Warren
I made a quick pass at changing getHasher() to iterator(). I think we can get rid of HasherBloomFilter as its purpose was really to create a Bloom filter for temporary usage and it doesn't seem to be required if we have a hasher that can be created from a Shape and a function that creates an

Re: [BloomFilters] changes to BloomFilter

2020-03-15 Thread Alex Herbert
On Sun, 15 Mar 2020, 17:27 Claude Warren, wrote: > We have spoken elsewhere about removing getHasher() and adding iterator() > In addition should we add forEachBit( IntConsumer )?I I was thinking the same. So we provide an iterator allowing failfast on the first index that fails a criteria,