from:"Claude Warren"

Does anyone know of a good cache testing library?

2024-04-25 Thread Claude Warren

I am looking for a cache testing library to test a new cache eviction
strategy that I have developed.  Does anybody know of one, preferably in
Java?

Claude

Re: [Collections-BloomFilter][Discuss] missing functionality?

2024-04-21 Thread Claude Warren

After looking at the LayeredBloomFilter and the LayerManager and the way it
is intended to be used I found a couple of changes that we might want to
consider.

1) WrappedBloomFilter should not implement copy(), that should be left to
the wrapping implementation.

2) We change LayerManager to be a templated type that contains a List and add two methods first() and last() to retrieve the
first and last internal filters respectively.  These are common queries and
should be directly implemented.

3) We change LayeredBloomFilter to be a templated type  where T is the type of BloomFilter to be used for each Layer.

For example the use of LayeredBloomFilter for KIP-936 requires a
BloomFilter with an associated timestamp.  So when copy() is called for
duplication it must create the same BloomFilter.  It uses
WrappedBloomFilter to add the timestamp property.  The LayeredBloomFilter
in KIP-936 actually stores a custom TimestampedBloomFilter implementation
that extends WrappedBloomFilter.

This is a rather deep change to LayeredBloomFilter but I think Alex is
correct that we should use List, but that usage dictates that we use T
extends BloomFilter as the type.

Claude



On Sat, Apr 20, 2024 at 3:00 PM Alex Herbert 
wrote:

> On Sat, 20 Apr 2024 at 11:36, Claude Warren  wrote:
>
> >  The LayerdBloomFilter has a method find() that returns an array of ints
> > that are the indices into the layer array.  This is easily reproducible
> > using an iterator.
> > There is also get() method that takes an integer argument (an index of
> the
> > bloom filter) and returns the Bloom filter from the layer.  This is
> > reproducible but not as efficient using an iterator.
> >
>
> Note that using an iterator is how the LinkedList would do this. We are not
> using an ArrayList with Order(1) retrieval time.
> So performance of the current implementation here is already Order(n).
>
>
> > I think the array is the proper structure.
> >
>
> If an array is the correct structure then we should type to List. Typing to
> an interface allows changing the implementation with no effect on the API.
> For example changing to a concurrent data structure.
>
> The question is what is the most important functionality? Order(1) add and
> remove from the head and tail of the layers (Deque) or Order(1) retrieval
> of layers by depth index (List).
>
> Re: find
>
> It currently returns the layer index. What would you do with this other
> than call get(index) to get the BloomFilter. We could support this with a
> different find method:
>
> // find each bloom filter layer that contains the IndexProducer
> // API can be changed if this is required to return as soon as a filter is
> found that satisfies some requirement (Predicate vs Consumer).
> public void findEach(IndexProducer indexProducer, Consumer
> result)
>
> Why else do you require indices of layers? If there is no use case
> other than layer retrieval then this seems to be the wrong API.
>
> Alex
>
>
>
> > Claude
> >
> > On Fri, Apr 19, 2024 at 11:06 AM Alex Herbert 
> > wrote:
> >
> > > On Fri, 19 Apr 2024 at 08:26, Claude Warren  wrote:
> > >
> > > > While the Deque makes clear the idea of enqueueing and dequeueing
> the
> > > > layers it does not have the method to natively traverse and extract
> > > entries
> > > > from the middle of the queue.  Nor would I expect it to.  So I think
> > the
> > > > Deque does not accurately reflect how the collection of Bloom filters
> > is
> > > > utilized.
> > > >
> > >
> > > You can traverse and remove entries with the Iterator of the Deque:
> > >
> > > Deque d = new LinkedList<>();
> > > d.addAll(Arrays.asList(1, 2, 3, 4, 5));
> > > for (Iterator it = d.iterator(); it.hasNext();) {
> > > int i = it.next();
> > > if (i == 3) {
> > > it.remove();
> > > }
> > > }
> > > System.out.println(d);
> > >
> > > prints:
> > >
> > > [1, 2, 4, 5]
> > >
> > > So it is easy to iterate the layers and remove them in Order(1) time
> (per
> > > removal).
> > >
> > > Alex
> > >
> > >
> > > >
> > > > On Wed, Apr 17, 2024 at 2:17 PM Alex Herbert <
> alex.d.herb...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Looks good to me.
> > > > >
> > > > > Any opinions on changing the LayerManager to keep the layers in a
> > Deque
> > > > > rather than a LinkedList. I think it would only require a change to
> > the
> > > > > following method:
> > > > >
> > > > > public final BloomFilter get(int depth)
> > > > >
> > > > > Performance will be the same as the Deque can be a LinkedList. This
> > is
> > > > more
> > > > > about how any custom downstream code is currently using the
> > collection
> > > of
> > > > > layers.
> > > > >
> > > > > Alex
> > > >
> > > >
> > >
> >
> >
> > --
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [Collections-BloomFilter][Discuss] missing functionality?

2024-04-20 Thread Claude Warren

 The LayerdBloomFilter has a method find() that returns an array of ints
that are the indices into the layer array.  This is easily reproducible
using an iterator.
There is also get() method that takes an integer argument (an index of the
bloom filter) and returns the Bloom filter from the layer.  This is
reproducible but not as efficient using an iterator.

I think the array is the proper structure.

Claude

On Fri, Apr 19, 2024 at 11:06 AM Alex Herbert 
wrote:

> On Fri, 19 Apr 2024 at 08:26, Claude Warren  wrote:
>
> > While the Deque makes clear the idea of enqueueing and dequeueing  the
> > layers it does not have the method to natively traverse and extract
> entries
> > from the middle of the queue.  Nor would I expect it to.  So I think the
> > Deque does not accurately reflect how the collection of Bloom filters is
> > utilized.
> >
>
> You can traverse and remove entries with the Iterator of the Deque:
>
> Deque d = new LinkedList<>();
> d.addAll(Arrays.asList(1, 2, 3, 4, 5));
> for (Iterator it = d.iterator(); it.hasNext();) {
> int i = it.next();
> if (i == 3) {
> it.remove();
> }
> }
> System.out.println(d);
>
> prints:
>
> [1, 2, 4, 5]
>
> So it is easy to iterate the layers and remove them in Order(1) time (per
> removal).
>
> Alex
>
>
> >
> > On Wed, Apr 17, 2024 at 2:17 PM Alex Herbert 
> > wrote:
> >
> > > Looks good to me.
> > >
> > > Any opinions on changing the LayerManager to keep the layers in a Deque
> > > rather than a LinkedList. I think it would only require a change to the
> > > following method:
> > >
> > > public final BloomFilter get(int depth)
> > >
> > > Performance will be the same as the Deque can be a LinkedList. This is
> > more
> > > about how any custom downstream code is currently using the collection
> of
> > > layers.
> > >
> > > Alex
> >
> >
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: Is there a blog for commons?

2024-04-19 Thread Claude Warren

I was really looking for a way to reach out to developers that do not know
that there is a Bloom filter implementation and may not know how they could
use one.  Once someone knows they'll look in the project documentation and
or javadoc (so I have no issue putting the info there as well).

On Fri, Apr 19, 2024 at 2:43 PM Gary Gregory  wrote:

> How about putting this in the Javadoc at the package level in
> package-info.java?
>
> Closer to the code is more likely to be maintained than out in the wild on
> a blog.
>
> Gary
>
> On Fri, Apr 19, 2024, 8:29 AM Gilles Sadowski 
> wrote:
>
> > Le ven. 19 avr. 2024 à 13:05, Gary Gregory  a
> > écrit :
> > >
> > > I think there are three places today this type of information can live
> > > within Apache:
> > >
> > > - the component website (which we can publish whenever we want)
> > > - the project wiki (which automatically is live)
> > > - https://news.apache.org/ (not sure how one posts there)
> >
> > I beg to differ.
> > From what I understand, this would be much more useful in a
> > documentation section of the component (as a practical way to
> > learn how to use non-trivial functionality implemented there).
> >
> > [If the primary purpose is personal advertisement, there is no
> > real point discussing it on the Commons "dev" ML.]
> >
> > Regards,
> > Gilles
> >
> > >
> > > [...]
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [Collections-BloomFilter][Discuss] missing functionality?

2024-04-19 Thread Claude Warren

While the Deque makes clear the idea of enqueueing and dequeueing  the
layers it does not have the method to natively traverse and extract entries
from the middle of the queue.  Nor would I expect it to.  So I think the
Deque does not accurately reflect how the collection of Bloom filters is
utilized.

On Wed, Apr 17, 2024 at 2:17 PM Alex Herbert 
wrote:

> Looks good to me.
>
> Any opinions on changing the LayerManager to keep the layers in a Deque
> rather than a LinkedList. I think it would only require a change to the
> following method:
>
> public final BloomFilter get(int depth)
>
> Performance will be the same as the Deque can be a LinkedList. This is more
> about how any custom downstream code is currently using the collection of
> layers.
>
> Alex
>
> On Wed, 17 Apr 2024 at 10:00, Claude Warren  wrote:
>
> > I have an open pull request to fix this problem.  I could use another
> > review: https://github.com/apache/commons-collections/pull/476
> >
> >
> > On Tue, Apr 9, 2024 at 11:29 AM Claude Warren  wrote:
> >
> > > Alex,
> > >
> > > I like your solution.  To answer your question. We create a BloomFilter
> > > that has a timestamp associated with it.  When the timestamp is greater
> > > than System.currentTimeMillis() the filter is removed.  The custom
> > cleanup
> > > calls Cleanup.removeEmptyTarget().andThen()
> > >
> > > I think that creating a cleanup() or clean() method on the
> > > LayeredBloomFilter is the appropriate solution and that it should call
> > > cleanup() on the LayerManager. (so 2 new methods, one exposed).
> > >
> > > The next() method is used when external circumstances dictate that a
> new
> > > layer should be created.  I think a StableBloomFilter I implemented
> > > required it,  but I do not have the code to hand at the moment.
> > >
> > > Claude
> > >
> > >
> > > On Tue, Apr 9, 2024 at 10:38 AM Alex Herbert  >
> > > wrote:
> > >
> > >> Hi Claude,
> > >>
> > >> Q. What is your current clean-up filter, i.e.
> > >> the Consumer>? I assume you are using a custom
> > >> one.
> > >>
> > >> The current collections code only has 2 functional implementations.
> One
> > >> will remove the newest filter if it is empty, one will remove the
> oldest
> > >> filters until the size is below a limit. Since neither of those will
> > >> iterate the list and purge stale objects then I assume you are using a
> > >> custom clean-up filter. So you had to have created the layer manager
> > with
> > >> your custom filter. Assuming this then there are at least two
> solutions
> > >> for
> > >> the current code:
> > >>
> > >> 1. The current implementation always calls the clean-up filter with
> the
> > >> same LinkedList since it is final. So you can capture the list and do
> > what
> > >> you want with it:
> > >>
> > >> @SuppressWarnings("rawtypes")
> > >> LinkedList[] captured = new LinkedList[1];
> > >> Consumer> cleanup = list -> {
> > >> captured[0] = list;
> > >> // ... do clean-up
> > >> };
> > >>
> > >> // Once you have captured the list, you can clean it when you
> > >> want:
> > >> // unchecked conversion
> > >> cleanup.accept(captured[0]);
> > >>
> > >> Obviously this is not ideal as you have to manage the captured list to
> > >> call
> > >> cleanup. But it delivers exactly what you require in terms of being
> able
> > >> to
> > >> call cleanup at any time.
> > >>
> > >> 2. The call to next() will clean the layers but also add a new layer.
> So
> > >> your custom clean method could clean stale objects and also any empty
> > >> filters not at the end of the list. This will avoid building up lots
> of
> > >> empty filters when you frequently trigger next() to purge stale
> filters.
> > >> You can call next() directly on the LayeredBloomFilter. I do not know
> > what
> > >> extend check you are using so there is some management to be done with
> > the
> > >> other settings of the LayerManager to avoid removing any functional
> > layers
> > >> which are currently empty.
> > >>
> > >> --
> &g

Re: Is there a blog for commons?

2024-04-19 Thread Claude Warren

I have what is currently a series of 4 blogs that introduce the new Bloom
filter implementations and framework in Commons Collections.  I have a
couple more in mind, they discuss what Bloom filters are and how the
Commons Collections implements them, provides extension points, and how to
implement some exotic flavors.  In fact I have an implementation for a
Kafka PID tracking problem (KIP-936) that uses layered Bloom filters to
track PIDs in a time window while handling bursty traffic and not exceeding
the desired false positive rate.

So these are rather technical posts, they could be transformed into pages
in documentation, but what I really want to do is get attention on the tool
from developers of other projects, to let them know the tools exist.  I
think that if the ASF had a technical blog we could be promoting our
projects to the wider development world.  I can think of other projects
that could have rather interesting blogs.  For example, a discussion of
Cassandra's new Accord consensus protocol could help other ASF projects
working on consensus issues.  Kafka too has a consensus protocol they are
working on.

Is there any interest in a technical blog that focus on the solutions to
technical problems within a project rather than the higher level technical
problems solved by the project.  So not that Cassandra has a solution to a
distributed database problem, but Cassandra solved the consensus problem
this way; not Kafka solves the data streaming problem, but Kafka solved the
consensus problem this way. (OK, too much consensus, but you see what I
mean).

Claude

On Wed, Apr 17, 2024 at 8:07 PM Gary Gregory  wrote:

> Well, we already have https://news.apache.org/
>
> Gary
>
> On Wed, Apr 17, 2024, 1:50 PM Elric V  wrote:
>
> > On 16/04/2024 13:08, Gary Gregory wrote:
> > > There is an Apache wide blog here:
> > > https://news.apache.org/
> >
> > There used to be a planet.apache.org which aggregated committer/project
> > blogs, but that seems to be broken.
> >
> > Would there be any interets in an aggregated ASF-project wide blog?
> > Where contributors from all projects could submit posts? Would be a
> > great way to keep up to do date, as well as learn about other ASF
> projects.
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
>

-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [Collections-BloomFilter][Discuss] missing functionality?

2024-04-17 Thread Claude Warren

I have an open pull request to fix this problem.  I could use another
review: https://github.com/apache/commons-collections/pull/476


On Tue, Apr 9, 2024 at 11:29 AM Claude Warren  wrote:

> Alex,
>
> I like your solution.  To answer your question. We create a BloomFilter
> that has a timestamp associated with it.  When the timestamp is greater
> than System.currentTimeMillis() the filter is removed.  The custom cleanup
> calls Cleanup.removeEmptyTarget().andThen()
>
> I think that creating a cleanup() or clean() method on the
> LayeredBloomFilter is the appropriate solution and that it should call
> cleanup() on the LayerManager. (so 2 new methods, one exposed).
>
> The next() method is used when external circumstances dictate that a new
> layer should be created.  I think a StableBloomFilter I implemented
> required it,  but I do not have the code to hand at the moment.
>
> Claude
>
>
> On Tue, Apr 9, 2024 at 10:38 AM Alex Herbert 
> wrote:
>
>> Hi Claude,
>>
>> Q. What is your current clean-up filter, i.e.
>> the Consumer>? I assume you are using a custom
>> one.
>>
>> The current collections code only has 2 functional implementations. One
>> will remove the newest filter if it is empty, one will remove the oldest
>> filters until the size is below a limit. Since neither of those will
>> iterate the list and purge stale objects then I assume you are using a
>> custom clean-up filter. So you had to have created the layer manager with
>> your custom filter. Assuming this then there are at least two solutions
>> for
>> the current code:
>>
>> 1. The current implementation always calls the clean-up filter with the
>> same LinkedList since it is final. So you can capture the list and do what
>> you want with it:
>>
>> @SuppressWarnings("rawtypes")
>> LinkedList[] captured = new LinkedList[1];
>> Consumer> cleanup = list -> {
>> captured[0] = list;
>> // ... do clean-up
>> };
>>
>> // Once you have captured the list, you can clean it when you
>> want:
>> // unchecked conversion
>> cleanup.accept(captured[0]);
>>
>> Obviously this is not ideal as you have to manage the captured list to
>> call
>> cleanup. But it delivers exactly what you require in terms of being able
>> to
>> call cleanup at any time.
>>
>> 2. The call to next() will clean the layers but also add a new layer. So
>> your custom clean method could clean stale objects and also any empty
>> filters not at the end of the list. This will avoid building up lots of
>> empty filters when you frequently trigger next() to purge stale filters.
>> You can call next() directly on the LayeredBloomFilter. I do not know what
>> extend check you are using so there is some management to be done with the
>> other settings of the LayerManager to avoid removing any functional layers
>> which are currently empty.
>>
>> --
>>
>> As to exposing the LayerManager and adding a clean() method to the
>> LayerManager, I think this is not in keeping with the current design. The
>> LayerManager is used during construction and then never used again. So
>> functionality to act on the layers is public through the
>> LayeredBloomFilter
>> (e.g. calling next()). So perhaps the change to the API should be to add a
>> cleanup() method to LayeredBloomFilter. This does the same as next(), but
>> does not add a new layer.
>>
>> I cannot recall the use case for next() in the LayeredBloomFilter. Would
>> the addition of cleanup() make the next() method redundant?
>>
>> --
>>
>> Note: The typing against LinkedList could be updated to java.util.Deque.
>> The only issue with this is the method:
>> public final BloomFilter get(int depth)
>>
>> This is not supported by the Deque interface. However the LinkedList
>> implementation of get(int) will use the iterator from the start or end of
>> the list (whichever is closer) to find the element. This can use the
>> iterator/descendingIterator method of Deque for the same performance (but
>> the code to do this has to be written).
>>
>> Alex
>>
>>
>> On Tue, 9 Apr 2024 at 08:45, Claude Warren  wrote:
>>
>> > Greetings,
>> >
>> > I am attempting to use the Bloomfilter code in Kafka to manage PID
>> > generation.  The requirement is to remove pid tracking after a period of
>> > time.  This is possible with the LayeredBloomFilter but it has an edge
>> case
>> > problem.
&g

Is there a blog for commons?

2024-04-16 Thread Claude Warren

I was wondering if there is a blog dedicated to commons?  I have several
blog posts about using the new Bloom filters in collections 4.5 and am
looking for a place to publish.

Claude

-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [VOTE] Release Apache Commons CLI 1.7.0 based on RC1

2024-04-14 Thread Claude Warren

+1

Ubuntu 22.04.4 LTS
Java openjdk 21.0.2 2024-01-16


On Sat, Apr 13, 2024 at 6:22 PM Bruno Kinoshita 
wrote:

> +1
>
> Building OK from tag with
>
> Apache Maven 3.8.5 (3599d3414f046de2324203b78ddcf9b5e4388aa0)
> Maven home: /opt/apache-maven-3.8.5
> Java version: 17.0.10, vendor: Private Build, runtime:
> /usr/lib/jvm/java-17-openjdk-amd64
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "5.15.0-102-generic", arch: "amd64", family:
> "unix"
>
> Inspected dist area archive, found no issues. Site reports look good, some
> low hanging fruit PMD issues, will see if I can prepare a PR to fix at
> least one that is complaining about a static import (happy if anyone beats
> me to it -- working on Imaging now).
>
> Thanks!
>
> Bruno
>
> On Sat, 13 Apr 2024 at 13:30, Gary Gregory  wrote:
>
> > We have fixed a few bugs and added enhancements since Apache Commons
> > CLI 1.6.0 was released, so I would like to release Apache Commons CLI
> > 1.7.0.
> >
> > Apache Commons CLI 1.7.0 RC1 is available for review here:
> > https://dist.apache.org/repos/dist/dev/commons/cli/1.7.0-RC1 (svn
> > revision 68470)
> >
> > The Git tag commons-cli-1.7.0-RC1 commit for this RC is
> > caed6714a73ce366aebccebf4a21e287d6d34ae0 which you can browse here:
> >
> >
> https://gitbox.apache.org/repos/asf?p=commons-cli.git;a=commit;h=caed6714a73ce366aebccebf4a21e287d6d34ae0
> > You may checkout this tag using:
> > git clone https://gitbox.apache.org/repos/asf/commons-cli.git
> > --branch 
> > commons-cli-1.7.0-RC1 commons-cli-1.7.0-RC1
> >
> > Maven artifacts are here:
> >
> >
> https://repository.apache.org/content/repositories/orgapachecommons-1718/commons-cli/commons-cli/1.7.0/
> >
> > These are the artifacts and their hashes:
> >
> > #Release SHA-512s
> > #Sat Apr 13 02:26:30 UTC 2024
> >
> >
> commons-cli-1.7.0-bom.json=fd710316564b50a1af147751c01f25c473c8ccee4af117e3ebd12f7e1eda9524b10389e8ae8705922cc82c0164bc1c5a2977327eb22eac449ce4b149b15d3d90
> >
> >
> commons-cli-1.7.0-test-sources.jar=7b1e4940a6d0b2c84939146122849db1086730d93b8c540047a6fd2b6112eba9c420008302300557c8187821316cddbb7b0599664f04fa21e325e04bd4bb06f0
> >
> >
> commons-cli-1.7.0-src.zip=e0f684163327c6180098c61d89705ff49041239b04dcd0235dd0b7b3b9ac590c23da85236ae4458135b1d804824d639db2f4a85ee9cfa281c2cc49ec2f5775f1
> >
> >
> commons-cli-1.7.0-javadoc.jar=fea0659ba7c7f23940547f2aa6ae15d85817a7ecf9d537083b7c5873012e85b39aa9dbf1054f69f49a86b30b3ea06e037dc5b972c1bc553caa2c6004be409613
> >
> >
> commons-cli_commons-cli-1.7.0.spdx.json=1527955bcb67fdbefac08bd6905290f01c3397e6921eb5980e4b96145cf68a4a6745f28b29435fb4e181f051cd1c9d43017d53db5df6a12e76244bd642c5213a
> >
> >
> commons-cli-1.7.0-bin.zip=07a2a6698ba1010c5952ec26ad2e4ce73a87476f0fd97dd34cb5280f0563c0a0bdeb65e3cc9afbe9bca50c149f2570fa716f0a617c9f277b1ffdd9baa75081d0
> >
> >
> commons-cli-1.7.0-tests.jar=08338e5488c6d12e28a65b2d5ea2da172a087832c0c58915f47e1078038b10873bf5594237dc2275a6079027487745d5a76722424c5482c5f786ba95f5c34acb
> >
> >
> commons-cli-1.7.0-src.tar.gz=d97be4df213e0df72a3866b8271e34064e984e98debb2ae8bdd1bef94ea1fd358655dbc730866cb62b4232bad7888d69b4d964a101139f2cf36cd0fcfb39d28b
> >
> >
> commons-cli-1.7.0-sources.jar=9bc86da3391d3fc87806384fda1272b8e0a0313eb529ae4e9eaf7d5fc63708a90449f6fe72b4d29da730c21c4fc497f395c01812b625803479884e08f0593090
> >
> >
> commons-cli-1.7.0-bom.xml=40623202b108d74635fee720a826e730ed63b838596db45a887f4718be949db9ca01f9661be3c03067b2d1d9246d7fde1524c83c7718db1cb2202e0d3ed748d2
> >
> >
> commons-cli-1.7.0-bin.tar.gz=c429f22021bbb80f11e9b27bc768ec28b797d9d6fc18b3af369bfc7278641f2585c62cd6d6736b7f1813e3f9690390f8fd304743f123ae9fff496edd1942cac6
> >
> >
> > I have tested this with 'mvn' and 'mvn -V -Duser.name=$my_apache_id
> > -Dcommons.release-plugin.version=$commons_release_plugin_version
> > -Prelease -Ptest-deploy -P jacoco -P japicmp clean package site
> > deploy' using:
> >
> > openjdk version "17.0.10" 2024-01-16
> > OpenJDK Runtime Environment Homebrew (build 17.0.10+0)
> > OpenJDK 64-Bit Server VM Homebrew (build 17.0.10+0, mixed mode, sharing)
> >
> > Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae)
> > Maven home: /usr/local/Cellar/maven/3.9.6/libexec
> > Java version: 17.0.10, vendor: Homebrew, runtime:
> > /usr/local/Cellar/openjdk@17/17.0.10/libexec/openjdk.jdk/Contents/Home
> > Default locale: en_US, platform encoding: UTF-8
> > OS name: "mac os x", version: "14.4.1", arch: "x86_64", family: "mac"
> >
> > Darwin * 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:11:05
> > PDT 2024; root:xnu-10063.101.17~1/RELEASE_X86_64 x86_64
> >
> > Details of changes since 1.6.0 are in the release notes:
> >
> >
> https://dist.apache.org/repos/dist/dev/commons/cli/1.7.0-RC1/RELEASE-NOTES.txt
> >
> >
> https://dist.apache.org/repos/dist/dev/commons/cli/1.7.0-RC1/site/changes-report.html
> >
> > Site:
> >
> >
>

Re: [Collections-BloomFilter][Discuss] missing functionality?

2024-04-09 Thread Claude Warren

Alex,

I like your solution.  To answer your question. We create a BloomFilter
that has a timestamp associated with it.  When the timestamp is greater
than System.currentTimeMillis() the filter is removed.  The custom cleanup
calls Cleanup.removeEmptyTarget().andThen()

I think that creating a cleanup() or clean() method on the
LayeredBloomFilter is the appropriate solution and that it should call
cleanup() on the LayerManager. (so 2 new methods, one exposed).

The next() method is used when external circumstances dictate that a new
layer should be created.  I think a StableBloomFilter I implemented
required it,  but I do not have the code to hand at the moment.

Claude


On Tue, Apr 9, 2024 at 10:38 AM Alex Herbert 
wrote:

> Hi Claude,
>
> Q. What is your current clean-up filter, i.e.
> the Consumer>? I assume you are using a custom one.
>
> The current collections code only has 2 functional implementations. One
> will remove the newest filter if it is empty, one will remove the oldest
> filters until the size is below a limit. Since neither of those will
> iterate the list and purge stale objects then I assume you are using a
> custom clean-up filter. So you had to have created the layer manager with
> your custom filter. Assuming this then there are at least two solutions for
> the current code:
>
> 1. The current implementation always calls the clean-up filter with the
> same LinkedList since it is final. So you can capture the list and do what
> you want with it:
>
> @SuppressWarnings("rawtypes")
> LinkedList[] captured = new LinkedList[1];
> Consumer> cleanup = list -> {
> captured[0] = list;
> // ... do clean-up
> };
>
> // Once you have captured the list, you can clean it when you want:
> // unchecked conversion
> cleanup.accept(captured[0]);
>
> Obviously this is not ideal as you have to manage the captured list to call
> cleanup. But it delivers exactly what you require in terms of being able to
> call cleanup at any time.
>
> 2. The call to next() will clean the layers but also add a new layer. So
> your custom clean method could clean stale objects and also any empty
> filters not at the end of the list. This will avoid building up lots of
> empty filters when you frequently trigger next() to purge stale filters.
> You can call next() directly on the LayeredBloomFilter. I do not know what
> extend check you are using so there is some management to be done with the
> other settings of the LayerManager to avoid removing any functional layers
> which are currently empty.
>
> --
>
> As to exposing the LayerManager and adding a clean() method to the
> LayerManager, I think this is not in keeping with the current design. The
> LayerManager is used during construction and then never used again. So
> functionality to act on the layers is public through the LayeredBloomFilter
> (e.g. calling next()). So perhaps the change to the API should be to add a
> cleanup() method to LayeredBloomFilter. This does the same as next(), but
> does not add a new layer.
>
> I cannot recall the use case for next() in the LayeredBloomFilter. Would
> the addition of cleanup() make the next() method redundant?
>
> --
>
> Note: The typing against LinkedList could be updated to java.util.Deque.
> The only issue with this is the method:
> public final BloomFilter get(int depth)
>
> This is not supported by the Deque interface. However the LinkedList
> implementation of get(int) will use the iterator from the start or end of
> the list (whichever is closer) to find the element. This can use the
> iterator/descendingIterator method of Deque for the same performance (but
> the code to do this has to be written).
>
> Alex
>
>
> On Tue, 9 Apr 2024 at 08:45, Claude Warren  wrote:
>
> > Greetings,
> >
> > I am attempting to use the Bloomfilter code in Kafka to manage PID
> > generation.  The requirement is to remove pid tracking after a period of
> > time.  This is possible with the LayeredBloomFilter but it has an edge
> case
> > problem.
> >
> > The LayeredBloomFilter uses the LayerManager to manage the filters that
> > comprise the layers of the LayerdBloomFilter.
> > The LayerManager uses a Consumer> called
> > filterCleanup to remove old layers.
> > The filterCleanup is only called when a new layer is added to the layered
> > filter.
> >
> > This solution works well in the general case where data is flowing
> through
> > the layered filter.  However if nothing is added to the filter,
> > filterCleanup is not called.
> >
> > In the Kafka case we have a LayeredBloomFilter for PIDs for each
> produce

[Collections-BloomFilter][Discuss] missing functionality?

2024-04-09 Thread Claude Warren

Greetings,

I am attempting to use the Bloomfilter code in Kafka to manage PID
generation.  The requirement is to remove pid tracking after a period of
time.  This is possible with the LayeredBloomFilter but it has an edge case
problem.

The LayeredBloomFilter uses the LayerManager to manage the filters that
comprise the layers of the LayerdBloomFilter.
The LayerManager uses a Consumer> called
filterCleanup to remove old layers.
The filterCleanup is only called when a new layer is added to the layered
filter.

This solution works well in the general case where data is flowing through
the layered filter.  However if nothing is added to the filter,
filterCleanup is not called.

In the Kafka case we have a LayeredBloomFilter for PIDs for each producer.
As long as a producer is producing PIDs the filter gets updated.

However, if a producer drops from the system or goes offline for a period
of time, then they will no longer be producing PIDs and their old expired
data will remain.

We want to remove the producer from the collection when there are no more
PIDs being tracked.

I think this can be solved by adding a clean() method to the LayerManager
that simply calls the existing filterCleanup.
It would be easier to access this method if the LayeredBloomFilter had a
method to return the LayerManager that was passed in the constructor.

Does anyone see any issues with this approach?  Are there other solutions
to be had?

Questions and comments welcomed.
-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [Collections] New release candidate for 4.5.0-beta1 or -M1 and new Bloom Filter package

2024-03-28 Thread Claude Warren

+1 on the M1 or beta1 release.

On Mon, Mar 25, 2024 at 2:12 PM Gary Gregory  wrote:

> Hi All,
>
> 4.5.0 will contain a new package for Bloom Filters.
>
> Since this is a new package and is non-trivial, I propose we release a
> version called 4.5.0-M1 or 4.5.0-beta1 to let users try this out while
> giving us the change to make breaking API changes if needed.
>
> Gary
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [CLI] kabob-format?

2024-01-07 Thread Claude Warren

Actually, I should have said "kabab-case" (Brain was apparently not fully
engaged when I typed that).  But just standard dash/hyphen between words.
kebab-case as opposed to CamelCase or snake_case.  The hyphen/dash is
disallowed in the CLI token verification.  If there is not a good reason to
reject it, I would like to allow it.

On Fri, Jan 5, 2024 at 8:40 PM Gary Gregory  wrote:

> For my money, all I would care about is "classic" dash dash for one option
> and one dash for combining single char options. I find anything else leads
> to confusing inconsistencies. For example, in Linux, for some app -list
> means the list option but in others is the same as -l -i -s -t, nasty.
>
> But more importantly, what insane variation of kebab formatting is kabob?
>
> Gary
>
> On Fri, Jan 5, 2024, 9:45 AM Claude Warren  wrote:
>
> > Is there a good reason not to support kabob-format in CLI?
> >
> > I can see that we have to strip the '-' nad '--' off the front but after
> > that tis seems like anything before a space should be valid.
> >
> > --
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>

-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

[CLI] kabob-format?

2024-01-05 Thread Claude Warren

Is there a good reason not to support kabob-format in CLI?

I can see that we have to strip the '-' nad '--' off the front but after
that tis seems like anything before a space should be valid.

-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

[BEANUTILS] Is there a good reason for Converter to not be a FunctionalInterface?

2023-12-29 Thread Claude Warren

I am looking at BeanUtils as part of the CLI options to parse a command
line option string into a class.

I see that Converter is an interface with one method; converting a String
to an Object.  Is there a good reason for this not to be annotated as
an @FunctionalInterface?

Claude
-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

[CLI] DISCUSS: Should Option(Builder) constructor be protected?

2023-12-28 Thread Claude Warren

Is the option class intended to be extended?  If not then it should
probably be final.  If so, then the Option(Builder) constructor should be
protected so that derived classes can avail of the default settings from
the Builder.

I have a case where I want to pass the standard builder into the derived
class constructor but I have no way of passing that on to Option
constructor itself.

Thoughts?

Claude

-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

[COLLECTIONS] Thread safe Bloom filter

2023-07-02 Thread Claude Warren

I have been thinking about what it would take to make SimpleBloomFilter
thread safe.  I think that the answer is to use a copy on write strategy
and a lock within all the merge methods.

However, this leaves the problem of the cardinality calculation.  Currently
it is lazily performed and cached.  I am thinking that there are 2
solutions.

   1. mark cardinality volatile and leave the calculation as it effectively
   does a copy on write.
   2. update the cardinality inside the merge.  This should be doable as we
   can get the number of bits in each long after the merge and simply add them
   up during processing.


Does anyone see a problem with this solution?

Claude
-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [COLLECTIONS] Trie subclassing

2023-05-03 Thread Claude Warren

I have a work around but not having to convert from binary representation
to string would be handy.  But it is not critical.  I am also not sure that
the API is documented all that well.  Perhaps I should open a low level
ticket that we can work on after 4.5 is released.

On Sun, Apr 23, 2023 at 9:33 PM Gary Gregory  wrote:

> Claude,
>
> Do you need the API to be made public?
>
> Gary
>
> On Mon, Apr 17, 2023 at 2:53 PM Gary Gregory 
> wrote:
> >
> > I am guessing that only what is required to be public is as to both
> maximize our flexibility in maintenance and minimize the public API surface
> to support.
> >
> > We could make it public if we are sure the API is documented and the
> code isbas good as we can reasonably make it.
> >
> > Gary
> >
> >
> > On Mon, Apr 17, 2023, 11:48 Claude Warren  wrote:
> >>
> >> I was looking at the Trie and PatriciaTree class structure from version
> 4.5
> >> over the weekend.  I wanted to build a different implementation with
> slight
> >> modifications.  However, there does not seem to be a way to inherit from
> >> AbstractPatriciaTrie as it is package protected.  Was this intentional
> or
> >> just an oversight because there was only one concrete implementation
> >> provided?
> >>
> >> --
> >> LinkedIn: http://www.linkedin.com/in/claudewarren
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

[COLLECTIONS] Trie subclassing

2023-04-17 Thread Claude Warren

I was looking at the Trie and PatriciaTree class structure from version 4.5
over the weekend.  I wanted to build a different implementation with slight
modifications.  However, there does not seem to be a way to inherit from
AbstractPatriciaTrie as it is package protected.  Was this intentional or
just an oversight because there was only one concrete implementation
provided?

-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

RE: [Collections] Bloom filters status?

2023-03-23 Thread Claude Warren, Jr

Gary,

The Bloom filter package is functionally complete and ready for prime
time.  As to beta, I have used it in several research projects and am
proposing using it in an in-house project at work.  I don't know what the
standard process is for commons, so I leave it to your discretion.

Claude

On 2023/03/17 13:06:48 Gary Gregory wrote:
> Hi All,
>
> What is the status of the bloom filters package? Is it ready for prime
> time? Should we have a beta?
>
> Gary
>

pulls 358 and 361

2022-11-12 Thread Claude Warren

These pulls have conflicting changes.  When one is merged I will fix the
other.
Claude

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

COLLECTIONS-824: BloomFilter: Optimize SimpleHasher.forEachIndex and SimpleHasher name change

2022-07-07 Thread Claude Warren




@aherbert

I will be out of pocket for the next 1.5 weeks.  When I return I would 
like to see if we can resolve COLLECTIONS-824.  Do you have an 
implementation of EnhancedDoubleHasher that we can use?


I found a bit shift method to do power calculations but am not certain 
that it is public domain so I believe it is unusable.  Do you have an 
implementation?


Thanks,
Claude

[1] https://issues.apache.org/jira/browse/COLLECTIONS-824


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [Commons-Collections] remove bloom filters?

2021-08-30 Thread Claude Warren

I have recently implemented something similar, in Rust, for work.  I think
there are significant changes around Shape and the Hasher classes.

Shape can be reduced to numberOfBits and numberOfHashFunctions.

Hasher can be reduced to 2 Longs.

A compound hasher can be created by multiple hashers in a list.

The Rust implementation forced an intense look at the architecture and a
significant simplification.  I think the simplification refactoring should
be applied before the Bloom filters are released.

I would also note that the current implementation is too slow.  It
significantly changes the timings of indexes, so much so that I stopped
working on a  paper because the differences in the index timings were
overwhelmed by the change in Bloomfilter performance.

Claude

On Mon, Aug 30, 2021 at 2:11 PM Alex Herbert 
wrote:

> Hi Claude,
>
> I previously raised the issue of moving the Bloom filters to another branch
> while still under development. However this would reduce the visibility of
> the code and discourage potential contributors. To maintain visibility for
> contributors I would suggest that a collections release could avoid
> releasing this package.
>
> IIRC the package was fairly mature in its current form with a few
> exceptions such as changes to move to Splitterators in-place of Iterators.
> Other discussions we had were about more major reengineering of the type
> not possible while maintaining binary compatibility. Is a major
> reengineering what you are suggesting?
>
> Regards,
>
> Alex
>
> On Mon, 30 Aug 2021 at 13:37, Claude Warren  wrote:
>
> > Greetings,
> >
> > I see that the Bloom filter implementation has not been released.  It
> would
> > be in V4.5.  I have not had time to come back and clean it up as it
> should
> > be to make is simpler and faster.   I am concerned that there may be an
> > upcoming release of 4.5 which would lock the implementation and freeze
> many
> > of the discussions we had before.
> >
> > Is it possible to remove the Bloom filter impelmentations from
> > Commons-Collections.  I will work on them when I get some time and
> > resubmit.  We should also probably have a discussion about whether or not
> > collections is the proper place for them.
> >
> > Claude
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

[Commons-Collections] remove bloom filters?

2021-08-30 Thread Claude Warren

Greetings,

I see that the Bloom filter implementation has not been released.  It would
be in V4.5.  I have not had time to come back and clean it up as it should
be to make is simpler and faster.   I am concerned that there may be an
upcoming release of 4.5 which would lock the implementation and freeze many
of the discussions we had before.

Is it possible to remove the Bloom filter impelmentations from
Commons-Collections.  I will work on them when I get some time and
resubmit.  We should also probably have a discussion about whether or not
collections is the proper place for them.

Claude

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [all] Thoughts on build system maven -> gradle??

2020-07-17 Thread Claude Warren

-1 from me.  I have a philosophical objection.

Much like HTTP's mod_rewrite[1] gradle's greatest strength is that it
allows the developer to do so much in so many ways.  But its greatest
weakness is that it allows the developer to do so much in so many ways.

My experience with Ant and Gradle is that in a fairly short period of time
the build scripts become so complex that it takes a development team just
to maintain them.

The greatest strength of Maven is that it requires developers to seriously
work to do things outside of the "Maven way".  But its greatest weakness is
that developers have to seriously work  to do things outside of the "Maven
way".

My experience with Maven is that if the developer wants to do something
outside of the "Maven way" it is difficult, but it also forces the
developer to think about what they are doing.

With respect to this project, it is my opinion that we do not want to
create a build system that is so complex that it becomes difficult for the
uninitiated developer to contribute to the project.

If the goal here is to provide a gradle build in parallel with the Maven
build then I would want to have a -1.5 vote.  Adding the extra work
associated with keeping the Gralde build sane onto the work associated with
ensuring the 2 builds provide the same output detracts further from the
goal of having a project that most experienced developers can join with
minimal effort.

Not to put too fine a point on it, but I think we want to focus on
developing, supporting and maintaining the software not developing,
supporting, and maintaining a build system to make the software.

Claude

On Fri, Jul 17, 2020 at 12:12 AM Singh, Baljit (GE Aviation, US) <
balsi...@ge.com> wrote:

> +1 from me. I prefer Gradle for two main reasons:
> - Control over how a library's dependencies are exposed ("api" vs
> "implementation", see
> https://docs.gradle.org/current/userguide/java_library_plugin.html#sec:java_library_configurations_graph
> )
> - build.gradle is a lot simpler and a lot less verbose than pom.xml
>
> Drawback: Maven probably has a lot more plugins.
>
>
> On 7/16/20, 7:00 PM, "Alex Remily"  wrote:
>
> For those of us not as familiar with Gradle, what are some of the
> benefits?  Drawbacks?
>
> On Thu, Jul 16, 2020 at 5:30 PM Rob Tompkins 
> wrote:
> >
> > I think we might be coming towards time to make this move or at
> least accommodate for gradle builds in commons. Let’s look to the success
> the Spring Framework has had here with gradle. That said, I’m merely trying
> to gauge opinions here and am entirely content to stay with maven, if
> that’s what the community wishes.
> >
> > I’m a +1 on at least letting gradle be a part of our systems (don’t
> have to get rid of maven either).
> >
> > Cheers,
> > -Rob
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

2020-05-10 Thread Claude Warren

I keep wondering if Bloom filters belong in Collections.  They are not a
collection in the standard sense of the word.  Would it make more sense to
spit it out as a new Commons project?  How does one even go about that?

On Wed, Apr 22, 2020 at 5:37 PM Alex Herbert 
wrote:

> On Wed, 22 Apr 2020 at 17:17, Gilles Sadowski 
> wrote:
>
> >
> > > - Does the BF code reuse Commons Collection utilities?
> >
> > Yes, in class "HasherBloomFilter":
> > import org.apache.commons.collections4.iterators.EmptyIterator;
> > import org.apache.commons.collections4.iterators.IteratorChain;
> >
>
> Missed that one. But I believe one of the discussions was how to improve
> the HasherBloomFilter to remove this code or even drop the
> HasherBloomFilter as the functionality it provides was heading towards
> being redundant. I'd have to check back what was discussed.
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: what became of beanshell in Apache commons?

2020-04-25 Thread Claude Warren

I don't know for certain but here is what I piected together.

The project currently resides at:  https://github.com/beanshell/beanshell

The documentation there says:

BeanShell was proposed as an incubator project
>  to move to Apache
> Software Foundation . In preparation for this,
> the codebase for BeanShell 2.0b4 was donated to ASF by a code grant, and
> the license changed to Apache License, version 2.0
> .
>
> The source code was moved to http://apache-extras.org/ - a project home
> hosted by Google Code, that was only informally associated with Apache
> Software Foundation. Many of the BeanShell committers were Apache
> committers, and thus Apache Extras seemed a natural home.
>
> However the project did not move into the Apache incubator
> , and remained at apache-extras.org as an
> independent project.
>

A quick review of the Apache Incubator projects does not show that it was
ever an incubator project, so I have to assume that it never came over.

Claude

On Fri, Apr 24, 2020 at 10:27 PM Peter Kovacs  wrote:

> Hello all,
>
> I hope I do not sound to stupid.
> In Apache OpenOffice we still beanshell 2.0b6.
> Now I figured that beanshell was included into apache copmmons, which we
> also use.
>
> Can anyone help me, and guide on the transition so we can drop the old
> dependency for good?
>
> Any information helps, links and so on. Sorry I am a bit lost having to
> bridge 7 years gap.
> Thanks a lot for any support! (I am subscribing, So no worries.)
>
> All the Best
> Peter
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

2020-04-22 Thread Claude Warren

Bloom filters should not use generics.  That has been my stated opinion.
They are not like other collections in that you don't get out what you put
in.  They are collections of hashes so the idea that generics should be
used to somehow define what goes in is misleading.

If commons-collections is supposed to be all about putting things into a
collection and getting them back out then perhaps bloom filters do not
belong here.

The only major point of contention is should the package end up looking
like the Guice bloom filter package.  In my opinion the Guice package is
very restrictive.  It does not allows for different hashing
implementations, it forces a one to one correspondence between an Object
and the filter, this makes putting properties of an Object into the filter
as separate items difficult if not impossible, and it makes creating
partial filters for matching the same still more difficult.

Any more complex usage of Bloom filters (e.g. in genomics) will be much
harder if not impossible.

Going the Guice route also begs the question: Why not just use Guice?

The intention of this contribution was a framework that allows the
developer to build Bloom filters that
a) met the requirements of the application.
b) were easy to share between applications.
c) could implement most strategies for Bloom filters.

The Guice implementation is easy enough to construct with the framework as
defined in the current Commons Collections Bloom filter code.  And I have
no objection to providing a Simple Bloom filter implementation that does
that.  But doing so should not modify the framework in such a way as to
make other usage more difficult.

There have been lots of good conversations and lots of improvements since
the code was contributed.  I have several open source projects that
utilized the original code and have been able to easily modify them to use
the Commons versions as development and improvements progressed.  I would
hope to continue to be able to do that as the code moves to a releasable
state.

As I said above, it may be that commons collections is not the place for
Bloom filters.  Perhaps they belong in codec or in its own project.  I
leave that to others to decide.

Claude

On Wed, Apr 22, 2020 at 12:47 AM Gilles Sadowski 
wrote:

> Hello.
>
> > > [...]
> > >
> >
> > Attempting to re-awaken this thread.
> >
> > IMO the bloomfilter package is currently under development. The original
> contributor was working through submitting Bloom filter functionality in
> parts. My involvement was to ensure the current code that was initially
> merged was documented (in javadoc) and functioned as described. This led to
> changes of the current functionality and ideas for changes. There were many
> ideas in this thread history that were discussed, some agreed and not yet
> implemented, some still open for discussion.
> >
> > Development has stalled and the vision for the final package has not
> been completed. At present the bloomfilter added to collections does not
> use generics and it has very little to do with java.util.Collection. So is
> it even in the remit of commons-collections?
>
> What would be the alternative(s)?
>
> >
> > I suggested moving to a development branch while the package
> functionality is stabilised if any other contributors wish to work on this.
>
> IIRC, in Commons, branches other than "master" have usually become stale,
> unfortunately.  They've been rare, and worked on by a single person...
> Perhaps somehow mark the feature as not ready, and to be removed from
> the release branch (?).
>
> > I have left development alone as it seems redundant to progress without
> a defined scope for the ultimate functionality.
>
> Do you mean that progress depends on intended usage?
>
> >
> > Opinions on how to proceed?
>
> I got lost along the way; sorry.
> Are there still uncertainties about the "basic" features?
>
> Regards,
> Gilles
>
> >
> > Alex
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

2020-03-22 Thread Claude Warren

 added. It is commonly used in applications like Hadoop to determine
which nodes might contain a specific file.
Multidimensional representation Bloom filters

This is a use case that extends the ability to store excess items (>”n”)
into a Bloom filter without significantly modifying the probability of
false positives. To do this efficiently the Multidimensional filter has to
know the Shape of the filters being inserted. In the most efficient cases
it requires the hasher so that it can build filters of different but
associated shapes. In addition, this type of filter can return ID’s
associated with the various filters. So that it in effect becomes an index
of objects associated with Bloom filters.

This type of filter is used in systems that have many (>1000) locations to
check for a Bloom filter match as they can perform the match much faster
than the linear search alternative.
Suggestions Move the hashing into the Bloom filter

Originally the Hasher was called a ProtoBloomFilter. There was a request
for a name change early on. Once a Hasher is created it can be used to
create any concrete Bloom filter. Currently this is done with the
getBits(Shape) method. Perhaps what is needed is a rework in this section.

If the Hasher had a method to return a LongSupplier for each item then the
Bloom filter constructor could be in control of the number of hash
functions that were called for each item.

So assume Hasher has 2 new methods:

int items(); // the number of items

LongSupplier supplier( int itemNumber )

Then BloomFilter(Hasher, Shape) constructor could call:

for (int i =0; i

First I think this would be merge(T)/contains(T) pair. But I think it is a
bad idea to confuse the general BloomFilter responsibility with the Hasher
responsibility. I suggest leaving the general structure as defined above in
place and create a TypedBloomFilter class that works much as you
suggest. We have to make a number of assumptions. For example we assume
that BitSetBloomFilter is the appropriate implementation and the
DynamicHasher the appropriate Hasher.

TypedBloomFilter extends BitSetBloomFilter {

private Function fn;

TypedBloomFilter( Function fn, Shape shape ) {

super( shape );

this.fn = fn;

}

public void merge( T t ) {

merge( new DynamicHasher.Builder().with( fn.apply( t ) ) );

}

public boolean contains( T t ) {

return contains( new DynamicHasher.Builder().with( fn.apply( t ) ) );

}

}
Thread Safety

I think we note that Bloom filters are not generally safe for multithreaded
use without external synchronization.
Rework the HashFunction and HashFunctionIdentifier issues

We recognize that there are issues with HashFunctions and the identifier as
preserved in the Shape. I think that it is overly complex and prone to
error. Perhaps the proper solution is to remove the HasherIdentity from
Shape and create a method on Hasher String getID(). In most cases the
string would return the fully qualified class name, in cases where
difference hashing algorithms can be defined the string should return the
className prepended to the algorithm name or something similar. We can then
remove the HashFunction interface and all the stuff that derives from it.
Bloom filter equivalence will then depend upon Shape and Hasher equivalence.

On Fri, Mar 20, 2020 at 10:51 AM Alex Herbert 
wrote:

> Sorry for an absence. I have been thinking on ways to move the BloomFilter
> API forward that consolidates the current functionality but makes it
> simpler to use for the common case.
>
> > On 18 Mar 2020, at 17:12, Claude Warren  wrote:
> >
> > bf.getBits() * Long.BYTES  may be as long as Math.Ceil(
> > Shape.getNumberOfBits() / 8.0 ) or it may be shorter.
>
> I presume you mean:
>
> bf.getBits().length * Long.BYTES  may be equal to Math.Ceil(
> Shape.getNumberOfBits() / 8.0 ) or it may be longer.
>
> >
> > I am building byte buffers of fixed length that is the maximum size that
> > any valid bf.getBits() * Long.BYTES I need to know
> > Shape.getNumberOfBytes().
> > The  conversion is required for some Bloom filter indexing techniques.
>
> OK. I gave you an example of how to do this. In my example the
> Shape.getNumberOfBytes() was 1 line of code in 8.
>
> This method has a practical use if you are creating a byte[]
> representation of the bits in a filter. So you already have to be writing
> code to do that. In the context of this process it is not going to save you
> a lot of code. It seems like this function is very specific to your need
> and not something generic required within the API.
>
> >
> > And while serialization is outside the scope of the library, it is only
> > reasonable that we provide enough information to allow developers to
> > serialize/deserialse the data.  For example BloomFilter allows you to get
> > either the long[] representation or the list of bit indexes (vi

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

bf.getBits() * Long.BYTES  may be as long as Math.Ceil(
Shape.getNumberOfBits() / 8.0 ) or it may be shorter.

I am building byte buffers of fixed length that is the maximum size that
any valid bf.getBits() * Long.BYTES I need to know
Shape.getNumberOfBytes().
The  conversion is required for some Bloom filter indexing techniques.

And while serialization is outside the scope of the library, it is only
reasonable that we provide enough information to allow developers to
serialize/deserialse the data.  For example BloomFilter allows you to get
either the long[] representation or the list of bit indexes (via OfInt) and
there are ways to reconstruct a BloomFilter if you were to write that out
and read it back.



On Wed, Mar 18, 2020 at 4:07 PM Alex Herbert 
wrote:

>
>
> > On 18 Mar 2020, at 14:39, Claude Warren  wrote:
> >
> >>> Shape Discussion:
> >>>
> >>> as for getNumberOfBytes() it should return the maximum number of bytes
> >>> returned by a getBits() call to a filter with this shape.  So yes, if
> > there
> >>> is a compressed internal representation, no it won't be that.  It is a
> >>> method on Shape so it should literally be Math.ceil( getNumberOfBits()
> /
> >>> 8.0 )
> >>>
> >>> Basically, if you want to create an array that will fit all the bits
> >>> returned by BloomFilter.iterator() you need an array of
> >>> Shape.getNumberOfBytes().  And that is actually what I use it for.
> >
> >> Then you are also mapping the index to a byte index and a bit within the
> > byte. So if you are doing these two actions then this is something that
> you
> > should control.
> >
> > BloomFilter.getBits returns a long[].  that long[] may be shorter than
> the
> > absolute number of bytes specified by Shape.  It also may be longer.
> >
> > If you want to create a copy of the byte[] you have to know how long it
> > should be.  The only way to determine that is from Shape, and currently
> > only if you do the Ceil() method noted above.  There is a convenience in
> > knowing how long (in bytes) the buffer can be.
>
> Copy of what byte[]?
>
> There is no method to create a byte[] for a BloomFilter. So no need for
> getNumberOfBytes().
>
> Are you talking about compressing the long[] to a byte[] by truncating the
> final long into 1-8 bytes?
>
> BloomFilter bf;
> long[] bits = bf.getBits();
> ByteBuffer bb = ByteBuffer.allocate(bits.length *
> Long.BYTES).order(ByteOrder.LITTLE_ENDIAN);
> Arrays.stream(bits).forEachOrdered(bb::putLong);
> byte[] bytes = bb.array();
> int expected = (int) Math.ceil(bf.getShape().getNumberOfBits() / 8.0);
> if (bytes.length != expected) {
> bytes = Arrays.copyOf(bytes, expected);
> }
>
> For a BloomFilter of any reasonable number of bits the storage saving will
> be small.
>
> Is this for serialisation? This is outside of the scope of the library.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

>> Shape Discussion:
>>
>> as for getNumberOfBytes() it should return the maximum number of bytes
>> returned by a getBits() call to a filter with this shape.  So yes, if
there
>> is a compressed internal representation, no it won't be that.  It is a
>> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
>> 8.0 )
>>
>> Basically, if you want to create an array that will fit all the bits
>> returned by BloomFilter.iterator() you need an array of
>> Shape.getNumberOfBytes().  And that is actually what I use it for.

>Then you are also mapping the index to a byte index and a bit within the
byte. So if you are doing these two actions then this is something that you
should control.

BloomFilter.getBits returns a long[].  that long[] may be shorter than the
absolute number of bytes specified by Shape.  It also may be longer.

If you want to create a copy of the byte[] you have to know how long it
should be.  The only way to determine that is from Shape, and currently
only if you do the Ceil() method noted above.  There is a convenience in
knowing how long (in bytes) the buffer can be.



On Wed, Mar 18, 2020 at 2:19 PM Claude Warren  wrote:

> We are getting to the point where there are a lot of options that
> determine which implementation is "best".  We could take a stab at creating
> a BloomFIlterFactory that takes a Shape as an argument and does a "finger
> in the air" guestimate of which implementation best fits.  Store values in
> long blocks or as integers in a list, that sort of thing.  Perhaps in a
> month or so when we really have some idea.
>
>
>
> On Wed, Mar 18, 2020 at 2:16 PM Claude Warren  wrote:
>
>> You don't need Iterator iterator() as we have forEachCount(
>> BitCountConsumer )
>>
>> I guess we need something like add( Iterator) or add(
>> Collection ) or add( Stream )
>>
>> It would be nice if we could have a BitCountProducer class that we could
>> just pass to an add() method.
>>
>>
>>
>>
>> On Wed, Mar 18, 2020 at 11:50 AM Alex Herbert 
>> wrote:
>>
>>>
>>>
>>> > On 18 Mar 2020, at 11:14, Claude Warren  wrote:
>>> >
>>> > On a slightly different note.  CountingBloomFilters have no way to
>>> perform
>>> > a reload.  All other bloom filters you can dump the bits and reload
>>> > (trivial) but if you preserve the counts from a bloom filter and want
>>> to
>>> > reload them you can't.  We need a constructor that takes the
>>> index,count
>>> > pairs somehow.
>>>
>>> Iterator ?
>>>
>>> Or foolproof:
>>>
>>> class IndexCount {
>>> final int index;
>>> final int count;
>>> // ...
>>> }
>>>
>>> Iterator
>>>
>>>
>>> The CountingBloomFilter already has a method forEachCount(…).
>>>
>>> I was reluctant to add some sort of iterator:
>>>
>>> Iterator iterator()
>>>
>>> But we could put in:
>>>
>>> Iterator iterator()
>>>
>>> It would be inefficient but at least it is fool-proof. The operation is
>>> unlikely to be used very often.
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>>
>>
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

We are getting to the point where there are a lot of options that determine
which implementation is "best".  We could take a stab at creating a
BloomFIlterFactory that takes a Shape as an argument and does a "finger in
the air" guestimate of which implementation best fits.  Store values in
long blocks or as integers in a list, that sort of thing.  Perhaps in a
month or so when we really have some idea.



On Wed, Mar 18, 2020 at 2:16 PM Claude Warren  wrote:

> You don't need Iterator iterator() as we have forEachCount(
> BitCountConsumer )
>
> I guess we need something like add( Iterator) or add(
> Collection ) or add( Stream )
>
> It would be nice if we could have a BitCountProducer class that we could
> just pass to an add() method.
>
>
>
>
> On Wed, Mar 18, 2020 at 11:50 AM Alex Herbert 
> wrote:
>
>>
>>
>> > On 18 Mar 2020, at 11:14, Claude Warren  wrote:
>> >
>> > On a slightly different note.  CountingBloomFilters have no way to
>> perform
>> > a reload.  All other bloom filters you can dump the bits and reload
>> > (trivial) but if you preserve the counts from a bloom filter and want to
>> > reload them you can't.  We need a constructor that takes the index,count
>> > pairs somehow.
>>
>> Iterator ?
>>
>> Or foolproof:
>>
>> class IndexCount {
>> final int index;
>> final int count;
>> // ...
>> }
>>
>> Iterator
>>
>>
>> The CountingBloomFilter already has a method forEachCount(…).
>>
>> I was reluctant to add some sort of iterator:
>>
>> Iterator iterator()
>>
>> But we could put in:
>>
>> Iterator iterator()
>>
>> It would be inefficient but at least it is fool-proof. The operation is
>> unlikely to be used very often.
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

You don't need Iterator iterator() as we have forEachCount(
BitCountConsumer )

I guess we need something like add( Iterator) or add(
Collection ) or add( Stream )

It would be nice if we could have a BitCountProducer class that we could
just pass to an add() method.




On Wed, Mar 18, 2020 at 11:50 AM Alex Herbert 
wrote:

>
>
> > On 18 Mar 2020, at 11:14, Claude Warren  wrote:
> >
> > On a slightly different note.  CountingBloomFilters have no way to
> perform
> > a reload.  All other bloom filters you can dump the bits and reload
> > (trivial) but if you preserve the counts from a bloom filter and want to
> > reload them you can't.  We need a constructor that takes the index,count
> > pairs somehow.
>
> Iterator ?
>
> Or foolproof:
>
> class IndexCount {
> final int index;
> final int count;
> // ...
> }
>
> Iterator
>
>
> The CountingBloomFilter already has a method forEachCount(…).
>
> I was reluctant to add some sort of iterator:
>
> Iterator iterator()
>
> But we could put in:
>
> Iterator iterator()
>
> It would be inefficient but at least it is fool-proof. The operation is
> unlikely to be used very often.
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

On a slightly different note.  CountingBloomFilters have no way to perform
a reload.  All other bloom filters you can dump the bits and reload
(trivial) but if you preserve the counts from a bloom filter and want to
reload them you can't.  We need a constructor that takes the index,count
pairs somehow.

On Tue, Mar 17, 2020 at 10:34 PM Claude Warren  wrote:

> Builder discussion:
>
> Let's go with
>
> >> Builder with(CharSequence, Charset);
> >> Builder withUnencoded(CharSequence);
>
> Shape Discussion:
>
> as for getNumberOfBytes() it should return the maximum number of bytes
> returned by a getBits() call to a filter with this shape.  So yes, if there
> is a compressed internal representation, no it won't be that.  It is a
> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
> 8.0 )
>
> Basically, if you want to create an array that will fit all the bits
> returned by BloomFilter.iterator() you need an array of
> Shape.getNumberOfBytes().  And that is actually what I use it for.
>
> Bloom Filter Discussion:
>
> I have not use probability() on a single filter, just on the Shape.
> However, I can see the value of it.  It seems to me that the difference
> between Shape.probability() and BloomFilter.probability() would give you an
> indication of how many collisions have already occurred.
>
> In the SetOperations class there is an "estimateSize()" method.  It is the
> only single Bloom filter argument method in the class and I think it should
> be moved into BloomFilter so we have 2 new methods:
>
> probability()
> estimateSize()
>
> Counting Filter Discussion:
>
> As for counting filters we could implement several
> int
> short
> byte
>
> Currently they would all have to return int counts but realistically that
> is what would be used in code anyway.  Also, once primitives can be used in
> generics this will be easier.
>
> As for contains( Hasher, int ), I think we would need to add contains(
> BloomFilter, int).  If I understand correctly, contains( BloomFilter, X )
> would test that a BloomFilter has been added X times or rather that there
> are enough counts in the right places to make it appear that BloomFilter
> has been added X times.  When used with a Hasher, remove the duplicates,
> and perform the same test.
>
> I see no reason not to add them.
>
> On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert 
> wrote:
>
>>
>>
>> > On 17 Mar 2020, at 17:06, Claude Warren  wrote:
>> >
>> > On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert 
>> > wrote:
>> >
>> >>
>> >>
>> >>> On 17 Mar 2020, at 15:41, Claude Warren  wrote:
>> >>>
>> >>> I agree with the HashFunction changes.
>> >>
>> >> OK, but which ones?
>> >>
>> >
>> > DOH! this one...
>> >
>> >>
>> >> Changing HashFunction to have two methods:
>> >>
>> >> long hash(byte[])
>> >> long increment(int seed)
>>
>> OK. I’ll update.
>>
>> >>
>> >>> I think Builder should have
>> >>> with(byte[])
>> >>> with(byte[], int offset, int len )
>> >>
>> >> Not convinced here. The HashFunction requires a byte[] and cannot
>> operate
>> >> on a range. This change should be made in conjunction with a similar
>> change
>> >> to HashFunction. So should we update HashFunction to:
>> >>
>> >>
>> > Given the depth of the change let's just leave the with( byte[] )
>> >
>> >
>> >>> with(String)
>> >>>
>> >>> I find that I use with(String) more than any other with() method.
>> >>
>> >> That may be so but String.getBytes(Charset) is trivial to call for the
>> >> user. Then they get to decide on the encoding and not leave it to the
>> >> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice
>> as a
>> >> cross-language standard. Leave it out of the API for now, or add both:
>> >>
>> >> Builder with(CharSequence, Charset);
>> >> Builder withUnencoded(CharSequence);
>> >>
>> >
>> > CharSequence has no easy method to convert to a byte[]. While it could
>> be
>> > done, it looks to be more of a streaming interface.  Let's leave that
>> out.
>>
>> I was thinking:
>>
>> /**
>>  * Adds a character sequence item to the hasher using the
>> specified encoding.
>>  *
>>  * @p

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Claude Warren

Builder discussion:

Let's go with

>> Builder with(CharSequence, Charset);
>> Builder withUnencoded(CharSequence);

Shape Discussion:

as for getNumberOfBytes() it should return the maximum number of bytes
returned by a getBits() call to a filter with this shape.  So yes, if there
is a compressed internal representation, no it won't be that.  It is a
method on Shape so it should literally be Math.ceil( getNumberOfBits() /
8.0 )

Basically, if you want to create an array that will fit all the bits
returned by BloomFilter.iterator() you need an array of
Shape.getNumberOfBytes().  And that is actually what I use it for.

Bloom Filter Discussion:

I have not use probability() on a single filter, just on the Shape.
However, I can see the value of it.  It seems to me that the difference
between Shape.probability() and BloomFilter.probability() would give you an
indication of how many collisions have already occurred.

In the SetOperations class there is an "estimateSize()" method.  It is the
only single Bloom filter argument method in the class and I think it should
be moved into BloomFilter so we have 2 new methods:

probability()
estimateSize()

Counting Filter Discussion:

As for counting filters we could implement several
int
short
byte

Currently they would all have to return int counts but realistically that
is what would be used in code anyway.  Also, once primitives can be used in
generics this will be easier.

As for contains( Hasher, int ), I think we would need to add contains(
BloomFilter, int).  If I understand correctly, contains( BloomFilter, X )
would test that a BloomFilter has been added X times or rather that there
are enough counts in the right places to make it appear that BloomFilter
has been added X times.  When used with a Hasher, remove the duplicates,
and perform the same test.

I see no reason not to add them.

On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert 
wrote:

>
>
> > On 17 Mar 2020, at 17:06, Claude Warren  wrote:
> >
> > On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert 
> > wrote:
> >
> >>
> >>
> >>> On 17 Mar 2020, at 15:41, Claude Warren  wrote:
> >>>
> >>> I agree with the HashFunction changes.
> >>
> >> OK, but which ones?
> >>
> >
> > DOH! this one...
> >
> >>
> >> Changing HashFunction to have two methods:
> >>
> >> long hash(byte[])
> >> long increment(int seed)
>
> OK. I’ll update.
>
> >>
> >>> I think Builder should have
> >>> with(byte[])
> >>> with(byte[], int offset, int len )
> >>
> >> Not convinced here. The HashFunction requires a byte[] and cannot
> operate
> >> on a range. This change should be made in conjunction with a similar
> change
> >> to HashFunction. So should we update HashFunction to:
> >>
> >>
> > Given the depth of the change let's just leave the with( byte[] )
> >
> >
> >>> with(String)
> >>>
> >>> I find that I use with(String) more than any other with() method.
> >>
> >> That may be so but String.getBytes(Charset) is trivial to call for the
> >> user. Then they get to decide on the encoding and not leave it to the
> >> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice
> as a
> >> cross-language standard. Leave it out of the API for now, or add both:
> >>
> >> Builder with(CharSequence, Charset);
> >> Builder withUnencoded(CharSequence);
> >>
> >
> > CharSequence has no easy method to convert to a byte[]. While it could be
> > done, it looks to be more of a streaming interface.  Let's leave that
> out.
>
> I was thinking:
>
> /**
>  * Adds a character sequence item to the hasher using the
> specified encoding.
>  *
>  * @param item the item to add
>  * @param charset the character set
>  * @return a reference to this object
>  */
> default Builder with(CharSequence item, Charset charset) {
> return with(item.toString().getBytes(charset));
> }
>
> /**
>  * Adds a character sequence item to the hasher. Each 16-bit
> character is
>  * converted to 2 bytes using little-endian order.
>  *
>  * @param item the item to add
>  * @return a reference to this object
>  */
> default Builder withUnencoded(CharSequence item) {
> final int length = item.length();
> final byte[] bytes = new byte[length * 2];
> for (int i = 0; i < length; i++) {
> final char ch = item.charAt(i);
>

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Claude Warren

On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert 
wrote:

>
>
> > On 17 Mar 2020, at 15:41, Claude Warren  wrote:
> >
> > I agree with the HashFunction changes.
>
> OK, but which ones?
>

DOH! this one...

>
> Changing HashFunction to have two methods:
>
> long hash(byte[])
> long increment(int seed)
>
> > I think Builder should have
> > with(byte[])
> > with(byte[], int offset, int len )
>
> Not convinced here. The HashFunction requires a byte[] and cannot operate
> on a range. This change should be made in conjunction with a similar change
> to HashFunction. So should we update HashFunction to:
>
>
Given the depth of the change let's just leave the with( byte[] )

> > with(String)
> >
> > I find that I use with(String) more than any other with() method.
>
> That may be so but String.getBytes(Charset) is trivial to call for the
> user. Then they get to decide on the encoding and not leave it to the
> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice as a
> cross-language standard. Leave it out of the API for now, or add both:
>
> Builder with(CharSequence, Charset);
> Builder withUnencoded(CharSequence);
>

CharSequence has no easy method to convert to a byte[]. While it could be
done, it looks to be more of a streaming interface.  Let's leave that out.

> I would argue that you may use BloomFilters for Strings but if we see a
> BloomFilter as a collection then we should really support all Objects (with
> a decorator or by typing the Builder) or not support Objects. Currently we
> are not supporting any Object so for now would drop this and the
> Hasher.Builder then becomes a very simple API that specifies that you put
> in items represented as a byte[] and call build to create a Hasher
> containing those items and reset for further use.
>

I have code example in several places where I hash GeoCode entities.  Since
they are comprised of strings, for the most part, building a hasher for
them simply requires hashing the Strings.  Many web services use JSON and
most JSON is string based.  I disagree with removing with(String) because
it is so convenient in so many cases.  It also makes the code
cleaner/easier to read.  But if you feel strongly about removing it then OK.

The only other thing that is really bothersome is the lack of
Shape.getNumberOfBytes().  Yes it is easy to call Math.ceil(
Shape.getNumberOfBits / 8.0 ).  But getNumberOfBytes() is much more
readable in the code.

Claude

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Claude Warren

I agree with the HashFunction changes.

The only place I know of where more items are added to a hasher is in our
test code.  So I think just requiring Builder.build() to do a reset is
correct solution.
I think Builder should have
with(byte[])
with(byte[], int offset, int len )
with(String)

I find that I use with(String) more than any other with() method.

If you want to add with(ByteBuffer) with the implementation you have above
I think that would be a good addition.  Anything else can probably be done
with a Decorator.

I would like to see https://github.com/apache/commons-collections/pull/131
get merged so that we can have more than one example of a hasher that
actually hashes.




On Tue, Mar 17, 2020 at 1:53 PM Alex Herbert 
wrote:

>
>
> > On 17 Mar 2020, at 11:08, Claude Warren  wrote:
> >
> > On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert  <mailto:alex.d.herb...@gmail.com>>
> > wrote:
> >
> >>
> >>> The Shape tells the hasher how many hash functions to apply to each
> item.
> >>
> >> OK. This is may misunderstanding. There is a contract that the Hasher is
> >> expected to fulfil but it is just not recorded in the javadoc. I can
> update
> >> the docs to indicate that:
> >>
> >> "A Hasher represents items of arbitrary byte size as a byte
> representation
> >> of fixed size (a hash). The hash for each item is created using a hash
> >> function; use of different seeds allows generation of different hashes
> for
> >> the same item. The hashes can be dynamically converted into the bit
> index
> >> representation used by a Bloom filter. The shape of the Bloom filter
> >> defines the number of indexes per item and the range of the indexes. The
> >> hasher functions to generate the correct number of indexes in the range
> >> required by the Bloom filter for each item it represents.
> >>
> >> Note that the process of generating hashes and mapping them to a Bloom
> >> filter shape may create duplicate indexes. The hasher may generate fewer
> >> than the required number of hash functions per item if duplicates have
> been
> >> removed."
> >>
> >> Looks Good
>
> Added to master.
>
> >
> >>> The Shape number of items is how many items are expected to be in the
> >> final
> >>> Bloom filter, it is more the expected value not a hard limit.
> >>
> >> Yes. As discussed before this is not actually required for a Bloom
> filter
> >> to function, it is required to maintain the intended purpose of the
> filter
> >> when it was constructed.
> >>
> >>
> >>
> >>>
> >>> The static hasher for example will not return duplicates so it might
> >> appear
> >>> that it has not respected the number of functions.  In addition there
> is
> >> no
> >>> indication from the hasher how many items it contains..
> >>
> >> Yes. So we state that the hasher represents one or more items.
> >>
> >>>
> >>> The inputs to the hash.builder are byte buffers that are fed to the
> hash
> >>> algorithm.  They are inputs to that algorithm.  So primitive types
> would
> >>> simply convert from the primitive type to its byte buffer
> representation.
> >>> Is that what you meant?
> >>
> >> I was unclear on the purpose of the Hasher.Builder. It seemed
> incomplete.
> >> If the builder is to add items then it seems strange to have:
> >>
> >> with(byte property)
> >> with(String property)
> >>
> >> It also seems strange to throw 'IllegalStateException if the Hasher is
> >> locked’ without explaining what this means. Is the builder intended to
> be
> >> concurrent? What is ‘locked’? Etc.
> >>
> >
> > Good catch. Documentation error.  Originally the Hasher.Builder was
> locked
> > once it generated a Hasher that was removed but the documentation was not
> > cleaned up.
>
> I’ve removed this from the Hasher.Builder interface.
>
> >
> >
> >> The byte could not possibly represent many meaningful objects. The
> string
> >> is trivially converted to UTF-8 bytes (as is done in the DynamicHasher).
> >> Both these methods could be added to the interface as default methods or
> >> preferrably dropped as they are so trivial.
> >>
> >> I changed the documentation to remove the encoding as UTF-8 requirement
> >> from the with(String) method. It seems like an implementation detail
> and a
> >> Hash

Re: [BloomFilters] changes to BloomFilter

2020-03-17 Thread Claude Warren

yte[] representation to feed to a
> HashFunction. Moving away from the current design would change HashFunction
> to specify methods to accept blocks of data and have a final method to get
> the hash. So making the HashFunction an online hash. This would then match
> Guava by having the some object accept items T and require a function to
> map the item T to blocks of data.
>
> However I note that the current implementation that accepts a byte[] for
> each call to get a hash value with a different seed can either use the
> byte[] or not (if cyclic). If the HashFunction was online then the choice
> of cyclic or iterative would not easily be possible. The Guava
> implementation creates a single hash and then the BloomFilter always uses
> this with a cyclic method. So the move away from the current design would
> be less flexible to allow different implementations of hashing.
>
> So we keep the byte[] interface to HashFunction for now. A performance
> test can be used to determine if there is an advantage to an advanced
> DynamicHasher.Builder which can recycle byte[] space. Javadoc should be
> added to the HashFunction to indicate that the same bytes passed with the
> same seed should create the same output. The same bytes with a different
> seed should create different output with very high probability. A seed of
> zero is used as a reset signal for implementations that have cached
> computation results that the byte[] input is different from the previous
> call.
>
>
> The last thing is that the Hasher.isEmpty() is not used anywhere except
> the units tests. It seems strange to have it. Can we just assume a Hasher
> is not empty. An empty hasher would return an iterator that does nothing.
>
> Yes, I guess we don't need isEmpty() I think it was originally used in
BloomFilter.merge() in a guard statement to test if the merge actually
needed to attempt to do anything.


> In summary:
>
> 1. change Hasher getBits to iterator
>
agree

> 2. improve documentation of Hasher and the contract that it should fulfil
> with respect to items and a Shape
>
absolutly

> 3. potentially drop Hasher.Builder unless there is a way to reset the
> Builder or create more
>
add the reset or have build() implicitly do a reset.

4. Or make Hasher.Builder typed to an object  so it is clear the with(…)
> methods are to accept a full representation of an item and add it to the
> in-progress Hasher currently being built
>
disagree.

5. Improve HashFunction javadoc on the use of the seed as a reset signal

agree

> 6. Drop Hasher.isEmpty()
>
ambivalent.

>
> That should clarify the currently functionality.
>
> Thought on this?
>
> Alex
>
>
> >
> > Claude
> >
> >
> > On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert 
> > wrote:
> >
> >>
> >> On 16/03/2020 07:57, Claude Warren wrote:
> >>> I made a quick pass at changing getHasher() to iterator().
> >>
> >> A look at the feasibility or have you started work on this? If so then
> >> I'll not start work on it as well.
> >>
> >> I changed master to return a boolean for the merge operations in
> >> BloomFilter. So the outstanding changes are to drop getHasher() from the
> >> BloomFilter interface in favour of an iterator, spliterator and a
> >> forEachBit method.
> >>
> >>> I think we can get rid of HasherBloomFilter as its purpose was really
> to
> >>> create a Bloom filter for temporary usage and it doesn't seem to be
> >>> required if we have a hasher that can be created from a Shape and a
> >>> function that creates an Iterator.
> >>
> >> I agree.
> >>
> >> One change that could be made is to clarify the contract between a
> >> Hasher and a BloomFilter. At present the Hasher can operate without a
> >> defined contract in this method:
> >>
> >> PrimitiveIterator.OfInt getBits(Shape shape)
> >>
> >> It should validate that it can generate indexes for the shape. But it
> >> doesn't have to. It could return unlimited indexes and they could be
> >> outside the number of bits of the BloomFilter.
> >>
> >> There does not appear to be any control anywhere on the number of hash
> >> functions generated by the Hasher. I would expect this test in the
> >> AbstractBloomFilterTest to pass:
> >>
> >> @Test
> >> public void hasherMergeTest() {
> >> int n = 1;
> >> int m = 10;
> >> HashFunctionIdentity h = new
> >> HashFunctionIdentityImpl("provider", "name",
> >> Signedness.SIGNED, Pr

Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Claude Warren

First I think that the hasher.getBits( Shape ) should be renamed to
iterator( Shape ).  It was poorly named at the start.

By definition a Hasher knows how many items it is going to insert.

The Shape tells the hasher how many hash functions to apply to each item.
The Shape number of items is how many items are expected to be in the final
Bloom filter, it is more the expected value not a hard limit.

Keeping in mind the possibility of hash collisions, I don't see a way to
check that the Hasher has respected the number of functions.

The static hasher for example will not return duplicates so it might appear
that it has not respected the number of functions.  In addition there is no
indication from the hasher how many items it contains..

The inputs to the hash.builder are byte buffers that are fed to the hash
algorithm.  They are inputs to that algorithm.  So primitive types would
simply convert from the primitive type to its byte buffer representation.
Is that what you meant?

The hasher contract is that it will generate integers in the proper range
and use the proper number of hash functions for each item that was added to
the builder and that repeated calls to getBits(Shape) will return the same
values.

Did I misunderstand something?

Claude


On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert 
wrote:

>
> On 16/03/2020 07:57, Claude Warren wrote:
> > I made a quick pass at changing getHasher() to iterator().
>
> A look at the feasibility or have you started work on this? If so then
> I'll not start work on it as well.
>
> I changed master to return a boolean for the merge operations in
> BloomFilter. So the outstanding changes are to drop getHasher() from the
> BloomFilter interface in favour of an iterator, spliterator and a
> forEachBit method.
>
> > I think we can get rid of HasherBloomFilter as its purpose was really to
> > create a Bloom filter for temporary usage and it doesn't seem to be
> > required if we have a hasher that can be created from a Shape and a
> > function that creates an Iterator.
>
> I agree.
>
> One change that could be made is to clarify the contract between a
> Hasher and a BloomFilter. At present the Hasher can operate without a
> defined contract in this method:
>
> PrimitiveIterator.OfInt getBits(Shape shape)
>
> It should validate that it can generate indexes for the shape. But it
> doesn't have to. It could return unlimited indexes and they could be
> outside the number of bits of the BloomFilter.
>
> There does not appear to be any control anywhere on the number of hash
> functions generated by the Hasher. I would expect this test in the
> AbstractBloomFilterTest to pass:
>
>  @Test
>  public void hasherMergeTest() {
>  int n = 1;
>  int m = 10;
>  HashFunctionIdentity h = new
> HashFunctionIdentityImpl("provider", "name",
>  Signedness.SIGNED, ProcessType.CYCLIC, 0L);
>  Hasher hasher = new Hasher() {
>  @Override
>  public boolean isEmpty() {
>  return false;
>  }
>  @Override
>  public HashFunctionIdentity getHashFunctionIdentity() {
>  return h;
>  }
>  @Override
>  public OfInt getBits(Shape shape) {
>  // Do not respect the shape number of hash functions
> but do respect
>  // the number of bits
>  return IntStream.range(0, m).iterator();
>  }
>  };
>  for (int k = 1; k < 5; k++) {
>  Shape shape = new Shape(h, n, m, k);
>  BloomFilter bf = createEmptyFilter(shape);
>  bf.merge(hasher);
>  assertEquals("incorrect cardinality", k, bf.cardinality());
>  }
>  }
>
> It currently does not as all the BloomFilters will not respect the Shape
> with which they were created, i.e. they disregard the number of hash
> functions in the Shape. So does the Hasher.
>
> I think some of the control should be returned to the BloomFilter. The
> Hasher would be reduced to a simple generator of data for the
> BloomFilter, for example:
>
>  PrimitiveIterator.OfInt getBits(int m);
>  PrimitiveIterator.OfInt getBits(int k, int m);
>  PrimitiveIterator.OfLong getBits();
>
> The BloomFilter then accept responsibility for converting the primitives
> to a suitable index and creating the correct number of hash functions
> (i.e. indexes).
>
> A merge operation with a BloomFilter then becomes:
>
> - check the Hasher is using the correct hash function identity
> - ask the Hasher for an iterator
> - set k bits in the filter using the iterator, mapping each to the range
> [0,

Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Claude Warren

I made a quick pass at changing getHasher() to iterator().

I think we can get rid of HasherBloomFilter as its purpose was really to
create a Bloom filter for temporary usage and it doesn't seem to be
required if we have a hasher that can be created from a Shape and a
function that creates an Iterator.

On Sun, Mar 15, 2020 at 6:08 PM Alex Herbert 
wrote:

> On Sun, 15 Mar 2020, 17:27 Claude Warren,  wrote:
>
> > We have spoken elsewhere about removing getHasher() and adding iterator()
> > In addition should we add forEachBit( IntConsumer )?I
>
>
> I was thinking the same. So we provide an iterator allowing failfast on the
> first index that fails a criteria, e.g. for contains, and a foreach
> allowing efficient receipt of all indexes.
>
> The only thing missing is whether we add a spliterator which has in its API
> the ability to specify DISTINCT and the exact size of the number of
> indexes. The spliterator can be a default method using the iterator to
> create it. An implementation can provide one if it wants.
>
>
>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

[BloomFilters] changes to BloomFilter

2020-03-15 Thread Claude Warren

We have spoken elsewhere about removing getHasher() and adding iterator()
In addition should we add forEachBit( IntConsumer )?



-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

2020-03-14 Thread Claude Warren

I am confused.

If we restrict what Shape stores to "m" and "t" we can provide methods to
estimate "n" and "p" for the estimated "n"

If the constructors for Shape remains as as they are then developers don't
need to go online to get numbers to plug into Shape.  They only need to go
online if they want to understand the interplay between the values.

I think that most of the time developers will know the number of items and
a probability and let "k" and "m" default.

It seems like the API is built into the Shape constructor.  Pass in any
hashFunctionIdentity and various parameters and you can get the other
parameters.

On Sat, Mar 14, 2020 at 9:50 AM Alex Herbert 
wrote:

> Should we not provide an API to compute this in code. Otherwise every
> developer who uses the code will have to go to the online Bloom filter
> calculator to check. They cannot do this dynamically in a program. If the
> equations are standard then I think we should provide a calculator.
>
> And Shape does not need to store the number of items to support a Bloom
> filter. Strictly speaking it only needs the number of bits. But bundling
> the hash function identity and number of hash functions saves you having to
> pass that separately to any Bloom filter and removes the requirement to
> specify these separately in the Bloom filter interface.
>
>
>
> On Sat, 14 Mar 2020, 09:31 Claude Warren,  wrote:
>
> > Shape is not intended to "Perform the standard computations using some of
> > n, m, k, p to produce optimal values for the other values of n, m, k, p:"
> > that is left to the developer to determine possibly with the help of
> > https://hur.st/bloomfilter/ as referenced in the class javadoc. However,
> > writing the constructors ends up performing the standard computations.
> >
> > On Sat, Mar 14, 2020 at 8:54 AM Alex Herbert 
> > wrote:
> >
> > >
> > >
> > > > On 10 Mar 2020, at 01:36, Alex Herbert 
> > wrote:
> > > >
> > > > I was going to modify the BloomFilter interface as discussed on this
> > > thread.
> > > >
> > > > However before that I have made some changes to the current code to
> > tidy
> > > it up. PR is [1]. The changes target the following:
> > >
> > > >
> > > > [1] https://github.com/apache/commons-collections/pull/140 <
> > > https://github.com/apache/commons-collections/pull/140>
> > > >
> > >
> > > This PR has updated the validations that the Shape does during
> > > construction. Number of bits has changed from >= 8 to > 0. But working
> on
> > > this it seems to me that the Shape is a composite class and should be
> > > refactored.
> > >
> > > I think that the Shape is combining two functionalities and should be
> > > separated:
> > >
> > > 1. Perform the standard computations using some of n, m, k, p to
> produce
> > > optimal values for the other values of n, m, k, p:
> > >
> > > n = number of items
> > > m = number of bits
> > > k = number of hash functions
> > > p = Probability of false positives (when storing n items)
> > >
> > > n = ceil(m / (-k / ln(1 - exp(ln(p) / k
> > > p = pow(1 - exp(-k / (m / n)), k)
> > > m = ceil((n * ln(p)) / ln(1 / pow(2, ln(2
> > > k = round((m / n) * ln(2))
> > >
> > > 2. Provide the functional shape for a Bloom filter
> > >
> > > This only requires storing m and k. It also requires storing the hash
> > > function identity.
> > >
> > > I suggest:
> > >
> > > - Moving all the computations to a ShapeCalculator
> > >
> > > - Only storing k and m in the Shape.
> > >
> > > - If you want to know n given p or p given n then you just call the
> > method
> > > in the ShapeCalculator.
> > >
> > > - Change constructors in the Shape to allow construction with:
> > >
> > > n, m; where n < m; compute k
> > > n, m, k; where n < m; validate p
> > > n, p; compute k and m
> > > m, k, p; compute n
> > > k, m;
> > >
> > > This adds the constructor (k, m) which is all you actually need for a
> > > working Bloom filter. Possible we should validate k < m.
> > >
> > > The constructor using n, m and k only serves to validate p is less
> than 1
> > > when the filter is saturated since n is only used to compute p. Perhaps
> > > this constructor should be removed.
> > >

Re: [collections] Bloom filters

2020-03-14 Thread Claude Warren

Shape is not intended to "Perform the standard computations using some of
n, m, k, p to produce optimal values for the other values of n, m, k, p:"
that is left to the developer to determine possibly with the help of
https://hur.st/bloomfilter/ as referenced in the class javadoc. However,
writing the constructors ends up performing the standard computations.

On Sat, Mar 14, 2020 at 8:54 AM Alex Herbert 
wrote:

>
>
> > On 10 Mar 2020, at 01:36, Alex Herbert  wrote:
> >
> > I was going to modify the BloomFilter interface as discussed on this
> thread.
> >
> > However before that I have made some changes to the current code to tidy
> it up. PR is [1]. The changes target the following:
>
> >
> > [1] https://github.com/apache/commons-collections/pull/140 <
> https://github.com/apache/commons-collections/pull/140>
> >
>
> This PR has updated the validations that the Shape does during
> construction. Number of bits has changed from >= 8 to > 0. But working on
> this it seems to me that the Shape is a composite class and should be
> refactored.
>
> I think that the Shape is combining two functionalities and should be
> separated:
>
> 1. Perform the standard computations using some of n, m, k, p to produce
> optimal values for the other values of n, m, k, p:
>
> n = number of items
> m = number of bits
> k = number of hash functions
> p = Probability of false positives (when storing n items)
>
> n = ceil(m / (-k / ln(1 - exp(ln(p) / k
> p = pow(1 - exp(-k / (m / n)), k)
> m = ceil((n * ln(p)) / ln(1 / pow(2, ln(2
> k = round((m / n) * ln(2))
>
> 2. Provide the functional shape for a Bloom filter
>
> This only requires storing m and k. It also requires storing the hash
> function identity.
>
> I suggest:
>
> - Moving all the computations to a ShapeCalculator
>
> - Only storing k and m in the Shape.
>
> - If you want to know n given p or p given n then you just call the method
> in the ShapeCalculator.
>
> - Change constructors in the Shape to allow construction with:
>
> n, m; where n < m; compute k
> n, m, k; where n < m; validate p
> n, p; compute k and m
> m, k, p; compute n
> k, m;
>
> This adds the constructor (k, m) which is all you actually need for a
> working Bloom filter. Possible we should validate k < m.
>
> The constructor using n, m and k only serves to validate p is less than 1
> when the filter is saturated since n is only used to compute p. Perhaps
> this constructor should be removed.
>
> My argument is that a Bloom filter does not require anything to function
> other than the number of bits (m) and a hash functions (k). Typical use
> case is to construct using (n, p) if you know your number of items and
> filter efficiency, or a constructor using m if you know your memory
> limitations.
>
> Do we need any other constructors, e.g.?
>
> m, p; compute k
>
> Note that bundling the HashFunctionIdentity with the Shape ties the Shape
> to a functional Bloom filter. It could be renamed to BloomFilterShape to
> make it clear that it is directly linked to the Bloom filter. But that is
> over verbose.
>
> However the name Shape is generic and does not incorporate the fact that
> it also has a HashFunctionIdentity. There should be something in the Shape
> javadoc to make it clear the Shape is not just a store for values
> associated with any Bloom filter (m, k) but is specific to a working Bloom
> filter as it also defines the hash function to be used with a Bloom filter.
>
> Alex
>
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Restart Build?

2020-03-08 Thread Claude Warren

I have a pull request (
https://github.com/apache/commons-collections/pull/131) that failed due to
an external connection being reset.  Is there a way to restart the build
without creating a new pull request or pushing to git?

Claude

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

2020-03-08 Thread Claude Warren

With the upcoming change the StaticHash usage model has changed.  It was
serving two purposes:

   1. as a mechanism to preserve the list of integers from the BloomFilter
   as well as the shape.
   2. as a way to construct a Hasher from a collection of integers and a
   shape so that they could be merged into a Bloom filter without the overhead
   of constructing a temporary Bloom filter to use in the merge.

The first purpose is removed by the removal of getHasher and addition of
iterator() in the BloomFilter interface.

I think we need a Hasher that accepts a shape and a collection of Integers
or a function producing an iterator.  Something like:

{code:java}
public static class CollectionHasher implements Hasher {

Shape shape;
Supplier func;

CollectionHasher( Supplier func, Shape
shape) {
this.shape = shape;
this.func = func;
}

CollectionHasher( Collection collection, Shape shape) {
this.shape = shape;
this.func = new Supplier() {
Collection coll = collection;
@Override
public OfInt get() {
return new PrimitiveIterator.OfInt() {
Iterator iter = coll.iterator();

@Override
public boolean hasNext() {
return iter.hasNext();
}

@Override
public int nextInt() {
return iter.next().intValue();
}

@Override
public Integer next() {
return iter.next();
}


};
}};
}

@Override
public OfInt getBits(Shape shape) {
if (!this.shape.equals(shape)) {
throw new IllegalArgumentException(String.format("Hasher
shape (%s) is not the same as shape (%s)",
this.shape.toString(), shape.toString()));
}
return func.get();
}

@Override
public HashFunctionIdentity getHashFunctionIdentity() {
return shape.getHashFunctionIdentity();
}

@Override
public boolean isEmpty() {
return !func.get().hasNext();
}


}
{code}

On Sun, Mar 8, 2020 at 12:39 AM Alex Herbert 
wrote:

>
> > On 6 Mar 2020, at 02:14, Alex Herbert  wrote:
> >
> > The change to make the CountingBloomFilter an interface is in this PR
> [1].
>
> Claude has stated in a review of the PR on GitHub that the change to
> CountingBloomFilter as an interface is good.
>
> I will now progress to updating the BloomFilter interface as previously
> discussed and put that into a PR. Changes would be:
>
> - boolean return values from the merge operations.
> - remove getHasher() and switch to providing an iterator of enabled indexes
>
> As per below:
> > *public* *interface* BloomFilter {
> >
> >  *int* andCardinality(BloomFilter other);
> >
> >  *int* cardinality();
> >
> >  *boolean* contains(BloomFilter other);
> >
> >  *boolean* contains(Hasher hasher);
> >
> >  *long*[] getBits();
> >
> >// Change
> >PrimitiveIterator.OfInt iterator();
> >
> >  Shape getShape();
> >
> >
> > *   // Change   boolean* merge(BloomFilter other);
> >
> >
> > *// Change   boolean* merge(Hasher hasher);
> >
> >  *int* orCardinality(BloomFilter other);
> >
> >  *int* xorCardinality(BloomFilter other);
> >
> > }
>
> Given the CountingBloomFilter provides a forEach(BitCountConsumer) method
> it may be useful to also have the following method to receive all the
> enabled indexes:
>
> forEach(IntConsumer)
>
> Thus you can use the iterator of indexes for fail-fast checking against
> each index, or use the forEach method when you know you want to process all
> the bit indexes. In many cases the forEach can be more efficiently
> implemented than an iterator and would avoid an iterator object creation.
>
>
> >
> >
> > [1] https://github.com/apache/commons-collections/pull/137
> >  
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

2020-03-03 Thread Claude Warren

take a look at
https://github.com/apache/commons-collections/pull/131/files#diff-8b2bf046dc35c88908eef196937173e1

This is a different Hasher with a smaller data footprint than
DynamicHasher.  Like the DynamicHasher it builds the values on the fly.
Can SplitIterators be implemented on them easily?

I am torn between the interesting uses of a SplitIterator and the well
known usage pattern for an iterator.  Perhaps we should implement both.  It
seems to me (without really looking) that we should be able to implement an
Iterator that uses the SplitIterator.

Thoughts?

On Tue, Mar 3, 2020 at 11:54 AM Alex Herbert 
wrote:

>
> On 02/03/2020 22:34, Claude Warren wrote:
> > So what we have then is:
> >
> > *public* *interface* BloomFilter {
> >
> >   *int* andCardinality(BloomFilter other);
> >
> >   *int* cardinality();
> >
> >   *boolean* contains(BloomFilter other);
> >
> >   *boolean* contains(Hasher hasher);
> >
> >   *long*[] getBits();
> >
> > // Change
> > PrimitiveIterator.OfInt iterator();
> >
> >   Shape getShape();
> >
> >
> > *   // Change   boolean* merge(BloomFilter other);
> >
> >
> > *// Change   boolean* merge(Hasher hasher);
> >
> >   *int* orCardinality(BloomFilter other);
> >
> >   *int* xorCardinality(BloomFilter other);
> >
> > }
> >
> > public interface BitCountConsumer {
> >  boolean accept(int index, int count);
> > }
> >
> > public interface CountingBloomFilter extends BloomFilter {
> >
> >
> > *// Change  boolean* remove(BloomFilter other);
> >
> >  // Change
> >  *boolean* remove(Hasher hasher);
> >
> >  // Change
> >boolean add(CountingBloomFilter other);
> >
> >  // Change
> >  boolean subtract(CountingBloomFilter other);
> >
> >  // Change
> > void getCounts(BitCountConsumer consumer);
> >
> > }
> >
> > *public* *final* *class* StaticHasher *implements* Hasher {
> >
> >   *public* StaticHasher(*final* Hasher hasher, *final* Shape shape)
> { ...
> > }
> >
> >   *public* StaticHasher(*final* Iterator iter, *final* Shape
> > shape) { ... }
> >
> >   *public* StaticHasher(*final* Collection collection,
> *final*
> > Shape shape) { ... }
> >
> >  *public* Shape getShape() { *... *}
> >
> >  *public* *int* size() { *... *}
> >
> >
> > *// Change  public* *int* max() { *... *}
> >
> > }
> >
> > *public* *final* *class* FixedHasher *implements* Hasher {
> >
> >  *public* FixedHasher(*final* Hasher hasher, *final* Shape shape) {
> ... }
> >
> >  *public* FixedHasher(*final* Iterator iter, *final* Shape
> shape)
> > { ... }
> >
> >  *public* FixedHasher(*final* Collection collection, *final*
> > Shape shape) { ... }
> >
> >  *public* Shape getShape() { *... *}
> >
> >  *public* *int* size() { *... *}
> >
> >  *public* *int* max() { *... *}
> >
> > }
> >
> > I have a pull request in to add a CachingHasher.
> >
> > While we are at it, I think we should change Hasher.getBits(Shape) to
> > Hasher.iterator(Shape) to match the BloomFilter.iterator() and to limit
> > confusion between Hasher.getBits( Shape ) (return iterator) and
> > BloomFilter.getBits() returns (long[])
> >
> > The interface changes look good to me.  I agree with your implementation
> > ideas.
> >
> > I do have one use case where I need the BloomFilter.iterator() results in
> > reverse order (I am estimating Log2 of the filter), but I can do that
> with
> > the methods here.  I don't think any of these changes significantly
> impacts
> > the uses that I currently have.
> >
> > Shall we move forward?
> >
> > Claude
>
> I finished a first draft of the CountingBloomFilter. I still have the
> tests to add for the new API.
>
> When working with this I had to write an Iterator (to create a Hasher
> for the soon to be changed API) which has very few lines of code, most
> of which are repeats due to the separation of hasNext() and nextInt()
> and the requirement to be able to call hasNext() repeatedly without
> calling nextInt(). This mandates storing the next value rather than a
> current position to search for a non-zero index:
>
>  static class IndexIterator implements PrimitiveIterator.OfInt {
>  int next;
>
>  IndexIterator() {
>  advance(0);
>  }
&

Re: [collections] Bloom filters

2020-03-02 Thread Claude Warren

So what we have then is:

*public* *interface* BloomFilter {

 *int* andCardinality(BloomFilter other);

 *int* cardinality();

 *boolean* contains(BloomFilter other);

 *boolean* contains(Hasher hasher);

 *long*[] getBits();

   // Change
   PrimitiveIterator.OfInt iterator();

 Shape getShape();


*   // Change   boolean* merge(BloomFilter other);


*// Change   boolean* merge(Hasher hasher);

 *int* orCardinality(BloomFilter other);

 *int* xorCardinality(BloomFilter other);

}

public interface BitCountConsumer {
boolean accept(int index, int count);
}

public interface CountingBloomFilter extends BloomFilter {


*// Change  boolean* remove(BloomFilter other);

// Change
*boolean* remove(Hasher hasher);

// Change
  boolean add(CountingBloomFilter other);

// Change
boolean subtract(CountingBloomFilter other);

// Change
   void getCounts(BitCountConsumer consumer);

}

*public* *final* *class* StaticHasher *implements* Hasher {

 *public* StaticHasher(*final* Hasher hasher, *final* Shape shape) { ...
}

 *public* StaticHasher(*final* Iterator iter, *final* Shape
shape) { ... }

 *public* StaticHasher(*final* Collection collection, *final*
Shape shape) { ... }

*public* Shape getShape() { *... *}

*public* *int* size() { *... *}


*// Change  public* *int* max() { *... *}

}

*public* *final* *class* FixedHasher *implements* Hasher {

*public* FixedHasher(*final* Hasher hasher, *final* Shape shape) { ... }

*public* FixedHasher(*final* Iterator iter, *final* Shape shape)
{ ... }

*public* FixedHasher(*final* Collection collection, *final*
Shape shape) { ... }

*public* Shape getShape() { *... *}

*public* *int* size() { *... *}

*public* *int* max() { *... *}

}

I have a pull request in to add a CachingHasher.

While we are at it, I think we should change Hasher.getBits(Shape) to
Hasher.iterator(Shape) to match the BloomFilter.iterator() and to limit
confusion between Hasher.getBits( Shape ) (return iterator) and
BloomFilter.getBits() returns (long[])

The interface changes look good to me.  I agree with your implementation
ideas.

I do have one use case where I need the BloomFilter.iterator() results in
reverse order (I am estimating Log2 of the filter), but I can do that with
the methods here.  I don't think any of these changes significantly impacts
the uses that I currently have.

Shall we move forward?

Claude


On Mon, Mar 2, 2020 at 6:02 PM Alex Herbert 
wrote:

>
> On 02/03/2020 16:12, Claude Warren wrote:
> > Does getCounts() return a snapshot of the values when the call was made
> or
> > does it return values that may be updated during the retrieval.  If there
> > are 2 threads (one reading counts and one doing a merge) it seems to me
> > that the "iterate over the data without constructing objects" approach
> > means that the data may not be internally consistent.  But then we don't
> > really check that issue in the other methods either.  So let's go with
> the
> > fail fast BitCountConsumer approach.
>
> My assumption on this is that no filter is safe for concurrent use.
> Allowing a merge while reading the state is a concurrency issue. I don't
> think we want to go down the route of having fail-fast concurrency
> checking using a 'modified count' strategy, i.e. all operations save the
> current 'modified count' at the start and increment it at the end, if it
> changes before the end then there has been concurrent modification. It
> would be simpler to state that no filters are designed to be used
> concurrently.
>
> >
> > On the StaticHasher it is used for two purposes:
> >
> > 1) getting a list of enabled bits.
> > 2) creating a bloom filter from a list of enabled bits.  There was a
> > request early on for the ability to test membership in Bloom filter
> without
> > having to construct a bloom filter.  Basically, the developer has a list
> of
> > values and wants to check membership.
>
> I think the issue is that for 1 and 2 there are a few scenarios with
> different requirements since the list of enabled bits can be small or big.
>
> One feature of the StaticHasher is the requirement to remove duplicates
> and return a sorted order. This allows for an efficient storage form
> factor with the intension that it may be around for a while. But it is
> an overhead when creating a large one from a known set of non-duplicate
> bits, i.e. a Bloom filter.
>
> For (2) if you want to do a dynamic check to test membership then you
> use BloomFilter.contains(Hasher). I would pass an on-the-fly Hasher for
> this and not construct the entire set of indices into a StaticHasher
> anyway.
>
> This would be done using the DynamicHasher to build hashes on-the-fly,
> which may have duplicates. Fo

Re: [collections] Bloom filters

2020-03-02 Thread Claude Warren

Does getCounts() return a snapshot of the values when the call was made or
does it return values that may be updated during the retrieval.  If there
are 2 threads (one reading counts and one doing a merge) it seems to me
that the "iterate over the data without constructing objects" approach
means that the data may not be internally consistent.  But then we don't
really check that issue in the other methods either.  So let's go with the
fail fast BitCountConsumer approach.

On the StaticHasher it is used for two purposes:

1) getting a list of enabled bits.
2) creating a bloom filter from a list of enabled bits.  There was a
request early on for the ability to test membership in Bloom filter without
having to construct a bloom filter.  Basically, the developer has a list of
values and wants to check membership.

It might make more sense to reimplement the StaticHasher using a long[]
internally (like the BloomFilter.getBits() method returns).  We have the
shape so we know the maximum length and we could trim it after loading to
remove trailing zero value longs.  This would remove the necessity of
converting the iterator into a set of some sort.  We can do the limit
checks quickly as we turn bits on.

Makes me think we might need to implement StandardBloomFilter to use long[]
as well.

Claude

On Mon, Mar 2, 2020 at 1:12 PM Alex Herbert 
wrote:

>
> On 02/03/2020 11:32, Claude Warren wrote:
> > my thought on changing the BloomFilter.merge() to return a boolean is
> along
> > the lines you had: return true on successful merge (even if there are no
> > changes in the enabled bits). And yes, for most cases the standard bloom
> > filter will return true, but the change is really to support extensions
> to
> > Bloom filter so I think it is reasonable.
> >
> > As for the getCounts().  Suppose we split it into two methods:
> >
> >  // return the indexes that are enabled.  This is equivalent to
> > CountingBloomFilter.getHasher().getBits( CountingBloomFilter.getShape()
> );
> >  Iterator getIndexes()
> >  // get the count for the specific index.
> >  int getCount( int index );
> >
> > With these methods It becomes possible to construct an iterator of int[]
> or
> > Map.Entry or whatever else the developer wants.
>
> Having to call getCount() for every index is likely to be slower than
> all the garbage collection involved with iterating a disposable  count> pair. For an array backed storage the access will be order(1).
> But other storage may be a lot slower. This method ideally should
> traverse the storage once and make each  pair available.
> Ideally this would be without creating each  pair as a
> disposable object.
>
> Following on from previously, another disadvantage of the consumer
> approach is the lack of fast fail. You can remedy this by adding a
> boolean return from the consumer to indicate if you want to consume more
> items:
>
> interface BitCountConsumer {
>  boolean accept(int index, int count);
> }
>
> A CountingBloomFilter implementation then has something like:
>
>  void getCounts(BitCountConsumer consumer) {
>  while(hasNext()) {
>  next();
>  if (!consumer.accept(index(), count()) {
>  break;
>  }
>  }
>  }
>
> Looking at the getHasher().getBits(...) idea I noticed that StaticHasher
> is constructed using:
>
> Iterator iter
>
> not:
>
> PrimitiveIterator.OfInt
>
> I'd prefer the primitive specialisation in this constructor. It would
> break the HasherBloomFilter merge function but that can be fixed.
> However it may be redundant. The StaticHasher has some oddities with the
> available constructors:
>
> public StaticHasher(final StaticHasher hasher, final Shape shape)
>
> - why have a shape parameter? The constructor checks the shape is the
> same as that for the hasher. So why not just a plain copy constructor:
>
> public StaticHasher(final StaticHasher hasher)
>
> This constructor is only used in the unit tests. Given the StaticHasher
> is immutable then a copy constructor seems odd. I think it should be
> dropped.
>
>
> Here are is a constructor that is missing IMO:
>
> public StaticHasher(final Set indices, final Shape shape)
>
> - Could be used to generically create a static hasher. The indices are
> ensured to be unique by the Set but should be checked to be within the
> shape.
>
>
> With these constructors the public StaticHasher(final Iterator
> iter, final Shape shape) is only used by the BitSetBloomFilter:
>
>  public StaticHasher getHasher() {
>  return new StaticHasher(bitSet.stream().iterator(), getShape());
>  }
>
> Becomes:
&

Re: [collections] Bloom filters

2020-03-02 Thread Claude Warren

my thought on changing the BloomFilter.merge() to return a boolean is along
the lines you had: return true on successful merge (even if there are no
changes in the enabled bits). And yes, for most cases the standard bloom
filter will return true, but the change is really to support extensions to
Bloom filter so I think it is reasonable.

As for the getCounts().  Suppose we split it into two methods:

// return the indexes that are enabled.  This is equivalent to
CountingBloomFilter.getHasher().getBits( CountingBloomFilter.getShape() );
Iterator getIndexes()
// get the count for the specific index.
int getCount( int index );

With these methods It becomes possible to construct an iterator of int[] or
Map.Entry or whatever else the developer wants.

Claude

On Mon, Mar 2, 2020 at 10:48 AM Alex Herbert 
wrote:

> On 02/03/2020 09:38, Claude Warren wrote:
> > It is not too late to update the BloomFIlter interface to have the merge
> > return a boolean.  The incorrect Shape would still throw an exception, so
> > the return value would only come into play if the bits could not be set.
> >
> > thoughts?
>
> I don't see the harm in it. But what would the return value be for?
>
> For a standard collection it would be if the collection was changed by
> the operation:
>
> Collection.add/remove return "true if this collection changed as a
> result of the call"
>
> So here is the equivalent:
>
> return "true if this filter was changed as a result of the call"
>
> This is computationally slow to track. It also is confusing if the
> filter was successfully merged but no bits were changed to then return
> false because the filter was actually incorporated. So it would go along
> the lines that we discussed for the counting Bloom filter:
>
> return "true if this filter was successfully merged as a result of the
> call"
>
> For most cases in the current library it would be true when an exception
> is not thrown. However the merge of the counting Bloom filter may have
> reason to return false, e.g. overflow.
>
> > On Mon, Mar 2, 2020 at 7:56 AM Claude Warren  wrote:
> >
> >> for the remove(), add(), and subtract() methods I agree that void is not
> >> correct and it should be boolean and be the same as the value you would
> get
> >> from calling isValid().
> >>
> >> You are correct the getCounts() should return an iterator of some type
> on
> >> int[], I don't know why I thought long[].  I am happy with a plain
> >> Iterator as the return.
>
> For the getCounts() method I am still looking for a way around having to
> create an  pair for everything in the filter. An
> alternative to an iterator is to use the consumer idea. Given there is
> no primitive specialisation of BiConsumer in JDK 8 functions we
> define our own:
>
> interface BitCountConsumer {
>  void accept(int index, int count);
> }
>
> The CountingBloomFilter then has:
>
>  void forEachCount(BitCountConsumer consumer);
>  // Or
>  void getCounts(BitCountConsumer consumer);
>
>
> You can then directly pass the counts from the backing storage to the
> destination.
>
> Advantages:
> - No object creation
> Disadvantages
> - The counts cannot be streamed
>
> An alternative is to provide an Iterator of long with the index and
> count packed into a long with methods to extract them:
>
> PrimativeIterator.OfLong getCounts();
>
> default static int getIndex(long indexCountPair) {
>  return (int) (indexCountPair >>> 32);
> }
>
> default static int getCount(long indexCountPair) {
>  return (int) indexCountPair;
> }
>
> This will almost certainly be a cause for bugs/issues from users.
>
> I believe that the counts will be used for 2 main use cases:
>
> 1. Storage
>
> 2. Adding to another counting Bloom filter
>
> Both cases are likely to be done serially and not in parallel. So
> providing a consumer based API to receive the counts would work.
>
> WDYT?
>
> >>
> >> Claude
> >>
> >>
> >>
> >> On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert 
> >> wrote:
> >>
> >>>
> >>>> On 1 Mar 2020, at 15:39, Claude Warren  wrote:
> >>>>
> >>>> I think the CountingBloomFilter interface needs to extend BloomFilter.
> >>> I said that but did not write it, sorry.
> >>>
> >>>> I think I am confused.
> >>>>
> >>>> I would expect CountingBloomFilter to have
> >>>>
> >>>> interface CountingBloomFilter extends BloomFilter {
> &

Re: [collections] Bloom filters

2020-03-02 Thread Claude Warren

It is not too late to update the BloomFIlter interface to have the merge
return a boolean.  The incorrect Shape would still throw an exception, so
the return value would only come into play if the bits could not be set.

thoughts?

On Mon, Mar 2, 2020 at 7:56 AM Claude Warren  wrote:

> for the remove(), add(), and subtract() methods I agree that void is not
> correct and it should be boolean and be the same as the value you would get
> from calling isValid().
>
> You are correct the getCounts() should return an iterator of some type on
> int[], I don't know why I thought long[].  I am happy with a plain
> Iterator as the return.
>
> Claude
>
>
>
> On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert 
> wrote:
>
>>
>>
>> > On 1 Mar 2020, at 15:39, Claude Warren  wrote:
>> >
>> > I think the CountingBloomFilter interface needs to extend BloomFilter.
>>
>> I said that but did not write it, sorry.
>>
>> >
>> > I think I am confused.
>> >
>> > I would expect CountingBloomFilter to have
>> >
>> > interface CountingBloomFilter extends BloomFilter {
>> >// these 2 methods are the reverse of merge()
>> >void remove( BloomFilter );
>> >void remove( Hasher );
>>
>> Fine. Same intention but different method names. But why void? It forces
>> you to check if the remove was valid with a second call. On the plus side
>> it matches the void merge(…) methods and in most cases a user would not
>> care to check anyway. If they are controlling the filter then leave it to
>> them to make sure they do not remove something they did not add.
>>
>> >
>> >// these 2 methods are the addition/subtraction of counts
>> >void add( CountingBloomFilter )
>> >void subtract( CountingBloomFilter );
>>
>> Fine. But same comment as above with the void return.
>>
>> >
>> >// 2 methods to retrieve data
>> >Stream getCounts();
>>
>> I don’t like this use of long[]. In my previous post I argued that if you
>> were to ever want to store more than max integer items in a filter then the
>> Bloom filter would have more bit indices than max integer. So we never have
>> to support long counts. A filter that exceeds max integer for a count is
>> highly likely to be saturated and no use as a filter anyway.
>>
>> For most backing implementations the object type of the stream will be
>> different so you will have to write a Spliterator implementation or wrap
>> some iterator anyway. So why not return the Spliterator:
>>
>>Spliterator getCounts();
>>
>> Since the backing implementation will likely not store int[] pairs then
>> this will have a lot of object creation and garbage collection overhead to
>> go through the counts. This does not seem to be a big concern here if the
>> purpose is the same as for the BloomFilter for long[] getBits(), i.e. to
>> get a canonical representation for storage.
>>
>> Note: The Spliterator may not have a known size (number of non zero bits)
>> at creation, for example if the counts are stored in a fixed size array.
>> Thus efficient parallel traversal by binary tree splitting is limited by
>> how evenly the counts are distributed. For a backing implementation using a
>> collection then the size should be known. In this case a Spliterator would
>> be of more use than a plain Iterator. You can convert one to the other
>> using:
>>
>> java.util.Spliterators:
>> public static Iterator iterator(Spliterator
>> spliterator)
>> public static  Spliterator spliterator(Iterator
>> iterator,
>>  long size,
>>  int characteristics)
>>
>> So which to choose for the API?
>>
>> The Hasher currently uses an Iterator:
>>
>> PrimitiveIterator.OfInt getBits(Shape shape);
>>
>> In the case of a StaticHasher this could return a spliterator. But the
>> DynamicHasher needs a reworking of the internal Iterator class. It could be
>> a Spliterator to use the new IntConsumer API but in most (all) cases
>> splitting would not be possible for dynamic hashing as the parts are
>> produced in order. It is likely that they will be consumed sequentially too.
>>
>> I would suggest that Spliterator is the more modern implementation,
>> despite not always being applicable to parallelisation in a stream.
>> Currently the iterator from the Hasher is used in forEachRemaining() and
>> while loop is approximately equal measure. The while l

Re: [collections] Bloom filters

2020-03-01 Thread Claude Warren

for the remove(), add(), and subtract() methods I agree that void is not
correct and it should be boolean and be the same as the value you would get
from calling isValid().

You are correct the getCounts() should return an iterator of some type on
int[], I don't know why I thought long[].  I am happy with a plain
Iterator as the return.

Claude



On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert 
wrote:

>
>
> > On 1 Mar 2020, at 15:39, Claude Warren  wrote:
> >
> > I think the CountingBloomFilter interface needs to extend BloomFilter.
>
> I said that but did not write it, sorry.
>
> >
> > I think I am confused.
> >
> > I would expect CountingBloomFilter to have
> >
> > interface CountingBloomFilter extends BloomFilter {
> >// these 2 methods are the reverse of merge()
> >void remove( BloomFilter );
> >void remove( Hasher );
>
> Fine. Same intention but different method names. But why void? It forces
> you to check if the remove was valid with a second call. On the plus side
> it matches the void merge(…) methods and in most cases a user would not
> care to check anyway. If they are controlling the filter then leave it to
> them to make sure they do not remove something they did not add.
>
> >
> >// these 2 methods are the addition/subtraction of counts
> >void add( CountingBloomFilter )
> >void subtract( CountingBloomFilter );
>
> Fine. But same comment as above with the void return.
>
> >
> >// 2 methods to retrieve data
> >Stream getCounts();
>
> I don’t like this use of long[]. In my previous post I argued that if you
> were to ever want to store more than max integer items in a filter then the
> Bloom filter would have more bit indices than max integer. So we never have
> to support long counts. A filter that exceeds max integer for a count is
> highly likely to be saturated and no use as a filter anyway.
>
> For most backing implementations the object type of the stream will be
> different so you will have to write a Spliterator implementation or wrap
> some iterator anyway. So why not return the Spliterator:
>
>Spliterator getCounts();
>
> Since the backing implementation will likely not store int[] pairs then
> this will have a lot of object creation and garbage collection overhead to
> go through the counts. This does not seem to be a big concern here if the
> purpose is the same as for the BloomFilter for long[] getBits(), i.e. to
> get a canonical representation for storage.
>
> Note: The Spliterator may not have a known size (number of non zero bits)
> at creation, for example if the counts are stored in a fixed size array.
> Thus efficient parallel traversal by binary tree splitting is limited by
> how evenly the counts are distributed. For a backing implementation using a
> collection then the size should be known. In this case a Spliterator would
> be of more use than a plain Iterator. You can convert one to the other
> using:
>
> java.util.Spliterators:
> public static Iterator iterator(Spliterator spliterator)
> public static  Spliterator spliterator(Iterator
> iterator,
>  long size,
>  int characteristics)
>
> So which to choose for the API?
>
> The Hasher currently uses an Iterator:
>
> PrimitiveIterator.OfInt getBits(Shape shape);
>
> In the case of a StaticHasher this could return a spliterator. But the
> DynamicHasher needs a reworking of the internal Iterator class. It could be
> a Spliterator to use the new IntConsumer API but in most (all) cases
> splitting would not be possible for dynamic hashing as the parts are
> produced in order. It is likely that they will be consumed sequentially too.
>
> I would suggest that Spliterator is the more modern implementation,
> despite not always being applicable to parallelisation in a stream.
> Currently the iterator from the Hasher is used in forEachRemaining() and
> while loop is approximately equal measure. The while loops are for a fast
> exit and would be uglier if rewritten for a
> Spliterator.tryAdvance(IntConsumer) syntax.
>
> There is a use of the IteratorChain in HasherBloomFilter that would need a
> rethink for spliterators.
>
> The path of least resistance is to use Iterator for the API of
> CountingBloomFilter to be consistent with Hasher’s use of Iterator.
>
> WDYT?
>
>
> >boolean isValid()
>
> Fine. Allows some level of feedback that the counts are corrupt.
>
> > }
> >
> > Claude
> >
> >
> > On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert  <mailto:alex.d.herb...@gmail.com>>
> > wrote:
> >

Re: [collections] Bloom filters

2020-03-01 Thread Claude Warren

I think the CountingBloomFilter interface needs to extend BloomFilter.

I think I am confused.

I would expect CountingBloomFilter to have

interface CountingBloomFilter extends BloomFilter {
// these 2 methods are the reverse of merge()
void remove( BloomFilter );
void remove( Hasher );

// these 2 methods are the addition/subtraction of counts
void add( CountingBloomFilter )
void subtract( CountingBloomFilter );

// 2 methods to retrieve data
Stream getCounts();
boolean isValid()
}

Claude


On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert 
wrote:

>
>
> > On 1 Mar 2020, at 09:28, Claude Warren  wrote:
> >
> > The idea of a backing array is fine and the only problem I see with it is
> > in very large filters (on the order of 10^8 bits and larger) but document
> > the size calculation and let the developer worry about it.
>
> Let us look at the use case where we max out the array. Using the Bloom
> filter calculator:
>
> n = 149,363,281
> p = 0.00125 (1 in 1000)
> m = 2147483647 (256MiB)
> k = 10
>
> n = 74,681,641
> p = 0.01 (1 in 50)
> m = 2147483647 (256MiB)
> k = 20
>
> n = 49,787,761
> p = 0.1 (1 in 24899)
> m = 2147483647 (256MiB)
> k = 30
>
> So you will be able to put somewhere in the order of 10^8 or 10^7 items
> into the filter. I would say that anyone putting more than that into the
> filter has an unusual use case. The CountingBloomFilter can throw an
> exception if m is too large and will throw an OutOfMemoryError if you
> cannot allocate an array large enough.
>
> One clear point here is that you cannot use a short as a 16-bit count
> would easily overflow. So you have to use an integer array for the counts.
>
> A maximum length int[] is roughly 8GB.
>
> What would another implementation cost in terms of memory? The
> TreeMap was the most space efficient. In the previous
> e-mail the saturation of a Bloom filter bits was approximately 50% when at
> the intended capacity. So we have to estimate the size of a TreeMap
> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test shows the
> TreeMap memory scales linearly with entries:
>
>  32768 /  65536 (0.500) : TreeMap  =  1834061 bytes
>  65536 / 131072 (0.500) : TreeMap  =  3669080 bytes
> 131072 / 262144 (0.500) : TreeMap  =  7339090 bytes
>
> So what is the memory for a TreeMap with 2^30 indices. I make it about:
>
> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB
>
> I would say that this amount of RAM is unusual. It is definitely not as
> efficient as using an array. So very large counting Bloom filters are going
> to require some thought as to the hardware they run on. This may not be the
> case in 10 years time.
>
> I would say that we try an int[] backing array for the storage
> implementation and document it’s limitations. A different implementation
> could be provided in future if required.
>
> This could be done by making CountingBloomFilter an interface that extends
> BloomFilter with the methods:
>
> subtract(BloomFilter filter)
> subtract(Hasher filter)
>
> These will negate the effect of the corresponding merge(BloomFilter)
> operation.
>
> Do we also need access to the counts and add/subtract of another
> CountingBloomFilter?:
>
> add(CountingBloomFilter filter);
> subtract(CountingBloomFilter filter);
>
> Iterator getCounts();
> int getSize(); // Number of items added
>
> The CountingBloomFilter is then an interface that defines how to reverse
> the merge of some bits into the filter.
>
> My concern is the inefficiency of the creation of objects in any method
> that provides access to the counts (e.g. above using an iterator as for
> Hasher.getBits()). I presume this method would be to allow some type of
> storage/serialisation of the filter akin to the long[] getBits() method of
> BloomFilter. So it may be better done using a method:
>
> int getCount(int index);
>
> The caller can then use long[] getBits() to get the indices set in the
> filter and then for each non-zero bit index call getCount(index). Or just
> not include the method as the counts are only of concern when storing the
> filter. This functionality is cleaner pushed into an implementation.
>
> In a previous post we discussed whether to throw an exception on
> overflow/underflow or raise in an invalid flag. Using the invalid flag idea
> the interface would be:
>
> interface CountingBloomFilter {
> int add(CountingBloomFilter filter);
> int subtract(BloomFilter filter);
> int subtract(Hasher filter);
> int subtract(CountingBloomFilter filter);
> int getStatus();
>
> // Maybe
> int getSize();
>

Re: [collections] Bloom filters

2020-03-01 Thread Claude Warren

The idea of a backing array is fine and the only problem I see with it is
in very large filters (on the order of 10^8 bits and larger) but document
the size calculation and let the developer worry about it.

As for the merge question.  merge is a standard bloom filter operation.  It
is well defined in the literature.  Merging a bloom filter into a counting
bloom filter means incrementing the bit counts.  I  think that merge/remove
should continue to operate as though the parameter were a standard bloom
filter.

We had spoken of implementing and adding/deleting method pair that would
operate on CountingBloomFilters and would add/subtract the counts.  (e.g.
add(CountingBloomFilter) and subtract(CountingBloomFilter))

I disagree with your proposal for the merge(Hasher) implementation, and I
am not certain that an add(Hasher) makes sense.  First consider that the
Hasher returns the bits that are to be enabled in the Bloom filter so
collisions are expected.  In the normal course of events a Hasher is used
to create a normal Bloom filter where all the duplicates are removed.  That
filter is then merged into a CountingBloomFilter.  So in some sense the
Hasher and the normal Bloom filter are the same.  So I would expect the
merge of the Hasher and the merge of the normal Bloom filter created from
that hasher into a CountingBloomFilter to yield the same result.  If you
wanted to add an add(Hasher)/delete(Hasher) pair of functions to a
CountingBloomFilter you could implement with duplicate counting, but I am
not certain of the validity of such a count and I fear that it muddies the
waters with respect to what the CountingBloomFilter is counting.

Claude

On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert 
wrote:

>
>
> > On 29 Feb 2020, at 07:46, Claude Warren  wrote:
> >
> > Alex would you take a look at pull request 131 [1].  it adds a new hasher
> > implementation and makes the HashFunctionValidator available for public
> use.
> >
> > https://github.com/apache/commons-collections/pull/131 <
> https://github.com/apache/commons-collections/pull/131>
>
> OK. I’ll take a look.
>
> I’ve been thinking about the counting Bloom filter and the backing
> storage. In summary:
>
> 1. The backing storage should be a fixed array.
> 2. Merging a Hasher should count duplicate indices, not flatten them all
> to a single count.
>
> For background I’ve used the standard formulas to estimate the number of
> indices that will be non-zero in a Bloom filter. The wikipedia page gives
> this formula for the expected number of bits set to 0 (E(q)) if you have
> inserted i elements into a filter of size m using k hash functions:
>
> E(q) = (1 - 1/m)^ki"
>
> So a rough guess of the number of indices (bits) used by a filter is
> 1-E(q).
>
> Here is a table of Bloom filters with different collision probabilities
> and the proportion of bits that will be set when 1%, 10%, 100% of the
> capacity of the filter has been met:
>
> n   p   m   k   I   E(q)bits
> 10001E-04   19171   13  10  0.9932  0.0068
> 10001E-04   19171   13  100 0.9344  0.0656
> 10001E-04   19171   13  10000.5076  0.4924
> 10001E-05   23963   17  10  0.9929  0.0071
> 10001E-05   23963   17  100 0.9315  0.0685
> 10001E-05   23963   17  10000.4919  0.5081
> 10001E-06   28756   20  10  0.9931  0.0069
> 10001E-06   28756   20  100 0.9328  0.0672
> 10001E-06   28756   20  10000.4988  0.5012
> 1   1E-06   287552  20  100 0.9931  0.0069
> 1   1E-06   287552  20  10000.9328  0.0672
> 1   1E-06   287552  20  1   0.4988  0.5012
>
> The point is that if you create a Bloom filter and fill it to 10% of the
> intended capacity the number of indices used will be about 6-7% of the
> filter bits.
>
> So how to store the counts? Currently the counting bloom filter uses a
> TreeMap. I tried:
>
> TreeMap
> HashMap
> TreeSet
> HashSet
> int[]
>
> The MutableCount is a custom object that stores the bit index and uses it
> for a hash code and then has a mutable integer count field. It allows the
> count to be incremented/decremented if the object is in the set:
>
> static final class MutableCount implements Comparable {
> final int key;
> int count;
> // etc
> }
>
> This is adapted from the Bag collection which stores an item count with
> a MutableInteger. Here the mutable count is part of the same object T
> inserted into the Set. So you can find the object, change the count and not
> have to put it back into the set. This is more efficient than changing the
> Integer stored in a Map.
>
> I’ve estimated memory usage using an idea based on this artic

Re: [collections] Bloom filters

2020-02-28 Thread Claude Warren

Alex would you take a look at pull request 131 [1].  it adds a new hasher
implementation and makes the HashFunctionValidator available for public use.

https://github.com/apache/commons-collections/pull/131

On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert 
wrote:

> I have created a PR that contains most of the changes discussed in this
> thread (see [1]).
>
> Please review the changes and comment here or on GitHub.
>
> I have left the CountingBloomFilter alone. I will reimplement the
> add/subtract functionality as discussed into another PR.
>
> Alex
>
> [1] https://github.com/apache/commons-collections/pull/133 <
> https://github.com/apache/commons-collections/pull/133>
>
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

2020-02-19 Thread Claude Warren

Way back in your first post on this topic you talked about changing the
AbstractBloomFilter to use:


 @Override
 public int cardinality() {
 int count = 0;
 for (final long bits : getBits()) {
 count += Long.bitCount(bits);
 }
 return count;
 }

among other changes.  Those were the changes I was referring to.

Claude

On Wed, Feb 19, 2020 at 11:33 PM Alex Herbert 
wrote:

>
>
> > On 19 Feb 2020, at 21:14, Claude Warren  wrote:
> >
> > I think the compromise solution of removing the thrown exception and
> adding
> > an isInvalid() method makes sense and fits within the parameters of how
> the
> > counting Bloom filter should work.  However, I think the add() and
> remove()
> > methods should (or add and subtract) should only accept
> CountingBloomFilter
> > arguments.  The reasoning being that the methods are specific  to the
> > counting filter and are defined in terms of the internal structure of the
> > filter (the count of each bit) that are not present in the "standard"
> bloom
> > filter.  I think that it makes it lexically clear.
>
> OK. We can always add a transaction counting Bloom filter later that will
> ensure entire filters are added or removed cleanly. For now it can be user
> beware documentation that states if you subtract a filter you did not add
> then you can end up with a set of counts that are not a sum of filters.
>
> I do note that you could do this anyway even with a transaction counting
> Bloom filter, e.g. A+B+C+D - E. It may or may not throw and the trust in on
> the user to ensure they do not subtract E from something that never had it
> added.
>
> Thinking about the invalid flag perhaps the methods merge/add/subtract
> should just return 0 if the filter state is valid and -1 if invalid. Thus
> once a filter is invalid all calls would return the invalid flag.
>
> The reason for this is efficiency. The counts are allowed to roam freely
> over any range for an integer. You have a valid flag that is the bitwise OR
> of every count you have adjusted. If any over/underflow occurs then the
> valid flag will have a negative sign bit. The methods then return the
> signed shift of the valid flag:
>
> return valid >> 31;
>
> This would be -1 if any count ever over/underflows, zero otherwise.
>
> This eliminates any conditional statements from the filter operations. The
> user just has to know this is how it works and so any downstream
> consumption of the counts would have to first check the isInvalid flag and
> sanitise them as necessary. You can even view the counts as unsigned int32
> if you never subtract filters.
>
> >
> > Do you have a branch someplace where you are doing this work?   I would
> > like to see the proposed changes you outlined for the AbstractBloomFilter
> > implemented as I think you are correct and they would be faster and
> cleaner.
>
> No but I’ll create one. I’m waiting for a big PR against collections to go
> through (an update to force use of checkstyle). Then I’ll revert the
> changes I made to the CountingBloomFilterTest to its previous state and put
> the changes into a branch. These would have a new internal structure for
> the counts and the methods for add/subtract.
>
> For the AbstractBloomFilter do you mean the use of the mod(long, int)
> method in the Signedness enum? I think the removal of Math.floorMod can be
> done in two places with bit shift but should we also use the % operator
> instead of Math.floorMod for Signedness.UNSIGNED?
>
>
> >
> > Claude
> >
> > On Wed, Feb 19, 2020 at 12:41 AM Alex Herbert 
> > wrote:
> >
> >>
> >>
> >>> On 18 Feb 2020, at 22:34, Gary Gregory  wrote:
> >>>
> >>> On Tue, Feb 18, 2020 at 5:32 PM Claude Warren 
> wrote:
> >>>
> >>>> Last one first, why a tree map?  I think it is a holdover from an
> >> earlier
> >>>> implementation.  It can be any reasonable Map (e.g. HashMap).
> >>>>
> >>>> on the remove() issue.  You are absolutely correct, there is a bug.  I
> >>>>
> >>>
> >>> May you please add a test to validate whatever fix you guys implement?
> >> I'm
> >>> not following the details here but I want to stress the importance of
> >> tests
> >>> to help future maintainers validate any future changes and protect us
> >> from
> >>> regression bugs. IOW, while the current code base is in your head, more
> >>> tests! ;-)
> >>>
> >>
> >> I added a test that shows the current implementation does not do w

Re: [collections] Bloom filters

2020-02-19 Thread Claude Warren

I think the compromise solution of removing the thrown exception and adding
an isInvalid() method makes sense and fits within the parameters of how the
counting Bloom filter should work.  However, I think the add() and remove()
methods should (or add and subtract) should only accept CountingBloomFilter
arguments.  The reasoning being that the methods are specific  to the
counting filter and are defined in terms of the internal structure of the
filter (the count of each bit) that are not present in the "standard" bloom
filter.  I think that it makes it lexically clear.

Do you have a branch someplace where you are doing this work?   I would
like to see the proposed changes you outlined for the AbstractBloomFilter
implemented as I think you are correct and they would be faster and cleaner.

Claude

On Wed, Feb 19, 2020 at 12:41 AM Alex Herbert 
wrote:

>
>
> > On 18 Feb 2020, at 22:34, Gary Gregory  wrote:
> >
> > On Tue, Feb 18, 2020 at 5:32 PM Claude Warren  wrote:
> >
> >> Last one first, why a tree map?  I think it is a holdover from an
> earlier
> >> implementation.  It can be any reasonable Map (e.g. HashMap).
> >>
> >> on the remove() issue.  You are absolutely correct, there is a bug.  I
> >>
> >
> > May you please add a test to validate whatever fix you guys implement?
> I'm
> > not following the details here but I want to stress the importance of
> tests
> > to help future maintainers validate any future changes and protect us
> from
> > regression bugs. IOW, while the current code base is in your head, more
> > tests! ;-)
> >
>
> I added a test that shows the current implementation does not do what I
> thought it should do based on the documentation in javadoc:
>
> "For each bit that is turned on in the other filter; if the other filter is
>  also a CountingBloomFilter the count is added to this filter, otherwise
> the
>  count is incremented by one."
>
> My test checks the counts are added.
>
> However it does what Claude thought it should do. The original test checks
> that the indices are incremented by 1.
>
> So what to do about counting? If overflow or underflow occur on addition
> or subtraction you can throw an exception to signal error but if a partial
> operation has been made this leaves the filter in an undetermined state.
> The filter does not actually correspond to a sum of filters anymore. So it
> is invalid. Should any further operations on the filter be prevented? In
> this case throwing an IllegalStateException for all further operations
> seems extreme. It may be better to have the addition/subtraction return
> true/false if the operation was clean, i.e similar to a standard Set
> collection that returns true if adding/removing an item changes the state,
> false otherwise. The filter could also maintain a state flag to indicate
> that an overflow/underflow has occurred and the filter no longer represents
> a sum of filters.
>
> Given this idea we change merge to increment the indices by 1 for any
> BloomFilter including a counting filter. It will not throw if it overflows
> but will set an invalid state flag. The logic here is the method is
> inherited from BloomFilter and so will function in the same way as other
> filters by accumulating set bits (indices).
>
> Then add some methods:
>
> boolean add(BloomFilter)
> boolean remove(BloomFilter)
> boolean isInvalid()
>
> These update the counts using the counts of the other filter if it is a
> CountingBloomFilter, otherwise add or subtract 1. The methods return true
> if the operation added or removed the filter entirely, i.e. it is now a
> part of the sum or was removed from the sum. Otherwise they return false
> and the filter is marked as an invalid sum of filters. This leaves the
> counting filter in an undetermined state but it can still be used as a
> non-counting filter as the indices will be correctly maintained.
>
> It is a compromise but there does not seem to be a nice way to do this
> without a lot of overhead. That would be to check the add/remove can be
> done entirely before actually performing the operation. If the operation
> would under/overflow then an exception can be thrown and the filter is left
> in a usable state. This could be achieved with some efficiency if the
> filter stored the maximum count. Then any add would be able to check for
> potential overflow quickly and do a safe add. However the subtraction could
> not know the indices that may underflow using a max value (since 0 - x
> underflows) and would always have to do a safe removal by checking all
> decrements before actually doing the entire removal. One idea here is a
> reusable list is kept to store all the indices that

Re: [collections] Bloom filters

2020-02-18 Thread Claude Warren

> be "subtracted" from the other.
>
> Change (b) is faster but could lead to errors. So how likely is overflow
> in a counting Bloom filter?
>
> Both can be supported but it would have to use a custom Bag
> implementation. This is pretty trivial and would actually speed up other
> parts of the filter by having direct access to the Map underlying the
> Bag for faster merge and remove operations. This would then be a change
> to the current Map to a Map
> to store counts.
>
> I can update the code to fix the implementation but will only do so
> after a decision on overflow is made. We could even have a custom
> implementation that stores counts as a long for very high counting
> filters. Is this likely to be used? Only in a situation where the
> exceptions from overflow are an issue.
>
>
> Other questions about this:
>
> - Why public Stream> getCounts()?
>
> Perhaps this is better served raw as arrays of length 2. This has lower
> memory overhead:
>
> public Stream getCounts()
>
> I've ruled out an accessor method as there is no equivalent for boolean
> getBit(int index) in BloomFilter, e.g:
>
> public int getCount(int index)
>
> - Why a TreeMap? Is the order of the indices important?
>
> Or do you have some benchmarks to show that the TreeMap handles lots of
> growth and shrinkage better than a HashMap. There are situations where
> each one would be a better choice and so perhaps this is a case for
> having a CountingBloomFilter with a choice for the backing Map. This
> could be done using an abstract class with implementations. This is one
> to leave until some benchmarks are in place in the main code.
>
>
> On 18/02/2020 09:54, Claude Warren wrote:
> > On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert 
> > wrote:
> >
> >
> >>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
> >>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for
> >> all
> >>> positive values of N?  If so then you are correct and Signedness does
> not
> >>> matter and can be removed.  I thought that the statement was false.
> >>>
> >> Yes, you are correct. Addition and multiplication use the bits in the
> same
> >> way. Modulus does not.
> >>
> >> So are the Murmur3 hash functions actually unsigned? I thought the
> >> original c code returned unsigned 64-bit values (uint64_t).
> >>
> >> IIUC the purpose of the Signedness enum is to specify that a negative
> >> value returned from the hash function should be used as a numerical
> >> magnitude if then used in downstream computations since sign extension
> on
> >> negatives will set leading bits to 1s. I.e. using the negative as if it
> >> were unsigned (and all bits are random) is not valid. Perhaps you can
> give
> >> an example of how signed or unsigned would be handled differently.
> >>
> >>
> > The murmur hash is not a good example here as it returns the resulting
> > buffer as 2 signed longs.  The MD5 is probably a better example as the
> > messageDigest.digest() returns a byte[128].  The code then interprets
> that
> > as 2 signed longs.  A C implementation could interpret that as 2 unsigned
> > longs.  Additionally, there are hashes that return smaller buffers such
> as
> > with murmur2 which returns a 32bit unsigned int in the C code.  In java
> > that could be represented as a long as per the Integer.asUnsignedlong()
> > method or it could just be cast to a long and thus signed.
> >
> > The values returned are then passed to floorMod( value, numberOfBits ) to
> > turn on the appropriate bit in the resulting bit vector.  As noted above
> > this will yield different values depending upon how the byte[] was
> > represented.
> >
> > Claude
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] Bloom filters

2020-02-18 Thread Claude Warren

On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert 
wrote:


> > My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
> > B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for
> all
> > positive values of N?  If so then you are correct and Signedness does not
> > matter and can be removed.  I thought that the statement was false.
> >
>
> Yes, you are correct. Addition and multiplication use the bits in the same
> way. Modulus does not.
>
> So are the Murmur3 hash functions actually unsigned? I thought the
> original c code returned unsigned 64-bit values (uint64_t).
>
> IIUC the purpose of the Signedness enum is to specify that a negative
> value returned from the hash function should be used as a numerical
> magnitude if then used in downstream computations since sign extension on
> negatives will set leading bits to 1s. I.e. using the negative as if it
> were unsigned (and all bits are random) is not valid. Perhaps you can give
> an example of how signed or unsigned would be handled differently.
>
>
The murmur hash is not a good example here as it returns the resulting
buffer as 2 signed longs.  The MD5 is probably a better example as the
messageDigest.digest() returns a byte[128].  The code then interprets that
as 2 signed longs.  A C implementation could interpret that as 2 unsigned
longs.  Additionally, there are hashes that return smaller buffers such as
with murmur2 which returns a 32bit unsigned int in the C code.  In java
that could be represented as a long as per the Integer.asUnsignedlong()
method or it could just be cast to a long and thus signed.

The values returned are then passed to floorMod( value, numberOfBits ) to
turn on the appropriate bit in the resulting bit vector.  As noted above
this will yield different values depending upon how the byte[] was
represented.

Claude

Re: [collections] Bloom filters

2020-02-18 Thread Claude Warren

On Mon, Feb 17, 2020 at 9:52 PM Alex Herbert 
wrote:

>
>
> > On 17 Feb 2020, at 20:30, Claude Warren  wrote:
> >
> > Alex,
> >
> > Thank you for your comments.
> >
> > See comments inline.
> >
> >
> >
> > On Mon, Feb 17, 2020 at 3:20 PM Alex Herbert  <mailto:alex.d.herb...@gmail.com>>
> > wrote:
> >
> >> I had a look through all the BloomFilter code. Thanks Claude for the
> >> contribution.
> >>
> >> Some items that need clarifying:
> >>
> >>
> >> 1. HashFunctionIdentity.Signedness
> >>
> >> This is not fully documented as to what the sign applies to. There is no
> >> javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
> >> states "Identifies the signedness of the calculations for this
> function".
> >>
> >> I assume this applies to the Hash computation in 'long
> >> HashFunction.apply(byte[], int)'
> >>
> >> Does this mean the hash algorithm has a variant that can treat the
> >> bytes/seed as signed or unsigned. If so which one because 2 enum values
> >> cannot cover all 4 possibilities. Since there is no javadoc it is
> >> unclear exactly what this property is supposed to indicate.
> >>
> >> I will attempt to clarify in the javadoc.  Basically when the underlying
> > hash function executes it returns an array of bytes.  The algorithm can
> > treat those bytes as signed or unsigned long values.  For the java
> > implementations this is probably signed but other languages could
> implement
> > it as unsigned.   For example the murmur128 has returns 128 bytes this
> > could be interpreted as 2 64-bit signed or unsigned longs.  The
> Signedness
> > attribute describes how that is interpreted.  As for the other values the
> > seed is a signed value and the byte buffer is a byte buffer interpreted
> as
> > per the underlying hash function.
>
> OK. So we add comments to the javadoc to state that the signedness is the
> interpretation of the return value of
>
> long HashFunction.apply(byte[], int)
>
> So there can be algorithms that return a signed long and those that should
> return an unsigned long but this is not a supported type in Java.
>
> Either way it doesn’t really matter, it seems more for correctness than
> for actual practical use given that unsigned and signed arithmetic on
> twos-complement integers is the same.
>

My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for all
positive values of N?  If so then you are correct and Signedness does not
matter and can be removed.  I thought that the statement was false.


> >
> >
> >> 2. HashFunctionIdentity comparators
> >>
> >> The interface HashFunctionIdentity defines constants which are
> >> comparators. Do these need to be here? Typically comparators should be
> >> Serializable. Are these comparators ever going to be bundled with
> >> BloomFilters and serialized? There are lots of ways to compare
> >> HashFunctionIdentities. Comparators are for ordering. So why do you need
> >> to order a HashFunction and why choose the two implementations provided?
> >>
> >
> > There was a discussion on the list about building factories where the
> > provider of the HashFunctionIdentity would be important (Think in terms
> of
> > the JCA where multiple providers can provide the same hash/encryption
> > functions but an application may wish to choose one over the other).
>
> OK. But given that this is not being done we could drop it from the API
> for now to make it cleaner.
>
OK

>
> >
> >>
> >> From what I can see the COMMON comparator is used in the main code in
> >> Shape, DynamicHasher and StaticHasher with:
> >>
> >> if (compare(a, b) != 0) {
> >> // not allowed ...
> >> }
> >>
> >> The DEEP comparator is not used in the main code.
> >
> >
> >> The use case for checking that the hash function is the same seems to be
> >> better satisfied by using a package private HashFunctionIdentityEquality
> >> helper method:
> >>
> >> static boolean areEqual(HashFunctionIdentity a,
> >> HashFunctionIdentity b) {
> >> return (a.getSignedness() == b.getSignedness() &&
> >> a.getProcessType() == b.getProcessType() &&
> >> a.getName().equalsIgnoreCase(b.getName()));
> >> }
> >>
>

Re: [collections] Bloom filters

2020-02-17 Thread Claude Warren

Alex,

Thank you for your comments.

See comments inline.



On Mon, Feb 17, 2020 at 3:20 PM Alex Herbert 
wrote:

> I had a look through all the BloomFilter code. Thanks Claude for the
> contribution.
>
> Some items that need clarifying:
>
>
> 1. HashFunctionIdentity.Signedness
>
> This is not fully documented as to what the sign applies to. There is no
> javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
> states "Identifies the signedness of the calculations for this function".
>
> I assume this applies to the Hash computation in 'long
> HashFunction.apply(byte[], int)'
>
> Does this mean the hash algorithm has a variant that can treat the
> bytes/seed as signed or unsigned. If so which one because 2 enum values
> cannot cover all 4 possibilities. Since there is no javadoc it is
> unclear exactly what this property is supposed to indicate.
>
> I will attempt to clarify in the javadoc.  Basically when the underlying
hash function executes it returns an array of bytes.  The algorithm can
treat those bytes as signed or unsigned long values.  For the java
implementations this is probably signed but other languages could implement
it as unsigned.   For example the murmur128 has returns 128 bytes this
could be interpreted as 2 64-bit signed or unsigned longs.  The Signedness
attribute describes how that is interpreted.  As for the other values the
seed is a signed value and the byte buffer is a byte buffer interpreted as
per the underlying hash function.


> 2. HashFunctionIdentity comparators
>
> The interface HashFunctionIdentity defines constants which are
> comparators. Do these need to be here? Typically comparators should be
> Serializable. Are these comparators ever going to be bundled with
> BloomFilters and serialized? There are lots of ways to compare
> HashFunctionIdentities. Comparators are for ordering. So why do you need
> to order a HashFunction and why choose the two implementations provided?
>

There was a discussion on the list about building factories where the
provider of the HashFunctionIdentity would be important (Think in terms of
the JCA where multiple providers can provide the same hash/encryption
functions but an application may wish to choose one over the other).

>
>  From what I can see the COMMON comparator is used in the main code in
> Shape, DynamicHasher and StaticHasher with:
>
> if (compare(a, b) != 0) {
>  // not allowed ...
> }
>
> The DEEP comparator is not used in the main code.


> The use case for checking that the hash function is the same seems to be
> better satisfied by using a package private HashFunctionIdentityEquality
> helper method:
>
>  static boolean areEqual(HashFunctionIdentity a,
> HashFunctionIdentity b) {
>  return (a.getSignedness() == b.getSignedness() &&
>  a.getProcessType() == b.getProcessType() &&
>  a.getName().equalsIgnoreCase(b.getName()));
>  }
>
> This would drop the comparators that no-one actually requires and
> clean-up the public API.
>
> I can see the value of this solution.

I note that not everything uses the comparator.
> AbstractBloomFilter.verfiyHasher uses:
>
> if (a.getSignature() != b.getSignature()) {
>  // not allowed ...
> }
>
> However the signature (a long value) could be the same by chance with
> p=2^-64. So why is this class allow to use getSignature() but others do
> a full hash identity comparison? It seems to me you either do one or the
> other: a full property equality test; or a quick signature check (which
> is error prone); but not mix and match in different parts of the code.
>
> The reasoning for the second part is that bloom filter comparisons must be
very fast.  The possibility of signature collision is low but not zero.
However, trusting the name has issues as well.  There are a number of
murmur128 hash implementations that are not correct.  Using the signature
will detect those because the signature value is built by the hashing
function executing on the name.  Keep in mind that the actual function that
performed the hash may not be present when the signature is verified.

>
> 3. AbtractBloomFilter
>
> I considered changing the cardinality methods to avoid using BitSet
> cardinality() and computing the cardinality direct. BitSet will
> duplicate the input long[] in the constructor. For the or/xor/and
> cardinality this is 2 array creations that can be avoided with a direct
> implementation. For the cardinality method it avoids 1 array creation
> via BitSet. Here is the type of implementation for a direct computation:
>
>  @Override
>  public int cardinality() {
>  int count = 0;
>  for (final long bits : getBits()) {
>  count += Long.bitCount(bits);
>  }
>  return count;
>  }
>
>  @Override
>  public int andCardinality(final BloomFilter other) {
>  verifyShape(other);
>  final long[] mine = getBits();
>  final long[] theirs = other.getBits();
>

Re: [collections] example code.

2020-01-26 Thread Claude Warren

The doxia snippet extension allows you to reference code in the source tree
and include it in the generated HTML files.  This means that the example
code in the documentation/site can come directly from code that compiles
and can be run in the examples.  Best of both worlds I think.

Claude

On Sun, Jan 26, 2020 at 8:48 PM Gary Gregory  wrote:

> No code in the site please. It must be compiled with the build and
> available as a plain source file for refactoring from IDEs and tools.
>
> Gary
>
> On Sun, Jan 26, 2020, 15:19 Bruno P. Kinoshita  wrote:
>
> >  That's exactly what we have in commons imaging. +1 to either an example
> > directory in src/test and maybe a link in the website, or the code direct
> > in the site.
> > Cheers
> > Bruno
> >
> > On Monday, 27 January 2020, 6:07:21 am NZDT, Gary Gregory <
> > garydgreg...@gmail.com> wrote:
> >
> >  I think the simplest is to create an examples package under src/test
> which
> > also let you put example data under src/resources.
> >
> > This way, that code would get processed just like any other test code
> > included the source xref report.
> >
> > Gary
> >
> > On Sun, Jan 26, 2020, 11:52 Claude Warren  wrote:
> >
> > > I see that there is no example code directory in the collections
> project.
> > > I was thinking of contributing an example of how to construct a Bloom
> > > filter that operates like the Hadoop Bloom filter but this seems like
> > > something that we may not want to include in the library.
> > >
> > > In the Jena project we have example code that we package for users if
> > they
> > > wish and we reference it in the documentation as well.
> > >
> > > Any thoughts on this?
> > >
> > > Claude
> > >
> > > --
> > > I like: Like Like - The likeliest place on the web
> > > <http://like-like.xenei.com>
> > > LinkedIn: http://www.linkedin.com/in/claudewarren
> > >
> >
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

FOSDEM 2020

2020-01-26 Thread Claude Warren

Is anyone on this list (besides me) planning on attending FOSDEM 2020[1]?
If so would you be interested in hosting the Apache table?  By "hosting" I
mean we stand there talk to people and promote Apache commons.

Claude

[1] https://cwiki.apache.org/confluence/display/COMDEV/FOSDEM+2020

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

[collections] example code.

2020-01-26 Thread Claude Warren

I see that there is no example code directory in the collections project.
I was thinking of contributing an example of how to construct a Bloom
filter that operates like the Hadoop Bloom filter but this seems like
something that we may not want to include in the library.

In the Jena project we have example code that we package for users if they
wish and we reference it in the documentation as well.

Any thoughts on this?

Claude

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [Collections] Bloom filters are in

2020-01-25 Thread Claude Warren

Now to get some documentation done!

On Sat, Jan 25, 2020 at 1:58 PM Gary Gregory  wrote:

> ... git master. Thank you Claude!
>
> Gary
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

[collections] Bloom filter signature calculation

2020-01-23 Thread Claude Warren

The HashFunctionIdentity.getSignature() method is intended to be used as in
a quick comparison of a HashFunctionIdentities.  As such it is supposed to
encompass the name, signedness and process as well as some indication that
the function implementation is the same as any other implementation of the
same function.  To do this is calls the hashing function (apply()) with a
seed of 0 (zero).

There was recent work on the commons-codec Murmur3 implementations that
dealt with sign extension errors in the seed.  In light of this, I wonder
if the seed for the signature shouldn't be negative rather than zero as
this may have a higher probability of exposing implementation issues.  But
then I wonder if I am tainted by the Murmur3 issue.

The apply() method is supposed  to be implemented as:

 apply( String.format( "%s-%s-%s", getName().toUpperCase( Locale.ROOT ),
getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 );

Thoughts?

Claude

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] bloom filters comments

2020-01-13 Thread Claude Warren

Gary suggested merging the pull request, Gilles points out there were
issues raised on the mailing list, I posted what I think was an accurate
summary.

What needs to be done to resolve this so that a final decision on the pull
request can be made?

Claude

On Thu, Jan 9, 2020 at 2:57 AM Bruno P. Kinoshita
 wrote:

>  Sorry, I'd read Gary's reply as if there was no PR yet. Reviewed it a bit
> now, lots of tests! Will play with the code and read the comments and
> finish the review by the end of the week.
>
> Thanks Claude
>
> On Thursday, 9 January 2020, 11:20:40 am NZDT, Claude Warren <
> cla...@xenei.com> wrote:
>
>  There is a pull request open:
> https://github.com/apache/commons-collections/pull/83
>
> On Wed, Jan 8, 2020 at 9:32 PM Bruno P. Kinoshita
>  wrote:
>
> >  Thanks for the summary Claude. I read some of the messages, but got lost
> > in the middle of the discussion, and tend to understand problems better
> if
> > there's some code to read along. And I am used to GitHub/GitLab diff
> > interface.
> > So I agree with Gary that this could be a good time for a PR (maybe a
> > draft one).
> > Bruno
> >
> >
> >On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <
> > cla...@xenei.com> wrote:
> >
> >  I believe the issue (I think history is at
> >
> >
> https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17003600
> > )
> > is about the identification of hash implementations.
> >
> > Currently there are a couple of classes involved:
> >
> > Hasher interface, has a method that returns a HashFunctionIdentity and a
> > method that returns an iterator of enabled bits.  There are a couple of
> > implementations of Hasher:  the DynamicHasher contains buffers that are
> > passed to the hash function several times, the StaticHasher contains a
> list
> > of bits enabled by a hasher for a specific Shape.
> >
> > HashFunction interface: extends HashFunctionIdentity and adds a method
> that
> > calls the actual hash function.
> >
> > HashFunctionIdentity: contains the name of the hash function, the name of
> > the provider, the processType (cyclic or iterative), Signedness and a
> > signature.
> >
> > There are places in the code where the actual function is not required
> and
> > is some use cases would make the implementation difficult or fragile.
> > These code places are where the Bloom filter has been built and the
> system
> > is verifying that two filters used the same hash function.  In these
> cases
> > the comparison is the hashName, processType and Signedness.  In cases
> where
> > the bloom filters are stored in a database retrieval would mean some sort
> > of serialization/deserialization of the hash function or ensure that the
> > hash function is otherwise available.  This is problematic.
> >
> > The provider was added in a nod to a future factory that would follow the
> > JCA pattern and allow implementations of multiple providers.
> >
> > The signature was added to support a requested quick check.  The
> signature
> > is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
> > getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).
> >
> > There were suggestions to create an enum of HashFunctions controlled  by
> > the Collections.  I think that this adds a layer of coordination and
> > management on the Collections team that as a team we may not want to take
> > on.  In addition, it makes it almost impossible for 3rd party users to
> > create new hash functions and test them with the library.
> >
> > I believe the current implementation provides the minimal information
> > necessary to determine if two functions are supposed to produce the same
> > result.  In my mind the signature and provider methods are extra and not
> > necessary but desirable.
> >
> > I think this is a summary of the open discussion.
> >
> >
> > On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski 
> > wrote:
> >
> > > Le mer. 8 janv. 2020 à 15:15, Gary Gregory  a
> > > écrit :
> > > >
> > > > I think it is time to bring this PR in and make any adjustments
> within
> > > > master beyond that. This will be quicker and simpler than going round
> > and
> > > > round for simple things like Javadoc tweaks and small non-functional
> > > > changes (formatting, variable names, and so on.) I'll proceed with
> that
> > > > tonigh

Re: [collections] bloom filters comments

2020-01-08 Thread Claude Warren

There is a pull request open:
https://github.com/apache/commons-collections/pull/83

On Wed, Jan 8, 2020 at 9:32 PM Bruno P. Kinoshita
 wrote:

>  Thanks for the summary Claude. I read some of the messages, but got lost
> in the middle of the discussion, and tend to understand problems better if
> there's some code to read along. And I am used to GitHub/GitLab diff
> interface.
> So I agree with Gary that this could be a good time for a PR (maybe a
> draft one).
> Bruno
>
>
> On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <
> cla...@xenei.com> wrote:
>
>  I believe the issue (I think history is at
>
> https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17003600
> )
> is about the identification of hash implementations.
>
> Currently there are a couple of classes involved:
>
> Hasher interface, has a method that returns a HashFunctionIdentity and a
> method that returns an iterator of enabled bits.  There are a couple of
> implementations of Hasher:  the DynamicHasher contains buffers that are
> passed to the hash function several times, the StaticHasher contains a list
> of bits enabled by a hasher for a specific Shape.
>
> HashFunction interface: extends HashFunctionIdentity and adds a method that
> calls the actual hash function.
>
> HashFunctionIdentity: contains the name of the hash function, the name of
> the provider, the processType (cyclic or iterative), Signedness and a
> signature.
>
> There are places in the code where the actual function is not required and
> is some use cases would make the implementation difficult or fragile.
> These code places are where the Bloom filter has been built and the system
> is verifying that two filters used the same hash function.  In these cases
> the comparison is the hashName, processType and Signedness.  In cases where
> the bloom filters are stored in a database retrieval would mean some sort
> of serialization/deserialization of the hash function or ensure that the
> hash function is otherwise available.  This is problematic.
>
> The provider was added in a nod to a future factory that would follow the
> JCA pattern and allow implementations of multiple providers.
>
> The signature was added to support a requested quick check.  The signature
> is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
> getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).
>
> There were suggestions to create an enum of HashFunctions controlled  by
> the Collections.  I think that this adds a layer of coordination and
> management on the Collections team that as a team we may not want to take
> on.  In addition, it makes it almost impossible for 3rd party users to
> create new hash functions and test them with the library.
>
> I believe the current implementation provides the minimal information
> necessary to determine if two functions are supposed to produce the same
> result.  In my mind the signature and provider methods are extra and not
> necessary but desirable.
>
> I think this is a summary of the open discussion.
>
>
> On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski 
> wrote:
>
> > Le mer. 8 janv. 2020 à 15:15, Gary Gregory  a
> > écrit :
> > >
> > > I think it is time to bring this PR in and make any adjustments within
> > > master beyond that. This will be quicker and simpler than going round
> and
> > > round for simple things like Javadoc tweaks and small non-functional
> > > changes (formatting, variable names, and so on.) I'll proceed with that
> > > tonight.
> >
> > Design issues were raised on the ML: With no agreement and no opinions
> > other than Claude's and mine, things stayed where they were.
> >
> > Gilles
> >
> > >> [...]
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren



-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] bloom filters comments

2020-01-08 Thread Claude Warren

I believe the issue (I think history is at
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17003600)
is about the identification of hash implementations.

Currently there are a couple of classes involved:

Hasher interface, has a method that returns a HashFunctionIdentity and a
method that returns an iterator of enabled bits.  There are a couple of
implementations of Hasher:  the DynamicHasher contains buffers that are
passed to the hash function several times, the StaticHasher contains a list
of bits enabled by a hasher for a specific Shape.

HashFunction interface: extends HashFunctionIdentity and adds a method that
calls the actual hash function.

HashFunctionIdentity: contains the name of the hash function, the name of
the provider, the processType (cyclic or iterative), Signedness and a
signature.

There are places in the code where the actual function is not required and
is some use cases would make the implementation difficult or fragile.
These code places are where the Bloom filter has been built and the system
is verifying that two filters used the same hash function.  In these cases
the comparison is the hashName, processType and Signedness.  In cases where
the bloom filters are stored in a database retrieval would mean some sort
of serialization/deserialization of the hash function or ensure that the
hash function is otherwise available.  This is problematic.

The provider was added in a nod to a future factory that would follow the
JCA pattern and allow implementations of multiple providers.

The signature was added to support a requested quick check.  The signature
is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).

There were suggestions to create an enum of HashFunctions controlled  by
the Collections.  I think that this adds a layer of coordination and
management on the Collections team that as a team we may not want to take
on.  In addition, it makes it almost impossible for 3rd party users to
create new hash functions and test them with the library.

I believe the current implementation provides the minimal information
necessary to determine if two functions are supposed to produce the same
result.  In my mind the signature and provider methods are extra and not
necessary but desirable.

I think this is a summary of the open discussion.

On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski  wrote:

> Le mer. 8 janv. 2020 à 15:15, Gary Gregory  a
> écrit :
> >
> > I think it is time to bring this PR in and make any adjustments within
> > master beyond that. This will be quicker and simpler than going round and
> > round for simple things like Javadoc tweaks and small non-functional
> > changes (formatting, variable names, and so on.) I'll proceed with that
> > tonight.
>
> Design issues were raised on the ML: With no agreement and no opinions
> other than Claude's and mine, things stayed where they were.
>
> Gilles
>
> >> [...]
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [VOTE] Release Apache Commons Codec 1.14 based on RC1

2020-01-03 Thread Claude Warren

I am not a PMC member but, I'll report any way

+0 (non binding)

*FindBug* issues show: MurmurHash3 case fall through issues.  I believe
these are expected and can be fixed with an annotation.
Suggest release and fix in next update

*CPD* issues shows private static long getLittleEndianLong(final byte[]
data, final int index) in both Murmur2 and Murmur3.  This might actually
make a reasonable method in an "interpret buffer as number class".




Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555;
2019-04-04T20:00:29+01:00)
Maven home: /home/claude/bin/apache-maven-3.6.1
Java version: 11.0.5, vendor: Private Build, runtime:
/usr/lib/jvm/java-11-openjdk-amd64
Default locale: en_IE, platform encoding: UTF-8
OS name: "linux", version: "5.0.0-37-generic", arch: "amd64", family: "unix"

Linux zoltan-the-great 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14
12:06:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


Claude


On Fri, Jan 3, 2020 at 4:20 AM Bruno P. Kinoshita
 wrote:

>[X] +1 Release these artifacts
>
>
> Build from tag passed with `mvn clean install site` on
>
>
> Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe;
> 2018-06-18T06:33:14+12:00)
> Maven home: /opt/apache-maven-3.5.4
> Java version: 1.8.0_232, vendor: Private Build, runtime:
> /usr/lib/jvm/java-8-openjdk-amd64/jre
> Default locale: en_NZ, platform encoding: UTF-8
> OS name: "linux", version: "4.15.0-72-generic", arch: "amd64", family:
> "unix"
>
> Build with same command as above passed fine too with JDK 13, using the
> ZIP sources from the dist area.
>
>
> Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe;
> 2018-06-18T06:33:14+12:00)
> Maven home: /opt/apache-maven-3.5.4
> Java version: 13, vendor: Oracle Corporation, runtime:
> /home/kinow/Development/java/jdk-13
> Default locale: en_NZ, platform encoding: UTF-8
> OS name: "linux", version: "4.15.0-72-generic", arch: "amd64", family:
> "unix"
>
> Had a look at files in the dist area and fond no issues. Site reports look
> good.
>
>
>
> Checked signatures and found no issues too.
>
>
> ThanksBruno
>
>
>
> On Tuesday, 31 December 2019, 2:00:12 pm NZDT, Gary Gregory <
> ggreg...@apache.org> wrote:
>
>  We have fixed quite a few bugs and added some significant enhancements
> since Apache Commons Codec 1.13 was released, so I would like to release
> Apache Commons Codec 1.14.
>
> Apache Commons Codec 1.14 RC1 is available for review here:
> https://dist.apache.org/repos/dist/dev/commons/codec/1.14-RC1 (svn
> revision 37409)
>
> The Git tag commons-codec-1.14-RC1 commit for this RC is
> af7b94750e2178b8437d9812b28e36ac87a455f2 which you can browse here:
>
>
> https://gitbox.apache.org/repos/asf?p=commons-codec.git;a=commit;h=af7b94750e2178b8437d9812b28e36ac87a455f2
> You may checkout this tag using:
> git clone https://gitbox.apache.org/repos/asf/commons-codec.git
> --branch 
> commons-codec-1.14-RC1 commons-codec-1.14-RC1
>
> Maven artifacts are here:
>
>
> https://repository.apache.org/content/repositories/orgapachecommons-1483/commons-codec/commons-codec/1.14/
>
> These are the artifacts and their hashes:
>
> #Release SHA-512s
> #Mon Dec 30 19:44:19 EST 2019
>
> commons-codec-1.14-bin.tar.gz=bb21a15d0415513050d7738914b66bfe1f4612d7b44805ba208ec6d4fd923af244c98e828986cd4aa6cdcf9c78e23de981b5a34ac1abf805691a55f5be8cc0cf
>
> commons-codec-1.14-bin.zip=1119a418683ee9b599148b90056055fa93062e839ef426004c756c8054548cc1a80479f3b77a5a60c292e05ae67ae8699d46ae4ec25bad0070e17eb0e9e69756
>
> commons-codec-1.14-javadoc.jar=53dda4afcd159debd68aa8b3fdb22c6de02903e725d82203db2c8692da86091a4435664eb72df12a69e65f461712435ece81da9f0658587b2f6df3991d5a048a
>
> commons-codec-1.14-sources.jar=a0791dacf27154fa9453e16856d5b911b568eaada52a02535bb6170bdd13e97da32051d5359aa92ad0e69948450ff6f463aedc9a04f5f9ff4d5704e31e51a6b9
>
> commons-codec-1.14-src.tar.gz=1a98dc017d3213d8a89d11524683d45e9616fc6da8b4723cd1cfae23d73fefafeccc1c0f2994d1c4c52d79e1b825e4e09e601071572e2eb229105b3b4fd4a49b
>
> commons-codec-1.14-src.zip=0a7ebe348c9b12ee360e137c5bee833df4f7cfb259435ce6bda24073e8e1d3b76a202d4fd451efe29ddebcbcc31b3f7a287c7530d23bdc672129280d5c12d90e
>
> commons-codec-1.14-test-sources.jar=02c4fba74f028749abbf4bc44b328ed40fae3f60f53998dcaa2699284d337180b934ca93344271739f4e4df44643e87f7ce12dee35b252fa60f8f32ae60df460
>
> commons-codec-1.14-tests.jar=18dd09a0bd91040d857d84b0be39a482e884c701bcbbb5bbba990c80cf02d0270158bdbce8fa1b4473eeeb86d03ce1c4a3232d280f8b1356ee5c9f64d8a685a8
>
> I have tested this with:
>
> mvn -V -Duser.name=%my_apache_id%
> -Dcommons.release-plugin.version=%commons.release-plugin.version% -Prelease
> -Pjacoco -Pjapicmp -Ptest-deploy clean package site deploy
>
> using:
>
> Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
> Maven home: C:\Java\apache-maven-3.6.3\bin\..
> Java version: 1.8.0_231, vendor: Oracle Corporation, runtime: C:\Program
> Files\Java\jdk1.8.0_231\jre
> Default

Re: [codec] release soon

2019-12-29 Thread Claude Warren

Changing the getBytes() call will change the operation vs the previous
release.  That would be a major release change would it not?

On Sun, Dec 29, 2019 at 4:20 PM Gary Gregory  wrote:

> On Sun, Dec 29, 2019 at 10:58 AM Gary Gregory 
> wrote:
>
> > The two murmur hash classes call String#getBytes() instead of
> > String#getBytes(Charset|String).
> >
> > This means you can get different results depending on where in the world
> > you run the code or by changing the "file.encoding" system property. I
> > can't imagine that's the intention here.
> >
> > Why not use UTF-8?
> >
>
> Or ISO_8859_1...
>
> Gary
>
> >
> > Gary
> >
> >
> > On Sun, Dec 29, 2019 at 8:28 AM Gary Gregory 
> > wrote:
> >
> >> On Sun, Dec 29, 2019 at 4:20 AM Alex Herbert 
> >> wrote:
> >>
> >>> On Sun, 29 Dec 2019, 01:14 Gary Gregory, 
> wrote:
> >>>
> >>> > It looks like public methods have been removed
> >>> > from org.apache.commons.codec.digest.MurmurHash3$IncrementalHash32,
> >>> These
> >>> > need to go back in to maintain binary compatibility. Then we can
> have a
> >>> > release candidate.
> >>> >
> >>>
> >>> To fix the broken hash implementation a new parent class with the
> correct
> >>> implementation was introduced and some methods bumped up to that. So
> the
> >>> methods still exist but they are in the parent class. When I ran clirr
> >>> during the development it did not complain. Is there a report stating
> >>> that
> >>> binary compatibility is broken? Maybe JApiCmp has a different opinion.
> >>>
> >>> Doing it this way allows the broken class to be deprecated. The other
> way
> >>> to have the new correct class as the child means that the broken class
> >>> cannot be deprecated as it has a child. Making the broken method
> >>> deprecated
> >>> would then have a child overriding a deprecated method.
> >>>
> >>
> >> You're correct, I misread the JApiCmp report. Sorry about that.
> >>
> >> Gary
> >>
> >>
> >>>
> >>> Alex
> >>>
> >>>
> >>> > Gary
> >>> >
> >>> > On Fri, Dec 27, 2019 at 7:02 PM Gary Gregory  >
> >>> > wrote:
> >>> >
> >>> > > On Fri, Dec 27, 2019 at 3:18 PM Alex Herbert <
> >>> alex.d.herb...@gmail.com>
> >>> > > wrote:
> >>> > >
> >>> > >>
> >>> > >>
> >>> > >> > On 27 Dec 2019, at 16:35, Gary Gregory 
> >>> > wrote:
> >>> > >> >
> >>> > >> > Great, TY. Feel free to add more tests if need be. Edge cases
> and
> >>> so
> >>> > on.
> >>> > >> >
> >>> > >> > Gary
> >>> > >>
> >>> > >> If you look at the Jacoco report for MurmurHash3 the only line
> >>> missing
> >>> > >> execution is the throwing of an AssertionError in a default block
> >>> of a
> >>> > >> switch statement for a line that should not be possible to reach
> >>> (line
> >>> > >> 1057).
> >>> > >>
> >>> > >> So it is missing coverage of unreachable code.
> >>> > >>
> >>> > >> This is part of the original code that I did not update. I can
> >>> rewrite
> >>> > it
> >>> > >> to drop the unreachable code but as it stands it is self
> >>> documenting.
> >>> > >>
> >>> > >> My preference would be to drop the unreachable code. It is not
> there
> >>> > >> because it needs to be, for example a catch block to handle a
> >>> declared
> >>> > >> exception that you never expect. It seems to be to add a default
> >>> block
> >>> > for
> >>> > >> the switch statement.
> >>> > >>
> >>> > >
> >>> > > I'm OK to drop the code, or replace the AssewrtionError with an
> >>> > > IllegalStateException? If any kind of code remains, the exception
> >>> message
> >>> > > and/or comment should state "this should n

Re: [collections] bloom filters comments

2019-12-29 Thread Claude Warren

It is currently a sub-class.  There was a suggestion to move it outside of
the BloomFilter interface.

On Sun, Dec 29, 2019 at 3:47 PM Gilles Sadowski 
wrote:

> Le dim. 29 déc. 2019 à 15:30, Claude Warren  a écrit :
> >
> > If the Shape class (BloomFilter.Shape) is extracted from the BloomFilter
> > interface and made a stand-alone class would the name change or is the
> fact
> > that it is in the o.a.c.c.bloomfilter package enough?
> >
> > I prefer the name Shape to BloomFilterShape.
>
> If "Shape" is only used for "BloomFilter" (as the suggestion above
> seems to indicate, why not declare it as a sub-interface:
> ---CUT---
> public interface BloomFilter {
> // ...
>
> public interface Shape {
> // ...
> }
> }
> ---CUT---
> ?
>
> Regards,
> Gilles
>
> >
> > Claude
> >
> > On Sat, Dec 28, 2019 at 9:06 AM Claude Warren  wrote:
> >
> > > Once the interface is extracted and reduced to the minimum necessary
> the
> > > following methods are removed:
> > >
> > > orCardinality() -- we have andCardinality() and xorCardinality() this
> was
> > > included for completeness.
> > >
> > > isFull() -- true if all the bits in the vector are on. A convenience
> > > method used to short circuit some logic.
> > >
> > > I think these 2 methods should go into the BloomFilter interface.
> > >
> > >
> > > Set operations:
> > >
> > > jaccardSimilarity -- a measure of similarity in sets with range [0,1]
> > >
> > > jaccardDistance -- defined as 1 - jaccardSimilarity
> > >
> > > cosineSimilarity -- a measure of similarity with range  [0,1]
> > >
> > > cosineDistance -- defined as 1 - cosineSimilarity
> > >
> > > estimateSize -- Set operation estimate of the number of items that were
> > > placed in the filter. (Requires Shape)
> > >
> > > estimateUnionSize -- Set operation estimate of the number of items that
> > > would be represented by the merging of two filters. (Requires Shape)
> > >
> > > estimateIntersectionSize -- Set operations estimate of the number of
> items
> > > in the intersection. (Requires Shape)
> > >
> > > Perhaps it makes sense to move the Set operations into their own class
> > > with static methods?  The set operations all operate on 2 Bloom
> filters.
> > > Moving them would clarify the AbstractBloomFilter class.
> > >
> > > Claude
> > >
> > >
> > > On Sat, Dec 28, 2019 at 2:01 AM Gary Gregory 
> > > wrote:
> > >
> > >> On Fri, Dec 27, 2019 at 11:36 AM Claude Warren 
> wrote:
> > >>
> > >> > On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory <
> garydgreg...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Hi Claude and all:
> > >> > >
> > >> > > Here are a couple of comments on the bloom filter PR.
> > >> > >
> > >> > > 10,100 ft level comment we do not have to worry about today:
> Before
> > >> the
> > >> > > release, we might want to split Commons Collection into a
> multi-module
> > >> > > project and have the BF code in it own module. The _main_ reason
> for
> > >> this
> > >> > > is that it will allow the dependency on Commons Codecis be a
> > >> non-optional
> > >> > > dependency. Optional dependency are a a bit of a pain IMO,
> especially
> > >> > when
> > >> > > you deal with large stacks. You end up sucking in everything that
> is
> > >> > > optional when you deliver an app because you do not know for
> certain
> > >> > what's
> > >> > > really required at runtime.
> > >> > >
> > >> > > Closer to the ground: I would make BloomFilter an interface and
> rename
> > >> > > the current BloomFilter class AbstractBloomFilter implements
> > >> BloomFilter.
> > >> > > This will allow the user and maintainer to see what is truly
> public.
> > >> > >
> > >> >
> > >> > I have done this (not checked in) but the net effect is that all the
> > >> public
> > >> > methods in the AbstractBloomFilter class are reflected in the
> > >> BloomFilter
> > >> > interface and the Shape class has moved to the Interface as well.
> >

Re: [collections] bloom filters comments

2019-12-29 Thread Claude Warren

If the Shape class (BloomFilter.Shape) is extracted from the BloomFilter
interface and made a stand-alone class would the name change or is the fact
that it is in the o.a.c.c.bloomfilter package enough?

I prefer the name Shape to BloomFilterShape.

Claude

On Sat, Dec 28, 2019 at 9:06 AM Claude Warren  wrote:

> Once the interface is extracted and reduced to the minimum necessary the
> following methods are removed:
>
> orCardinality() -- we have andCardinality() and xorCardinality() this was
> included for completeness.
>
> isFull() -- true if all the bits in the vector are on. A convenience
> method used to short circuit some logic.
>
> I think these 2 methods should go into the BloomFilter interface.
>
>
> Set operations:
>
> jaccardSimilarity -- a measure of similarity in sets with range [0,1]
>
> jaccardDistance -- defined as 1 - jaccardSimilarity
>
> cosineSimilarity -- a measure of similarity with range  [0,1]
>
> cosineDistance -- defined as 1 - cosineSimilarity
>
> estimateSize -- Set operation estimate of the number of items that were
> placed in the filter. (Requires Shape)
>
> estimateUnionSize -- Set operation estimate of the number of items that
> would be represented by the merging of two filters. (Requires Shape)
>
> estimateIntersectionSize -- Set operations estimate of the number of items
> in the intersection. (Requires Shape)
>
> Perhaps it makes sense to move the Set operations into their own class
> with static methods?  The set operations all operate on 2 Bloom filters.
> Moving them would clarify the AbstractBloomFilter class.
>
> Claude
>
>
> On Sat, Dec 28, 2019 at 2:01 AM Gary Gregory 
> wrote:
>
>> On Fri, Dec 27, 2019 at 11:36 AM Claude Warren  wrote:
>>
>> > On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory 
>> > wrote:
>> >
>> > > Hi Claude and all:
>> > >
>> > > Here are a couple of comments on the bloom filter PR.
>> > >
>> > > 10,100 ft level comment we do not have to worry about today: Before
>> the
>> > > release, we might want to split Commons Collection into a multi-module
>> > > project and have the BF code in it own module. The _main_ reason for
>> this
>> > > is that it will allow the dependency on Commons Codecis be a
>> non-optional
>> > > dependency. Optional dependency are a a bit of a pain IMO, especially
>> > when
>> > > you deal with large stacks. You end up sucking in everything that is
>> > > optional when you deliver an app because you do not know for certain
>> > what's
>> > > really required at runtime.
>> > >
>> > > Closer to the ground: I would make BloomFilter an interface and rename
>> > > the current BloomFilter class AbstractBloomFilter implements
>> BloomFilter.
>> > > This will allow the user and maintainer to see what is truly public.
>> > >
>> >
>> > I have done this (not checked in) but the net effect is that all the
>> public
>> > methods in the AbstractBloomFilter class are reflected in the
>> BloomFilter
>> > interface and the Shape class has moved to the Interface as well.
>> >
>>
>> OK, I think I like BloomFilter as an interface especially since it is used
>> as an argument in methods. I'll leave it to you as to whether Shape needs
>> to be folded in. I did notice that Shape is an argument in a few places.
>> Might we loose some focus and abstraction by folding Shape into the
>> BloomFilter interface?
>>
>>
>> > I can remove several methods from BloomFilter that are not strictly
>> > necessary for this code to function.  I am and have been reticent to do
>> so
>> > since there are a number of home-grown libraries used by various
>> > researchers that provide these functions.  But if that is what it takes
>> to
>> > get this out the door it will be done.
>> >
>>
>> Once you have BloomFilter as an interface with an implementing class, it
>> might make it much clearer as to what really belongs in the interface. The
>> handy methods can stay in the abstract class of course.
>>
>>
>> >
>> > > Since this is a first cut for Commons Collection, I would consider
>> only
>> > > making APIs public that must be public. Once released, we MUST
>> maintain
>> > > binary compatibility within minor releases. Major releases allow us to
>> > > break this compatibility but these do not happen very often. I do not
>> > know
>> > > what this means yet for BF but

Re: [commons-codec] branch master updated: Change AssertionError to IllegalStateException

2019-12-28 Thread Claude Warren

Wherever the note is found the javadoc should include

 * This implementation contains a sign-extension bug in the seed
initialization.
 * This manifests if the seed is negative.

On Sat, Dec 28, 2019 at 1:45 AM Gary Gregory  wrote:

> On Fri, Dec 27, 2019 at 8:17 PM  wrote:
>
> > This is an automated email from the ASF dual-hosted git repository.
> >
> > aherbert pushed a commit to branch master
> > in repository https://gitbox.apache.org/repos/asf/commons-codec.git
> >
> >
> > The following commit(s) were added to refs/heads/master by this push:
> >  new 1cf4d19  Change AssertionError to IllegalStateException
> > 1cf4d19 is described below
> >
> > commit 1cf4d19069c64d0493f8b92178ffdb728c0c0ac2
> > Author: Alex Herbert 
> > AuthorDate: Sat Dec 28 01:17:17 2019 +
> >
> > Change AssertionError to IllegalStateException
> > ---
> >  src/main/java/org/apache/commons/codec/digest/MurmurHash3.java | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git
> > a/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java
> > b/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java
> > index d4a95ea..5d9aa9d 100644
> > --- a/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java
> > +++ b/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java
> > @@ -1054,7 +1054,7 @@ public final class MurmurHash3 {
> >  k = orBytes(unprocessed[0], unprocessed[1],
> > unprocessed[2], data[offset]);
> >  break;
> >  default:
> > -throw new AssertionError("Unprocessed length should
> > be 1, 2, or 3: " + unprocessedLength);
> > +throw new IllegalStateException("Unprocessed length
> > should be 1, 2, or 3: " + unprocessedLength);
> >  }
> >  hash = mix32(k, hash);
> >  // Update the offset and length
> >
>
> That seems clearer, thanks.
>
> This seems like the kind of code we might want fuzz test. It seems
> quite unlikely otherwise we'd hit this case. I also wonder if
>
> Thoughts?
>
> I see in several places:
>
> // Note: This fails to apply masking using 0xL to the seed.
>
> Shouldn't this be in the Javadoc?
>
> Gary
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] bloom filters comments

2019-12-28 Thread Claude Warren

Once the interface is extracted and reduced to the minimum necessary the
following methods are removed:

orCardinality() -- we have andCardinality() and xorCardinality() this was
included for completeness.

isFull() -- true if all the bits in the vector are on. A convenience method
used to short circuit some logic.

I think these 2 methods should go into the BloomFilter interface.


Set operations:

jaccardSimilarity -- a measure of similarity in sets with range [0,1]

jaccardDistance -- defined as 1 - jaccardSimilarity

cosineSimilarity -- a measure of similarity with range  [0,1]

cosineDistance -- defined as 1 - cosineSimilarity

estimateSize -- Set operation estimate of the number of items that were
placed in the filter. (Requires Shape)

estimateUnionSize -- Set operation estimate of the number of items that
would be represented by the merging of two filters. (Requires Shape)

estimateIntersectionSize -- Set operations estimate of the number of items
in the intersection. (Requires Shape)

Perhaps it makes sense to move the Set operations into their own class with
static methods?  The set operations all operate on 2 Bloom filters.  Moving
them would clarify the AbstractBloomFilter class.

Claude


On Sat, Dec 28, 2019 at 2:01 AM Gary Gregory  wrote:

> On Fri, Dec 27, 2019 at 11:36 AM Claude Warren  wrote:
>
> > On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory 
> > wrote:
> >
> > > Hi Claude and all:
> > >
> > > Here are a couple of comments on the bloom filter PR.
> > >
> > > 10,100 ft level comment we do not have to worry about today: Before the
> > > release, we might want to split Commons Collection into a multi-module
> > > project and have the BF code in it own module. The _main_ reason for
> this
> > > is that it will allow the dependency on Commons Codecis be a
> non-optional
> > > dependency. Optional dependency are a a bit of a pain IMO, especially
> > when
> > > you deal with large stacks. You end up sucking in everything that is
> > > optional when you deliver an app because you do not know for certain
> > what's
> > > really required at runtime.
> > >
> > > Closer to the ground: I would make BloomFilter an interface and rename
> > > the current BloomFilter class AbstractBloomFilter implements
> BloomFilter.
> > > This will allow the user and maintainer to see what is truly public.
> > >
> >
> > I have done this (not checked in) but the net effect is that all the
> public
> > methods in the AbstractBloomFilter class are reflected in the BloomFilter
> > interface and the Shape class has moved to the Interface as well.
> >
>
> OK, I think I like BloomFilter as an interface especially since it is used
> as an argument in methods. I'll leave it to you as to whether Shape needs
> to be folded in. I did notice that Shape is an argument in a few places.
> Might we loose some focus and abstraction by folding Shape into the
> BloomFilter interface?
>
>
> > I can remove several methods from BloomFilter that are not strictly
> > necessary for this code to function.  I am and have been reticent to do
> so
> > since there are a number of home-grown libraries used by various
> > researchers that provide these functions.  But if that is what it takes
> to
> > get this out the door it will be done.
> >
>
> Once you have BloomFilter as an interface with an implementing class, it
> might make it much clearer as to what really belongs in the interface. The
> handy methods can stay in the abstract class of course.
>
>
> >
> > > Since this is a first cut for Commons Collection, I would consider only
> > > making APIs public that must be public. Once released, we MUST maintain
> > > binary compatibility within minor releases. Major releases allow us to
> > > break this compatibility but these do not happen very often. I do not
> > know
> > > what this means yet for BF but it's a thought to consider. This kind of
> > > work is made harder due to the current packaging of the BF code.
> > >
> > > We could consider putting it all in one package if that helps reduce
> the
> > > public API footprint.
> > >
> >
> > I think that putting all the pieces into a single package will muddy the
> > waters a bit.  The classes in o.a.c.c.bloomfilter are high level classes
> > and used wherever the Bloom filter code is used.  Objects in the
> > o.a.c.c.bloomfilter.hasher classes are Hashers and are really only
> selected
> > when the application developer determines which hasher is best suited for
> > the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
> &

Re: [collections] bloom filters comments

2019-12-27 Thread Claude Warren

On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory  wrote:

> Hi Claude and all:
>
> Here are a couple of comments on the bloom filter PR.
>
> 10,100 ft level comment we do not have to worry about today: Before the
> release, we might want to split Commons Collection into a multi-module
> project and have the BF code in it own module. The _main_ reason for this
> is that it will allow the dependency on Commons Codecis be a non-optional
> dependency. Optional dependency are a a bit of a pain IMO, especially when
> you deal with large stacks. You end up sucking in everything that is
> optional when you deliver an app because you do not know for certain what's
> really required at runtime.
>
> Closer to the ground: I would make BloomFilter an interface and rename
> the current BloomFilter class AbstractBloomFilter implements BloomFilter.
> This will allow the user and maintainer to see what is truly public.
>

I have done this (not checked in) but the net effect is that all the public
methods in the AbstractBloomFilter class are reflected in the BloomFilter
interface and the Shape class has moved to the Interface as well.

I can remove several methods from BloomFilter that are not strictly
necessary for this code to function.  I am and have been reticent to do so
since there are a number of home-grown libraries used by various
researchers that provide these functions.  But if that is what it takes to
get this out the door it will be done.

> Since this is a first cut for Commons Collection, I would consider only
> making APIs public that must be public. Once released, we MUST maintain
> binary compatibility within minor releases. Major releases allow us to
> break this compatibility but these do not happen very often. I do not know
> what this means yet for BF but it's a thought to consider. This kind of
> work is made harder due to the current packaging of the BF code.
>
> We could consider putting it all in one package if that helps reduce the
> public API footprint.
>

I think that putting all the pieces into a single package will muddy the
waters a bit.  The classes in o.a.c.c.bloomfilter are high level classes
and used wherever the Bloom filter code is used.  Objects in the
o.a.c.c.bloomfilter.hasher classes are Hashers and are really only selected
when the application developer determines which hasher is best suited for
the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
HasherFunction implementations.  These can be used or special cases built
by the implementer as necessary.  Perhaps I am being a bit too pedantic but
I do not think it makes sense to merge them into a single package.

>
> Q: All the code in main's
> org.apache.commons.collections4.bloomfilter.hasher.function is only called
> from test code. Why is that?
>

They are pluggable modules.  When a developer uses them they an instance is
created and it is passed to the BloomFilter.Shape() constructor as well as
to the various Hasher constructors.  For example:

HashFunction hashFunction = new Murmur128x86Cyclic();
Shape bloomFilterConfig = new Shape( hashFunction, 3, 1.0 / 10);

>
> Nit: Some odd formatting like spaces after open parens: 'toUpperCase(
> Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'
>
> Gary
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [codec] release soon

2019-12-26 Thread Claude Warren

For the contributions and issues I was involved in, the javadoc appear to
be correct.

Claude

On Thu, Dec 26, 2019 at 1:30 PM Gary Gregory  wrote:

> It looks like we will need a new version of Commons Codec out before we can
> use new code there from Commons Collections. So now's the time to polish,
> PR, and so on.
>
> If you've contributed code to Codec, please make sure the Javadoc are
> helpful and the site up to date.
>
> Thank you!
> Gary
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [Collection] adding Codec dependency?

2019-12-01 Thread Claude Warren

The contribution would work without the Murmur hash but losing it means no
Murmur based hashes in the factory methods.  There are work arounds for
this, but all are a bit messy.

It might be possible to clean it up a bit by creating a HashFunction
class/interface that could be used to crate the Hasher implementations
rather than the name and function that is currently used.  Then it might be
much easier for integrators to implement other hash functions.

I could make it an optional dependency.

On Sat, Nov 30, 2019 at 4:13 PM sebb  wrote:

> On Sat, 30 Nov 2019 at 12:48, Claude Warren  wrote:
>
> > Greetings,
> >
> > I have a contribution[1] that requires a Murmur128 hash.  The options are
> > add the hash and associated tests to the Collections classes or use the
> > Murmur hash found in the Commons Codec project.
> >
> > In the contribution I elected to go with adding the dependency on the
> codec
> > 1.14 version.[2]  However, this is currently a SNAPSHOT version and will
> > block the release of Collections until Codec releases.
> >
> > Are there any opinions on how to proceed so that the contribution can  be
> > accepted?
> >
> >
> If the Murmur128 hash is added to Collections, it might be best if it is
> added as an internal package, so it can easily be removed if required.
> Will the contribution work at all without the Murmur128 hash?
> If so, perhaps make it an optional dependency?
>
>
> > Any opinions on the contribution as a whole?
> >
> > Thank you for your attention,
> > Claude
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=16985141
> > [2] https://github.com/apache/commons-collections/pull/83
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

[Collection] adding Codec dependency?

2019-11-30 Thread Claude Warren

Greetings,

I have a contribution[1] that requires a Murmur128 hash.  The options are
add the hash and associated tests to the Collections classes or use the
Murmur hash found in the Commons Codec project.

In the contribution I elected to go with adding the dependency on the codec
1.14 version.[2]  However, this is currently a SNAPSHOT version and will
block the release of Collections until Codec releases.

Are there any opinions on how to proceed so that the contribution can  be
accepted?

Any opinions on the contribution as a whole?

Thank you for your attention,
Claude

[1]
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=16985141
[2] https://github.com/apache/commons-collections/pull/83

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

[commons-codec] is there a commons-codec2

2019-11-24 Thread Claude Warren

Is there a commons-codec2?  I see it listed as
group ID: org.apache.commons
artifact ID: commons-codec2
version: 2.0-SNAPSHOT

in the apache snapshot repository:
https://repository.apache.org/content/repositories/snapshots/org/apache/commons/commons-codec2/2.0-SNAPSHOT/

But I don't see any way to get to the code from the commons website and all
development seems dead.

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [CODEC] Sign Extension Error in Murmur3

2019-11-11 Thread Claude Warren

https://github.com/apache/commons-codec/pull/27

On Mon, Nov 11, 2019 at 4:25 PM Claude Warren  wrote:

> I took the approach that I would leave the original code there and add new
> methods hash128_x64 and hash32_x86.  I also marked the older methods as
> deprecated with a note that the implementation has a sign extension error
> and to use the new methods for new development and cases where the change
> does not impact the code in the wild.
>
> Claude
>
> On Mon, Nov 4, 2019 at 10:50 AM Claude Warren  wrote:
>
>> There are a number of issues with the format and potential bugs in the
>> Codec Murmur3 code. ( See spotbugs, PMD, and codestyle reports)  The one
>> that tripped me up was the mix of tab/space lines.  Anyway, these issues
>> can be fixed in later pull requests.  I think is one overarching question
>> and one specific question:
>>
>> In general, what is the policy for how do we deal with digest
>> implementations that have a bug and that have been released into the wild?
>>
>> Specifically in this case, what do we do to solve the problem?
>>
>> I was in favor of the following specific action:
>>
>> 1. Deprecate all affected methods and document the issue.
>> 2. Move the deprecated methods into a new class called HiveMurmur3 and
>> have all deprecation point to this class.
>> 3. Create a new Murmur3 class with method names like hash128_x64 (ones
>> that do not conflict with the old names).
>>
>> But the more I think about it the more I think that we should follow KISS
>> and
>>
>> 1. Create a new Murmur3 class with method names like hash128_x64 (ones
>> that do not conflict with the old names).
>> 2. Update the javadocs for the defective methods with notes indicating
>> they have defects.  Perhaps write a readme type document explaining in
>> detail the issue.
>>
>> I think in general we should follow the same action with any digest
>> defect: New methods and Javadoc old.
>>
>> Claude
>>
>>
>>
>>
>> On Mon, Nov 4, 2019 at 8:08 AM Alex Herbert 
>> wrote:
>>
>>>
>>>
>>> > On 4 Nov 2019, at 02:13, sebb  wrote:
>>> >
>>> > On Mon, 4 Nov 2019 at 00:53, Claude Warren >> cla...@xenei.com>> wrote:
>>> >>
>>> >> I think the way to prove they behave correctly is to test the results
>>> >> against the original "C" implementation.  With this in mind I
>>> proposed the
>>> >> changes.
>>> >>
>>> >> I agree that there should be a way to revert to the old code as there
>>> is
>>> >> code in the wild that depends on the earlier implementation.
>>> >>
>>> >> I know that there are broken implementations in other Apache
>>> projects, but
>>> >> those are not libraries (i.e. Cassandra).
>>> >>
>>> >> I would suggest that perhaps the changes that I provided be renamed as
>>> >> hash128Standard and hash32Standard (or some other designator to show
>>> that
>>> >> it is different from the original CODEC implementation).
>>> >>
>>> >> Alternatively hash128 and hash32 could be deprecated  and replaced
>>> with
>>> >> hashx86_32Old, and hashx64_128Old and provide the corrected versions
>>> for
>>> >> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
>>> >> hashx86_128 implementation.
>>> >>
>>> >> My opinion is that we should provide Murmur3 hash implementations that
>>> >> produce the same results as the "C"/C++ implementations for cross
>>> >> application compatibility.
>>> >
>>> > I agree: but the original implementation has far fewer methods.
>>> > So how does one check the Codec methods that have no corresponding
>>> C/C++ method?
>>>
>>> AFAICT the methods that accept a short, int, or long are convenience
>>> methods to avoid creating a byte[] for the data. The JUnit tests
>>> demonstrate this by passing the same values to a ByteBuffer and then
>>> calling the corresponding byte[] methods. The methods do assume the endian
>>> order of the primitive datatypes as processed by a default ByteBuffer (Big
>>> Endian). So perhaps this should be added to the Javadoc with an example for
>>> the equivalent call to the method using the byte[].
>>>
>>> Note that these methods will not be broken. The bug in the code is for
>>>

Re: [CODEC] Sign Extension Error in Murmur3

2019-11-11 Thread Claude Warren

I took the approach that I would leave the original code there and add new
methods hash128_x64 and hash32_x86.  I also marked the older methods as
deprecated with a note that the implementation has a sign extension error
and to use the new methods for new development and cases where the change
does not impact the code in the wild.

Claude

On Mon, Nov 4, 2019 at 10:50 AM Claude Warren  wrote:

> There are a number of issues with the format and potential bugs in the
> Codec Murmur3 code. ( See spotbugs, PMD, and codestyle reports)  The one
> that tripped me up was the mix of tab/space lines.  Anyway, these issues
> can be fixed in later pull requests.  I think is one overarching question
> and one specific question:
>
> In general, what is the policy for how do we deal with digest
> implementations that have a bug and that have been released into the wild?
>
> Specifically in this case, what do we do to solve the problem?
>
> I was in favor of the following specific action:
>
> 1. Deprecate all affected methods and document the issue.
> 2. Move the deprecated methods into a new class called HiveMurmur3 and
> have all deprecation point to this class.
> 3. Create a new Murmur3 class with method names like hash128_x64 (ones
> that do not conflict with the old names).
>
> But the more I think about it the more I think that we should follow KISS
> and
>
> 1. Create a new Murmur3 class with method names like hash128_x64 (ones
> that do not conflict with the old names).
> 2. Update the javadocs for the defective methods with notes indicating
> they have defects.  Perhaps write a readme type document explaining in
> detail the issue.
>
> I think in general we should follow the same action with any digest
> defect: New methods and Javadoc old.
>
> Claude
>
>
>
>
> On Mon, Nov 4, 2019 at 8:08 AM Alex Herbert 
> wrote:
>
>>
>>
>> > On 4 Nov 2019, at 02:13, sebb  wrote:
>> >
>> > On Mon, 4 Nov 2019 at 00:53, Claude Warren > cla...@xenei.com>> wrote:
>> >>
>> >> I think the way to prove they behave correctly is to test the results
>> >> against the original "C" implementation.  With this in mind I proposed
>> the
>> >> changes.
>> >>
>> >> I agree that there should be a way to revert to the old code as there
>> is
>> >> code in the wild that depends on the earlier implementation.
>> >>
>> >> I know that there are broken implementations in other Apache projects,
>> but
>> >> those are not libraries (i.e. Cassandra).
>> >>
>> >> I would suggest that perhaps the changes that I provided be renamed as
>> >> hash128Standard and hash32Standard (or some other designator to show
>> that
>> >> it is different from the original CODEC implementation).
>> >>
>> >> Alternatively hash128 and hash32 could be deprecated  and replaced with
>> >> hashx86_32Old, and hashx64_128Old and provide the corrected versions
>> for
>> >> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
>> >> hashx86_128 implementation.
>> >>
>> >> My opinion is that we should provide Murmur3 hash implementations that
>> >> produce the same results as the "C"/C++ implementations for cross
>> >> application compatibility.
>> >
>> > I agree: but the original implementation has far fewer methods.
>> > So how does one check the Codec methods that have no corresponding
>> C/C++ method?
>>
>> AFAICT the methods that accept a short, int, or long are convenience
>> methods to avoid creating a byte[] for the data. The JUnit tests
>> demonstrate this by passing the same values to a ByteBuffer and then
>> calling the corresponding byte[] methods. The methods do assume the endian
>> order of the primitive datatypes as processed by a default ByteBuffer (Big
>> Endian). So perhaps this should be added to the Javadoc with an example for
>> the equivalent call to the method using the byte[].
>>
>> Note that these methods will not be broken. The bug in the code is for
>> input byte[] lengths that are not a factor of 4. The fix is for any of the
>> leftover 1, 2, or 3 bytes when computing the hash.
>>
>> The broken methods are:
>>
>> public static int hash32(final byte[] data)
>> public static int hash32(final String data)
>> public static int hash32(final byte[] data, final int length)
>> public static int hash32(final byte[] data, final int length, final int
>> seed)
>> public static int hash32(final by

Re: [IO] Don't deprecate IOUtils#closeQuietly()

2019-11-05 Thread Claude Warren

+1 I have similar use cases.

On Tue, Nov 5, 2019 at 3:12 PM Gary Gregory  wrote:

> In this vein, I'd also like to add a null-safe close(Closeable) method.
>
> Gary
>
> On Tue, Nov 5, 2019 at 10:10 AM Gary Gregory 
> wrote:
>
> > Hi All:
> >
> > I propose that we do NOT deprecate IOUtils#closeQuietly(). There are
> > use-cases for this method outside of try-with-resources blocks.
> >
> > For example, I have some wrapper objects that contain Closeable objects.
> > When the wrapper is closed, I do and want to add calls to
> > IOUtils#closeQuietly() to avoid null checks in call sites.
> >
> > Gary
> >
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [CODEC] Sign Extension Error in Murmur3

2019-11-04 Thread Claude Warren

There are a number of issues with the format and potential bugs in the
Codec Murmur3 code. ( See spotbugs, PMD, and codestyle reports)  The one
that tripped me up was the mix of tab/space lines.  Anyway, these issues
can be fixed in later pull requests.  I think is one overarching question
and one specific question:

In general, what is the policy for how do we deal with digest
implementations that have a bug and that have been released into the wild?

Specifically in this case, what do we do to solve the problem?

I was in favor of the following specific action:

1. Deprecate all affected methods and document the issue.
2. Move the deprecated methods into a new class called HiveMurmur3 and have
all deprecation point to this class.
3. Create a new Murmur3 class with method names like hash128_x64 (ones that
do not conflict with the old names).

But the more I think about it the more I think that we should follow KISS
and

1. Create a new Murmur3 class with method names like hash128_x64 (ones that
do not conflict with the old names).
2. Update the javadocs for the defective methods with notes indicating they
have defects.  Perhaps write a readme type document explaining in detail
the issue.

I think in general we should follow the same action with any digest defect:
New methods and Javadoc old.

Claude




On Mon, Nov 4, 2019 at 8:08 AM Alex Herbert 
wrote:

>
>
> > On 4 Nov 2019, at 02:13, sebb  wrote:
> >
> > On Mon, 4 Nov 2019 at 00:53, Claude Warren  cla...@xenei.com>> wrote:
> >>
> >> I think the way to prove they behave correctly is to test the results
> >> against the original "C" implementation.  With this in mind I proposed
> the
> >> changes.
> >>
> >> I agree that there should be a way to revert to the old code as there is
> >> code in the wild that depends on the earlier implementation.
> >>
> >> I know that there are broken implementations in other Apache projects,
> but
> >> those are not libraries (i.e. Cassandra).
> >>
> >> I would suggest that perhaps the changes that I provided be renamed as
> >> hash128Standard and hash32Standard (or some other designator to show
> that
> >> it is different from the original CODEC implementation).
> >>
> >> Alternatively hash128 and hash32 could be deprecated  and replaced with
> >> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
> >> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
> >> hashx86_128 implementation.
> >>
> >> My opinion is that we should provide Murmur3 hash implementations that
> >> produce the same results as the "C"/C++ implementations for cross
> >> application compatibility.
> >
> > I agree: but the original implementation has far fewer methods.
> > So how does one check the Codec methods that have no corresponding C/C++
> method?
>
> AFAICT the methods that accept a short, int, or long are convenience
> methods to avoid creating a byte[] for the data. The JUnit tests
> demonstrate this by passing the same values to a ByteBuffer and then
> calling the corresponding byte[] methods. The methods do assume the endian
> order of the primitive datatypes as processed by a default ByteBuffer (Big
> Endian). So perhaps this should be added to the Javadoc with an example for
> the equivalent call to the method using the byte[].
>
> Note that these methods will not be broken. The bug in the code is for
> input byte[] lengths that are not a factor of 4. The fix is for any of the
> leftover 1, 2, or 3 bytes when computing the hash.
>
> The broken methods are:
>
> public static int hash32(final byte[] data)
> public static int hash32(final String data)
> public static int hash32(final byte[] data, final int length)
> public static int hash32(final byte[] data, final int length, final int
> seed)
> public static int hash32(final byte[] data, final int offset, final int
> length, final int seed)
>
> This class is also broken:
> public static class IncrementalHash32
>
> >
> > Maybe the Codec implementation needs to be drastically pruned as well.
> >
> >> Claude
> >>
> >> On Sun, Nov 3, 2019 at 11:34 PM sebb  wrote:
> >>
> >>> As I see it, there are two use cases here.
> >>>
> >>> 1) Commons Codec code is used exclusively, in which case it is
> >>> essential that the output does not change for a given seed.
> >>>
> >>> 2) Commons Codec is used in conjunction with other implementations, in
> >>> which case it is essential that Codec is aligned with other
> >>> implementations
> >>>
> >

Re: [CODEC] Sign Extension Error in Murmur3

2019-11-03 Thread Claude Warren

As a third option as @melloware said in the pull request comments the
original implementation came from Apache Hive.  The current hashes could be
named hash31Hive and hash128Hive

On Mon, Nov 4, 2019 at 12:53 AM Claude Warren  wrote:

> I think the way to prove they behave correctly is to test the results
> against the original "C" implementation.  With this in mind I proposed the
> changes.
>
> I agree that there should be a way to revert to the old code as there is
> code in the wild that depends on the earlier implementation.
>
> I know that there are broken implementations in other Apache projects, but
> those are not libraries (i.e. Cassandra).
>
> I would suggest that perhaps the changes that I provided be renamed as
> hash128Standard and hash32Standard (or some other designator to show that
> it is different from the original CODEC implementation).
>
> Alternatively hash128 and hash32 could be deprecated  and replaced with
> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
> hashx86_128 implementation.
>
> My opinion is that we should provide Murmur3 hash implementations that
> produce the same results as the "C"/C++ implementations for cross
> application compatibility.
>
> Claude
>
> On Sun, Nov 3, 2019 at 11:34 PM sebb  wrote:
>
>> As I see it, there are two use cases here.
>>
>> 1) Commons Codec code is used exclusively, in which case it is
>> essential that the output does not change for a given seed.
>>
>> 2) Commons Codec is used in conjunction with other implementations, in
>> which case it is essential that Codec is aligned with other
>> implementations
>>
>> Both use cases seem valid to me, so I think it's important to allow
>> both old and new algorithms to co-exist going forward.
>>
>> However:
>> Should the default behaviour change (with option to revert)?
>> Or would it be better to make the corrected behaviour optional?
>>
>> Or deliberately break the API to force users to choose between old and
>> new behaviour?
>>
>> Note that Wikipedia says that the x86 and x64 versions of the
>> algorithm generate different results in the case of the 128 bit hash.
>> However Commons Codec does not have different implementations for x86
>> and x64; it is not clear which has been implemented.
>> The JavaDoc also fails to mention that there is a 64bit hash.
>>
>> The class clearly needs additional test data; it's possible that there
>> are discrepancies in other hash sizes as well.
>>
>> The code has multiple implementations for each hash size depending on
>> the parameter types, which increases the number of tests needed.
>>
>> Note that the original C++ implementation only has 3 hash methods:
>> MurmurHash3_x86_32
>> MurmurHash3_x86_128
>> MurmurHash3_x64_128
>>
>> AFAICT these all take the same parameters:
>> - unsigned byte array
>> - length
>> - unsigned 32 bit seed
>> - pointer to output buffer
>>
>> No idea why the Codec version allows additional input such as one (or
>> two) longs, etc.
>> Unless there is a defined standard for how these should be handled, it
>> will be impossible to prove that they behave correctly.
>>
>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert 
>> wrote:
>> >
>> >
>> >
>> > > On 3 Nov 2019, at 21:45, Gary Gregory  wrote:
>> > >
>> > > I feel like I am missing something basic in the assumption of this
>> issue:
>> > > there is no such thing as an unsigned int in Java and the ticket talks
>> > > about (C?) unsigned ints. Please help me understand how or why we
>> should
>> > > care about C vs. Java ints. Are we comparing apples to oranges here?
>> >
>> > When a byte is converted to an int there is sign extension if it is
>> negative. A negative byte will have all bits above the 8th bit set to 1. So
>> if the byte is negative then when converted to an int for bit shift and xor
>> operations the raw bits are not the same.
>> >
>> > These are not the same:
>> >
>> > byte b = -1;
>> > (int) b != (b & 0xff);
>> > b << 8 != (b & 0xff) << 8;
>> > b << 16 != (b & 0xff) << 16;
>> >
>> > The original code has the use of the 0xff mask for most of the murmur3
>> algorithm. It has been missed from the final steps applied the the last 3
>> bytes in the hash32 algorithm variant.
>> >
>> > Alex
>> >
>>

Re: [CODEC] Sign Extension Error in Murmur3

2019-11-03 Thread Claude Warren

I think the way to prove they behave correctly is to test the results
against the original "C" implementation.  With this in mind I proposed the
changes.

I agree that there should be a way to revert to the old code as there is
code in the wild that depends on the earlier implementation.

I know that there are broken implementations in other Apache projects, but
those are not libraries (i.e. Cassandra).

I would suggest that perhaps the changes that I provided be renamed as
hash128Standard and hash32Standard (or some other designator to show that
it is different from the original CODEC implementation).

Alternatively hash128 and hash32 could be deprecated  and replaced with
hashx86_32Old, and hashx64_128Old and provide the corrected versions for
hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
hashx86_128 implementation.

My opinion is that we should provide Murmur3 hash implementations that
produce the same results as the "C"/C++ implementations for cross
application compatibility.

Claude

On Sun, Nov 3, 2019 at 11:34 PM sebb  wrote:

> As I see it, there are two use cases here.
>
> 1) Commons Codec code is used exclusively, in which case it is
> essential that the output does not change for a given seed.
>
> 2) Commons Codec is used in conjunction with other implementations, in
> which case it is essential that Codec is aligned with other
> implementations
>
> Both use cases seem valid to me, so I think it's important to allow
> both old and new algorithms to co-exist going forward.
>
> However:
> Should the default behaviour change (with option to revert)?
> Or would it be better to make the corrected behaviour optional?
>
> Or deliberately break the API to force users to choose between old and
> new behaviour?
>
> Note that Wikipedia says that the x86 and x64 versions of the
> algorithm generate different results in the case of the 128 bit hash.
> However Commons Codec does not have different implementations for x86
> and x64; it is not clear which has been implemented.
> The JavaDoc also fails to mention that there is a 64bit hash.
>
> The class clearly needs additional test data; it's possible that there
> are discrepancies in other hash sizes as well.
>
> The code has multiple implementations for each hash size depending on
> the parameter types, which increases the number of tests needed.
>
> Note that the original C++ implementation only has 3 hash methods:
> MurmurHash3_x86_32
> MurmurHash3_x86_128
> MurmurHash3_x64_128
>
> AFAICT these all take the same parameters:
> - unsigned byte array
> - length
> - unsigned 32 bit seed
> - pointer to output buffer
>
> No idea why the Codec version allows additional input such as one (or
> two) longs, etc.
> Unless there is a defined standard for how these should be handled, it
> will be impossible to prove that they behave correctly.
>
> On Sun, 3 Nov 2019 at 22:01, Alex Herbert 
> wrote:
> >
> >
> >
> > > On 3 Nov 2019, at 21:45, Gary Gregory  wrote:
> > >
> > > I feel like I am missing something basic in the assumption of this
> issue:
> > > there is no such thing as an unsigned int in Java and the ticket talks
> > > about (C?) unsigned ints. Please help me understand how or why we
> should
> > > care about C vs. Java ints. Are we comparing apples to oranges here?
> >
> > When a byte is converted to an int there is sign extension if it is
> negative. A negative byte will have all bits above the 8th bit set to 1. So
> if the byte is negative then when converted to an int for bit shift and xor
> operations the raw bits are not the same.
> >
> > These are not the same:
> >
> > byte b = -1;
> > (int) b != (b & 0xff);
> > b << 8 != (b & 0xff) << 8;
> > b << 16 != (b & 0xff) << 16;
> >
> > The original code has the use of the 0xff mask for most of the murmur3
> algorithm. It has been missed from the final steps applied the the last 3
> bytes in the hash32 algorithm variant.
> >
> > Alex
> >
> >
> >
> > >
> > > Thank you,
> > > Gary
> > >
> > > On Sun, Nov 3, 2019, 07:59 Claude Warren  wrote:
> > >
> > >> There is an error in the current Murmur3 code introduced by sign
> extension
> > >> errors.  This is documented in CODEC-264.[1]
> > >>
> > >> I have created a pull request to fix it.[2]
> > >>
> > >> While the code changes did not change any of the existing Murmur3
> tests, I
> > >> did add new tests that failed until the changes were applied.
> > >>
> > >> However, I

[CODEC] Sign Extension Error in Murmur3

2019-11-03 Thread Claude Warren

There is an error in the current Murmur3 code introduced by sign extension
errors.  This is documented in CODEC-264.[1]

I have created a pull request to fix it.[2]

While the code changes did not change any of the existing Murmur3 tests, I
did add new tests that failed until the changes were applied.

However, I am concerned that the commons-codec Murmur3 code is in the wild
and this change will impact some hashes.

Therefore, I am asking for a discussion concerning how to apply any patches
like CODEC-264 to hashing algorithms that are in the wild.

Thx,
Claude

[1] https://issues.apache.org/jira/browse/CODEC-264
[2] https://github.com/apache/commons-codec/pull/27

Re: [lang] immutable BitSet

2019-10-28 Thread Claude Warren

Having "ImmutableBitSet" inherit from "BitSet" breaks the latter's contract.
>
>
no more so than the ImmutableSet breaks the Set contract.  Yes it does but
the pattern is well established.


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] BloomFilter package architecture discussion

2019-10-28 Thread Claude Warren

Based on the feedback and questions I have created a different
implementation (https://github.com/Claudenw/BloomFilter/tree/Hasher) --
note this is the Hasher branch.

This implementation uses an abstract BloomFilter to implement the basic
functionality required and required that implementations implement 4
methods at a minimum:

getBits() -- retrieves a LongBuffer with the associated enabled bits turned
on.
getHasher() -- retrieves a StaticHasher that will iterate the index of the
enabled bits.
merge(BloomFilter) -- merges an abstract bloom filter into the current
bloom filter using one of the two methods above.
merge(Shape,Hasher) -- merges the bits from the hasher using the specified
shape into the filter.

Other methods are expected to be specialized for operation with Bloom
filters of the same type.

There are 4 Bloom filter implementations to demonstrate how this would be
done.  I do not expect the EWAH implementation to be included in an Apache
library as it depends on 3rd party code, it is here as an example.

BloomFilter class contains an enclosed  Shape class that describes the
shape of the BloomFilter.  This is very similar  to the Shape presented by
Gilles

There is a construct called a Hasher that comprises a hashName and a
ToLongBiFunction.The Hashing functions are moved
to DynamicHasher and result of hashing can be stored for reply later using
the StaticHasher.

There is a HasherFactory to build well known hashers.  The HasherFactory
also contains a description of how the ToLongBiFunction should function (i.e. what the parameters are) so that new
implementations can be created.  There are 4 implementations of hashing
functions included as examples.

Claude

On Sun, Oct 20, 2019 at 5:04 PM Gilles Sadowski 
wrote:

> Hi Claude.
>
> Le dim. 20 oct. 2019 à 14:20, Claude Warren  a écrit :
> >
> > A Bloom filter is a set of bits:
>
> It is not, according to the quote here:
>
> > "The hash area is considered as N individual addressable bits, with
> > addresses 0 through N - 1. It is assumed that all bits in the hash area
> are
> > first set to 0. Next, each mes- sage in the set to be stored is hash
> coded
> > into a number of distinct bit addresses, say al, a2, ..., ad. Finally,
> all
> > d bits addressed by al through ad are set to 1." -- Burton Bloom,
> > "Space/Time Trade-offs in Hash Coding with Allowable Errors"[1]
>
> In effect, the functionality needs some structures (cf. "hash", "message",
> "addressable bits") to operate on, in order to provide its service (what I
> called the _user_ API).
> Either your contribution still confuses API with implementation, or I'm
> still totally missing what it tries to achieve (in terms of "service").
> My feeling is that we are stuck because of different programming habits.
> My proposal (to expand the contents of the "BloomFilter.java" file) should
> be seen as a practical exercise around the misunderstanding, that would
> hopefully lead us to a consensus.
>
> Regards,
> Gilles
>
> > Claude
> >
> > [1] http://crystall.uta.edu/~mcguigan/cse6350/papers/Bloom.pdf
> >
> > On Sat, Oct 19, 2019 at 11:40 PM Claude Warren  wrote:
> >
> > > As an experiment I attempted to implement a complete BloomFilter (see
> > > BloomFilterI2 in https://issues.apache.org/jira/browse/COLLECTIONS-728
> )
> > > using only IntStream and merge( BloomFilter ).  It is possible, though
> > > probably not the fastest implementation.
> > >
> > > We could define something like:
> > >  ByteBuffer getBits()
> > >
> > > or a method that returns a stream of Boolean (this will be very long in
> > > processing).
> > > or a method that returns a stream of ByteBuffers (the concatenation of
> > > which is the entire set of bits)
> > > or a method that returns an IntStream (the concatenation of which is
> the
> > > entire set of bits).
> > >
> > > I think that most implementations will have some internal structure
> that
> > > can generate the above.  The wider the item in the stream the faster
> we can
> > > move the bits, the better the average performance will be.
> > >
> > >
> > > Claude
> > >
> > >
> > >
> > >
> > > On Sat, Oct 19, 2019 at 4:48 PM Gilles Sadowski 
> > > wrote:
> > >
> > >> Hello.
> > >>
> > >> 2019-10-19 16:20 UTC+02:00, Claude Warren :
> > >> > On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski <
> gillese...@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> Hi.
> > >> >

Re: [All] Using "SemVer"?

2019-10-21 Thread Claude Warren

I think there is a belief in the general using public that Apache follows
SemVer for most Java code.  I think that it would be best if each module
specified if it follows SemVer or not -- just to clear up any confusion on
the part of the user.

Claude

On Fri, Oct 18, 2019 at 4:08 PM Stefan Bodewig  wrote:

> On 2019-10-18, Gilles Sadowski wrote:
>
> > My point/question was whether we do not already follow it.
>
> We don't, at least not in all components.
>
> Quite a few of our components don't have a patch number at all and they
> sometimes create minor releases that would be patch releases if we
> followed SemVer. And sometimes we convince ourselves something that in a
> release that would be a major release it is OK to not change the major
> version number. Both ends of this are true for Compress which I happen
> to know best.
>
> > What would be the additional burden?
>
> Having to obey more rules.
>
> Don't get me wrong. I haven't got any problems with components who want
> to adhere to SemVer.
>
> It is that I just don't think we need to force all components to do
> so. As little as we don't need to force all components to use Maven or
> need to force all components to use an indentation of four spaces. There
> is no reason to standardize this accross components IMHO.
>
> Stefan
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] BloomFilter package architecture discussion

2019-10-20 Thread Claude Warren

A Bloom filter is a set of bits:

"The hash area is considered as N individual addressable bits, with
addresses 0 through N - 1. It is assumed that all bits in the hash area are
first set to 0. Next, each mes- sage in the set to be stored is hash coded
into a number of distinct bit addresses, say al, a2, ..., ad. Finally, all
d bits addressed by al through ad are set to 1." -- Burton Bloom,
"Space/Time Trade-offs in Hash Coding with Allowable Errors"[1]

Claude

[1] http://crystal.uta.edu/~mcguigan/cse6350/papers/Bloom.pdf

On Sat, Oct 19, 2019 at 11:40 PM Claude Warren  wrote:

> As an experiment I attempted to implement a complete BloomFilter (see
> BloomFilterI2 in https://issues.apache.org/jira/browse/COLLECTIONS-728)
> using only IntStream and merge( BloomFilter ).  It is possible, though
> probably not the fastest implementation.
>
> We could define something like:
>  ByteBuffer getBits()
>
> or a method that returns a stream of Boolean (this will be very long in
> processing).
> or a method that returns a stream of ByteBuffers (the concatenation of
> which is the entire set of bits)
> or a method that returns an IntStream (the concatenation of which is the
> entire set of bits).
>
> I think that most implementations will have some internal structure that
> can generate the above.  The wider the item in the stream the faster we can
> move the bits, the better the average performance will be.
>
>
> Claude
>
>
>
>
> On Sat, Oct 19, 2019 at 4:48 PM Gilles Sadowski 
> wrote:
>
>> Hello.
>>
>> 2019-10-19 16:20 UTC+02:00, Claude Warren :
>> > On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski 
>> > wrote:
>> >
>> >> Hi.
>> >>
>> >> >>> [...]
>> >> > >
>> >> > > Maybe I was not clear enough: I'm not saying that we should prefer
>> >> > > some representation (of the state) over another; only that the how
>> >> > > the state is represented externally need not be the same as the
>> >> internal
>> >> > > representation.
>> >> > >
>> >> > > But if the state is fully encapsulated by a "BitSet", then internal
>> >> > > and
>> >> > > external can be the same, making easier both for developers and for
>> >> > > users.  So let's have indeed a "getBitSet()" that returns a copy of
>> >> > > the
>> >> > > (private final) instance field (of type "BitSet") used to represent
>> >> > > the
>> >> > > state of a "BloomFilter".
>> >> > >
>> >> > >
>> >> > I agree with most of this statement. However, I think the use of
>> BitSet
>> >> for
>> >> > the internal representation is questionable.
>> >>
>> >> Hence, that settles the issue of "class" versus "interface":
>> >> "BloomFilter"
>> >> must be an interface.
>> >>
>> >> *I am still not convinced. How do you then provide consistent
>> >> implementations of the distance functions? We could use “default”
>> >> implementations but that seems like a misuse of the intent for
>> “default”,
>> >> though perhaps I misunderstand that.*
>>
>> What's the issue with having "distance(BloomFilter other)"
>> in the interface?
>>
>> >>
>> >>
>> >> [snip]
>> >
>> >> > In addition, there are researchers that do not use
>> >> > BitSet because it is too slow or too bloated.
>> >>
>> >> Important point.
>> >> [I thought that "BitSet" was evoked because it perfectly fitted the
>> >> bill.]
>> >>
>> >
>> > BitSet fits the bill in terms of required data read functionality.
>>
>> Yes; but not in terms of performance for certain use cases.
>> Hence why it cannot be a common denominator for all
>> implementations.
>>
>> >
>> > [snip]
>> >
>> >> > We can use the
>> >> > Implementation in our classes and use the interface where possible
>> but
>> >> > it
>> >> > leaves open the possibility of switching in a version of
>> RoaringBitmap
>> >> > or
>> >> > similar without changing the external API. It is this interface that
>> is
>> >> the
>> >> > “external reference implementation of state”. I think I wou

Re: [collections] BloomFilter package architecture discussion

2019-10-19 Thread Claude Warren

As an experiment I attempted to implement a complete BloomFilter (see
BloomFilterI2 in https://issues.apache.org/jira/browse/COLLECTIONS-728)
using only IntStream and merge( BloomFilter ).  It is possible, though
probably not the fastest implementation.

We could define something like:
 ByteBuffer getBits()

or a method that returns a stream of Boolean (this will be very long in
processing).
or a method that returns a stream of ByteBuffers (the concatenation of
which is the entire set of bits)
or a method that returns an IntStream (the concatenation of which is the
entire set of bits).

I think that most implementations will have some internal structure that
can generate the above.  The wider the item in the stream the faster we can
move the bits, the better the average performance will be.


Claude




On Sat, Oct 19, 2019 at 4:48 PM Gilles Sadowski 
wrote:

> Hello.
>
> 2019-10-19 16:20 UTC+02:00, Claude Warren :
> > On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski 
> > wrote:
> >
> >> Hi.
> >>
> >> >>> [...]
> >> > >
> >> > > Maybe I was not clear enough: I'm not saying that we should prefer
> >> > > some representation (of the state) over another; only that the how
> >> > > the state is represented externally need not be the same as the
> >> internal
> >> > > representation.
> >> > >
> >> > > But if the state is fully encapsulated by a "BitSet", then internal
> >> > > and
> >> > > external can be the same, making easier both for developers and for
> >> > > users.  So let's have indeed a "getBitSet()" that returns a copy of
> >> > > the
> >> > > (private final) instance field (of type "BitSet") used to represent
> >> > > the
> >> > > state of a "BloomFilter".
> >> > >
> >> > >
> >> > I agree with most of this statement. However, I think the use of
> BitSet
> >> for
> >> > the internal representation is questionable.
> >>
> >> Hence, that settles the issue of "class" versus "interface":
> >> "BloomFilter"
> >> must be an interface.
> >>
> >> *I am still not convinced. How do you then provide consistent
> >> implementations of the distance functions? We could use “default”
> >> implementations but that seems like a misuse of the intent for
> “default”,
> >> though perhaps I misunderstand that.*
>
> What's the issue with having "distance(BloomFilter other)"
> in the interface?
>
> >>
> >>
> >> [snip]
> >
> >> > In addition, there are researchers that do not use
> >> > BitSet because it is too slow or too bloated.
> >>
> >> Important point.
> >> [I thought that "BitSet" was evoked because it perfectly fitted the
> >> bill.]
> >>
> >
> > BitSet fits the bill in terms of required data read functionality.
>
> Yes; but not in terms of performance for certain use cases.
> Hence why it cannot be a common denominator for all
> implementations.
>
> >
> > [snip]
> >
> >> > We can use the
> >> > Implementation in our classes and use the interface where possible but
> >> > it
> >> > leaves open the possibility of switching in a version of RoaringBitmap
> >> > or
> >> > similar without changing the external API. It is this interface that
> is
> >> the
> >> > “external reference implementation of state”. I think I would like to
> >> call
> >> > this interface BitSetI and will refer to it this way later. I think it
> >> > should contain the following methods as defined in BitSet: and(),
> >> andNot(),
> >> > cardinality(), get(), intersects(), isEmpty(), nextClearBit(),
> >> > nextSetBit(), or(), previousClearBit(), previousSetBit(), stream(),
> >> xor().
> >>
> >> IMHO, this proposal orthogonal to the "BloomFilter" functionality, and
> >> we must discuss separately whether [Collections] wants to host such an
> >> API.
> >>
> >
> > I think a core question needs to be answered: Is a Bloom filter an
> > extension of functionality on a BitSet (isA) or does it simply use the
> > functionality of a BitSet (hasA).
>
> AFAICT, neither.
>
> > If find myself of two minds on this
> > quesiton at the start, but seem to come down on the isA side after
> analysis
> > of how the Bloom fi

Re: [collections] BloomFilter package architecture discussion

2019-10-19 Thread Claude Warren

On Fri, Oct 18, 2019 at 3:22 PM Gilles Sadowski 
wrote:

> Hi.
>
> >>> [...]
> > >
> > > Maybe I was not clear enough: I'm not saying that we should prefer
> > > some representation (of the state) over another; only that the how
> > > the state is represented externally need not be the same as the
> internal
> > > representation.
> > >
> > > But if the state is fully encapsulated by a "BitSet", then internal and
> > > external can be the same, making easier both for developers and for
> > > users.  So let's have indeed a "getBitSet()" that returns a copy of the
> > > (private final) instance field (of type "BitSet") used to represent the
> > > state of a "BloomFilter".
> > >
> > >
> > I agree with most of this statement. However, I think the use of BitSet
> for
> > the internal representation is questionable.
>
> Hence, that settles the issue of "class" versus "interface": "BloomFilter"
> must be an interface.
>
> *I am still not convinced. How do you then provide consistent
> implementations of the distance functions? We could use “default”
> implementations but that seems like a misuse of the intent for “default”,
> though perhaps I misunderstand that.*
>
>
> [snip]

> > In addition, there are researchers that do not use
> > BitSet because it is too slow or too bloated.
>
> Important point.
> [I thought that "BitSet" was evoked because it perfectly fitted the bill.]
>

BitSet fits the bill in terms of required data read functionality.

[snip]

> > We can use the
> > Implementation in our classes and use the interface where possible but it
> > leaves open the possibility of switching in a version of RoaringBitmap or
> > similar without changing the external API. It is this interface that is
> the
> > “external reference implementation of state”. I think I would like to
> call
> > this interface BitSetI and will refer to it this way later. I think it
> > should contain the following methods as defined in BitSet: and(),
> andNot(),
> > cardinality(), get(), intersects(), isEmpty(), nextClearBit(),
> > nextSetBit(), or(), previousClearBit(), previousSetBit(), stream(),
> xor().
>
> IMHO, this proposal orthogonal to the "BloomFilter" functionality, and
> we must discuss separately whether [Collections] wants to host such an
> API.
>

I think a core question needs to be answered: Is a Bloom filter an
extension of functionality on a BitSet (isA) or does it simply use the
functionality of a BitSet (hasA). If find myself of two minds on this
quesiton at the start, but seem to come down on the isA side after analysis
of how the Bloom filter is used. To that end BloomFilter should provide the
data read methods of the BitSet functionality. What I called BitSetI
before.

>
> A priori, how the classes that implements the "BloomFilter" interface
> represents the "set of bits" is an implementation detail.  Even that there
> is a "set of bits" is a detail to *users* of the interface.
> Of course, a practical question is whether the representation should be
> "pluggable".  If not, we could change to using one or another from those
> classes that you mentioned (and change it from version to version).
> In any case, I think that this would be a feature to be left for later (and
> that could be added in a compatible way, if nothing of the "state" is
> leaked
> into the public API).
>

To my mind the “set of bits” interface is what I called BitSetI. Again,
this defines how the Bloom filter presents itself as a “set of bits” not
how it implements it.

[snip]

> > The implementation may not want to use the BitSet class because the
> filter
> > itself can be large. For example in DNA matching it is not uncommon for
> the
> > Bloom filter to be 100s of MB in size. This also means we don’t want to
> > make copies if we don’t have to. (perhaps I should add DNA matching to
> the
> > Use cases to capture the large number of bits).
>
> So this use case would entail that the "BloomFilter" interface must *not*
> mandate a "toBitSet()" method.
>

Agreed.


> > > [...]
> > >
> >
> > I agree with most of this statement. I still don’t like forcing a
> standard
> > BitSet implementation onto all instances of Bloom filter
>
> +1
>
> > when it is the
> > logical construct (BitSetI) that we are looking for.
>
> -1
> [IIUC; see my comments above.]
>

Somehow I think we are in violent agreement here, but perhaps are talking
past each other some how.


> > >
> > > > There are several other examples such as layered Bloom
> > > > filters (
> > > https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters)
> > > > and attenuated Bloom filters (
> > > > https://en.wikipedia.org/wiki/Bloom_filter#Attenuated_Bloom_filters)
> > > where
> > > > the internal structure deviates from the BitSet model but that still
> need
> > > > to present themselves as AbstractBloomFilters do. I think that
> Abstract
> > > > class is the correct pattern here.
> > >
> > > Not sure: "abstract" is adequate when several subclasses need to
> > > override behaviours;

[collections] Bloom filter - Discussion of Shape

2019-10-18 Thread Claude Warren

 I think the other discussion is getting a bit long so I thought we could
start this discussion here and see if we can close out the other discussion
with agreement on the remaining topics.

The “Shape” of a bloom filter (excluding the hash algo) is defined
mathematically by

Number of Items (AKA: n) n = ceil(m / (-k / log(1 - exp(log(p) / k

Probability of Collision (AKA: p) p = (1 – exp(-kn/m))^k

Number of Bits (AKA: m) m = ceil((n * log(p)) / log(1 / pow(2, log(2

Number of Functions (AKA: k) k = round((m / n) * log(2))

Since the probability (p) is actually calculated directly from the other 3
the shape is actually concretely defined by n, m, and k.

This is all in the BloomFilterConfiguration class.

The BloomFilterConfigurationClass has 4 constrcutors (n,p), (p,m,k), (n,m),
and (n,m,k). in all cases where is an "p" it is assumed to be a desired
value and values of the other missing properties are calculated. The actual
value for p is then recalculated from n, m and k.

Now it makes sense that the hash algo should also be specified I am happy
to assume an enum here for the moment and also assume a mechanism for
resolving the enum to a concrete hash implementation.

I think that you will find that BloomFilterConfiguration fits the bill but
perhaps should be renamed to BloomFilterShape.



There is one other component that we have not addressed lately: The
ProtoBloomFilter.

The ProtoBloomFilter addresses an issue that arises when there are filters
with different “shape” but same Hashing Algo.

Since hashing is the expensive operation, we want to try to cache the hash
values and reuse them rather than recalculate them.

The ProtoBloomFilter contains the results of the hashing in a format that
is easily used to create filters of the specified shape.

The current implementation uses a circular method whereby a murmur128 hash
is generated for a parameter to a with() call (e.g. with(“foo”) generates a
murmur128 hash of “foo”)
the hash  is split into 2 longs.

Now when the shape has a “k” of:
  1 we return the first long.
  2 we add the second long to the first long and return that.
  3 we add the second long the the result above and return that.

This has been shown to be as randomly distributed as calling
  murmur128( “foo”, (seed1) )
  murmur128( “foo”, (seed2) )
  murmur128( “foo”, (seed3) )

and much faster.

We can do this with any implementation that produces 128 bit hashes.

Once we have the long value for the specific k we take the mod() of that
and the number of bits in the Bloom filter (m). For Bloom filters that have
small enough (m) we could use smaller integers and it would not have a
significant impact on the distribution of the bits.

There is an issue of whether the long value is considered signed or
unsigned as this significantly impacts the mod(). This is why I suggested a
string like Murmur128-UC or Murmur128-SC (unsigned or signed).

If instead the ProtoBloomFilter called the hash function directly then it
would be what I called an incremented call. So calling

  murmur128( “foo”, (seed1) )
  murmur128( “foo”, (seed2) )
  murmur128( “foo”, (seed3) )

inside the ProtoBloomFilter would yield a Murmur128-UI or Murmur128-SI hash
depending on whether the values were assumed to be signed or unsigned for
the mod() call.

Anyway, we can rename the ProtoBloomFilter to something else, but it
effectively generates the required number of functions (k) for each with(x)
parameter.

So you take the BloomFilterConfiguration (Shape) and the ProtoBloomFilter
(k provider) and you can build the BloomFilter.

There are some problems here. Almost none of the nomenclature is common. So
we can define as we go. I think there is a need for a proper hashing naming
were we can determine how the hash was created from the name (like the with
encryption algo names and similar to the RandomSource code you pointed me
to), but I have found no standards here.

Claude

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [All] Using "SemVer"?

2019-10-18 Thread Claude Warren

+1 ensures interoperability for our users for minimal pain on our side.

On Fri, Oct 18, 2019 at 12:01 PM Emmanuel Bourg  wrote:

> Le 18/10/2019 à 12:46, Gilles Sadowski a écrit :
>
> > Why not state it explicitly (and make it a formal requirement for
> > a release)?
>
> -1, it restricts our freedom for no real gain, we have enough
> constraints already.
>
> Emmanuel Bourg
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: strange change to src/main/java/org/apache/bcel/generic/FieldGenOrMethodGen.java

2019-10-18 Thread Claude Warren

The change from public to private would indicate a major version change as
it changes the API.  Though I suppose this could also be done if code were
being contributed to a project from outside.  In which case the minor
(middle) number would have to have changed.

In either case changing from a protected to a public method would be
allowed under semantic numbering as a patch change.

As this is a patch level change, and there is a need for the functionality
perhaps it should revert to the public form that it had in the past.

@sebbaz can you explain why making a method public makes it more difficult
to test?  Seems counter intuitive to me.

Claude



On Fri, Oct 18, 2019 at 1:55 AM Mark Roberts 
wrote:

> So I'm forced to add pass through methods to MethodGen?  That seems a
> waste of effort and still requires testing.  I repeat - you can already
> manipulate Attrbiutes - you should be able to manipulate Annotations in
> exactly the same fashion.  It is a missing capability that is needed - at
> least by me and as we move forward into Java 9 and beyond I'm sure others
> are going to run the problem.
>
> -Original Message-
> From: sebb [mailto:seb...@gmail.com]
> Sent: Thursday, October 17, 2019 3:55 PM
> To: Commons Developers List 
> Subject: Re: strange change to
> src/main/java/org/apache/bcel/generic/FieldGenOrMethodGen.java
>
> On Thu, 17 Oct 2019 at 23:16, Mark Roberts 
> wrote:
> >
> > When the BCEL package was renamed back to org.apache.bcel  in:
> >
> > commit d522432b79044740831a132d8b92e7dab5477444
> > Author: Benedikt Ritter 
> > Date:   Tue Jun 7 17:28:43 2016 +
> >
> > The methods to add and delete annotations were changed from public to
> protected with a confusing comment:
> > < public void addAnnotationEntry(AnnotationEntryGen ag)
> > ---
> > > protected void addAnnotationEntry(final AnnotationEntryGen ag) //
> TODO could this be package protected?
>
> The comment makes sense to me.
>
> The method was changed from public to protected, reducing the visibility.
> The TODO asks if it could be made package protected, i.e. further reducing
> the visibility.
>
> > I think this might have been a cut and paste error as the same comment
> was added to other methods, but they were left public (so the comment makes
> sense).
>
> It seems to me that the other methods were also probably intended to be
> reduced in visibility to protected, but this was not done.
>
> > In any case, the current situation is you can add and delete Attributes
> but not Annotations.  And, surprise, that is exactly what I need to do.
> >
> > Any reason not to change these back to public?
>
> Increasing visibility increases the difficulty of testing and increases
> the likelihood of subtle bugs.
>
> > Thanks,
> > Mark
> >
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [collections] BloomFilter package architecture discussion

2019-10-17 Thread Claude Warren

On Wed, Oct 16, 2019 at 2:08 AM Gilles Sadowski 
wrote:

> Hi.
>
> 2019-10-15 20:05 UTC+02:00, Claude Warren :
> > On Tue, Oct 15, 2019 at 1:46 AM Gilles Sadowski 
> > wrote:
> >
> >> Hello.
> >>
> >>
> >>
> >> > Furthermore,
> >> > the other potential users and supporters have not responded to any
> >> > communication about this proposal so I am floundering on that front
> >> > too.
> >>
> >> Who are they?
> >>
> >
> > Developers I have worked with or know of that have expertice utilizing
> and
> > analyzing Bloom filters in various projects and papers. Apache developers
> > that are considering adding Bloom filters to their projects.
>
> For sure, it would clear the path (to a design) if more people
> would lay out (usage) requirements.
>
> >
> >>
> >> > What I want to do is take a step back. Let’s remove the collection
> >> packages
> >> > from consideration and work out the base package first.
> >>
> >> +1
> >>
> >> > With that in mind I
> >> > would like to present reasoning behind the classes as they now stand
> >> > and
> >> > suggest a way forward to get the core classes in before adding the
> >> > collections.
> >>
> >> For sure, I appreciate your willingness to provide more explanations.
> :-)
> >>
> >> > As discussed earlier Bloom filters are a concept. In most peoples
> >> > mind’s
> >> > they are bit vectors with specific named functions (e.g. match => A
> and
> >> B =
> >> > A, hammingValue => cardinality(this)). This idea of the filter as a
> bit
> >> > vector led me to using a BitSet. Originally the BloomFilter was
> >> > serializable but valid arguments against that and toward making the
> >> > internal representation visible enought to serialize that lead to
> >> > making
> >> > the BitSet public.
> >>
> >> Maybe I'm missing your point, but I don't see a direct connection
> >> between persistence of a "BloomFillter" instance and the visibility
> >> of "BitSet" (if that is used to represent its internal state).
> >>
> >> Realistically, when we “persist” a Bloom filter we are preserving the
> > internal state. If the Bloom filter itself is not serializable then the
> > internal state has to be preserved somehow. You made a comment on Sep 23,
> > 2019 when we were discussing the serialization on the “New Sub-project
> > Proposal.” topic in which you said “At first sight, I'd say that
> > serialization is out-of-scope (we should let application developers deal
> > with that using the available accessors).” Perhaps I have misunderstood
> but
> > when I read that a light went on. What has to happen is that enough of
> the
> > internal state has to be exposed for the developer to preserve that state
> > and be able to reinstantiate it later. As the state is really an bit
> array
> > I figured the most logical thing would be to expose the BitSet.
> > Alternatively we could expose a ByteBuffer or any of a number of
> structures
> > that can represent a bit array. However, the BitSet is, semantically
> > speaking, what we are exposing.
>
> Maybe I was not clear enough: I'm not saying that we should prefer
> some representation (of the state) over another; only that the how
> the state is represented externally need not be the same as the internal
> representation.
>
> But if the state is fully encapsulated by a "BitSet", then internal and
> external can be the same, making easier both for developers and for
> users.  So let's have indeed a "getBitSet()" that returns a copy of the
> (private final) instance field (of type "BitSet") used to represent the
> state of a "BloomFilter".
>
>
I agree with most of this statement. However, I think the use of BitSet for
the internal representation is questionable. For the simple cases it works
and there is a whole class of Bloom filters where this would be acceptable.
However for filters that have some concept of a count (e.g. Counting Bloom
Filter, Stable Bloom Filters) it doesn’t make sense to force the use of the
BitSet internally. In addition, there are researchers that do not use
BitSet because it is too slow or too bloated. The exaples are
https://github.com/lemire/javaewah, and
https://github.com/RoaringBitmap/RoaringBitmap which are used in Apache
projects. The Dilemma her is to require BitSet or simply require BitSet
functionality.

Re: [collections] BloomFilter package architecture discussion

2019-10-17 Thread Claude Warren

I have added a list of Use Cases (
https://github.com/Claudenw/commons-collections/blob/master/src/main/java/org/apache/commons/collections4/bloomfilter/Usage.md)
I expect that these will end up in documentation somewhere.

On Wed, Oct 16, 2019 at 2:08 AM Gilles Sadowski 
wrote:

> Hi.
>
> 2019-10-15 20:05 UTC+02:00, Claude Warren :
> > On Tue, Oct 15, 2019 at 1:46 AM Gilles Sadowski 
> > wrote:
> >
> >> Hello.
> >>
> >>
> >>
> >> > Furthermore,
> >> > the other potential users and supporters have not responded to any
> >> > communication about this proposal so I am floundering on that front
> >> > too.
> >>
> >> Who are they?
> >>
> >
> > Developers I have worked with or know of that have expertice utilizing
> and
> > analyzing Bloom filters in various projects and papers. Apache developers
> > that are considering adding Bloom filters to their projects.
>
> For sure, it would clear the path (to a design) if more people
> would lay out (usage) requirements.
>
> >
> >>
> >> > What I want to do is take a step back. Let’s remove the collection
> >> packages
> >> > from consideration and work out the base package first.
> >>
> >> +1
> >>
> >> > With that in mind I
> >> > would like to present reasoning behind the classes as they now stand
> >> > and
> >> > suggest a way forward to get the core classes in before adding the
> >> > collections.
> >>
> >> For sure, I appreciate your willingness to provide more explanations.
> :-)
> >>
> >> > As discussed earlier Bloom filters are a concept. In most peoples
> >> > mind’s
> >> > they are bit vectors with specific named functions (e.g. match => A
> and
> >> B =
> >> > A, hammingValue => cardinality(this)). This idea of the filter as a
> bit
> >> > vector led me to using a BitSet. Originally the BloomFilter was
> >> > serializable but valid arguments against that and toward making the
> >> > internal representation visible enought to serialize that lead to
> >> > making
> >> > the BitSet public.
> >>
> >> Maybe I'm missing your point, but I don't see a direct connection
> >> between persistence of a "BloomFillter" instance and the visibility
> >> of "BitSet" (if that is used to represent its internal state).
> >>
> >> Realistically, when we “persist” a Bloom filter we are preserving the
> > internal state. If the Bloom filter itself is not serializable then the
> > internal state has to be preserved somehow. You made a comment on Sep 23,
> > 2019 when we were discussing the serialization on the “New Sub-project
> > Proposal.” topic in which you said “At first sight, I'd say that
> > serialization is out-of-scope (we should let application developers deal
> > with that using the available accessors).” Perhaps I have misunderstood
> but
> > when I read that a light went on. What has to happen is that enough of
> the
> > internal state has to be exposed for the developer to preserve that state
> > and be able to reinstantiate it later. As the state is really an bit
> array
> > I figured the most logical thing would be to expose the BitSet.
> > Alternatively we could expose a ByteBuffer or any of a number of
> structures
> > that can represent a bit array. However, the BitSet is, semantically
> > speaking, what we are exposing.
>
> Maybe I was not clear enough: I'm not saying that we should prefer
> some representation (of the state) over another; only that the how
> the state is represented externally need not be the same as the internal
> representation.
>
> But if the state is fully encapsulated by a "BitSet", then internal and
> external can be the same, making easier both for developers and for
> users.  So let's have indeed a "getBitSet()" that returns a copy of the
> (private final) instance field (of type "BitSet") used to represent the
> state of a "BloomFilter".
>
> >> > In addition to the normal Bloom filter methods
> >> > (hammingValue, match, etc.) there are some methods (log, cosine
> >> > distance,
> >> > jaccard distance, etc.) that are used by researchers and other
> >> > attempting
> >> > to do some work with indexing and such for large collections of
> >> > filters.
> >>
> >> IIUC, you indeed distinguish between core functionality (i.e. "norm

Re: [collections] BloomFilter package architecture discussion

2019-10-15 Thread Claude Warren

On Tue, Oct 15, 2019 at 1:46 AM Gilles Sadowski 
wrote:

> Hello.
>
>
>
> > Furthermore,
> > the other potential users and supporters have not responded to any
> > communication about this proposal so I am floundering on that front too.
>
> Who are they?
>

Developers I have worked with or know of that have expertice utilizing and
analyzing Bloom filters in various projects and papers. Apache developers
that are considering adding Bloom filters to their projects.


>
> > What I want to do is take a step back. Let’s remove the collection
> packages
> > from consideration and work out the base package first.
>
> +1
>
> > With that in mind I
> > would like to present reasoning behind the classes as they now stand and
> > suggest a way forward to get the core classes in before adding the
> > collections.
>
> For sure, I appreciate your willingness to provide more explanations. :-)
>
> > As discussed earlier Bloom filters are a concept. In most peoples mind’s
> > they are bit vectors with specific named functions (e.g. match => A and
> B =
> > A, hammingValue => cardinality(this)). This idea of the filter as a bit
> > vector led me to using a BitSet. Originally the BloomFilter was
> > serializable but valid arguments against that and toward making the
> > internal representation visible enought to serialize that lead to making
> > the BitSet public.
>
> Maybe I'm missing your point, but I don't see a direct connection
> between persistence of a "BloomFillter" instance and the visibility
> of "BitSet" (if that is used to represent its internal state).
>
> Realistically, when we “persist” a Bloom filter we are preserving the
internal state. If the Bloom filter itself is not serializable then the
internal state has to be preserved somehow. You made a comment on Sep 23,
2019 when we were discussing the serialization on the “New Sub-project
Proposal.” topic in which you said “At first sight, I'd say that
serialization is out-of-scope (we should let application developers deal
with that using the available accessors).” Perhaps I have misunderstood but
when I read that a light went on. What has to happen is that enough of the
internal state has to be exposed for the developer to preserve that state
and be able to reinstantiate it later. As the state is really an bit array
I figured the most logical thing would be to expose the BitSet.
Alternatively we could expose a ByteBuffer or any of a number of structures
that can represent a bit array. However, the BitSet is, semantically
speaking, what we are exposing.


> > In addition to the normal Bloom filter methods
> > (hammingValue, match, etc.) there are some methods (log, cosine distance,
> > jaccard distance, etc.) that are used by researchers and other attempting
> > to do some work with indexing and such for large collections of filters.
>
> IIUC, you indeed distinguish between core functionality (i.e. "normal
> Bloom filter methods") and additional support for (?)...
> If so, "core" should indeed be implemented first, with maximum
> encapsulation (even if it would seem to prevent "research" usage).
>
> Indeed I distinguish. I distinguish to align various functionality with
external documentation. For example in Bloom’s paper he does not talk about
Hamming values a term and concept that is found in every Bloom filter
implementation I have seen and which is frequent in the literature. So a
“Strict” implementation of a Bloom filter might only have a single method
“match()” and still be considered an implementation of a Bloom filter, but
it would not be of much use in modern systems. Realistically, there are a
number of methods that are commonly found in Bloom filter implementations.
I believe that at a minimum they are “match()”, “merge()” and
“hammingValue()”.



> > Finally there is the question of “normal” usage. In his paper on
> > “Compressed Bloom Filters”[1] (an implementation we do not have yet)
> > Michael Mitzenmacher states “the Bloom filter plays a dual role. It is
> both
> > a data structure being used at the proxies, and a message being passed
> > between them”. He is discussing the issue of using bloom filters in a
> > distributed cache where the proxies distributed their bloom filter so
> that
> > all the proxies know which proxy might have the document. So the Bloom
> > filter needs both an internal representation and a mechanism to pass the
> > data between systems that use it. It seems obvious that the BitSet works
> as
> > the mechanism to pass the data between systems, though an immutable
> version
> > would be preferred.
>
> So it seems we can have a Bloom filter state represented by a
> "private" and "final" instance of "BitSet".  Is there any reason that
> the implementation in [Collections] would ever need to use several
> representations of its state?
> If not, I'd think that "BloomFilter" can be a concrete class (with, for
> now, a "BitSet" as its state).
>
> The counter examples that come to mind is not within a single Bloom filter
type. A

Re: [ALL] Update to commons security page

2019-10-15 Thread Claude Warren

If the style is to rely on external code to do input validation, then I
think that should be in the javadocs as well as on the page you mention.

Claude

On Tue, Oct 15, 2019 at 10:59 AM sebb  wrote:

> It might be useful to add a note to the commons security page about
> automated vulnerability checkers.
>
> These tend to produce a lot of false positives and may report items
> which could never be a security issue (e.g. poor code style, dead
> code).
>
> Even if the issue is potentially a vulnerability, it often depends on
> the context.
> This is particularly true of Commons - the code generally relies on
> the application to do validation of input parameters.
>
> Thoughts?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

[collections] BloomFilter package architecture discussion

2019-10-14 Thread Claude Warren

Greetings,


I feel like I have been beating my head against a wall trying to get this
contribution accepted; this is not a complaint about process or
personalities, just a statement of how I feel. I realize that I have made
progress but I also realize that there is a long way to go. Furthermore,
the other potential users and supporters have not responded to any
communication about this proposal so I am floundering on that front too.
What I want to do is take a step back. Let’s remove the collection packages
from consideration and work out the base package first. With that in mind I
would like to present reasoning behind the classes as they now stand and
suggest a way forward to get the core classes in before adding the
collections.


As discussed earlier Bloom filters are a concept. In most peoples mind’s
they are bit vectors with specific named functions (e.g. match => A and B =
A, hammingValue => cardinality(this)). This idea of the filter as a bit
vector led me to using a BitSet. Originally the BloomFilter was
serializable but valid arguments against that and toward making the
internal representation visible enought to serialize that lead to making
the BitSet public. In addition to the normal Bloom filter methods
(hammingValue, match, etc.) there are some methods (log, cosine distance,
jaccard distance, etc.) that are used by researchers and other attempting
to do some work with indexing and such for large collections of filters.


Finally there is the question of “normal” usage. In his paper on
“Compressed Bloom Filters”[1] (an implementation we do not have yet)
Michael Mitzenmacher states “the Bloom filter plays a dual role. It is both
a data structure being used at the proxies, and a message being passed
between them”. He is discussing the issue of using bloom filters in a
distributed cache where the proxies distributed their bloom filter so that
all the proxies know which proxy might have the document. So the Bloom
filter needs both an internal representation and a mechanism to pass the
data between systems that use it. It seems obvious that the BitSet works as
the mechanism to pass the data between systems, though an immutable version
would be preferred.


So the Bloom filter needs an external reference implementation of its
state, so that it can communicate said state with other Bloom filters. The
quesiton of what is necessary for transmission of that state is still open.
I think we can agree a mechanism to identify which bits are on is requried
as part of the state. I have seen discussions that indicate the hashing
algorithm and other configuration information should also be included, but
I believe this is dependent upon where the filter is being sent. All Bloom
filters within a system, by necessity, have to use the same hash and
configuration parameters. Only when sending outside the system does the
other data need to be made available. In my mind this means those data
structures have to be availabe to be transmitted along with the Bloom
filter or filters.


So now to the code in the current contribution. The argument that
BloomFilter should be a class not an interface is a good one. However, as
the internal representation of the bit structure can vary wildly, I think
it should be an abstract class. My reasoning is as follows: Assuming we
have an external reference implementation of state many/all of the methods
could be implemented against that definition. This is what the current
implementation attempts to do, but not successfully. Any implementation of
the abstract class could develop an internal representation of state that
is more efficient for whatever the use case is (e.g. counting Bloom
filters, compressed Bloom filters, attenuated Bloom filters, etc) and
override the methods that are more efficiently answered within its internal
structure. In addition, the abstract Bloom filter might implement the “data
read” methods from BitSet directly. This would allow more efficient
implementation of several methods that currently depend upon the creation
of a BitSet and the manipulation of same.


I think this change yields the following:


1. Creation of an external reference implementation of state.

2. Conversion of BloomFilter for interface to AbstractBloomFilter class

3. Merge BloomFilterFunctions into AbstractBloomFilter

4. Rename StandardBloomFilter to BloomFilter (should we do this?)

5. Modify (Standard)BloomFilter and CountingBloomFilter to extends the
AbstractBloomFilter class.


On the topic of an external reference implementation of state, could this
be an interface that AbstractBloomFilter implements? Might it comprise the
“read data” methods from the BitSet class?


[1] https://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

1 2 >

1 - 100 of 149 matches

Mail list logo