[jira] [Comment Edited] (FLINK-7465) Add build-in BloomFilterCount on TableAPI

2017-08-28 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137425#comment-16137425
 ] 

Fabian Hueske edited comment on FLINK-7465 at 8/28/17 10:20 AM:


I'm sorry, I confused count-min sketches (for approximate group counts) and 
HyperLogLog (for approximate distinct counts). 

I assume the goal of the BloomFilterCount function is to (approximately) count 
the number of distinct values. In contrast to HyperLogLog, Bloom filters are 
not specifically designed for approximate distinct counting but for approximate 
membership testing. AFAIK, bloom filters should be more precise for log 
distinct cardinalities but HyperLogLog should provide much better results for 
larger cardinalities.

IMO, [~jark]'s idea to split the bitmask into multiple long values is pretty 
nice. OTOH, multiple RocksDB point lookups might also be more expensive than a 
single lookup with larger serialization payload (the deserialization logic for 
byte arrays shouldn't be very costy).


was (Author: fhueske):
I'm sorry, I confused count-min sketches (for approximate group counts) and 
HyperLogLog (for approximate distinct counts). 

I assume the goal of the BloomFilterCount function is to (approximately) count 
the number of distinct values. In contrast to HyperLogLog, Bloom filters are 
not specifically designed for approximate distinct counting but for approximate 
membership testing. AFAIK, bloom filters should be more precise for log 
distinct cardinalities but HyperLogLog should provide much better results for 
larger cardinalities.

IMO, [~jark]'s idea to split the bitmask into multiple long values is pretty 
nice. OTOH, point multiple RocksDB lookups might also be more expensive than a 
single lookup with larger serialization payload (the deserialization logic for 
byte arrays shouldn't be very costy).

> Add build-in BloomFilterCount on TableAPI
> -
>
> Key: FLINK-7465
> URL: https://issues.apache.org/jira/browse/FLINK-7465
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: sunjincheng
>Assignee: sunjincheng
> Attachments: bloomfilter.png
>
>
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also 
> be k different hash functions defined, each of which maps or hashes some set 
> element to one of the m array positions, generating a uniform random 
> distribution. Typically, k is a constant, much smaller than m, which is 
> proportional to the number of elements to be added; the precise choice of k 
> and the constant of proportionality of m are determined by the intended false 
> positive rate of the filter.
> To add an element, feed it to each of the k hash functions to get k array 
> positions. Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of 
> the k hash functions to get k array positions. If any of the bits at these 
> positions is 0, the element is definitely not in the set – if it were, then 
> all the bits would have been set to 1 when it was inserted. If all are 1, 
> then either the element is in the set, or the bits have by chance been set to 
> 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored 
> arrows show the positions in the bit array that each set element is mapped 
> to. The element w is not in the set {x, y, z}, because it hashes to one 
> bit-array position containing 0. For this figure, m = 18 and k = 3. The 
> sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. 
> https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java
> Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7465) Add build-in BloomFilterCount on TableAPI

2017-08-28 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137425#comment-16137425
 ] 

Fabian Hueske edited comment on FLINK-7465 at 8/28/17 10:20 AM:


I'm sorry, I confused count-min sketches (for approximate group counts) and 
HyperLogLog (for approximate distinct counts). 

I assume the goal of the BloomFilterCount function is to (approximately) count 
the number of distinct values. In contrast to HyperLogLog, Bloom filters are 
not specifically designed for approximate distinct counting but for approximate 
membership testing. AFAIK, bloom filters should be more precise for log 
distinct cardinalities but HyperLogLog should provide much better results for 
larger cardinalities.

IMO, [~jark]'s idea to split the bitmask into multiple long values is pretty 
nice. OTOH, multiple RocksDB point lookups might also be more expensive than a 
single lookup with larger serialization payload (the deserialization logic for 
byte arrays shouldn't be very costly).


was (Author: fhueske):
I'm sorry, I confused count-min sketches (for approximate group counts) and 
HyperLogLog (for approximate distinct counts). 

I assume the goal of the BloomFilterCount function is to (approximately) count 
the number of distinct values. In contrast to HyperLogLog, Bloom filters are 
not specifically designed for approximate distinct counting but for approximate 
membership testing. AFAIK, bloom filters should be more precise for log 
distinct cardinalities but HyperLogLog should provide much better results for 
larger cardinalities.

IMO, [~jark]'s idea to split the bitmask into multiple long values is pretty 
nice. OTOH, multiple RocksDB point lookups might also be more expensive than a 
single lookup with larger serialization payload (the deserialization logic for 
byte arrays shouldn't be very costy).

> Add build-in BloomFilterCount on TableAPI
> -
>
> Key: FLINK-7465
> URL: https://issues.apache.org/jira/browse/FLINK-7465
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: sunjincheng
>Assignee: sunjincheng
> Attachments: bloomfilter.png
>
>
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also 
> be k different hash functions defined, each of which maps or hashes some set 
> element to one of the m array positions, generating a uniform random 
> distribution. Typically, k is a constant, much smaller than m, which is 
> proportional to the number of elements to be added; the precise choice of k 
> and the constant of proportionality of m are determined by the intended false 
> positive rate of the filter.
> To add an element, feed it to each of the k hash functions to get k array 
> positions. Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of 
> the k hash functions to get k array positions. If any of the bits at these 
> positions is 0, the element is definitely not in the set – if it were, then 
> all the bits would have been set to 1 when it was inserted. If all are 1, 
> then either the element is in the set, or the bits have by chance been set to 
> 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored 
> arrows show the positions in the bit array that each set element is mapped 
> to. The element w is not in the set {x, y, z}, because it hashes to one 
> bit-array position containing 0. For this figure, m = 18 and k = 3. The 
> sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. 
> https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java
> Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7465) Add build-in BloomFilterCount on TableAPI

2017-08-24 Thread sunjincheng (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139810#comment-16139810
 ] 

sunjincheng edited comment on FLINK-7465 at 8/24/17 9:33 AM:
-

Hi [~jparkie] [~fhueske] [~jark], Thanks for your comments. 

* HyperLogLog/HyperLogLog++ used for cardinality statistics;
* CountMinSketch used for frequency statistics;
* BloomFilter used for membership judgment;

[~jparkie] [~fhueske] you are right, the {{HyperLogLog/HyperLogLog++}} is the 
best solution for this approximately distinct count. And 
[HyperLogLog|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf] can 
count A Billion distinct objects using only using 1.5kB (kilobyte) of 
storage(typical accuracy of 2%). So I want add approximately distinct counting 
by [HyperLogLog|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf] 
[HyperLogLog 
Plus|https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/40671.pdf]

But here a question that how can we process the retract record? What do you 
think? [~jparkie][~fhueske] [~jark]

Thanks, jincheng
 



was (Author: sunjincheng121):
Hi [~jparkie] [~fhueske] [~jark], Thanks for your comments. 

* HyperLogLog/HyperLogLog++ used for cardinality statistics;
* CountMinSketch used for frequency statistics;
* BloomFilter used for membership judgment;

[~jparkie] [~fhueske] you are right, the {{HyperLogLog/HyperLogLog++}} is the 
best solution for this approximately distinct count. And 
[HyperLogLog|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf] can 
count A Billion distinct objects using only using 1.5kB (kilobyte) of 
storage(typical accuracy of 2%). So I want add approximately distinct counting 
by 
[HyperLogLog|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf]/[HyperLogLog++|https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/40671.pdf]

But here a question that how can we process the retract record? What do you 
think? [~jparkie][~fhueske] [~jark]

Thanks, jincheng
 


> Add build-in BloomFilterCount on TableAPI
> -
>
> Key: FLINK-7465
> URL: https://issues.apache.org/jira/browse/FLINK-7465
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: sunjincheng
>Assignee: sunjincheng
> Attachments: bloomfilter.png
>
>
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also 
> be k different hash functions defined, each of which maps or hashes some set 
> element to one of the m array positions, generating a uniform random 
> distribution. Typically, k is a constant, much smaller than m, which is 
> proportional to the number of elements to be added; the precise choice of k 
> and the constant of proportionality of m are determined by the intended false 
> positive rate of the filter.
> To add an element, feed it to each of the k hash functions to get k array 
> positions. Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of 
> the k hash functions to get k array positions. If any of the bits at these 
> positions is 0, the element is definitely not in the set – if it were, then 
> all the bits would have been set to 1 when it was inserted. If all are 1, 
> then either the element is in the set, or the bits have by chance been set to 
> 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored 
> arrows show the positions in the bit array that each set element is mapped 
> to. The element w is not in the set {x, y, z}, because it hashes to one 
> bit-array position containing 0. For this figure, m = 18 and k = 3. The 
> sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. 
> https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java
> Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7465) Add build-in BloomFilterCount on TableAPI

2017-08-24 Thread sunjincheng (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139810#comment-16139810
 ] 

sunjincheng edited comment on FLINK-7465 at 8/24/17 9:32 AM:
-

Hi [~jparkie] [~fhueske] [~jark], Thanks for your comments. 

* HyperLogLog/HyperLogLog++ used for cardinality statistics;
* CountMinSketch used for frequency statistics;
* BloomFilter used for membership judgment;

[~jparkie] [~fhueske] you are right, the {{HyperLogLog/HyperLogLog++}} is the 
best solution for this approximately distinct count. And 
[HyperLogLog|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf] can 
count A Billion distinct objects using only using 1.5kB (kilobyte) of 
storage(typical accuracy of 2%). So I want add approximately distinct counting 
by 
[HyperLogLog|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf]/[HyperLogLog++|https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/40671.pdf]

But here a question that how can we process the retract record? What do you 
think? [~jparkie][~fhueske] [~jark]

Thanks, jincheng
 



was (Author: sunjincheng121):
Hi [~jparkie] [~fhueske] [~jark], Thanks for your comments. 

* HyperLogLog/HyperLogLog++ used for cardinality statistics;
* CountMinSketch used for frequency statistics;
* BloomFilter used for membership judgment;

[~jparkie] [~fhueske] you are right, the {{HyperLogLog/HyperLogLog++}} is the 
best solution for this approximately distinct count. And 
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) can 
count A Billion distinct objects using only using 1.5kB (kilobyte) of 
storage(typical accuracy of 2%). So I want add approximately distinct counting 
by 
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)/[HyperLogLog++](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/40671.pdf)
 

But here a question that how can we process the retract record? What do you 
think? [~jparkie][~fhueske] [~jark]

Thanks, jincheng
 


> Add build-in BloomFilterCount on TableAPI
> -
>
> Key: FLINK-7465
> URL: https://issues.apache.org/jira/browse/FLINK-7465
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: sunjincheng
>Assignee: sunjincheng
> Attachments: bloomfilter.png
>
>
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also 
> be k different hash functions defined, each of which maps or hashes some set 
> element to one of the m array positions, generating a uniform random 
> distribution. Typically, k is a constant, much smaller than m, which is 
> proportional to the number of elements to be added; the precise choice of k 
> and the constant of proportionality of m are determined by the intended false 
> positive rate of the filter.
> To add an element, feed it to each of the k hash functions to get k array 
> positions. Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of 
> the k hash functions to get k array positions. If any of the bits at these 
> positions is 0, the element is definitely not in the set – if it were, then 
> all the bits would have been set to 1 when it was inserted. If all are 1, 
> then either the element is in the set, or the bits have by chance been set to 
> 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored 
> arrows show the positions in the bit array that each set element is mapped 
> to. The element w is not in the set {x, y, z}, because it hashes to one 
> bit-array position containing 0. For this figure, m = 18 and k = 3. The 
> sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. 
> https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hive/common/util/BloomFilter.java
> Hi [~fhueske] [~twalthr] I appreciated if you can give me some advice. :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7465) Add build-in BloomFilterCount on TableAPI

2017-08-22 Thread sunjincheng (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136062#comment-16136062
 ] 

sunjincheng edited comment on FLINK-7465 at 8/22/17 9:01 AM:
-

[~fhueske] I want add accuracy and maxElement as function parameter,the 
function signature looks like: 
{code}
count-bf(accuracy:Double, maxKeyCount, col:Any)
{code}
That mean we configure the accuracy when the function is used. Is this make 
sense for you? [~fhueske]

I think {{count-min}}  is very useful in some certain cases. so does the 
{{HyperLogLog}} (cardinality counting). After we complete the this JIRA. we can 
discuss these implementations.

[~jark] The de/serialize of bitArray if very important in the implementation. I 
think the best way is do the de/serialization at check point or in 
{{open/close}} method, but currently we can not access the {{RuntimeContext}} 
from {{FunctionContext}},we need do some change. OR using DataView.  Currently 
In my mind we have some choices as follows:
* de/serialization bitArray every call the {{accumulate}}(bitArray as member of 
ACC)
* de/serialization bitArray in check point.( bitArray as member of AGG)
* de/serialization bitArray in {{open/close}} .( bitArray as member of AGG)

What do you think? [~jark] [~fhueske]




was (Author: sunjincheng121):
[~fhueske] I want add accuracy and maxElement as function parameter,the 
function signature looks like: 
{code}
count-bf(accuracy:Double, maxKeyCount, col:Any)
{code}
 
And we will use the following formula to calculate the bitarray size(bsize):
{code}
(-maxKeyCount * Math.log(accuracy) / (Math.log(2) * Math.log(2)))
{code}
And we will use the following formula to calculate the cont of hash function:
{code}
Math.max(1, Math.round(bsize.asInstanceOf[Double] / maxKeyCount * Math.log(2)))
{code}
The formula same as the reference of the JIRA. description. 
That mean we configure the accuracy when the function is used. Is this make 
sense for you? [~fhueske]

I think {{count-min}}  is very useful in some certain cases. so does the 
{{HyperLogLog}} (cardinality counting). After we complete the this JIRA. we can 
discuss these implementations.

[~jark] The de/serialize of bitArray if very important in the implementation. I 
think the best way is do the de/serialization at check point or in 
{{open/close}} method, but currently we can not access the {{RuntimeContext}} 
from {{FunctionContext}},we need do some change. OR using DataView.  Currently 
In my mind we have some choices as follows:
* de/serialization bitArray every call the {{accumulate}}(bitArray as member of 
ACC)
* de/serialization bitArray in check point.( bitArray as member of AGG)
* de/serialization bitArray in {{open/close}} .( bitArray as member of AGG)

What do you think? [~jark] [~fhueske]



> Add build-in BloomFilterCount on TableAPI
> -
>
> Key: FLINK-7465
> URL: https://issues.apache.org/jira/browse/FLINK-7465
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API & SQL
>Reporter: sunjincheng
>Assignee: sunjincheng
> Attachments: bloomfilter.png
>
>
> In this JIRA. use BloomFilter to implement counting functions.
> BloomFilter Algorithm description:
> An empty Bloom filter is a bit array of m bits, all set to 0. There must also 
> be k different hash functions defined, each of which maps or hashes some set 
> element to one of the m array positions, generating a uniform random 
> distribution. Typically, k is a constant, much smaller than m, which is 
> proportional to the number of elements to be added; the precise choice of k 
> and the constant of proportionality of m are determined by the intended false 
> positive rate of the filter.
> To add an element, feed it to each of the k hash functions to get k array 
> positions. Set the bits at all these positions to 1.
> To query for an element (test whether it is in the set), feed it to each of 
> the k hash functions to get k array positions. If any of the bits at these 
> positions is 0, the element is definitely not in the set – if it were, then 
> all the bits would have been set to 1 when it was inserted. If all are 1, 
> then either the element is in the set, or the bits have by chance been set to 
> 1 during the insertion of other elements, resulting in a false positive.
> An example of a Bloom filter, representing the set {x, y, z}. The colored 
> arrows show the positions in the bit array that each set element is mapped 
> to. The element w is not in the set {x, y, z}, because it hashes to one 
> bit-array position containing 0. For this figure, m = 18 and k = 3. The 
> sketch as follows:
> !bloomfilter.png!
> Reference:
> 1. https://en.wikipedia.org/wiki/Bloom_filter
> 2. 
>