[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2016-10-18 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587387#comment-15587387
 ] 

Matthew Hayes commented on DATAFU-16:
-

I don't think the exponential jump version got added.

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Assignee: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2016-10-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586085#comment-15586085
 ] 

Eyal Allweil commented on DATAFU-16:


It looks like this got added - can this issue be closed?

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Assignee: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2014-02-11 Thread jian wang (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897828#comment-13897828
 ] 

jian wang commented on DATAFU-16:
-

I have updated the WeightedSamplingCorrectnessTests.java and there is a 
simulated perf test within. Following is the ouptut of the test.

   [testng] *** Running reservoirExpJPerfTest ***
   [testng] Output:
   [testng] accumulateDuration  accumulateExpJDuration
   [testng]  8563   1563

accumulateDuration:  test duration for weighted sampling without exp jump in 
accumulate mode
accumulateExpJDuration:  test duration for weighted sampling with exp jump

unit is milliseconds





> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2014-02-10 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897085#comment-13897085
 ] 

Matthew Hayes commented on DATAFU-16:
-

I think an exponential jump version of the accumulator-based reservoir sample 
UDF could make sense.  It seems like this could help with performance in some 
cases, especially when producing a large sample.  Have you run any performance 
tests to compare the two accumulator-based implementations to see under what 
circumstances it helps and by how much?

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2014-02-08 Thread jian wang (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895791#comment-13895791
 ] 

jian wang commented on DATAFU-16:
-

Matt, Do you think we go ahead to implement the exponential jump only for the 
accumulate-based model? And for algebraic, we still use the weighted reservoir 
sampling without exponential jump. 

The good part of introducing the exp jump:  it could improve the job 
performance, especially when there is a lot of data to process, without 
sacrificing much on the sampling precision(per-item sampling probability is 
close to w/sum(w)). 

The not good part: the chance of using accumulate-based model may not be as 
many as algebraic, so is it worthwhile to introduce this enhancement?

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2014-01-27 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883575#comment-13883575
 ] 

Matthew Hayes commented on DATAFU-16:
-

Thanks for running the experiment Jian!  I expected there might be an issue 
with the "weighted reservoir sampling exponential jump algebraic" case.  I 
think that the exponential jump method only works on an accumulate-based model. 
 For algebraic, the usage of a combiner probably breaks the assumptions behind 
this approach.

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2014-01-25 Thread jian wang (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881922#comment-13881922
 ] 

jian wang commented on DATAFU-16:
-

Experiment to test the algorithm's item sample probability estimation 
correctness using the same methodology described in the original issue: 
https://github.com/linkedin/datafu/issues/80. 

Using (weight / sum(weight)) as the ground truth of each item's sampling 
probability, calculate the average squared error of the algo's per item 
sampling probability. 

Using exponential jump in weighted reservoir sampling in accumulate mode seems 
OK, but it is not sure if it is OK for algebraic mode since it has higher error 
than other algos. [is verifying the test code to see if it is something wrong 
with test code]

The testAccumulateExpJ() to simulate Accumulate() for data stream

The testAlgebraicExpJ() to simulate the Initial/Interm/Final using 100 
combiners and each initial processes only one sample, which is the majority of 
real world cases.

Experiment result

err_ws:  1.174525314652248E-5   
err_acc: 1.1883407123610779E-5  
err_alg:  1.2130630748818072E-5 
err_skip_acc: 1.2081897301243E-5
err_skip_alg:  1.3854125917604345E-4

err_ws is for weighted sampling UDF
err_acc is for weighted reservoir sampling accumulate
err_alg is for weighted reservoir sampling algebraic
err_skip_acc is for weighted reservoir sampling exponential jump accumulate
err_skip_alg is for weighted reservoir sampling exponential jump algebraic


Pls see test code as attached



> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Priority: Minor
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2014-01-22 Thread jian wang (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878635#comment-13878635
 ] 

jian wang commented on DATAFU-16:
-

According to Matt's feedback on review request 
https://reviews.apache.org/r/17058/, need to re-think how we implement the 
reservoir sample with exponential jumps. Will do an offline simulation by this 
weekend.

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)