[jira] [Comment Edited] (IGNITE-10144) Optimize bagging upstream transformer

2018-11-29 Thread Artem Malykh (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703391#comment-16703391
 ] 

Artem Malykh edited comment on IGNITE-10144 at 11/29/18 3:58 PM:
-

According to series of quick experiments, approach used now (tested on "count")
{code:java}
upstream.sequential().flatMap(en -> Stream.generate(() -> 
en).limit(poisson.sample())).count()
{code}
 

outperforms following approaches:

1. Using hash code for determinism instead of "sequential" and then using 
parallel (this is bad approach because setSeed is not threadsafe)

 
{code:java}
PoissonWithUnderlyingRandomAccess p = new PoissonWithUnderlyingRandomAccess(
new Well19937c(123L),
0.3,
PoissonDistribution.DEFAULT_EPSILON,
PoissonDistribution.DEFAULT_MAX_ITERATIONS
upstream
.flatMap(en -> {
p.getRand().setSeed(1234L + en.hashCode());
return Stream.generate(() -> en).limit(p.sample());
}).parallel().count();{code}
2. Zipping upstream with indexes stream and then using index for determinism.

 


was (Author: amalykh):
According to series of quick experiments, approach used now (tested on "count")
{code:java}
upstream.sequential().flatMap(en -> Stream.generate(() -> 
en).limit(poisson.sample())).count()
{code}
 

outperforms following approaches:

1. Using hash code for determinism instead of "sequential" and then using 
parallel

 
{code:java}
PoissonWithUnderlyingRandomAccess p = new PoissonWithUnderlyingRandomAccess(
new Well19937c(123L),
0.3,
PoissonDistribution.DEFAULT_EPSILON,
PoissonDistribution.DEFAULT_MAX_ITERATIONS
upstream
.flatMap(en -> {
p.getRand().setSeed(1234L + en.hashCode());
return Stream.generate(() -> en).limit(p.sample());
}).parallel().count();{code}
2. Zipping upstream with indexes stream and then using index for determinism.

 

> Optimize bagging upstream transformer
> -
>
> Key: IGNITE-10144
> URL: https://issues.apache.org/jira/browse/IGNITE-10144
> Project: Ignite
>  Issue Type: Improvement
>  Components: ml
>Reporter: Artem Malykh
>Assignee: Artem Malykh
>Priority: Minor
> Fix For: 2.8
>
>
> For now BaggingUpstreamTransformer makes upstream sequential to make 
> transformation deterministic. Maybe we should do it other way, for example 
> use mapping of the form (entryIdx, en) -> Stream.generate(() -> en).limit(new 
> PoissonDistribution(Well19937c(entryIdx + seed), ...).sample())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-10144) Optimize bagging upstream transformer

2018-11-29 Thread Artem Malykh (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703391#comment-16703391
 ] 

Artem Malykh edited comment on IGNITE-10144 at 11/29/18 3:56 PM:
-

According to series of quick experiments, approach used now (tested on "count")
{code:java}
upstream.sequential().flatMap(en -> Stream.generate(() -> 
en).limit(poisson.sample())).count()
{code}
 

outperforms following approaches:

1. Using hash code for determinism instead of "sequential" and then using 
parallel

 
{code:java}
PoissonWithUnderlyingRandomAccess p = new PoissonWithUnderlyingRandomAccess(
new Well19937c(123L),
0.3,
PoissonDistribution.DEFAULT_EPSILON,
PoissonDistribution.DEFAULT_MAX_ITERATIONS
upstream
.flatMap(en -> {
p.getRand().setSeed(1234L + en.hashCode());
return Stream.generate(() -> en).limit(p.sample());
}).parallel().count();{code}
2. Zipping upstream with indexes stream and then using index for determinism.

 


was (Author: amalykh):
According to series of quick experiments, approach used now 
{code:java}
upstream.sequential().flatMap(en -> Stream.generate(() -> 
en).limit(poisson.sample()))
{code}
 

outperforms following approaches:

1. Using hash code for determinism instead of "sequential" and then using 
parallel

 
{code:java}

PoissonWithUnderlyingRandomAccess p = new PoissonWithUnderlyingRandomAccess(
new Well19937c(123L),
0.3,
PoissonDistribution.DEFAULT_EPSILON,
PoissonDistribution.DEFAULT_MAX_ITERATIONS
upstream
.flatMap(en -> {
p.getRand().setSeed(1234L + en.hashCode());
return Stream.generate(() -> en).limit(p.sample());
}).parallel().count();{code}
2. Zipping upstream with indexes stream and then using index for determinism.

 

> Optimize bagging upstream transformer
> -
>
> Key: IGNITE-10144
> URL: https://issues.apache.org/jira/browse/IGNITE-10144
> Project: Ignite
>  Issue Type: Improvement
>  Components: ml
>Reporter: Artem Malykh
>Assignee: Artem Malykh
>Priority: Minor
> Fix For: 2.8
>
>
> For now BaggingUpstreamTransformer makes upstream sequential to make 
> transformation deterministic. Maybe we should do it other way, for example 
> use mapping of the form (entryIdx, en) -> Stream.generate(() -> en).limit(new 
> PoissonDistribution(Well19937c(entryIdx + seed), ...).sample())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)