[jira] [Comment Edited] (IGNITE-10144) Optimize bagging upstream transformer
[ https://issues.apache.org/jira/browse/IGNITE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703391#comment-16703391 ] Artem Malykh edited comment on IGNITE-10144 at 11/29/18 3:58 PM: - According to series of quick experiments, approach used now (tested on "count") {code:java} upstream.sequential().flatMap(en -> Stream.generate(() -> en).limit(poisson.sample())).count() {code} outperforms following approaches: 1. Using hash code for determinism instead of "sequential" and then using parallel (this is bad approach because setSeed is not threadsafe) {code:java} PoissonWithUnderlyingRandomAccess p = new PoissonWithUnderlyingRandomAccess( new Well19937c(123L), 0.3, PoissonDistribution.DEFAULT_EPSILON, PoissonDistribution.DEFAULT_MAX_ITERATIONS upstream .flatMap(en -> { p.getRand().setSeed(1234L + en.hashCode()); return Stream.generate(() -> en).limit(p.sample()); }).parallel().count();{code} 2. Zipping upstream with indexes stream and then using index for determinism. was (Author: amalykh): According to series of quick experiments, approach used now (tested on "count") {code:java} upstream.sequential().flatMap(en -> Stream.generate(() -> en).limit(poisson.sample())).count() {code} outperforms following approaches: 1. Using hash code for determinism instead of "sequential" and then using parallel {code:java} PoissonWithUnderlyingRandomAccess p = new PoissonWithUnderlyingRandomAccess( new Well19937c(123L), 0.3, PoissonDistribution.DEFAULT_EPSILON, PoissonDistribution.DEFAULT_MAX_ITERATIONS upstream .flatMap(en -> { p.getRand().setSeed(1234L + en.hashCode()); return Stream.generate(() -> en).limit(p.sample()); }).parallel().count();{code} 2. Zipping upstream with indexes stream and then using index for determinism. > Optimize bagging upstream transformer > - > > Key: IGNITE-10144 > URL: https://issues.apache.org/jira/browse/IGNITE-10144 > Project: Ignite > Issue Type: Improvement > Components: ml >Reporter: Artem Malykh >Assignee: Artem Malykh >Priority: Minor > Fix For: 2.8 > > > For now BaggingUpstreamTransformer makes upstream sequential to make > transformation deterministic. Maybe we should do it other way, for example > use mapping of the form (entryIdx, en) -> Stream.generate(() -> en).limit(new > PoissonDistribution(Well19937c(entryIdx + seed), ...).sample()) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (IGNITE-10144) Optimize bagging upstream transformer
[ https://issues.apache.org/jira/browse/IGNITE-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703391#comment-16703391 ] Artem Malykh edited comment on IGNITE-10144 at 11/29/18 3:56 PM: - According to series of quick experiments, approach used now (tested on "count") {code:java} upstream.sequential().flatMap(en -> Stream.generate(() -> en).limit(poisson.sample())).count() {code} outperforms following approaches: 1. Using hash code for determinism instead of "sequential" and then using parallel {code:java} PoissonWithUnderlyingRandomAccess p = new PoissonWithUnderlyingRandomAccess( new Well19937c(123L), 0.3, PoissonDistribution.DEFAULT_EPSILON, PoissonDistribution.DEFAULT_MAX_ITERATIONS upstream .flatMap(en -> { p.getRand().setSeed(1234L + en.hashCode()); return Stream.generate(() -> en).limit(p.sample()); }).parallel().count();{code} 2. Zipping upstream with indexes stream and then using index for determinism. was (Author: amalykh): According to series of quick experiments, approach used now {code:java} upstream.sequential().flatMap(en -> Stream.generate(() -> en).limit(poisson.sample())) {code} outperforms following approaches: 1. Using hash code for determinism instead of "sequential" and then using parallel {code:java} PoissonWithUnderlyingRandomAccess p = new PoissonWithUnderlyingRandomAccess( new Well19937c(123L), 0.3, PoissonDistribution.DEFAULT_EPSILON, PoissonDistribution.DEFAULT_MAX_ITERATIONS upstream .flatMap(en -> { p.getRand().setSeed(1234L + en.hashCode()); return Stream.generate(() -> en).limit(p.sample()); }).parallel().count();{code} 2. Zipping upstream with indexes stream and then using index for determinism. > Optimize bagging upstream transformer > - > > Key: IGNITE-10144 > URL: https://issues.apache.org/jira/browse/IGNITE-10144 > Project: Ignite > Issue Type: Improvement > Components: ml >Reporter: Artem Malykh >Assignee: Artem Malykh >Priority: Minor > Fix For: 2.8 > > > For now BaggingUpstreamTransformer makes upstream sequential to make > transformation deterministic. Maybe we should do it other way, for example > use mapping of the form (entryIdx, en) -> Stream.generate(() -> en).limit(new > PoissonDistribution(Well19937c(entryIdx + seed), ...).sample()) -- This message was sent by Atlassian JIRA (v7.6.3#76005)