[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

2015-10-13 Thread Marko Asplund (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955444#comment-14955444
 ] 

Marko Asplund commented on SPARK-10791:
---

Please see:
Sep 2015 / Thread view / page 8
thread title: "How to speed up MLlib LDA?"

https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3CCANoUZR-xcmvj%3DYgUc1JEHu54vWfyP0n-%3DHfz2dxiWFRuk8zRpQ%40mail.gmail.com%3E

> Optimize MLlib LDA topic distribution query performance
> ---
>
> Key: SPARK-10791
> URL: https://issues.apache.org/jira/browse/SPARK-10791
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
> Environment: Ubuntu 13.10, Oracle Java 8
>Reporter: Marko Asplund
>
> I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size 
> and ~3.4 M documents using EMLDAOptimizer.
> Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit 
> training with the same data and on the same system set took ~5 minutes. 
> Loading the persisted model from disk (~2 minutes), as well as querying LDA 
> model topic distributions (~4 seconds for one document) are also quite slow 
> operations.
> Our application is querying LDA model topic distribution (for one doc at a 
> time) as part of end-user operation execution flow, so a ~4 second execution 
> time is very problematic.
> The log includes the following message, which AFAIK, should mean that 
> netlib-java is using machine optimised native implementation: 
> "com.github.fommil.jni.JniLoader - successfully loaded 
> /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"
> My test code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57
> I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable 
> change in training performance. Model loading time was reduced to ~ 5 seconds 
> from ~ 2 minutes (now persisted as LocalLDAModel). However, query / 
> prediction time was unchanged.
> Unfortunately, this is the critical performance characteristic in our case.
> I did some profiling for my LDA prototype code that requests topic 
> distributions from a model. According to Java Mission Control more than 80 % 
> of execution time during sample interval is spent in the following methods:
> - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
> - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
> - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
> 6.98%
> - java.lang.Double.valueOf(double); count: 31; 4.33%
> Is there any way of using the API more optimally?
> Are there any opportunities for optimising the "topicDistributions" code
> path in MLlib?
> My query test code looks like this essentially:
> // executed once
> val model = LocalLDAModel.load(ctx, ModelFileName)
> // executed four times
> val samples = Transformers.toSparseVectors(vocabularySize,
> ctx.parallelize(Seq(input))) // fast
> model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
> seems to take about 4 seconds to execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

2015-09-25 Thread Marko Asplund (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907975#comment-14907975
 ] 

Marko Asplund commented on SPARK-10791:
---

This performance issue was actually discussed on the spark mailing list.
Please see full discussion here: 
https://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/browser

My tests were performed on a single node.

> Optimize MLlib LDA topic distribution query performance
> ---
>
> Key: SPARK-10791
> URL: https://issues.apache.org/jira/browse/SPARK-10791
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
> Environment: Ubuntu 13.10, Oracle Java 8
>Reporter: Marko Asplund
>
> I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size 
> and ~3.4 M documents using EMLDAOptimizer.
> Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit 
> training with the same data and on the same system set took ~5 minutes. 
> Loading the persisted model from disk (~2 minutes), as well as querying LDA 
> model topic distributions (~4 seconds for one document) are also quite slow 
> operations.
> Our application is querying LDA model topic distribution (for one doc at a 
> time) as part of end-user operation execution flow, so a ~4 second execution 
> time is very problematic.
> The log includes the following message, which AFAIK, should mean that 
> netlib-java is using machine optimised native implementation: 
> "com.github.fommil.jni.JniLoader - successfully loaded 
> /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"
> My test code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57
> I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable 
> change in training performance. Model loading time was reduced to ~ 5 seconds 
> from ~ 2 minutes (now persisted as LocalLDAModel). However, query / 
> prediction time was unchanged.
> Unfortunately, this is the critical performance characteristic in our case.
> I did some profiling for my LDA prototype code that requests topic 
> distributions from a model. According to Java Mission Control more than 80 % 
> of execution time during sample interval is spent in the following methods:
> - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
> - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
> - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
> 6.98%
> - java.lang.Double.valueOf(double); count: 31; 4.33%
> Is there any way of using the API more optimally?
> Are there any opportunities for optimising the "topicDistributions" code
> path in MLlib?
> My query test code looks like this essentially:
> // executed once
> val model = LocalLDAModel.load(ctx, ModelFileName)
> // executed four times
> val samples = Transformers.toSparseVectors(vocabularySize,
> ctx.parallelize(Seq(input))) // fast
> model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
> seems to take about 4 seconds to execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

2015-09-24 Thread Marko Asplund (JIRA)
Marko Asplund created SPARK-10791:
-

 Summary: Optimize MLlib LDA topic distribution query performance
 Key: SPARK-10791
 URL: https://issues.apache.org/jira/browse/SPARK-10791
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
 Environment: Ubuntu 13.10, Oracle Java 8
Reporter: Marko Asplund


I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size and 
~3.4 M documents using EMLDAOptimizer.

Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit 
training with the same data and on the same system set took ~5 minutes. Loading 
the persisted model from disk (~2 minutes), as well as querying LDA model topic 
distributions (~4 seconds for one document) are also quite slow operations.

Our application is querying LDA model topic distribution (for one doc at a 
time) as part of end-user operation execution flow, so a ~4 second execution 
time is very problematic.

The log includes the following message, which AFAIK, should mean that 
netlib-java is using machine optimised native implementation: 
"com.github.fommil.jni.JniLoader - successfully loaded 
/tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"

My test code can be found here:
https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57

I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable change 
in training performance. Model loading time was reduced to ~ 5 seconds from ~ 2 
minutes (now persisted as LocalLDAModel). However, query / prediction time was 
unchanged.
Unfortunately, this is the critical performance characteristic in our case.

I did some profiling for my LDA prototype code that requests topic 
distributions from a model. According to Java Mission Control more than 80 % of 
execution time during sample interval is spent in the following methods:

- org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
- org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
- org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
6.98%
- java.lang.Double.valueOf(double); count: 31; 4.33%

Is there any way of using the API more optimally?
Are there any opportunities for optimising the "topicDistributions" code
path in MLlib?

My query test code looks like this essentially:

// executed once
val model = LocalLDAModel.load(ctx, ModelFileName)

// executed four times
val samples = Transformers.toSparseVectors(vocabularySize,
ctx.parallelize(Seq(input))) // fast
model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
seems to take about 4 seconds to execute




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10557) Publish Spark 1.5.0 on Maven central

2015-09-12 Thread Marko Asplund (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742209#comment-14742209
 ] 

Marko Asplund commented on SPARK-10557:
---

thanks! (y)

> Publish Spark 1.5.0 on Maven central
> 
>
> Key: SPARK-10557
> URL: https://issues.apache.org/jira/browse/SPARK-10557
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Marko Asplund
>
> Spark v1.5.0 has been officially released, but not published on Maven central.
> https://spark.apache.org/releases/spark-release-1-5-0.html
> Also, in Jira 1.5.0 is listed under "unreleased" version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10557) Publish Spark 1.5.0 on Maven central

2015-09-11 Thread Marko Asplund (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740518#comment-14740518
 ] 

Marko Asplund commented on SPARK-10557:
---

The artifacts seem to have been created on 2015-09-01, so it's strange they 
don't show up in search event 10 days after creation.

I understand that Maven central is not controlled by the Spark development 
team, but I think it's still a problem for Spark users if the artifacts aren't 
searchable.
Perhaps someone from the Spark team could contact Maven Central on this?

> Publish Spark 1.5.0 on Maven central
> 
>
> Key: SPARK-10557
> URL: https://issues.apache.org/jira/browse/SPARK-10557
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Marko Asplund
>
> Spark v1.5.0 has been officially released, but not published on Maven central.
> https://spark.apache.org/releases/spark-release-1-5-0.html
> Also, in Jira 1.5.0 is listed under "unreleased" version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10557) Publish Spark 1.5.0 on Maven central

2015-09-10 Thread Marko Asplund (JIRA)
Marko Asplund created SPARK-10557:
-

 Summary: Publish Spark 1.5.0 on Maven central
 Key: SPARK-10557
 URL: https://issues.apache.org/jira/browse/SPARK-10557
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 1.5.0
Reporter: Marko Asplund


Spark v1.5.0 has been officially released, but not published on Maven central.
https://spark.apache.org/releases/spark-release-1-5-0.html

Also, in Jira 1.5.0 is listed under "unreleased" version.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org