Hi, We are trying to port over some code that uses Mahout Logistic Regression to Mllib Logistic Regression and our preliminary performance tests indicate a performance bottleneck. It is not clear to me if this is due to one of three factors:
o Comparing apples to oranges o Inadequate tuning o Insufficient parallelism The test results and the code that produced the results are below. I am hoping that someone can shed some light on the performance problem we are having. thanks much -Raj P.S. Apologies if this is a duplicate posting. I got a response to a previous posting that suggested that the posting may not have correctly registered. ----- Mahout LR vs. Mllib LR ------------- Data Cluster MLLIb Mahout size type Train Test Rate Train Test Rate ---- ------ ----- ---- ---- ----- ---- ---- 100 local[*] .03 .1 54 1.1 11 100 100 Cluster[6] .036 .09 59 1 9 100 500,000 local[*] 32 9 83 326 1086 82 500,000 Cluster[6] 8 4 83 310 877 81 All rates are in records/milliseconds The 100 dataset is the sample_libsvm_data.txt My cluster was a set of 6 worker-machines on aws Rate indicate the % of the test set that were labeled correctly The latest versions of mllib (1.6) and Mahout (0.9) were used in the tests -------------------------------------------- MllMahout.scala <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26346/MllMahout.scala> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Logistic-Regression-performance-relative-to-Mahout-tp26346.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org