Re: Mllib Logistic Regression performance relative to Mahout
Hi, If your features are numeric, try feature scaling and feed it to Spark Logistic Regression, It might increase rate% -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Logistic-Regression-performance-relative-to-Mahout-tp26346p26358.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Integration Patterns
Hi, To connect to Spark from a remote location and submit jobs, you can try Spark - Job Server.Its been open sourced now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354p26357.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to unpersist a DStream in Spark Streaming
Hi, DStream->Discretized Streams are made up of multiple RDDs You can unpersist each RDD by accessing the individual RDD's using dstreamrdd.foreachRDD { rdd.unpersist(). } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-unpersist-a-DStream-in-Spark-Streaming-tp25281p25284.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkR vs R
Hi, 1. The main difference between SparkR and R is that "SparkR" can handle bigdata. Yes, you can use other core libraries inside SparkR(not algos like lm(),glm(),kmean()) 2.Yes, core R libraries will not be distributed. You can use function from these libraries which are applicabe for mapper kind of thing. funnctions which can be applied on each line individually. 3. SparkR is an wrapper for an underlying scala code, Whereas for R it is not. R gives you complete flexibility to do any machine learning u want, While SparkR is still in developing stage. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-vs-R-tp24772p24778.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Partitions on RDDs
HI, In the first rdd transformation (eg: reading from a file sc.textfile("path",partition)), the partition you specify will be applied to all further transformations and actions from this rdd. In few places repartitioning your rdd will give a added advantage. Repartition is usually done during actions stage. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-on-RDDs-tp24775p24779.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Slow Performance with Apache Spark Gradient Boosted Tree training runs
Hi vkutsenko, Can you just give partitions to the input labeled rdd, like: data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD().*repartition(5)*; Here, i used 5, since you have have 5 cores. Also for further benchmark and performance tuning: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Slow-Performance-with-Apache-Spark-Gradient-Boosted-Tree-training-runs-tp24758p24764.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib inconsistent documentation
Hi, I guess, the double values are number of visits rather than a visit flag (obviously it should be more useful than visit flag i.e 1/0) this is based on the assumption that while doing matrix factorisation, rating trained using implicit cannot be binary, as it gives poor feature values. In turn poor prediction. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-inconsistent-documentation-tp24742p24767.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org