Re: Mllib Logistic Regression performance relative to Mahout

2016-02-28 Thread Yashwanth Kumar
Hi,
If your features are numeric, try feature scaling and feed it to Spark
Logistic Regression, It might increase rate%



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Logistic-Regression-performance-relative-to-Mahout-tp26346p26358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Integration Patterns

2016-02-28 Thread Yashwanth Kumar
Hi, 
To connect to Spark from a remote location and submit jobs, you can try
Spark - Job Server.Its been open sourced now.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354p26357.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread Yashwanth Kumar
Hi,

DStream->Discretized Streams are made up of multiple RDDs
You can unpersist each RDD by accessing the individual RDD's using 

dstreamrdd.foreachRDD
{

rdd.unpersist().


}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-unpersist-a-DStream-in-Spark-Streaming-tp25281p25284.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR vs R

2015-09-22 Thread Yashwanth Kumar
Hi,

1. The main difference between SparkR and R is that "SparkR" can handle
bigdata.

Yes, you can use other core libraries inside SparkR(not algos like
lm(),glm(),kmean())

2.Yes, core R libraries will not be distributed. You can use function from
these libraries which are applicabe for mapper kind of thing. funnctions
which can be applied on each line individually.


3. SparkR is an wrapper for an underlying scala code, Whereas for R it is
not.
R gives you complete flexibility to do any machine learning u want, While
SparkR is still in developing stage.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-vs-R-tp24772p24778.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Partitions on RDDs

2015-09-22 Thread Yashwanth Kumar
HI,
In the first rdd transformation (eg: reading from a file
sc.textfile("path",partition)), the partition you specify will be applied to
all further transformations and actions from this rdd.

In few places repartitioning your rdd will give a added advantage.
Repartition is usually done during actions stage.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-on-RDDs-tp24775p24779.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Slow Performance with Apache Spark Gradient Boosted Tree training runs

2015-09-22 Thread Yashwanth Kumar
Hi vkutsenko,

Can you just give partitions to the input labeled rdd, like:

  data = MLUtils.loadLibSVMFile(jsc.sc(),
"s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD().*repartition(5)*;


Here, i used 5, since you have have 5 cores.

Also for further benchmark and performance tuning:

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Slow-Performance-with-Apache-Spark-Gradient-Boosted-Tree-training-runs-tp24758p24764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib inconsistent documentation

2015-09-22 Thread Yashwanth Kumar
Hi,


I guess, the double values are number of visits
rather than a visit flag (obviously it should be more useful than visit flag
i.e 1/0)

this is based on the assumption that while doing matrix factorisation,
rating trained using implicit cannot be binary, as it gives poor feature
values. In turn poor prediction.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-inconsistent-documentation-tp24742p24767.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org