Re: Reading Hive RCFiles?

2018-01-20 Thread Jörn Franke
Forgot to add the mailinglist > On 18. Jan 2018, at 18:55, Jörn Franke wrote: > > Welll you can use: >

Re: Reading Hive RCFiles?

2018-01-20 Thread Prakash Joshi
If it's simply reading the files from source in HDFS then we have an option of sc.hadoopFile in spark API Not sure if Spark SQL provides direct method to read On Jan 18, 2018 9:32 PM, "Michael Segel" wrote: > No idea on how that last line of garbage got in the

Re: Saving each line of RDD as a separate file with key as the file name

2018-01-20 Thread Jörn Franke
Not sure if I understood exactly what you need, but you could have one partition by line. Alternatively you could use the MultipleOutput format in Hadoop. > On 20. Jan 2018, at 22:56, pooja bhojwani wrote: > > Hi all, > > So, I have a Java Pair RDD with let’s say n

Saving each line of RDD as a separate file with key as the file name

2018-01-20 Thread pooja bhojwani
Hi all, So, I have a Java Pair RDD with let’s say n lines, each line has a unique key and a hash map as the value(there are no duplicate keys). I want to save each line as a separate text file and since saveAsTextFile is not serializable, I need to somehow split the RDD into n RDD’s or so and

Re: Spark MLLib vs. SciKitLearn

2018-01-20 Thread Aakash Basu
Any help on the below? On 19-Jan-2018 7:12 PM, "Aakash Basu" wrote: > Hi all, > > I am totally new to ML APIs. Trying to get the *ROC_Curve* for Model > Evaluation on both *ScikitLearn* and *PySpark MLLib*. I do not find any > API for ROC_Curve calculation for

Re: external shuffle service in mesos

2018-01-20 Thread Susan X. Huynh
Hi Igor, The best way I know of is with Marathon. * Placement constraint: you could combine constraints in Marathon. Like: "constraints": [ ["hostname", "UNIQUE"], ["hostname", "LIKE", "host1|host2|host3"] ] https://groups.google.com/forum/#!topic/marathon-framework/hfLUw3TIw2I *

external shuffle service in mesos

2018-01-20 Thread igor.berman
Hi, wanted to get some advice regarding managing external shuffle service in mesos environments In spark documentation the Marathon is mentioned, however there is very limited documentation. I've tried to search for some documentation and it's seems not too difficult to configure it under