Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Xiaoye Sun
Hi Jeff and Prabhu, Thanks for your help. I look deep in the nodemanager log and I found that I have a error message like this: 2016-03-02 03:13:59,692 ERROR org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: error opening leveldb file file:/data/yarn/cache/yarn/nm-local-dir/registere

Re: select count(*) return wrong row counts

2016-03-02 Thread Mich Talebzadeh
This works fine scala> sql("use oraclehadoop") res1: org.apache.spark.sql.DataFrame = [result: string] scala> sql("select count(1) from sales").show +---+ |_c0| +---+ |4991761| +---+ You can do "select count(*) from tablename") as it is not dynamic sql. Does it actually work? Sin

Sorting the RDD

2016-03-02 Thread Angel Angel
Hello Sir/Madam, I am try to sort the RDD using *sortByKey* function but i am getting the following error. My code is 1) convert the rdd array into key value pair. 2) after that sort by key but i am getting the error *No implicit Ordering defined for any * [image: Inline image 1] thanks

RE: Converting array to DF

2016-03-02 Thread Mao, Wei
“Seq” will be implicitly converted to “DataFrameHolder”, and “toDF” method is defined in “DataFrameHolder”. And there is no such method for Array. So user has to convert explicitly by himself. implicit def localSeqToDataFrameHolder[A <: Product : TypeTag](data: Seq[A]): DataFrameHolder = { Da

Spark Mllib kmeans execution

2016-03-02 Thread Priya Ch
Hi All, I am running k-means clustering algorithm. Now, when I am running the algorithm as - val conf = new SparkConf val sc = new SparkContext(conf) . . val kmeans = new KMeans() val model = kmeans.run(RDD[Vector]) . . . The 'kmeans' object gets created on driver. Now does *kmeans.run() *get e

Re: Spark Mllib kmeans execution

2016-03-02 Thread Sonal Goyal
It will run distributed On Mar 2, 2016 3:00 PM, "Priya Ch" wrote: > Hi All, > > I am running k-means clustering algorithm. Now, when I am running the > algorithm as - > > val conf = new SparkConf > val sc = new SparkContext(conf) > . > . > val kmeans = new KMeans() > val model = kmeans.run(RDD[

rdd cache name

2016-03-02 Thread charles li
hi, there, I feel a little confused about the *cache* in spark. first, is there any way to *customize the cached RDD name*, it's not convenient for me when looking at the storage page, there are the kind of RDD in the RDD Name column, I hope to make it as my customized name, kinds of 'rdd 1', 'rrd

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread James Hammerton
Hi, Based on the behaviour I've seen using parquet, the number of partitions in the DataFrame will determine the number of files in each parquet partition. I.e. when you use "PARTITION BY" you're actually partitioning twice, once via the partitions spark has created internally and then again with

<    1   2