Launching multiple spark jobs within a main spark job.

2016-12-20 Thread Naveen
Hi Team, Is it ok to spawn multiple spark jobs within a main spark job, my main spark job's driver which was launched on yarn cluster, will do some preprocessing and based on it, it needs to launch multilple spark jobs on yarn cluster. Not sure if this right pattern. Please share your thoughts.

Re: Aggregating over sorted data

2016-12-20 Thread Liang-Chi Hsieh
Hi, Can you try the combination of `repartition` + `sortWithinPartitions` on the dataset? E.g., val df = Seq((2, "b c a"), (1, "c a b"), (3, "a c b")).toDF("number", "letters") val df2 = df.explode('letters) { case Row(letters: String) => letters.split("

Re: Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-20 Thread Liang-Chi Hsieh
Hi, You can't invoke any RDD actions/transformations inside another transformations. They must be invoked by the driver. If I understand your purpose correctly, you can partition your data (i.e., `partitionBy`) when writing out to parquet files. - Liang-Chi Hsieh | @viirya Spark

Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-20 Thread satyajit vegesna
Hi All, PFB sample code , val df = spark.read.parquet() df.registerTempTable("df") val zip = df.select("zip_code").distinct().as[String].rdd def comp(zipcode:String):Unit={ val zipval = "SELECT * FROM df WHERE zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode) val data =

[no subject]

2016-12-20 Thread satyajit vegesna
Hi All, PFB sample code , val df = spark.read.parquet() df.registerTempTable("df") val zip = df.select("zip_code").distinct().as[String].rdd def comp(zipcode:String):Unit={ val zipval = "SELECT * FROM df WHERE zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode) val data =

答复: How to deal with string column data for spark mlib?

2016-12-20 Thread Triones,Deng(vip.com)
Hi spark dev, I am using spark 2 to write orc file to hdfs. I have one questions about the savemode. My use case is this. When I write data into hdfs. If one task failed I hope the file that the task created should be delete and the retry task can write all data, that is to

question about the data frame save mode to make the data exactly one

2016-12-20 Thread Triones,Deng(vip.com)
Hi spark dev, I am using spark 2 to write orc file to hdfs. I have one questions about the savemode. My use case is this. When I write data into hdfs. If one task failed I hope the file that the task created should be delete and the retry task can write all data, that is to

Reg: Any Dev member in and around Chennai / Tamilnadu

2016-12-20 Thread Sivanesan Govindaraj
HI Dev, Sorry to bother with non-technical query. I wish to connect with any active contributor / committer in and around Chennai / TamilNadu. I wish to connect in person. Is there a list of all committer details in any location? Regs, Siva.

Re: Expand the Spark SQL programming guide?

2016-12-20 Thread Ricardo Almeida
The examples look great indeed. Seems a good addition to the existing documentation. I understand the UDAF examples don't apply to Python but is there any relevant reason to skip Python API altogether from this window functions documentation? On 20 December 2016 at 16:56, Jim Hughes

Re: Expand the Spark SQL programming guide?

2016-12-20 Thread Jim Hughes
Hi Anton, Your example and documentation looks great! I left some comments suggesting a few additions, but the PR in its current state is a great improvement! Thanks, Jim On 12/18/2016 09:09 AM, Anton Okolnychyi wrote: Any comments/suggestions are more than welcome. Thanks, Anton

Re: Kafka Spark structured streaming latency benchmark.

2016-12-20 Thread Prashant Sharma
Hi Shixiong, Thanks for taking a look, I am trying to run and see if making ContextCleaner run more frequently and/or making it non blocking will help. --Prashant On Tue, Dec 20, 2016 at 4:05 AM, Shixiong(Ryan) Zhu wrote: > Hey Prashant. Thanks for your codes. I did

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-20 Thread Liang-Chi Hsieh
Hi Nick, The scope of the PR I submitted is reduced because we can't make sure if it is really the root cause of the error you faced. You can check out the discussion on the PR. So I can just change the assert in the code as shown in the PR. If you can have a repro, we can go back to see if it