ClassCastException for SerializedLamba

2019-03-29 Thread Koert Kuipers
hi all, we are switching from scala 2.11 to 2.12 with a spark 2.4.1 release candidate and so far this has been going pretty smoothly. however we do see some new serialization errors related to Function1, Function2, etc. they look like this: ClassCastException: cannot assign instance of java.lang.

spark generates corrupted parquet files

2019-03-29 Thread Lian Jiang
Hi, Occasionally, spark generates some parquet files having only 4 bytes. The content is "PAR1". ETL spark jobs cannot handle such corrupted files and ignore the whole partition containing such poison pill files, causing big data loss. Spark also generates 0 bytes parquet files but they can be ha

Re: spark.submit.deployMode: cluster

2019-03-29 Thread Todd Nist
A little late, but have you looked at https://livy.incubator.apache.org/, works well for us. -Todd On Thu, Mar 28, 2019 at 9:33 PM Jason Nerothin wrote: > Meant this one: https://docs.databricks.com/api/latest/jobs.html > > On Thu, Mar 28, 2019 at 5:06 PM Pat Ferrel wrote: > >> Thanks, are you

Spark SQL API taking longer time than DF API.

2019-03-29 Thread neeraj bhadani
Hi Team, I am executing same spark code using the Spark SQL API and DataFrame API, however, Spark SQL is taking longer than expected. PFB Sudo code. --- Case 1 : Spark SQL -

Dataset schema incompatibility bug when reading column partitioned data

2019-03-29 Thread Dávid Szakállas
We observed the following bug on Spark 2.4.0: scala> spark.createDataset(Seq((1,2))).write.partitionBy("_1").parquet("foo.parquet") scala> val schema = StructType(Seq(StructField("_1", IntegerType),StructField("_2", IntegerType))) scala> spark.read.schema(schema).parquet("foo.parquet").as[(Int

Re: How to extract data in parallel from RDBMS tables

2019-03-29 Thread Jason Nerothin
How many tables? What DB? On Fri, Mar 29, 2019 at 00:50 Surendra , Manchikanti < surendra.manchika...@gmail.com> wrote: > Hi Jason, > > Thanks for your reply, But I am looking for a way to parallelly extract > all the tables in a Database. > > > On Thu, Mar 28, 2019 at 2:50 PM Jason Nerothin > w

Re: Where does the Driver run?

2019-03-29 Thread ayan guha
Have you tried apache Livy? On Fri, 29 Mar 2019 at 9:32 pm, Jianneng Li wrote: > Hi Pat, > > Now that I understand your terminology better, the method I described was > actually closer to spark-submit than what you referred to as > "programmatically". You want to have SparkContext running in the

Re: Where does the Driver run?

2019-03-29 Thread Jianneng Li
Hi Pat, Now that I understand your terminology better, the method I described was actually closer to spark-submit than what you referred to as "programmatically". You want to have SparkContext running in the launcher program, and also the driver somehow running on the cluster, and unfortunately

Re: Spark Profiler

2019-03-29 Thread jcdauchy
Hello Jack, You can also have a look at “Babar”, there is a nice “flame graph” feature too. I haven’t had the time to test it out. https://github.com/criteo/babar JC -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --

Re: Spark Profiler

2019-03-29 Thread Hariharan
Hi Jack, You can try sparklens (https://github.com/qubole/sparklens). I think it won't give details at as low a level as you're looking for, but it can help you identify and remove performance bottlenecks. ~ Hariharan On Fri, Mar 29, 2019 at 12:01 AM bo yang wrote: > Yeah, these options are ve