Re: Serialization error - sql UDF related

2017-02-17 Thread vaquar khan
Hi Darshan , When you get org.apache.spark.SparkException: Task not serializable exception, it means that you are using a reference to an instance of a non-serialize class inside a transformation. Hope following link will help.

Serialization error - sql UDF related

2017-02-17 Thread Darshan Pandya
Hello, I am getting the famous serialization exception on running some code as below, val correctColNameUDF = udf(getNewColumnName(_: String, false: Boolean): String); val charReference: DataFrame = thinLong.select("char_name_id", "char_name").withColumn("columnNameInDimTable",

Re: Spark standalone cluster on EC2 error .. Checkpoint..

2017-02-17 Thread shyla deshpande
Thanks TD and Marco for the feedback. The directory referenced by SPARK_LOCAL_DIRS did not exist. After creating that directory, it worked. This was the first time I was trying to run spark on standalone cluster, so I missed it. Thanks On Fri, Feb 17, 2017 at 12:35 PM, Tathagata Das

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
one executor per Spark slave should be fine right I am not really sure what benefit one would get by starting more executors (jvm's) on one node? End of the day JVM creates native/kernel threads through system calls so if those threads are spawned by one or multiple processes I dont see much

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread Alex Kozlov
I found in some previous CDH versions that Spark starts only one executor per Spark slave, and DECREASING the executor-cores in standalone makes the total # of executors go up. Just my 2¢. On Fri, Feb 17, 2017 at 5:20 PM, kant kodali wrote: > Hi Satish, > > I am using spark

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
Hi Satish, I am using spark 2.0.2. And no I have not passed those variables because I didn't want to shoot in the dark. According to the documentation it looks like SPARK_WORKER_CORES is the one which should do it. If not, can you please explain how these variables inter play together?

RE: question on SPARK_WORKER_CORES

2017-02-17 Thread Satish Lalam
Have you tried passing --executor-cores or –total-executor-cores as arguments, , depending on the spark version? From: kant kodali [mailto:kanth...@gmail.com] Sent: Friday, February 17, 2017 5:03 PM To: Alex Kozlov Cc: user @spark Subject: Re: question

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
Standalone. On Fri, Feb 17, 2017 at 5:01 PM, Alex Kozlov wrote: > What Spark mode are you running the program in? > > On Fri, Feb 17, 2017 at 4:55 PM, kant kodali wrote: > >> when I submit a job using spark shell I get something like this >> >> [Stage

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread Alex Kozlov
What Spark mode are you running the program in? On Fri, Feb 17, 2017 at 4:55 PM, kant kodali wrote: > when I submit a job using spark shell I get something like this > > [Stage 0:>(36814 + 4) / 220129] > > > Now all I want is I want to increase number of parallel

question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
when I submit a job using spark shell I get something like this [Stage 0:>(36814 + 4) / 220129] Now all I want is I want to increase number of parallel tasks running from 4 to 16 so I exported an env variable called SPARK_WORKER_CORES=16 in conf/spark-env.sh. I though that should do

Re: Query data in subdirectories in Hive Partitions using Spark SQL

2017-02-17 Thread Yan Facai
Hi, Abdelfatah, How to you read these files? spark.read.parquet or spark.sql? Could you show some code? On Wed, Feb 15, 2017 at 8:47 PM, Ahmed Kamal Abdelfatah < ahmed.abdelfa...@careem.com> wrote: > Hi folks, > > > > How can I force spark sql to recursively get data stored in parquet format >

Re: How to specify default value for StructField?

2017-02-17 Thread Yan Facai
I agree with Yong Zhang, perhaps spark sql with hive could solve the problem: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables On Thu, Feb 16, 2017 at 12:42 AM, Yong Zhang wrote: > If it works under hive, do you try just create the DF from

Re: How to convert RDD to DF for this case -

2017-02-17 Thread Yan Facai
Hi, Basu, if all columns is separated by delimter "\t", csv parser might be a better choice. for example: ```scala spark.read .option("sep", "\t") .option("header", fasle) .option("inferSchema", true) .csv("/user/root/spark_demo/scala/data/Stations.txt") ```

How do I increase readTimeoutMillis parameter in Spark-shell?

2017-02-17 Thread kant kodali
How do I increase readTimeoutMillis parameter in Spark-shell? because in the middle of CassandraCount The job aborts with the following exception java.io.IOException: Exception during execution of SELECT count(*) FROM "test"."hello" WHERE token("cid") > ? AND token("cid") <= ? ALLOW FILTERING:

Executor tab values in Spark Application UI

2017-02-17 Thread satishl
I would like to understand spark Application UI's executor tab values better. Are the values for Input, Shuffle Rad and Shuffle Write for sum of values for all tasks across all stages? If yes, then it appears that value isnt much of help while debugging? Or am I missing the point of these

Re: Spark standalone cluster on EC2 error .. Checkpoint..

2017-02-17 Thread Tathagata Das
Seems like an issue with the HDFS you are using for checkpointing. Its not able to write data properly. On Thu, Feb 16, 2017 at 2:40 PM, shyla deshpande wrote: > Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): > File >

Re: Spark on Mesos with Docker in bridge networking mode

2017-02-17 Thread Michael Gummelt
There's a JIRA here: https://issues.apache.org/jira/browse/SPARK-11638 I haven't had time to look at it. On Thu, Feb 16, 2017 at 11:00 AM, cherryii wrote: > I'm getting errors when I try to run my docker container in bridge > networking > mode on mesos. > Here is my spark

Re: Graphx Examples for ALS

2017-02-17 Thread Irving Duran
Not sure I follow your question. Do you want to use ALS or GraphX? Thank You, Irving Duran On Fri, Feb 17, 2017 at 7:07 AM, balaji9058 wrote: > Hi, > > Where can i find the the ALS recommendation algorithm for large data set? > > Please feel to share your

Re: I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread kant kodali
Oops I think that should fix it. I am going to try it now.. Great catch! I feel like an idiot. On Fri, Feb 17, 2017 at 10:02 AM, Russell Spitzer wrote: > Great catch Anastasios! > > On Fri, Feb 17, 2017 at 9:59 AM Anastasios Zouzias > wrote: > >>

Re: I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread Russell Spitzer
Great catch Anastasios! On Fri, Feb 17, 2017 at 9:59 AM Anastasios Zouzias wrote: > Hey, > > Can you try with the 2.11 spark-cassandra-connector? You just reported > that you use spark-cassandra-connector*_2.10*-2.0.0-RC1.jar > > Best, > Anastasios > > On Fri, Feb 17, 2017 at

Re: I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread Anastasios Zouzias
Hey, Can you try with the 2.11 spark-cassandra-connector? You just reported that you use spark-cassandra-connector*_2.10*-2.0.0-RC1.jar Best, Anastasios On Fri, Feb 17, 2017 at 6:40 PM, kant kodali wrote: > Hi, > > > val df =

I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread kant kodali
Hi, val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "hello", "keyspace" -> "test" )).load() This line works fine. I can see it actually pulled the table schema from cassandra. however when I do df.count I get the error below. I am using the following

Re: skewed data in join

2017-02-17 Thread Jon Gregg
It depends how you salt it. See slide 40 and onwards from a spark summit talk here: http://www.slideshare.net/cloudera/top-5-mistakes- to-avoid-when-writing-apache-spark-applications The speakers use a mod8 integer salt appended to the end of the key, the salt that works best for you might be

Re: How to convert RDD to DF for this case -

2017-02-17 Thread Aakash Basu
Hey Chris, Thanks for your quick help. Actually the dataset had issues, otherwise the logic I implemented was not wrong. I did this - 1) *V.Imp *– Creating row by segregating columns after reading the tab delimited file before converting into DF= val stati = stat.map(x =>

Graphx Examples for ALS

2017-02-17 Thread balaji9058
Hi, Where can i find the the ALS recommendation algorithm for large data set? Please feel to share your ideas/algorithms/logic to build recommendation engine by using spark graphx Thanks in advance. Thanks, Balaji -- View this message in context:

Re: How to convert RDD to DF for this case -

2017-02-17 Thread Christophe Préaud
Hi Aakash, You can try this: import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StringType, StructField, StructType} val header = Array("col1", "col2", "col3", "col4") val schema = StructType(header.map(StructField(_, StringType, true))) val statRow = stat.map(line =>

how to give hdfs file path as argument to spark-submit

2017-02-17 Thread nancy henry
Hi All, object Step1 { def main(args: Array[String]) = { val sparkConf = new SparkConf().setAppName("my-app") val sc = new SparkContext(sparkConf) val hiveSqlContext: HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)