Re: Local spark context on an executor

2017-03-21 Thread ayan guha
For JDBC to work, you can start spark-submit with appropriate jdbc driver jars (using --jars), then you will have the driver available on executors. For acquiring connections, create a singleton connection per executor. I think dataframe's jdbc reader (sqlContext.read.jdbc) already take care of

Re: [Spark Streaming+Kafka][How-to]

2017-03-21 Thread OUASSAIDI, Sami
So it worked quite well with a coalesce, I was able to find an solution to my problem : Altough not directly handling the executor a good roundaway was to assign the desired partition to a specific stream through assign strategy and coalesce to a single partition then repeat the same process for

Spark Streaming questions, just 2

2017-03-21 Thread shyla deshpande
Hello all, I have a couple of spark streaming questions. Thanks. 1. In the case of stateful operations, the data is, by default, persistent in memory. In memory does it mean MEMORY_ONLY? When is it removed from memory? 2. I do not see any documentation for spark.cleaner.ttl. Is this no

Re: Easily creating custom encoders

2017-03-21 Thread Koert Kuipers
see: https://issues.apache.org/jira/browse/SPARK-18122 On Tue, Mar 21, 2017 at 1:13 PM, Ashic Mahtab wrote: > I'm trying to easily create custom encoders for case classes having > "unfriendly" fields. I could just kryo the whole thing, but would like to > at least have a few

Re: Local spark context on an executor

2017-03-21 Thread Shashank Mandil
I am using spark to dump data from mysql into hdfs. The way I am doing this is by creating a spark dataframe with the metadata of different mysql tables to dump from multiple mysql hosts and then running a map over that data frame to dump each mysql table data into hdfs inside the executor. The

Re: Local spark context on an executor

2017-03-21 Thread ayan guha
What is your use case? I am sure there must be a better way to solve it On Wed, Mar 22, 2017 at 9:34 AM, Shashank Mandil wrote: > Hi All, > > I am using spark in a yarn cluster mode. > When I run a yarn application it creates multiple executors on the hadoop >

[SparkSQL] Project using NamedExpression

2017-03-21 Thread Aviral Agarwal
Hi guys, I want transform Row using NamedExpression. Below is the code snipped that I am using : def apply(dataFrame: DataFrame, selectExpressions: java.util.List[String]): RDD[UnsafeRow] = { val exprArray = selectExpressions.map(s => Column(SqlParser.parseExpression(s)).named )

Spark data frame map problem

2017-03-21 Thread Shashank Mandil
Hi All, I have a spark data frame which has 992 rows inside it. When I run a map on this data frame I expect that the map should work for all the 992 rows. As a mapper runs on an executor on a cluster I did a distributed count of the number of rows the mapper is being run on. dataframe.map(r

data cleaning and error routing

2017-03-21 Thread vincent gromakowski
Hi, In a context of ugly data, I am trying to find an efficient way to parse a kafka stream of CSV lines into a clean data model and route lines in error in a specific topic. Generally I do this: 1. First a map to split my lines with the separator character (";") 2. Then a filter where I put all

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-21 Thread shyla deshpande
Thanks TD. On Tue, Mar 14, 2017 at 4:37 PM, Tathagata Das wrote: > This setting allows multiple spark jobs generated through multiple > foreachRDD to run concurrently, even if they are across batches. So output > op2 from batch X, can run concurrently with op1 of batch X+1

Easily creating custom encoders

2017-03-21 Thread Ashic Mahtab
I'm trying to easily create custom encoders for case classes having "unfriendly" fields. I could just kryo the whole thing, but would like to at least have a few fields in the schema instead of one binary blob. For example, case class MyClass(id: UUID, items: Map[String, Double], name: String)

Re: Merging Schema while reading Parquet files

2017-03-21 Thread Matt Deaver
You could create a one-time job that processes historical data to match the updated format On Tue, Mar 21, 2017 at 8:53 AM, Aditya Borde wrote: > Hello, > > I'm currently blocked with this issue: > > I have job "A" whose output is partitioned by one of the field - "col1" >

Merging Schema while reading Parquet files

2017-03-21 Thread Aditya Borde
Hello, I'm currently blocked with this issue: I have job "A" whose output is partitioned by one of the field - "col1" Now job "B" reads the output of job "A". Here comes the problem. my job "A" output previously not been partitioned by "col1" (this is recent change). But the thing is now, all

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-21 Thread Dongjin Lee
Hi Chetan, Sadly, you can not; Spark is configured to ignore the null values when writing JSON. (check JacksonMessageWriter and find JsonInclude.Include.NON_NULL from the code.) If you want that functionality, it would be much better to file the problem to JIRA. Best, Dongjin On Mon, Mar 20,