RE: Error building a self contained Spark app

2016-03-04 Thread Jelez Raditchkov
Ok this is what I have: object SQLHiveContextSingleton { @transient private var instance: HiveContext = _ def getInstance(sparkContext: SparkContext): HiveContext = { synchronized { if (instance == null || sparkContext.isStopped) { instance = new

FW: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏

2016-03-04 Thread Jelez Raditchkov
From: je...@hotmail.com To: yuzhih...@gmail.com Subject: RE: How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)‏ Date: Fri, 4 Mar 2016 14:09:20 -0800 Below code is from the soruces, is this what you ask? class

Best way to merge files from streaming jobs

2016-03-04 Thread Jelez Raditchkov
My streaming job is creating files on S3.The problem is that those files end up very small if I just write them to S3 directly.This is why I use coalesce() to reduce the number of files and make them larger. However, coalesce shuffles data and my job processing time ends up higher than

How to get the singleton instance of SQLContext/HiveContext: val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)

2016-03-04 Thread Jelez Raditchkov
What is the best approach to use getOrCreate for streaming job with HiveContext.It seems for SQLContext the recommended approach is to use getOrCreate: https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operationsval sqlContext =

S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-04 Thread Jelez Raditchkov
Working on a streaming job with DirectParquetOutputCommitter to S3I need to use PartitionBy and hence SaveMode.Append Apparently when using SaveMode.Append spark automatically defaults to the default parquet output committer and ignores DirectParquetOutputCommitter. My problems are:1. the