Accumulator param not combining values

2016-04-15 Thread Connie Chen
Hi all, I wrote an AccumulatorParam but in my job it does not seem to be adding the values. When I tried with an int accumulator in my job the value was added to. object MapAccumulatorParam extends AccumulatorParam[Map[Long, Int]]{ > def zero(initialValue: Map[Long, Int] = Map.empty):

Re: Shuffle service fails to register driver - Spark - Mesos

2016-04-15 Thread Jo Voordeckers
Never mind, just figured out my problem, I was running: *deploy.ExternalShuffleService* instead of *deploy.mesos.MesosExternalShuffleService* - Jo Voordeckers On Fri, Apr 15, 2016 at 2:29 PM, Jo Voordeckers wrote: > Forgot to mention we're running Spark (Streaming)

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Benjamin Kim
Thanks! I got this to work. val csvRdd = sc.parallelize(data.split("\n")) val df = new com.databricks.spark.csv.CsvParser().withUseHeader(true).withInferSchema(true).csvRdd(sqlContext, csvRdd) > On Apr 15, 2016, at 1:14 PM, Hyukjin Kwon wrote: > > Hi, > > Would you try

Re: Shuffle service fails to register driver - Spark - Mesos

2016-04-15 Thread Jo Voordeckers
Forgot to mention we're running Spark (Streaming) 1.5.1 - Jo Voordeckers On Fri, Apr 15, 2016 at 12:21 PM, Jo Voordeckers wrote: > Hi all, > > I've got mesos in coarse grained mode with dyn alloc, shuffle service > enabled and am running the shuffle service on every

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Benjamin Kim
Is this right? import com.databricks.spark.csv val csvRdd = data.flatMap(x => x.split("\n")) val df = new CsvParser().csvRdd(sqlContext, csvRdd, useHeader = true) Thanks, Ben > On Apr 15, 2016, at 1:14 PM, Hyukjin Kwon wrote: > > Hi, > > Would you try this codes

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Hyukjin Kwon
Hi, Would you try this codes below? val csvRDD = ...your processimg for csv rdd.. val df = new CsvParser().csvRdd(sqlContext, csvRDD, useHeader = true) Thanks! On 16 Apr 2016 1:35 a.m., "Benjamin Kim" wrote: > Hi Hyukjin, > > I saw that. I don’t know how to use it. I’m

Re: alter table add columns aternatives or hive refresh

2016-04-15 Thread Mich Talebzadeh
looks plausible. Glad it helped Personally I prefer ORC tables as they are arguably a better fit for columnar tables. Others may differ on this :) Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: alter table add columns aternatives or hive refresh

2016-04-15 Thread Maurin Lenglart
Hi, Following your answer I was able to make it work. FIY: Basically the solution is to manually create the table in hive using a sql “Create table” command. When doing a saveAsTable, hive meta-store don’t get the info of the df. So now my flow is : * Create a dataframe * if it is the

Re: How many disks for spark_local_dirs?

2016-04-15 Thread Jan Rock
Hi, is it physical server or AWS/Azure? What are the executed parameters for spark-shell command? Hadoop distro/version and Spark version? Kind Regards, Jan > On 15 Apr 2016, at 16:15, luca_guerra wrote: > > Hi, > I'm looking for a solution to improve my Spark cluster

Shuffle service fails to register driver - Spark - Mesos

2016-04-15 Thread Jo Voordeckers
Hi all, I've got mesos in coarse grained mode with dyn alloc, shuffle service enabled and am running the shuffle service on every mesos slave. I'm assuming I misconfigured something on the scheduler service, any ideas? On my driver is see a few of these, I guess it's one for every executor :

Spark Standalone with SPARK_CLASSPATH in spark-env.sh and "spark.driver.userClassPathFirst"

2016-04-15 Thread Yong Zhang
Hi, I found out one problem of using "spark.driver.userClassPathFirst" and SPARK_CLASSPATH in spark-env.sh on Standalone environment, and want to confirm this in fact has no good solution. We are running Spark 1.5.2 in standalone mode on a cluster. Since the cluster doesn't have the direct

Spark Standalone with SPARK_CLASSPATH in spark-env.sh and "spark.driver.userClassPathFirst"

2016-04-15 Thread Yong Zhang
Hi, I found out one problem of using "spark.driver.userClassPathFirst" and SPARK_CLASSPATH in spark-env.sh on Standalone environment, and want to confirm this in fact has no good solution. We are running Spark 1.5.2 in standalone mode on a cluster. Since the cluster doesn't have the direct

Re: Will nested field performance improve?

2016-04-15 Thread Michael Armbrust
> > If we expect fields nested in structs to always be much slower than flat > fields, then I would be keen to address that in our ETL pipeline with a > flattening step. If it's a known issue that we expect will be fixed in > upcoming releases, I'll hold off. > The difference might be even larger

Re: Spark sql not pushing down timestamp range queries

2016-04-15 Thread Kiran Chitturi
Mich, I am curious as well on how Spark casts between different types of filters. For example: the conversions happen implicitly for 'EqualTo' filter scala> sqlContext.sql("SELECT * from events WHERE `registration` = > '2015-05-28'").explain() > > 16/04/15 11:44:15 INFO ParseDriver: Parsing

Re: Logging in executors

2016-04-15 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1 On Fri, Apr 15, 2016 at 5:38 AM, Carlos Rojas Matas wrote: > Hi guys, > > any clue on this? Clearly the > spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the > executors. > > Thanks, >

Re: How do i get a spark instance to use my log4j properties

2016-04-15 Thread Demi Ben-Ari
Hi Steve, I wrote a blog post on the configuration of Spark that we've used, including the log4j.properties: http://progexc.blogspot.co.il/2014/12/spark-configuration-mess-solved.html (What we did was to distribute the relevant *log4j.properties* file to all of the slaves to the same location)

Re: Spark sql not pushing down timestamp range queries

2016-04-15 Thread Kiran Chitturi
Thanks Hyukjin for the suggestion. I will take a look at implementing Solr datasource with CatalystScan. ​

Re: JSON Usage

2016-04-15 Thread Benjamin Kim
Holden, If I were to use DataSets, then I would essentially do this: val receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl) val messages = sqs.receiveMessage(receiveMessageRequest).getMessages() for (message <- messages.asScala) { val files =

Re: How many disks for spark_local_dirs?

2016-04-15 Thread Mich Talebzadeh
Is that 32 CPUs or 32 cores? So in this configuration assuming 32 cores you have I worker with how much memory (deducting memory for OS etc) and 32 cores. What is the ratio of memory per core in this case? HTH Dr Mich Talebzadeh LinkedIn *

Re: EMR Spark log4j and metrics

2016-04-15 Thread Peter Halliday
I wonder if anyone can confirm is Spark on YARN the problem here? Or is it how AWS has put it together? I'm wondering if Spark on YARN has problems with configuration files for the workers and driver? Peter Halliday On Thu, Apr 14, 2016 at 1:09 PM, Peter Halliday wrote:

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Benjamin Kim
Hi Hyukjin, I saw that. I don’t know how to use it. I’m still learning Scala on my own. Can you help me to start? Thanks, Ben > On Apr 15, 2016, at 8:02 AM, Hyukjin Kwon wrote: > > I hope it was not too late :). > > It is possible. > > Please check csvRdd api here, >

How many disks for spark_local_dirs?

2016-04-15 Thread luca_guerra
Hi, I'm looking for a solution to improve my Spark cluster performances, I have read from http://spark.apache.org/docs/latest/hardware-provisioning.html: "We recommend having 4-8 disks per node", I have tried both with one and two disks but I have seen that with 2 disks the execution time is

Re: In-Memory Only Spark Shuffle

2016-04-15 Thread Hyukjin Kwon
This reminds me of this Jira, https://issues.apache.org/jira/browse/SPARK-3376 and this PR, https://github.com/apache/spark/pull/5403. AFAIK, it is not and won't be supported. On 2 Apr 2016 4:13 a.m., "slavitch" wrote: > Hello; > > I’m working on spark with very large memory

Spark RDD.take() generating duplicates for AvroData

2016-04-15 Thread Anoop Shiralige
Hi All, I have some avro data, which I am reading in the following way. Query : > val data = sc.newAPIHadoopFile(file, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable]). map(_._1.datum) But, when I try to print the data, it is generating

Re: How to stop hivecontext

2016-04-15 Thread Ted Yu
You can call stop() method. > On Apr 15, 2016, at 5:21 AM, ram kumar wrote: > > Hi, > I started hivecontext as, > > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); > > I want to stop this sql context > > Thanks

Re: Logging in executors

2016-04-15 Thread Carlos Rojas Matas
Hi guys, any clue on this? Clearly the spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the executors. Thanks, -carlos. On Wed, Apr 13, 2016 at 2:48 PM, Carlos Rojas Matas wrote: > Hi Yong, > > thanks for your response. As I said in my first email,

How to stop hivecontext

2016-04-15 Thread ram kumar
Hi, I started hivecontext as, val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); I want to stop this sql context Thanks

Unable To access Hive From Spark

2016-04-15 Thread Amit Singh Hora
Hi All, I am trying to access hive from Spark but getting exception The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- Code :- String logFile = "hdfs://hdp23ha/logs"; // Should be some file on

[Help]:Strange Issue :Debug Spark Dataframe code

2016-04-15 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Scala 2.10. Is there any other option apart from "explain(true)" to debug Spark Dataframe code . I am facing strange issue . I have a lookuo dataframe and using it join another dataframe on different columns . I am getting *Analysis exception* in third join. When

Re: Spark sql not pushing down timestamp range queries

2016-04-15 Thread Mich Talebzadeh
Thanks Takeshi, I did check it. I believe you are referring to this statement "This is likely because we cast this expression weirdly to be compatible with Hive. Specifically I think this turns into, CAST(c_date AS STRING) >= "2016-01-01", and we don't push down casts down into data sources.

Will nested field performance improve?

2016-04-15 Thread James Aley
Hello, I'm trying to make a call on whether my team should invest time added a step to "flatten" our schema as part of our ETL pipeline to improve performance of interactive queries. Our data start out life as Avro before being converted to Parquet, and so we follow the Avro idioms of creating

Re: Can this performance be improved?

2016-04-15 Thread Jörn Franke
You could use a different format and the dataset or dataframe instead of rdd. > On 14 Apr 2016, at 23:21, Bibudh Lahiri wrote: > > Hi, > As part of a larger program, I am extracting the distinct values of some > columns of an RDD with 100 million records and 4