Re: spark sql - reading data from sql tables having space in column names

2015-06-07 Thread Cheng Lian
You can use backticks to quote the column names. Cheng On 6/3/15 2:49 AM, David Mitchell wrote: I am having the same problem reading JSON. There does not seem to be a way of selecting a field that has a space, Executor Info from the Spark logs. I suggest that we open a JIRA ticket to

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-07 Thread Don Drake
Thanks Cheng, we have a workaround in place for Spark 1.3 (remove .metadata directory), good to know it will be resolved in 1.4. -Don On Sun, Jun 7, 2015 at 8:51 AM, Cheng Lian lian.cs@gmail.com wrote: This issue has been fixed recently in Spark 1.4

Re: Not understanding manually building EC2 cluster

2015-06-07 Thread Akhil
- Remove localhost from the conf/slaves file, add the slaves private ip. - Make sure master and slave machines are on the same security group (this way all ports will be accessible to all machines) - In conf/spark-env.sh file, place export SPARK_MASTER_IP=MASTER-NODES-PUBLIC-OR-PRIVATE-IP and

Re: Monitoring Spark Jobs

2015-06-07 Thread Otis Gospodnetić
Hi Sam, Have a look at Sematext's SPM for your Spark monitoring needs. If the problem is CPU, IO, Network, etc. as Ahkil mentioned, you'll see that in SPM, too. As for the number of jobs running, you have see a chart with that at http://sematext.com/spm/integrations/spark-monitoring.html Otis --

Re: Running SparkSql against Hive tables

2015-06-07 Thread Cheng Lian
On 6/6/15 9:06 AM, James Pirz wrote: I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a better performance, it is a good idea to use Parquet files. I have 2 questions regarding that: 1) If I wanna

Re: hiveContext.sql NullPointerException

2015-06-07 Thread patcharee
Hi, How can I expect to work on HiveContext on the executor? If only the driver can see HiveContext, does it mean I have to collect all datasets (very large) to the driver and use HiveContext there? It will be memory overload on the driver and fail. BR, Patcharee On 07. juni 2015 11:51,

Re: Caching parquet table (with GZIP) on Spark 1.3.1

2015-06-07 Thread Cheng Lian
Is it possible that some Parquet files of this data set have different schema as others? Especially those ones reported in the exception messages. One way to confirm this is to use [parquet-tools] [1] to inspect these files: $ parquet-schema path-to-file Cheng [1]:

Monitoring Spark Jobs

2015-06-07 Thread SamyaMaiti
Hi All, I have a Spark SQL application to fetch data from Hive, on top I have a akka layer to run multiple Queries in parallel. *Please suggest a mechanism, so as to figure out the number of spark jobs running in the cluster at a given instance of time. * I need to do the above as, I see the

Re: Avro or Parquet ?

2015-06-07 Thread Cheng Lian
Usually Parquet can be more efficient because of its columnar nature. Say your table has 10 columns but your join query only touches 3 of them, Parquet only reads those 3 columns from disk while Avro must load all data. Cheng On 6/5/15 3:00 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: We currently have data in

Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian
Spark SQL supports Hive dynamic partitioning, so one possible workaround is to create a Hive table partitioned by zone, z, year, and month dynamically, and then insert the whole dataset into it directly. In 1.4, we also provides dynamic partitioning support for non-Hive environment, and you

Re: Spark ML decision list

2015-06-07 Thread Debasish Das
What is decision list ? Inorder traversal (or some other traversal) of fitted decision tree On Jun 5, 2015 1:21 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Is there an existing way in SparkML to convert a decision tree to a decision list? On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh

Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-07 Thread Cheng Lian
For the following code: val df = sqlContext.parquetFile(path) `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code: val cdf = df.cache() `cdf` is also columnar but that's different from Parquet. When a DataFrame is cached,

Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-07 Thread Cheng Lian
Were you using HiveContext.setConf()? dfs.replication is a Hadoop configuration, but setConf() is only used to set Spark SQL specific configurations. You may either set it in your Hadoop core-site.xml. Cheng On 6/2/15 2:28 PM, Haopu Wang wrote: Hi, I'm trying to save SparkSQL DataFrame

Re: NullPointerException SQLConf.setConf

2015-06-07 Thread Cheng Lian
Are you calling hiveContext.sql within an RDD.map closure or something similar? In this way, the call actually happens on executor side. However, HiveContext only exists on the driver side. Cheng On 6/4/15 3:45 PM, patcharee wrote: Hi, I am using Hive 0.14 and spark 0.13. I got

Re: Monitoring Spark Jobs

2015-06-07 Thread Akhil Das
It could be a CPU, IO, Network bottleneck, you need to figure out where exactly its chocking. You can use certain monitoring utilities (like top) to understand it better. Thanks Best Regards On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, I have a Spark

Re: Problem reading Parquet from 1.2 to 1.3

2015-06-07 Thread Cheng Lian
This issue has been fixed recently in Spark 1.4 https://github.com/apache/spark/pull/6581 Cheng On 6/5/15 12:38 AM, Marcelo Vanzin wrote: I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread Cheng Lian
Interesting, just posted on another thread asking exactly the same question :) My answer there quoted below: For the following code: val df = sqlContext.parquetFile(path) `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code:

Re: Spark Streaming Stuck After 10mins Issue...

2015-06-07 Thread Cody Koeninger
What is the code used to set up the kafka stream? On Sat, Jun 6, 2015 at 3:23 PM, EH eas...@gmail.com wrote: And here is the Thread Dump, where seems every worker is waiting for Executor #6 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) to be complete: Thread 41:

Re: Accumulator map

2015-06-07 Thread Akhil Das
​Another approach would be to use a zookeeper. If you have zookeeper running somewhere in the cluster you can simply create a path like */dynamic-list*​ in it and then write objects/values to it, you can even create/access nested objects. Thanks Best Regards On Fri, Jun 5, 2015 at 7:06 PM,

Re: Spark Streaming Stuck After 10mins Issue...

2015-06-07 Thread Akhil Das
Which consumer are you using? If you can paste the complete code then may be i can try reproducing it. Thanks Best Regards On Sun, Jun 7, 2015 at 1:53 AM, EH eas...@gmail.com wrote: And here is the Thread Dump, where seems every worker is waiting for Executor #6 Thread 95:

Driver crash at the end with InvocationTargetException when running SparkPi

2015-06-07 Thread Dong Lei
Hi spark users: After I submitted a SparkPi job to spark, the driver crashed at the end of the job with the following log: WARN EventLoggingListener: Event log dir file:/d:/data/SparkWorker/work/driver-20150607200517-0002/logs/event does not exists, will newly create one. Exception in thread

FlatMap in DataFrame

2015-06-07 Thread dimple
Hi, I'm trying to write a custom transformer in Spark ML and since that uses DataFrames, am trying to use flatMap function in DataFrame class in Java. Can you share a simple example of how to use the flatMap function to do word count on single column of the DataFrame. Thanks. Dimple -- View

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread kiran lonikar
Thanks for replying twice :) I think I sent this question by email and somehow thought I did not sent it, hence created the other one on the web interface. Lets retain this thread since you have provided more details here. Great, it confirms my intuition about DataFrame. It's similar to Shark

Examples of flatMap in dataFrame

2015-06-07 Thread Dimp Bhat
Hi, I'm trying to write a custom transformer in Spark ML and since that uses DataFrames, am trying to use flatMap function in DataFrame class in Java. Can you share a simple example of how to use the flatMap function to do word count on single column of the DataFrame. Thanks Dimple

Optimization module in Python mllib

2015-06-07 Thread martingoodson
Am I right in thinking that Python mllib does not contain the optimization module? Are there plans to add this to the Python api? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimization-module-in-Python-mllib-tp23191.html Sent from the Apache Spark

Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian
Hi, This is expected behavior. HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side. However, the closure passed to RDD.foreach is executed on executor side, where no viable HiveContext instance exists. Cheng On 6/7/15 10:06 AM, patcharee