RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Wang, Daoyuan
You can use Explain extended select …. From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Tuesday, May 05, 2015 9:52 AM To: Cheng, Hao; Olivier Girardot; user Subject: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. As I know broadcastjoin is automatically enabled

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Can you print out the physical plan? EXPLAIN SELECT xxx… From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Monday, May 4, 2015 9:08 PM To: Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining 2 tables. hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics

sparksql support hive view

2015-05-04 Thread luohui20001
guys, just to confirm, sparksql support hive feature view, is that the one LateralView in hive language manual? thanks Thanksamp;Best regards! 罗辉 San.Luo

Re: sparksql support hive view

2015-05-04 Thread Michael Armbrust
to confirm, sparksql support hive feature view, is that the one LateralView https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView in hive language manual? thanks Thanksamp;Best regards! 罗辉 San.Luo

Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Cheng, Hao
I assume you’re using the DataFrame API within your application. sql(“SELECT…”).explain(true) From: Wang, Daoyuan Sent: Tuesday, May 5, 2015 10:16 AM To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. You can use

Re: SparkSQL Nested structure

2015-05-04 Thread Michael Armbrust
You are looking for LATERAL VIEW explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode in HiveQL. On Mon, May 4, 2015 at 7:49 AM, Giovanni Paolo Gibilisco gibb...@gmail.com wrote: Hi, I'm trying to parse log files generated by Spark using SparkSQL

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Or, have you ever try broadcast join? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, May 5, 2015 8:33 AM To: luohui20...@sina.com; Olivier Girardot; user Subject: RE: 回复:Re: sparksql running slow while joining 2 tables. Can you print out the physical plan? EXPLAIN SELECT xxx

回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread luohui20001
Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Cheng, Hao hao.ch...@intel.com 收件人:Cheng, Hao hao.ch...@intel.com, luohui20...@sina.com luohui20...@sina.com, Olivier Girardot ssab...@gmail.com, user user@spark.apache.org 主题:RE: 回复:Re: sparksql running

Re: sparksql - HiveConf not found during task deserialization

2015-04-29 Thread Manku Timma
The issue is solved. There was a problem in my hive codebase. Once that was fixed, -Phive-provided spark is working fine against my hive jars. On 27 April 2015 at 08:00, Manku Timma manku.tim...@gmail.com wrote: Made some progress on this. Adding hive jars to the system classpath is needed.

Automatic Cache in SparkSQL

2015-04-27 Thread Wenlei Xie
Hi, I am trying to answer a simple query with SparkSQL over the Parquet file. When execute the query several times, the first run will take about 2s while the later run will take 0.1s. By looking at the log file it seems the later runs doesn't load the data from disk. However, I didn't enable

Re: Understand the running time of SparkSQL queries

2015-04-27 Thread Akhil Das
Isn't it already available on the driver UI (that runs on 4040)? Thanks Best Regards On Mon, Apr 27, 2015 at 9:55 AM, Wenlei Xie wenlei@gmail.com wrote: Hi, I am wondering how should we understand the running time of SparkSQL queries? For example the physical query plan and the running

Re: Automatic Cache in SparkSQL

2015-04-27 Thread ayan guha
storage. Note if caching is done by spark it may be transient. On 28 Apr 2015 08:00, Wenlei Xie wenlei@gmail.com wrote: Hi, I am trying to answer a simple query with SparkSQL over the Parquet file. When execute the query several times, the first run will take about 2s while the later run

Re: sparksql - HiveConf not found during task deserialization

2015-04-26 Thread Manku Timma
Made some progress on this. Adding hive jars to the system classpath is needed. But looks like it needs to be towards the end of the system classes. Manually adding the hive classpath into Client.populateHadoopClasspath solved the issue. But a new issue has come up. It looks like some hive

Understand the running time of SparkSQL queries

2015-04-26 Thread Wenlei Xie
Hi, I am wondering how should we understand the running time of SparkSQL queries? For example the physical query plan and the running time on each stage? Is there any guide talking about this? Thank you! Best, Wenlei

Re: Creating a Row in SparkSQL 1.2 from ArrayList

2015-04-24 Thread Wenlei Xie
Use Object[] in Java just works :). On Fri, Apr 24, 2015 at 4:56 PM, Wenlei Xie wenlei@gmail.com wrote: Hi, I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java by using an List? It looks like ArrayListObject something; Row.create(something) will create a row

Re: sparksql - HiveConf not found during task deserialization

2015-04-24 Thread Manku Timma
Setting SPARK_CLASSPATH is triggering other errors. Not working. On 25 April 2015 at 09:16, Manku Timma manku.tim...@gmail.com wrote: Actually found the culprit. The JavaSerializerInstance.deserialize is called with a classloader (of type MutableURLClassLoader) which has access to all the

Re: sparksql - HiveConf not found during task deserialization

2015-04-24 Thread Manku Timma
Actually found the culprit. The JavaSerializerInstance.deserialize is called with a classloader (of type MutableURLClassLoader) which has access to all the hive classes. But internally it triggers a call to loadClass but with the default classloader. Below is the stacktrace (line numbers in the

Creating a Row in SparkSQL 1.2 from ArrayList

2015-04-24 Thread Wenlei Xie
Hi, I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java by using an List? It looks like ArrayListObject something; Row.create(something) will create a row with single column (and the single column contains the array) Best, Wenlei

Re: SparkSQL performance

2015-04-22 Thread Michael Armbrust
:18 GMT+02:00 Michael Armbrust mich...@databricks.com: There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column

Re: sparksql - HiveConf not found during task deserialization

2015-04-22 Thread Akhil Das
I see, now try a bit tricky approach, Add the hive jar to the SPARK_CLASSPATH (in conf/spark-env.sh file on all machines) and make sure that jar is available on all the machines in the cluster in the same path. Thanks Best Regards On Wed, Apr 22, 2015 at 11:24 AM, Manku Timma

Re: SparkSQL performance

2015-04-21 Thread Michael Armbrust
has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function

Re: sparksql - HiveConf not found during task deserialization

2015-04-21 Thread Manku Timma
Akhil, Thanks for the suggestions. I tried out sc.addJar, --jars, --conf spark.executor.extraClassPath and none of them helped. I added stuff into compute-classpath.sh. That did not change anything. I checked the classpath of the running executor and made sure that the hive jars are in that dir.

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function

Re: SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
wondering why there is such a big gap on performance if it is just a filter. Internally, the relation files are mapped to a JavaBean. This different data presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance

Re: SparkSQL performance

2015-04-20 Thread ayan guha
SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo

Re: SparkSQL performance

2015-04-20 Thread Michael Armbrust
There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you

SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
presentation (JavaBeans vs SparkSQL internal representation) could lead to such difference? Is there anything I could do to make the performance get closer to the hard-coded option? Thanks in advance for any suggestions or ideas. Renato M.

Re: sparksql - HiveConf not found during task deserialization

2015-04-20 Thread Akhil Das
Can you try sc.addJar(/path/to/your/hive/jar), i think it will resolve it. Thanks Best Regards On Mon, Apr 20, 2015 at 12:26 PM, Manku Timma manku.tim...@gmail.com wrote: Akhil, But the first case of creating HiveConf on the executor works fine (map case). Only the second case fails. I was

sparksql - HiveConf not found during task deserialization

2015-04-20 Thread Manku Timma
I am using spark-1.3 with hadoop-provided and hive-provided and hive-0.13.1 profiles. I am running a simple spark job on a yarn cluster by adding all hadoop2 and hive13 jars to the spark classpaths. If I remove the hive-provided while building spark, I dont face any issue. But with hive-provided

Re: sparksql - HiveConf not found during task deserialization

2015-04-20 Thread Akhil Das
Looks like a missing jar, try to print the classpath and make sure the hive jar is present. Thanks Best Regards On Mon, Apr 20, 2015 at 11:52 AM, Manku Timma manku.tim...@gmail.com wrote: I am using spark-1.3 with hadoop-provided and hive-provided and hive-0.13.1 profiles. I am running a

Re: sparksql - HiveConf not found during task deserialization

2015-04-20 Thread Manku Timma
Akhil, But the first case of creating HiveConf on the executor works fine (map case). Only the second case fails. I was suspecting some foul play with classloaders. On 20 April 2015 at 12:20, Akhil Das ak...@sigmoidanalytics.com wrote: Looks like a missing jar, try to print the classpath and

Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Using Spark 1.2.0. Tried to apply register an RDD and got: scala.MatchError: class java.util.Date (of class java.lang.Class) I see it was resolved in https://issues.apache.org/jira/browse/SPARK-2562 (included in 1.2.0) Anyone encountered this issue? Thanks, Lior

Re: Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Here's a code example: public class DateSparkSQLExample { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName(test).setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ListSomeObject itemsList =

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-16 Thread Nathan McCarthy
...@quantium.com.aumailto:nathan.mccar...@quantium.com.au Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0 Can you provide the JDBC connector jar version. Possibly the full JAR name

RE: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Nathan McCarthy
: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0 Can you provide your spark version? Thanks, Daoyuan From: Nathan McCarthy [mailto:nathan.mccar...@quantium.com.au] Sent: Wednesday, April 15, 2015 1:57 PM To: Nathan McCarthy; user@spark.apache.org Subject: Re: SparkSQL JDBC

[SparkSQL; Thriftserver] Help tracking missing 5 minutes

2015-04-15 Thread Yana Kadiyska
Hi Spark users, Trying to upgrade to Spark1.2 and running into the following seeing some very slow queries and wondering if someone can point me in the right direction for debugging. My Spark UI shows a job with duration 15s (see attached screenshot). Which would be great but client side

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Nathan McCarthy
nathan.mccar...@quantium.com.aumailto:nathan.mccar...@quantium.com.au Date: Wednesday, 15 April 2015 11:49 pm To: Wang, Daoyuan daoyuan.w...@intel.commailto:daoyuan.w...@intel.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: RE: SparkSQL

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread ๏̯͡๏
nathan.mccar...@quantium.com.au Date: Wednesday, 15 April 2015 1:57 pm To: user@spark.apache.org user@spark.apache.org Subject: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0 Hi guys, Trying to use a Spark SQL context’s .load(“jdbc, …) method to create a DF from a JDBC data

RE: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-15 Thread Wang, Daoyuan
Can you provide your spark version? Thanks, Daoyuan From: Nathan McCarthy [mailto:nathan.mccar...@quantium.com.au] Sent: Wednesday, April 15, 2015 1:57 PM To: Nathan McCarthy; user@spark.apache.org Subject: Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0 Just an update

Re: SparkSQL + Parquet performance

2015-04-14 Thread Akhil Das
, Paolo Platter paolo.plat...@agilelab.it wrote: Hi all, is there anyone using SparkSQL + Parquet that has made a benchmark about storing parquet files on HDFS or on CFS ( Cassandra File System )? What storage can improve performance of SparkSQL+ Parquet ? Thanks Paolo

SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-14 Thread Nathan McCarthy
Hi guys, Trying to use a Spark SQL context’s .load(“jdbc, …) method to create a DF from a JDBC data source. All seems to work well locally (master = local[*]), however as soon as we try and run on YARN we have problems. We seem to be running into problems with the class path and loading up the

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-14 Thread Nathan McCarthy
Subject: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0 Hi guys, Trying to use a Spark SQL context’s .load(“jdbc, …) method to create a DF from a JDBC data source. All seems to work well locally (master = local[*]), however as soon as we try and run on YARN we have problems. We

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-08 Thread Hao Ren
finished, while the DF/SQL approach don't. Any idea ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-08 Thread Michael Armbrust
, join, and then apply a new schema on the result RDD. This approach works, at least all tasks were finished, while the DF/SQL approach don't. Any idea ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join

The differentce between SparkSql/DataFram join and Rdd join

2015-04-07 Thread Hao Ren
, at least all tasks were finished, while the DF/SQL approach don't. Any idea ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html Sent from the Apache Spark User List mailing list

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-07 Thread Michael Armbrust
a new schema on the result RDD. This approach works, at least all tasks were finished, while the DF/SQL approach don't. Any idea ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-differentce-between-SparkSql-DataFram-join-and-Rdd-join-tp22407.html Sent

How to get SparkSql results on a webpage on real time

2015-04-07 Thread Mukund Ranjan (muranjan)
Hi, I have written a scala object which can do query on the messages which I am receiving from Kafka. Now I have to show it on some webpage or dashboard which can auto refresh with new results.. Any pointer how can I do that.. Thanks, Mukund

SparkSQL + Parquet performance

2015-04-06 Thread Paolo Platter
Hi all, is there anyone using SparkSQL + Parquet that has made a benchmark about storing parquet files on HDFS or on CFS ( Cassandra File System )? What storage can improve performance of SparkSQL+ Parquet ? Thanks Paolo

RE: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Michael, thanks for the response and looking forward to try 1.3.1 From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, April 03, 2015 6:52 AM To: Haopu Wang Cc: user Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k

Re: Error in SparkSQL/Scala IDE

2015-04-02 Thread Dean Wampler
It failed to find the class class org.apache.spark.sql.catalyst.ScalaReflection in the Spark SQL library. Make sure it's in the classpath and the version is correct, too. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly)

Error in SparkSQL/Scala IDE

2015-04-02 Thread Sathish Kumaran Vairavelu
Hi Everyone, I am getting following error while registering table using Scala IDE. Please let me know how to resolve this error. I am using Spark 1.2.1 import sqlContext.createSchemaRDD val empFile = sc.textFile(/tmp/emp.csv, 4) .map ( _.split(,) )

[SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Hi, I want to rename an aggregation field using DataFrame API. The aggregation is done on a nested field. But I got below exception. Do you see the same issue and any workaround? Thank you very much! == Exception in thread main org.apache.spark.sql.AnalysisException: Cannot resolve

Re: Error in SparkSQL/Scala IDE

2015-04-02 Thread Michael Armbrust
This is actually a problem with our use of Scala's reflection library. Unfortunately you need to load Spark SQL using the primordial classloader, otherwise you run into this problem. If anyone from the scala side can hint how we can tell scala.reflect which classloader to use when creating the

RE: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Felix Cheung
This is tracked by these JIRAs.. https://issues.apache.org/jira/browse/SPARK-5947 https://issues.apache.org/jira/browse/SPARK-5948 From: denny.g@gmail.com Date: Wed, 1 Apr 2015 04:35:08 + Subject: Creating Partitioned Parquet Tables via SparkSQL To: user@spark.apache.org Creating

Re: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Denny Lee
: Wed, 1 Apr 2015 04:35:08 + Subject: Creating Partitioned Parquet Tables via SparkSQL To: user@spark.apache.org Creating Parquet tables via .saveAsTable is great but was wondering if there was an equivalent way to create partitioned parquet tables. Thanks!

Re: SparkSQL - Caching RDDs

2015-04-01 Thread Michael Armbrust
...@centurylink.com wrote: I am trying to integrate SparkSQL with a BI tool. My requirement is to query a Hive table very frequently from the BI tool. Is there a way to cache the Hive Table permanently in SparkSQL? I don't want to read the Hive table and cache it everytime the query

SparkSQL - Caching RDDs

2015-04-01 Thread Venkat, Ankam
I am trying to integrate SparkSQL with a BI tool. My requirement is to query a Hive table very frequently from the BI tool. Is there a way to cache the Hive Table permanently in SparkSQL? I don't want to read the Hive table and cache it everytime the query is submitted from BI tool. Thanks

RE: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread java8964
You can use the HiveContext instead of SQLContext, which should support all the HiveQL, including lateral view explode. SQLContext is not supporting that yet. BTW, nice coding format in the email. Yong Date: Tue, 31 Mar 2015 18:18:19 -0400 Subject: Re: SparkSql - java.util.NoSuchElementException

SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist
I am accessing ElasticSearch via the elasticsearch-hadoop and attempting to expose it via SparkSQL. I am using spark 1.2.1, latest supported by elasticsearch-hadoop, and org.elasticsearch % elasticsearch-hadoop % 2.1.0.BUILD-SNAPSHOT of elasticsearch-hadoop. I’m encountering an issue when I

Re: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist
at 3:26 PM, Todd Nist tsind...@gmail.com wrote: I am accessing ElasticSearch via the elasticsearch-hadoop and attempting to expose it via SparkSQL. I am using spark 1.2.1, latest supported by elasticsearch-hadoop, and org.elasticsearch % elasticsearch-hadoop % 2.1.0.BUILD-SNAPSHOT of elasticsearch

Re: SparkSQL Timestamp query failure

2015-03-30 Thread anu
://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p22292.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote: Hm, which version of Hadoop are you using? Actually there should also be a _metadata file together with _common_metadata. I was using Hadoop 2.4.1 btw. I'm not sure whether Hadoop

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Cheng Lian
Thanks for the information. Verified that the _common_metadata and _metadata file are missing in this case when using Hadoop 1.0.4. Would you mind to open a JIRA for this? Cheng On 3/27/15 2:40 PM, Pei-Lun Lee wrote: I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
JIRA ticket created at: https://issues.apache.org/jira/browse/SPARK-6581 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 7:03 PM, Cheng Lian lian.cs@gmail.com wrote: Thanks for the information. Verified that the _common_metadata and _metadata file are missing in this case when using Hadoop

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-26 Thread ๏̯͡๏
, 2015 1:44 AM *To:* Cheng, Hao *Cc:* user@spark.apache.org *Subject:* Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Thanks Hao, But my question concerns UDAF (user defined aggregation function ) not UDTF( user defined type function ). I appreciate if you could point me

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Cheng Lian
I couldn’t reproduce this with the following spark-shell snippet: |scala import sqlContext.implicits._ scala Seq((1, 2)).toDF(a, b) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) | The _common_metadata file is

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
Hi Cheng, on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode. Overwrite) produces: peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx total 32 -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29

Difference behaviour of DateType in SparkSQL between 1.2 and 1.3

2015-03-26 Thread Wush Wu
Dear all, I am trying to upgrade the spark from 1.2 to 1.3 and switch the existed API of creating SchemaRDD to DataFrame. After testing, I notice that the following behavior is changed: ``` import java.sql.Date import com.bridgewell.SparkTestUtils import org.apache.spark.rdd.RDD import

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-26 Thread Takeshi Yamamuro
/spark/pull/3247 *From:* shahab [mailto:shahab.mok...@gmail.com] *Sent:* Wednesday, March 11, 2015 1:44 AM *To:* Cheng, Hao *Cc:* user@spark.apache.org *Subject:* Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Thanks Hao, But my question concerns UDAF (user defined

[SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Haopu Wang
Hi, I have a DataFrame object and I want to do types of aggregations like count, sum, variance, stddev, etc. DataFrame has DSL to do simple aggregations like count and sum. How about variance and stddev? Thank you for any suggestions!

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Corey Nolet
I would do sum square. This would allow you to keep an ongoing value as an associative operation (in an aggregator) and then calculate the variance std deviation after the fact. On Wed, Mar 25, 2015 at 10:28 PM, Haopu Wang hw...@qilinsoft.com wrote: Hi, I have a DataFrame object and I

SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-25 Thread Pei-Lun Lee
Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Denny Lee
Perhaps this email reference may be able to help from a DataFrame perspective: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201503.mbox/%3CCALte62ztepahF=5hk9rcfbnyk4z43wkcq4fkdcbwmgf_3_o...@mail.gmail.com%3E On Wed, Mar 25, 2015 at 7:29 PM Haopu Wang hw...@qilinsoft.com wrote:

Re: SparkSQL UDTs with Ordering

2015-03-24 Thread Patrick Woody
Awesome. yep - I have seen the warnings on UDTs, happy to keep up with the API changes :). Would this be a reasonable PR to toss up despite the API unstableness or would you prefer it to wait? Thanks -Pat On Tue, Mar 24, 2015 at 7:44 PM, Michael Armbrust mich...@databricks.com wrote: I'll

Re: SparkSQL UDTs with Ordering

2015-03-24 Thread Michael Armbrust
I'll caution that the UDTs are not a stable public interface yet. We'd like to do this someday, but currently this feature is mostly for MLlib as we have not finalized the API. Having an ordering could be useful, but I'll add that currently UDTs actually exist in serialized from so the ordering

SparkSQL UDTs with Ordering

2015-03-24 Thread Patrick Woody
Hey all, Currently looking into UDTs and I was wondering if it is reasonable to add the ability to define an Ordering (or if this is possible, then how)? Currently it will throw an error when non-Native types are used. Thanks! -Pat

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-24 Thread Jon Chase
UDAFs with HiveConetxt in SparkSQL, how? Thanks Hao, But my question concerns UDAF (user defined aggregation function ) not UDTF( user defined type function ). I appreciate if you could point me to some starting point on UDAF development in Spark. Thanks Shahab On Tuesday, March

Re: Converting SparkSQL query to Scala query

2015-03-23 Thread Dean Wampler
...@yahoo.com wrote: I have a complex SparkSQL query of the nature select a.a, b.b, c.c from a,b,c where a.x = b.x and b.y = c.y How do I convert this efficiently into scala query of a.join(b,..,..) and so on. Can anyone help me with this? If my question needs more clarification, please

Converting SparkSQL query to Scala query

2015-03-23 Thread nishitd
I have a complex SparkSQL query of the nature select a.a, b.b, c.c from a,b,c where a.x = b.x and b.y = c.y How do I convert this efficiently into scala query of a.join(b,..,..) and so on. Can anyone help me with this? If my question needs more clarification, please let me know. -- View

how to cache table with OFF_HEAP storage level in SparkSQL thriftserver

2015-03-23 Thread LiuZeshan
hi all: I got a spark on yarn cluster (spark-1.3.0, hadoop-2.2.0) with hive-0.12.0 and tachyon-0.6.1, and now I start SparkSQL thriftserver with start-thriftserver.sh, and use beeline to connect to thriftserver according to spark document. My question is: how to cache table with specified

RE: configure number of cached partition in memory on SparkSQL

2015-03-19 Thread Judy Nash
; user@spark.apache.org Subject: Re: configure number of cached partition in memory on SparkSQL Hi Judy, In the case of HadoopRDD and NewHadoopRDD, partition number is actually decided by the InputFormat used. And spark.sql.inMemoryColumnarStorage.batchSize is not related to partition number

Re: SparkSQL 1.3.0 JDBC data source issues

2015-03-19 Thread Pei-Lun Lee
JIRA and PR for first issue: https://issues.apache.org/jira/browse/SPARK-6408 https://github.com/apache/spark/pull/5087 On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax where

Re: HIVE SparkSQL

2015-03-18 Thread 宫勐
Hi: I need to count some Game Player Events in the game. Such as : How Many Players stay in the game scene 1--Save the Princess from a Dragon Moneys they have paid in the last 5 min How many players pay money for go through this scene much more

sparksql native jdbc driver

2015-03-18 Thread sequoiadb
hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift server, is this correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: sparksql native jdbc driver

2015-03-18 Thread Arush Kharbanda
Yes, I have been using Spark SQL from the onset. Haven't found any other Server for Spark SQL for JDBC connectivity. On Wed, Mar 18, 2015 at 5:50 PM, sequoiadb mailing-list-r...@sequoiadb.com wrote: hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift

Re: sparksql native jdbc driver

2015-03-18 Thread Cheng Lian
Yes On 3/18/15 8:20 PM, sequoiadb wrote: hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift server, is this correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

SparkSQL 1.3.0 JDBC data source issues

2015-03-18 Thread Pei-Lun Lee
Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax where str_col='value' will give error for both postgresql and mysql: psql create table foo(id int primary key,name text,age int); bash SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar

Re: HIVE SparkSQL

2015-03-18 Thread Jörn Franke
Hallo, Depending non your needs, search technology, such as SolrCloud or ElasticSearch makes more sense. If you go for the Cassandra solution you can use the lucene text indexer... I am not sure if hive or sparksql are very suitable for text. However, if you do not need text search then feel free

HIVE SparkSQL

2015-03-17 Thread 宫勐
Hi: I need to migrate a Log Analysis System from mysql + some C++ real time computer framwork to Hadoop ecosystem. When I want to build a data warehouse. don't know which one is the right choice. Cassandra? HIVE? Or just SparkSQL ? There is few benchmark for these systems. My

Re: configure number of cached partition in memory on SparkSQL

2015-03-16 Thread Cheng Lian
Hi Judy, In the case of |HadoopRDD| and |NewHadoopRDD|, partition number is actually decided by the |InputFormat| used. And |spark.sql.inMemoryColumnarStorage.batchSize| is not related to partition number, it controls the in-memory columnar batch size within a single partition. Also, what

Re: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-11 Thread Michael Armbrust
) -- *From:* Cheng, Hao [mailto:hao.ch...@intel.com] *Sent:* Wednesday, March 11, 2015 8:25 AM *To:* Haopu Wang; user; d...@spark.apache.org *Subject:* RE: [SparkSQL] Reuse HiveContext to different Hive warehouse? I am not so sure if Hive supports change the metastore after

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-11 Thread Haopu Wang
) From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Wednesday, March 11, 2015 8:25 AM To: Haopu Wang; user; d...@spark.apache.org Subject: RE: [SparkSQL] Reuse HiveContext to different Hive warehouse? I am not so sure if Hive supports change the metastore after initialized, I guess

Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread shahab
Hi, I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs can be registered as a function in HiveContext, I could not find any documentation of how UDAFs can be registered in the HiveContext?? so far what I have found is to make a JAR file, out of developed UDAF class

[SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table src). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with

RE: Does any one know how to deploy a custom UDAF jar file in SparkSQL?

2015-03-10 Thread Cheng, Hao
any one know how to deploy a custom UDAF jar file in SparkSQL? Hi, Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where should i put the jar file so SparkSQL can pick it up and make it accessible for SparkSQL applications? I do not use spark-shell instead I want to use

RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread Cheng, Hao
: Tuesday, March 10, 2015 5:44 PM To: user@spark.apache.org Subject: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Hi, I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs can be registered as a function in HiveContext, I could not find any documentation of how UDAFs

Does any one know how to deploy a custom UDAF jar file in SparkSQL?

2015-03-10 Thread shahab
Hi, Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where should i put the jar file so SparkSQL can pick it up and make it accessible for SparkSQL applications? I do not use spark-shell instead I want to use it in an spark application. best, /Shahab

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread shahab
:_e(%7B%7D,'cvml','shahab.mok...@gmail.com');] *Sent:* Tuesday, March 10, 2015 5:44 PM *To:* user@spark.apache.org javascript:_e(%7B%7D,'cvml','user@spark.apache.org'); *Subject:* Registering custom UDAFs with HiveConetxt in SparkSQL, how? Hi, I need o develop couple of UDAFs and use them

RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread Cheng, Hao
/pull/3247 From: shahab [mailto:shahab.mok...@gmail.com] Sent: Wednesday, March 11, 2015 1:44 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Thanks Hao, But my question concerns UDAF (user defined aggregation function ) not UDTF

<    1   2   3   4   5   6   7   8   9   10   >