Re: Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread Yin Huai
Hello Arthur, You can use do aggregations in SQL. How did you create LINEITEM? Thanks, Yin On Thu, Oct 23, 2014 at 8:54 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I got $TreeNodeException, few questions: Q1) How should I do aggregation in SparK? Can I use

Re: SchemaRDD Convert

2014-10-22 Thread Yin Huai
The implicit conversion function mentioned by Hao is createSchemaRDD in SQLContext/HiveContext. You can import it by doing val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Or new org.apache.spark.sql.hive.HiveContext(sc) for HiveContext import sqlContext.createSchemaRDD On Wed, Oct

Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai
Is there any specific issues you are facing? Thanks, Yin On Tue, Oct 21, 2014 at 4:00 PM, tridib tridib.sama...@live.com wrote: Any help? or comments? -- View this message in context:

Re: Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai
that may help. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val schemaRDD = hiveContext.jsonFile(...) schemaRDD.registerTempTable(jsonTable) hiveContext.sql(SELECT CAST(columnName as DATE) FROM jsonTable) Thanks, Yin On Tue, Oct 21, 2014 at 8:00 PM, Yin Huai huaiyin

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Yin Huai
Hi Tridib, For the second approach, can you attach the complete stack trace? Thanks, Yin On Mon, Oct 20, 2014 at 8:24 PM, Michael Armbrust mich...@databricks.com wrote: I think you are running into a bug that will be fixed by this PR: https://github.com/apache/spark/pull/2850 On Mon, Oct

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Yin Huai
/jira/browse/SPARK-4003 You can check PR https://github.com/apache/spark/pull/2850 . Thanks, Daoyuan *From:* Yin Huai [mailto:huaiyin@gmail.com] *Sent:* Tuesday, October 21, 2014 10:00 AM *To:* Michael Armbrust *Cc:* tridib; u...@spark.incubator.apache.org *Subject:* Re: spark

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-16 Thread Yin Huai
by the 2 partition columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just counting the rows in the console output. Let me know if you need more information. Thanks -Terry From: Yin Huai huaiyin@gmail.com Date: Tuesday, October 14, 2014 at 6:29 PM

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Yin Huai
Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu terry@smartfocus.com wrote: Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-13 Thread Yin Huai
Question 1: Please check http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#hive-tables. Question 2: One workaround is to re-write it. You can use LEFT SEMI JOIN to implement the subquery with EXISTS and use LEFT OUTER JOIN + IS NULL to implement the subquery with NOT EXISTS. SELECT

Re: Is Array Of Struct supported in json RDDs? is it possible to query this?

2014-10-13 Thread Yin Huai
If you are using HiveContext, it should work in 1.1. Thanks, Yin On Mon, Oct 13, 2014 at 5:08 AM, shahab shahab.mok...@gmail.com wrote: Hello, Given the following structure, is it possible to query, e.g. session[0].id ? In general, is it possible to query Array Of Struct in json RDDs?

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread Yin Huai
Hi Shahab, Can you try to use HiveContext? Its should work in 1.1. For SQLContext, this issues was not fixed in 1.1 and you need to use master branch at the moment. Thanks, Yin On Sun, Oct 12, 2014 at 5:20 PM, shahab shahab.mok...@gmail.com wrote: Hi, Apparently is it is possible to query

Re: Spark SQL parser bug?

2014-10-13 Thread Yin Huai
Seems the reason that you got wrong results was caused by timezone. The time in java.sql.Timestamp(long time) means milliseconds since January 1, 1970, 00:00:00 *GMT*. A negative number is the number of milliseconds before January 1, 1970, 00:00:00 *GMT*. However, in ts='1970-01-01 00:00:00',

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread Yin Huai
on youtube Easy JSON Data Manipulation in Spark), is it possible to perform aggregation kind queries, for example counting number of attributes (considering that attributes in schema is presented as array), or any other type of aggregation? best, /Shahab On Mon, Oct 13, 2014 at 4:01 PM, Yin Huai

Re: Spark SQL parser bug?

2014-10-13 Thread Yin Huai
,ts#3], MapPartitionsRDD[22] at mapPartitions at basicOperators.scala:208 scala s.collect res5: Array[org.apache.spark.sql.Row] = Array() Mohammed *From:* Yin Huai [mailto:huaiyin@gmail.com] *Sent:* Monday, October 13, 2014 7:19 AM *To:* Mohammed Guller *Cc:* Cheng, Hao

Re: partition size for initial read

2014-10-02 Thread Yin Huai
Hi Tamas, Can you try to set mapred.map.tasks and see if it works? Thanks, Yin On Thu, Oct 2, 2014 at 10:33 AM, Tamas Jambor jambo...@gmail.com wrote: That would work - I normally use hive queries through spark sql, I have not seen something like that there. On Thu, Oct 2, 2014 at 3:13

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-30 Thread Yin Huai
I think this problem has been fixed after the 1.1 release. Can you try the master branch? On Mon, Sep 29, 2014 at 10:06 PM, vdiwakar.malladi vdiwakar.mall...@gmail.com wrote: I'm using the latest version i.e. Spark 1.1.0 Thanks. -- View this message in context:

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread Yin Huai
What version of Spark did you use? Can you try the master branch? On Mon, Sep 29, 2014 at 1:52 PM, vdiwakar.malladi vdiwakar.mall...@gmail.com wrote: Thanks for your prompt response. Still on further note, I'm getting the exception while executing the query. SELECT data[0].name FROM people

Re: Spark SQL CLI

2014-09-22 Thread Yin Huai
Hi Gaurav, Can you put hive-site.xml in conf/ and try again? Thanks, Yin On Mon, Sep 22, 2014 at 4:02 PM, gtinside gtins...@gmail.com wrote: Hi , I have been using spark shell to execute all SQLs. I am connecting to Cassandra , converting the data in JSON and then running queries on it,

Re: spark-1.1.0-bin-hadoop2.4 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass

2014-09-18 Thread Yin Huai
Hello Andy, Will our JSON support in Spark SQL help your case? If your JSON files store one JSON object per line, you can use SQLContext.jsonFile to load it. If you want to do pre-process these files, once you have an RDD[String] (one JSON object per String), you can use SQLContext.jsonRDD. In

Re: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/HIVE-5474 is related? On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote: Thank you for pasting the steps, I will look at this, hopefully come out with a solution soon. -Original Message- From: linkpatrickliu

Re: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Yin Huai
I meant it may be a Hive bug since we also call Hive's drop table internally. On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai huaiyin@gmail.com wrote: Seems https://issues.apache.org/jira/browse/HIVE-5474 is related? On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote: Thank

Re: About SparkSQL 1.1.0 join between more than two table

2014-09-15 Thread Yin Huai
1.0.1 does not have the support on outer joins (added in 1.1). Your query should be fine in 1.1. On Mon, Sep 15, 2014 at 5:35 AM, Yanbo Liang yanboha...@gmail.com wrote: Spark SQL can support SQL and HiveSQL which used SQLContext and HiveContext separate. As far as I know, SQLContext of Spark

Re: compiling spark source code

2014-09-13 Thread Yin Huai
Can you try sbt/sbt clean first? On Sat, Sep 13, 2014 at 4:29 PM, Ted Yu yuzhih...@gmail.com wrote: bq. [error] File name too long It is not clear which file(s) loadfiles was loading. Is the filename in earlier part of the output ? Cheers On Sat, Sep 13, 2014 at 10:58 AM, kkptninja

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Yin Huai
What is the schema of table? On Thu, Sep 11, 2014 at 4:30 PM, jamborta jambo...@gmail.com wrote: thanks. this was actually using hivecontext. -- View this message in context:

Re: Re: Spark SQL -- more than two tables for join

2014-09-11 Thread Yin Huai
1.0.1 does not have the support on outer joins (added in 1.1). Can you try 1.1 branch? On Wed, Sep 10, 2014 at 9:28 PM, boyingk...@163.com boyingk...@163.com wrote: Hi,michael : I think Arthur.hk.chan arthur.hk.c...@gmail.com isn't here now,I Can Show something: 1)my spark version is 1.0.1

Re: Problem Accessing Hive Table from hiveContext

2014-09-01 Thread Yin Huai
Hello Igor, Although Decimal is supported, Hive 0.12 does not support user definable precision and scale (it was introduced in Hive 0.13). Thanks, Yin On Sat, Aug 30, 2014 at 1:50 AM, Zitser, Igor igor.zit...@citi.com wrote: Hi All, New to spark and using Spark 1.0.2 and hive 0.12. If

Re: Spark SQL Parser error

2014-08-26 Thread Yin Huai
, In all three option when I try to create temporary funtion i get the classNotFoundException. What would be the issue here? Thanks and Regards, Sankar S. On Saturday, 23 August 2014, 0:53, Yin Huai huaiyin@gmail.com wrote: Hello Sankar, Add JAR in SQL is not supported at the moment

Re: unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-26 Thread Yin Huai
Hello Du, Can you check if there is a dir metastore in the place you launching your program. If so, can you delete it and try again? Also, can you try HiveContext? LocalHiveContext is deprecated. Thanks, Yin On Mon, Aug 25, 2014 at 6:33 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, I

Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
Hi Sankar, You need to create an external table in order to specify the location of data (i.e. using CREATE EXTERNAL TABLE user1 LOCATION). You can take a look at this page https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable for

Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
the create external table command as well. I get the same error. Please help me to find the root cause. Thanks and Regards, Sankar S. On Friday, 22 August 2014, 22:43, Yin Huai huaiyin@gmail.com wrote: Hi Sankar, You need to create an external table in order to specify

Re: Spark SQL: Caching nested structures extremely slow

2014-08-21 Thread Yin Huai
I have not profiled this part. But, I think one possible cause is allocating an array for every inner struct for every row (every struct value is represented by a Spark SQL row). I will play with it later and see what I find. On Tue, Aug 19, 2014 at 9:01 PM, Evan Chan velvia.git...@gmail.com

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Yin Huai
If you want to filter the table name, you can use hc.sql(show tables).filter(row = !test.equals(row.getString(0 Seems making functionRegistry transient can fix the error. On Wed, Aug 20, 2014 at 8:53 PM, Vida Ha v...@databricks.com wrote: Hi, I doubt the the broadcast variable is your

RE: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Yin Huai
PR is https://github.com/apache/spark/pull/2074. -- From: Yin Huai huaiyin@gmail.com Sent: ‎8/‎20/‎2014 10:56 PM To: Vida Ha v...@databricks.com Cc: tianyi tia...@asiainfo.com; Fengyun RAO raofeng...@gmail.com; user@spark.apache.org Subject: Re: Got

Re: Spark RuntimeException due to Unsupported datatype NullType

2014-08-19 Thread Yin Huai
Hi Rafeeq, I think the following part triggered the bug https://issues.apache.org/jira/browse/SPARK-2908. [{*href:null*,rel:me}] It has been fixed. Can you try spark master and see if the error get resolved? Thanks, Yin On Mon, Aug 11, 2014 at 3:53 AM, rafeeq s rafeeq.ec...@gmail.com wrote:

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira tracking this issue. On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo ce...@zephyrhealthinc.com wrote: Thanks, Zhan for the follow up. But, do you know how I am supposed to set that table name on the jobConf? I don't have

Re: spark error when distinct on more than one cloume

2014-08-19 Thread Yin Huai
Hi, The SQLParser used by SQLContext is pretty limited. Instead, can you try HiveContext? Thanks, Yin On Tue, Aug 19, 2014 at 7:57 AM, wan...@testbird.com wan...@testbird.com wrote: sql:SELECT app_id,COUNT(DISTINCT app_id, macaddr) cut from object group by app_id *Error Log* 14/08/19

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-13 Thread Yin Huai
not able to switch to a database other than the default one, for Yarn-client mode, it works fine. Thanks! Jenny On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai huaiyin@gmail.com wrote: Hi Jenny, Have you copied hive-site.xml to spark/conf directory? If not, can you put it in conf/ and try

Re: SparkSQL Hive partitioning support

2014-08-13 Thread Yin Huai
Hi Silvio, You can insert into a static partition via SQL statement. Dynamic partitioning is not supported at the moment. Thanks, Yin On Wed, Aug 13, 2014 at 2:03 PM, Michael Armbrust mich...@databricks.com wrote: This is not supported at the moment. There are no concrete plans at the

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-12 Thread Yin Huai
.svl.ibm.com:8080/value /property property namehive.security.authorization.enabled/name valuetrue/value /property property namehive.security.authorization.createtable.owner.grants/name valueALL/value /property /configuration On Mon, Aug 11, 2014 at 4:29 PM, Yin Huai huaiyin

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Yin Huai
Hi Jenny, How's your metastore configured for both Hive and Spark SQL? Which metastore mode are you using (based on https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin )? Thanks, Yin On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao linlin200...@gmail.com wrote: you can

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Yin Huai
If the link to PR/1819 is broken. Here is the one https://github.com/apache/spark/pull/1819. On Sun, Aug 10, 2014 at 5:56 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Thanks Michael, I can try that too. I know you guys aren't in sales/marketing (thank G-d), but given all the hoopla

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
Hi Brad, It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be fixed soon. Thanks, Yin On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I'm having a bit of trouble with nested data structures in pyspark

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
will have a better story to handle NullType columns ( https://issues.apache.org/jira/browse/SPARK-2695). But, we still will not expose NullType to users. On Thu, Aug 7, 2014 at 1:41 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: Thanks Yin! best, -Brad On Thu, Aug 7, 2014 at 1:39 PM, Yin

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
The PR is https://github.com/apache/spark/pull/1840. On Thu, Aug 7, 2014 at 1:48 PM, Yin Huai yh...@databricks.com wrote: Actually, the issue is if values of a field are always null (or this field is missing), we cannot figure out the data type. So, we use NullType (it is an internal data

Re: pyspark inferSchema

2014-08-05 Thread Yin Huai
Yes, 2376 has been fixed in master. Can you give it a try? Also, for inferSchema, because Python is dynamically typed, I agree with Davies to provide a way to scan a subset (or entire) of the dataset to figure out the proper schema. We will take a look it. Thanks, Yin On Tue, Aug 5, 2014 at

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Yin Huai
I tried jsonRDD(...).printSchema() and it worked. Seems the problem is when we take the data back to the Python side, SchemaRDD#javaToPython failed on your cases. I have created https://issues.apache.org/jira/browse/SPARK-2875 to track it. Thanks, Yin On Tue, Aug 5, 2014 at 9:20 PM, Brad

Re: Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Yin Huai
I have created https://issues.apache.org/jira/browse/SPARK-2775 to track it. On Thu, Jul 31, 2014 at 11:47 AM, Budde, Adam bu...@amazon.com wrote: I still see the same “Unresolved attributes” error when using hql + backticks. Here’s a code snippet that replicates this behavior: val

Re: Simple record matching using Spark SQL

2014-07-24 Thread Yin Huai
Hi Sarath, I will try to reproduce the problem. Thanks, Yin On Wed, Jul 23, 2014 at 11:32 PM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi Michael, Sorry for the delayed response. I'm using Spark 1.0.1 (pre-built version for hadoop 1). I'm running spark programs on

Re: Simple record matching using Spark SQL

2014-07-24 Thread Yin Huai
Hi Sarath, Have you tried the current branch 1.0? If not, can you give it a try and see if the problem can be resolved? Thanks, Yin On Thu, Jul 24, 2014 at 11:17 AM, Yin Huai yh...@databricks.com wrote: Hi Sarath, I will try to reproduce the problem. Thanks, Yin On Wed, Jul 23

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-23 Thread Yin Huai
Yes, https://issues.apache.org/jira/browse/SPARK-2576 is used to track it. On Wed, Jul 23, 2014 at 9:11 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Do we have a JIRA issue to track this? I think I've run into a similar issue. On Wed, Jul 23, 2014 at 1:12 AM, Yin Huai yh

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-22 Thread Yin Huai
On Tue, Jul 22, 2014 at 12:53 AM, Victor Sheng victorsheng...@gmail.com wrote: Hi, Yin Huai I test again with your snippet code. It works well in spark-1.0.1 Here is my code: val sqlContext = new org.apache.spark.sql.SQLContext(sc) case class Record(data_date: String, mobile

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-21 Thread Yin Huai
Hi Victor, Instead of importing sqlContext.createSchemaRDD, can you explicitly call sqlContext.createSchemaRDD(rdd) to create a SchemaRDD? For example, You have a case class Record. case class Record(data_date: String, mobile: String, create_time: String) Then, you create a RDD[Record] and

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread Yin Huai
Instead of using union, can you try sqlContext.parquetFile(/user/ hive/warehouse/xxx_parquet.db).registerAsTable(parquetTable)? Then, var all = sql(select some_id, some_type, some_time from parquetTable).map(line = (line(0), (line(1).toString, line(2).toString.substring(0, 19 Thanks, Yin

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-19 Thread Yin Huai
Can you attach your code? Thanks, Yin On Sat, Jul 19, 2014 at 4:10 PM, chutium teng@gmail.com wrote: 160G parquet files (ca. 30 files, snappy compressed, made by cloudera impala) ca. 30 full table scan, took 3-5 columns out, then some normal scala operations like substring, groupby,

Re: Spark Streaming Json file groupby function

2014-07-16 Thread Yin Huai
Hi Srinivas, Seems the query you used is val results =sqlContext.sql(select type from table1). However, table1 does not have a field called type. The schema of table1 is defined as the class definition of your case class Record (i.e. ID, name, score, and school are fields of your table1). Can you

Re: spark1.0.1 catalyst transform filter not push down

2014-07-14 Thread Yin Huai
Hi, queryPlan.baseLogicalPlan is not the plan used to execution. Actually, the baseLogicalPlan of a SchemaRDD (queryPlan in your case) is just the parsed plan (the parsed plan will be analyzed, and then optimized. Finally, a physical plan will be created). The plan shows up after you execute val

Re: Spark SQL : Join throws exception

2014-07-07 Thread Yin Huai
Hi Subacini, Just want to follow up on this issue. SPARK-2339 has been merged into the master and 1.0 branch. Thanks, Yin On Tue, Jul 1, 2014 at 2:00 PM, Yin Huai huaiyin@gmail.com wrote: Seems it is a bug. I have opened https://issues.apache.org/jira/browse/SPARK-2339 to track

Re: Spark SQL : Join throws exception

2014-07-01 Thread Yin Huai
Seems it is a bug. I have opened https://issues.apache.org/jira/browse/SPARK-2339 to track it. Thank you for reporting it. Yin On Tue, Jul 1, 2014 at 12:06 PM, Subacini B subac...@gmail.com wrote: Hi All, Running this join query sql(SELECT * FROM A_TABLE A JOIN B_TABLE B WHERE

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Yin Huai
Hi Durin, I guess that blank lines caused the problem (like Aaron said). Right now, jsonFile does not skip faulty lines. Can you first use sc.textfile to load the file as RDD[String] and then use filter to filter out those blank lines (code snippet can be found below)? val sqlContext = new

Re: Problems with connecting Spark to Hive

2014-06-03 Thread Yin Huai
Hello Lars, Can you check the value of hive.security.authenticator.manager in hive-site.xml? I guess the value is org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator. This class was introduced in hive 0.13, but Spark SQL is based on hive 0.12 right now. Can you change the value of

<    1   2