RE: Re: problem with spark thrift server

2015-04-23 Thread Cheng, Hao
Hi, can you describe a little bit how the ThriftServer crashed, or steps to reproduce that? It’s probably a bug of ThriftServer. Thanks, From: guoqing0...@yahoo.com.hk [mailto:guoqing0...@yahoo.com.hk] Sent: Friday, April 24, 2015 9:55 AM To: Arush Kharbanda Cc: user Subject: Re: Re: problem wit

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Can you print out the physical plan? EXPLAIN SELECT xxx… From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Monday, May 4, 2015 9:08 PM To: Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining 2 tables. hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Or, have you ever try broadcast join? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, May 5, 2015 8:33 AM To: luohui20...@sina.com; Olivier Girardot; user Subject: RE: 回复:Re: sparksql running slow while joining 2 tables. Can you print out the physical plan? EXPLAIN SELECT xxx

Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Cheng, Hao
I assume you’re using the DataFrame API within your application. sql(“SELECT…”).explain(true) From: Wang, Daoyuan Sent: Tuesday, May 5, 2015 10:16 AM To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. You can use

RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-05 Thread Cheng, Hao
, Hao; Wang, Daoyuan; Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining_2_tables. Hi guys, attache the pic of physical plan and logs.Thanks. Thanks&Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:"Cheng, Hao"

RE: question about sparksql caching

2015-05-14 Thread Cheng, Hao
You probably can try something like: val df = sqlContext.sql("select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1") df.cache() // Cache the result, but it's a lazy execution. df.registerAsTempTable("my_result") sqlContext.sql("select * from my_result where c1=1").collect // the cache

RE: What's the advantage features of Spark SQL(JDBC)

2015-05-14 Thread Cheng, Hao
Spark SQL just take the JDBC as a new data source, the same as we need to support loading data from a .csv or .json. From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 2:30 PM To: User Subject: What's the advantage features of Spark SQL(JDBC) Hi All, Comparing direct

RE: What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Cheng, Hao
Yes. From: Yi Zhang [mailto:zhangy...@yahoo.com] Sent: Friday, May 15, 2015 2:51 PM To: Cheng, Hao; User Subject: Re: What's the advantage features of Spark SQL(JDBC) @Hao, As you said, there is no advantage feature for JDBC, it just provides unified api to support different data sources.

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao
Forgot to import the implicit functions/classes? import sqlContext.implicits._ From: Rajdeep Dua [mailto:rajdeep@gmail.com] Sent: Monday, May 18, 2015 8:08 AM To: user@spark.apache.org Subject: InferredSchema Example in Spark-SQL Hi All, Was trying the Inferred Schema spart example http://sp

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao
Typo? Should be .toDF(), not .toRD() From: Ram Sriharsha [mailto:sriharsha@gmail.com] Sent: Monday, May 18, 2015 8:31 AM To: Rajdeep Dua Cc: user Subject: Re: InferredSchema Example in Spark-SQL you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring schema from the c

RE: Is Spark-1.0.0 not backward compatible with Shark-0.9.1 ?

2014-06-10 Thread Cheng, Hao
And if you want to use the SQL CLI (based on catalyst) as it works in Shark, you can also check out https://github.com/amplab/shark/pull/337 :) This preview version doesn’t require the Hive to be setup in the cluster. (Don’t forget to put the hive-site.xml under SHARK_HOME/conf also) Cheng Hao

RE: Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Cheng, Hao
st.fill(300)(Foo("c", 3)) sparkContext.makeRDD(rows).registerAsTable("foo") sql("select k,count(*) from foo group by k").collect res1: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300]) Cheng Hao From: Pei-Lun Lee [mailto:pl...@appier.com] Sent: Wedne

RE: Hive From Spark

2014-07-17 Thread Cheng, Hao
I couldn't reproduce the issue with latest master, but I found another bug of running this. https://github.com/apache/spark/pull/1475 Can you give more details about your env? -Original Message- From: JiajiaJing [mailto:jj.jing0...@gmail.com] Sent: Friday, July 18, 2014 8:48 AM To: u..

RE: Hive From Spark

2014-07-20 Thread Cheng, Hao
u...@spark.incubator.apache.org Subject: RE: Hive From Spark Hi Cheng Hao, Thank you very much for your reply. Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 . Some setups of the environment are done by running "SPARK_HIVE=true sbt/sbt assembly/assembly", including t

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao
This is a very interesting problem. SparkSQL supports the Non Equi Join, but it is in very low efficiency with large tables. One possible solution is make both table partition based and the partition keys are (cast(ds as bigint) / 240), and with each partition in dataset1, you probably can writ

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao
Actually it's just a pseudo algorithm I described, you can do it with spark API. Hope the algorithm helpful. -Original Message- From: durga [mailto:durgak...@gmail.com] Sent: Tuesday, July 22, 2014 11:56 AM To: u...@spark.incubator.apache.org Subject: RE: Joining by timestamp. Hi Chen,

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao
Durga, you can start from the documents http://spark.apache.org/docs/latest/quick-start.html http://spark.apache.org/docs/latest/programming-guide.html -Original Message- From: durga [mailto:durgak...@gmail.com] Sent: Tuesday, July 22, 2014 12:45 PM To: u...@spark.incubator.apache.o

RE: SparkSQL can not use SchemaRDD from Hive

2014-07-28 Thread Cheng, Hao
In your code snippet, "sample" is actually a SchemaRDD, and SchemaRDD actually binds a certain SQLContext in runtime, I don't think we can manipulate/share the SchemaRDD across SQLContext Instances. -Original Message- From: Kevin Jung [mailto:itsjb.j...@samsung.com] Sent: Tuesday, July

RE: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-28 Thread Cheng, Hao
I ran this before, actually the hive-site.xml works in this way for me (the tricky happens in the new HiveConf(classOf[SessionState]), can you double check if hive-site.xml can be loaded in the class path? It supposes to appear in the root of the class path. -Original Message- From: nik

RE: Data from Mysql using JdbcRDD

2014-07-30 Thread Cheng, Hao
Probably you need to update the SQL like "SELECT * FROM student_info where id >= ? and id <= ?". -Original Message- From: srinivas [mailto:kusamsrini...@gmail.com] Sent: Thursday, July 31, 2014 6:55 AM To: u...@spark.incubator.apache.org Subject: Data from Mysql using JdbcRDD Hi, I am

RE: Substring in Spark SQL

2014-08-04 Thread Cheng, Hao
>From the log, I noticed the "substr" was added on July 15th, 1.0.1 release >should be earlier than that. Community is now working on releasing the 1.1.0, >and also some of the performance improvements were added. Probably you can try >that for your benchmark. Cheng Hao

RE: Spark SQL Stackoverflow error

2014-08-14 Thread Cheng, Hao
I couldn’t reproduce the exception, probably it’s solved in the latest code. From: Vishal Vibhandik [mailto:vishal.vibhan...@gmail.com] Sent: Thursday, August 14, 2014 11:17 AM To: user@spark.apache.org Subject: Spark SQL Stackoverflow error Hi, I tried running the sample sql code JavaSparkSQL bu

RE: Unsupported language features in query

2014-09-02 Thread Cheng, Hao
Currently SparkSQL doesn’t support the row format/serde in CTAS. The work around is create the table first. -Original Message- From: centerqi hu [mailto:cente...@gmail.com] Sent: Tuesday, September 02, 2014 3:35 PM To: user@spark.apache.org Subject: Unsupported language features in query

RE: Unsupported language features in query

2014-09-02 Thread Cheng, Hao
[mailto:cente...@gmail.com] Sent: Tuesday, September 02, 2014 3:46 PM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Unsupported language features in query Thanks Cheng Hao Have a way of obtaining spark support hive statement list? Thanks 2014-09-02 15:39 GMT+08:00 Cheng, Hao : > Curren

RE: TimeStamp selection with SparkSQL

2014-09-04 Thread Cheng, Hao
, neither of those SQL dialect supports Date, but Timestamp. Cheng Hao From: Benjamin Zaitlen [mailto:quasi...@gmail.com] Sent: Friday, September 05, 2014 5:37 AM To: user@spark.apache.org Subject: TimeStamp selection with SparkSQL I may have missed this but is it possible to select on datetime in a

RE: SchemaRDD - Parquet - "insertInto" makes many files

2014-09-04 Thread Cheng, Hao
Hive can launch another job with strategy to merged the small files, probably we can also do that in the future release. From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, September 05, 2014 8:59 AM To: DanteSama Cc: u...@spark.incubator.apache.org Subject: Re: SchemaRDD - Parqu

RE: Spark SQL JDBC

2014-09-11 Thread Cheng, Hao
I copied the 3 datanucleus jars (datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar) to the fold lib/ manually, and it works for me. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 11:28 AM To: alexandria1101 Cc: u...@spark.incu

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread Cheng, Hao
What's your Spark / Hadoop version? And also the hive-site.xml? Most of case like that caused by incompatible Hadoop client jar and the Hadoop cluster. -Original Message- From: linkpatrickliu [mailto:linkpatrick...@live.com] Sent: Monday, September 15, 2014 2:35 PM To: u...@spark.incubat

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread Cheng, Hao
The Hadoop client jar should be assembled into the uber-jar, but (I suspect) it's probably not compatible with your Hadoop Cluster. Can you also paste the Spark uber-jar name? Usually will be under the path lib/spark-assembly-1.1.0-xxx-hadoopxxx.jar. -Original Message- From: linkpatrick

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-15 Thread Cheng, Hao
Sorry, I am not able to reproduce that. Can you try add the following entry into the hive-site.xml? I know they have the default value, but let's make it explicitly. hive.server2.thrift.port hive.server2.thrift.bind.host hive.server2.authentication (NONE、KERBEROS、LDAP、PAM or CUSTOM) -Origi

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Cheng, Hao
Thank you for pasting the steps, I will look at this, hopefully come out with a solution soon. -Original Message- From: linkpatrickliu [mailto:linkpatrick...@live.com] Sent: Tuesday, September 16, 2014 3:17 PM To: u...@spark.incubator.apache.org Subject: RE: SparkSQL 1.1 hang when "DROP"

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Cheng, Hao
https://github.com/apache/spark/pull/2241), not sure if you can wait for this. ☺ From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Wednesday, September 17, 2014 1:50 AM To: Cheng, Hao Cc: linkpatrickliu; u...@spark.incubator.apache.org Subject: Re: SparkSQL 1.1 hang when "DROP" or "L

RE: problem with HiveContext inside Actor

2014-09-17 Thread Cheng, Hao
the HiveDriver will always get the null value when retrieving HiveConf. Cheng Hao From: Du Li [mailto:l...@yahoo-inc.com.INVALID] Sent: Thursday, September 18, 2014 7:51 AM To: user@spark.apache.org; d...@spark.apache.org Subject: problem with HiveContext inside Actor Hi, Wonder anybody

RE: Spark SQL parser bug?

2014-10-12 Thread Cheng, Hao
(1::2::Nil).map(i=> T(i.toString, new java.sql.Timestamp(i))) data.registerTempTable("x") val s = sqlContext.sql("select a from x where ts>='1970-01-01 00:00:00';") s.collect output: res1: Array[org.apache.spark.sql.Row] = Array([1], [2

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Cheng, Hao
Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of the data types supported by Catalyst. From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 11:44 PM To: Wang, Daoyuan; user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp sc

RE: SchemaRDD Convert

2014-10-22 Thread Cheng, Hao
You needn't do anything, the implicit conversion should do this for you. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L103 https://github.com/apache/spark/blob/2ac40da3f9fa6d45a59bb45b41606f1931ac5e81/sql/catalyst/src/main/scala/org/apac

RE: spark sql query optimization , and decision tree building

2014-10-22 Thread Cheng, Hao
not sure about how kd tree used in mllib. but keep in mind SchemaRDD is just a normal RDD. Cheng Hao From: sanath kumar [mailto:sanath1...@gmail.com] Sent: Wednesday, October 22, 2014 12:58 PM To: user@spark.apache.org Subject: spark sql query optimization , and decision tree building Hi all

RE: Create table error from Hive in spark-assembly-1.0.2.jar

2014-10-26 Thread Cheng, Hao
Can you paste the hive-site.xml? Most of times I meet this exception, because the JDBC driver for hive metastore are not correct set or wrong driver classes are included in the assembly jar. As default, the assembly jar contains the derby.jar, which is the embedded derby JDBC driver. From: Jac

Support Hive 0.13 .1 in Spark SQL

2014-10-27 Thread Cheng, Hao
both. Sorry if I missed some discussion of Hive upgrading. Cheng Hao

Spark SQL Hive Version

2014-11-05 Thread Cheng, Hao
Hi, all, I noticed that when compiling the SparkSQL with profile "hive-0.13.1", it will fetch the Hive version of 0.13.1a under groupId "org.spark-project.hive", what's the difference with the one of "org.apache.hive"? And where can I get the source code for re-compiling? Thanks, Cheng Hao

RE: [SQL] PERCENTILE is not working

2014-11-05 Thread Cheng, Hao
Which version are you using? I can reproduce that in the latest code, but with different exception. I've filed an bug https://issues.apache.org/jira/browse/SPARK-4263, can you also add some information there? Thanks, Cheng Hao -Original Message- From: Kevin Paul [mailto:kevinp

RE: SparkSQL Timestamp query failure

2014-11-23 Thread Cheng, Hao
Can you try query like “SELECT timestamp, CAST(timestamp as string) FROM logs LIMIT 5”, I guess you probably ran into the timestamp precision or the timezone shifting problem. (And it’s not mandatory, but you’d better change the field name from “timestamp” to something else, as “timestamp” is t

RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Cheng, Hao
Are all of your join keys the same? and I guess the join type are all “Left” join, https://github.com/apache/spark/pull/3362 probably is what you need. And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) currently, https://github.com/apache/spark/pull/3270 should be an

RE: Spark SQL performance and data size constraints

2014-11-26 Thread Cheng, Hao
Spark SQL doesn't support the DISTINCT well currently, particularly the case you described, it will leads all of the data fall into a single node and keep them in memory only. Dev community actually has solutions for this, it probably will be solved after the release of Spark 1.2. -Original

RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Cheng, Hao
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Thursday, November 27, 2014 10:24 PM To: Cheng, Hao Cc: user Subject: Re: Auto BroadcastJoin optimization failed in latest Spark Hi Hao, I'm using inner join as Broadcast join didn't work for left joins (thanks for the lin

RE: Spark SQL UDF returning a list?

2014-12-03 Thread Cheng, Hao
/pull/3595 ) b. It expects the function return type to be immutable.Seq[XX] for List, immutable.Map[X, X] for Map, scala.Product for Struct, and only Array[Byte] for binary. The Array[_] is not supported. Cheng Hao From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Thursday, December 4

RE: Spark SQL with a sorted file

2014-12-03 Thread Cheng, Hao
You can try to write your own Relation with filter push down or use the ParquetRelation2 for workaround. (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala) Cheng Hao -Original Message- From: Jerry Raj [mailto:jerry

RE: Can HiveContext be used without using Hive?

2014-12-09 Thread Cheng, Hao
It works exactly like Create Table As Select (CTAS) in Hive. Cheng Hao From: Anas Mosaad [mailto:anas.mos...@incorta.com] Sent: Wednesday, December 10, 2014 11:59 AM To: Michael Armbrust Cc: Manoj Samel; user@spark.apache.org Subject: Re: Can HiveContext be used without using Hive? In that

RE: Why my SQL UDF cannot be registered?

2014-12-15 Thread Cheng, Hao
As the error log shows, you may need to register it as: sqlContext.rgisterFunction(“toHour”, toHour _) The “_” means you are passing the function as parameter, not invoking it. Cheng Hao From: Xuelin Cao [mailto:xuelin...@yahoo.com.INVALID] Sent: Monday, December 15, 2014 5:28 PM To: User

RE: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Cheng, Hao
Hi, Lam, I can confirm this is a bug with the latest master, and I filed a jira issue for this: https://issues.apache.org/jira/browse/SPARK-4944 Hope come with a solution soon. Cheng Hao From: Jerry Lam [mailto:chiling...@gmail.com] Sent: Wednesday, December 24, 2014 4:26 AM To: user

RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Cheng, Hao
I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: user@spark.apache.org

RE: Escape commas in file names

2014-12-25 Thread Cheng, Hao
multiple parquet files for API sqlContext.parquetFile, we need to think how to support multiple paths in some other way. Cheng Hao From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, December 25, 2014 1:01 PM To: Daniel Siegmann Cc: user@spark.apache.org Subject: Re: Escape

RE: Implement customized Join for SparkSQL

2015-01-05 Thread Cheng, Hao
Can you paste the error log? From: Dai, Kevin [mailto:yun...@ebay.com] Sent: Monday, January 5, 2015 6:29 PM To: user@spark.apache.org Subject: Implement customized Join for SparkSQL Hi, All Suppose I want to join two tables A and B as follows: Select * from A join B on A.id = B.id A is a file

RE: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Cheng, Hao
The log showed it failed in parsing, so the typo stuff shouldn’t be the root cause. BUT I couldn’t reproduce that with master branch. I did the test as follow: sbt/sbt –Phadoop-2.3.0 –Phadoop-2.3 –Phive –Phive-0.13.1 hive/console scala> sql(“SELECT user_id FROM actions where conversion_aciton_id

RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-15 Thread Cheng, Hao
Hi, BB Ideally you can do the query like: select key, value.percent from mytable_data lateral view explode(audiences) f as key, value limit 3; But there is a bug in HiveContext: https://issues.apache.org/jira/browse/SPARK-5237 I am working on it now, hopefully make a patch soon. Cheng

RE: Spark SQL Custom Predicate Pushdown

2015-01-15 Thread Cheng, Hao
The Data Source API probably work for this purpose. It support the column pruning and the Predicate Push Down: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala Examples also can be found in the unit test: https://github.com/apache/sp

RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-17 Thread Cheng, Hao
Wow, glad to know that it works well, and sorry, the Jira is another issue, which is not the same case here. From: Bagmeet Behera [mailto:bagme...@gmail.com] Sent: Saturday, January 17, 2015 12:47 AM To: Cheng, Hao Subject: Re: using hiveContext to select a nested Map-data-type from an

RE: SparkSQL 1.2.0 sources API error

2015-01-18 Thread Cheng, Hao
It seems the netty jar works with an incompatible method signature. Can you check if there different versions of netty jar in your classpath? From: Walrus theCat [mailto:walrusthe...@gmail.com] Sent: Sunday, January 18, 2015 3:37 PM To: user@spark.apache.org Subject: Re: SparkSQL 1.2.0 sources A

<    1   2