FW: Spark SQL 1.1.0: NPE when join two cached table

2014-09-22 Thread Haopu Wang
FWD to dev mail list for helps From: Haopu Wang Sent: 2014年9月22日 16:35 To: u...@spark.apache.org Subject: Spark SQL 1.1.0: NPE when join two cached table I have two data sets and want to join them on each first field. Sample data are below: data set

Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Haopu Wang
! From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 12:31 To: Haopu Wang Cc: dev@spark.apache.org; user Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin? Hi Haopu, My understanding is that the hashtable on both left and right side

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Haopu Wang
Liquan, yes, for full outer join, one hash table on both sides is more efficient. For the left/right outer join, it looks like one hash table should be enought. From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 18:34 To: Haopu Wang Cc: dev

Do you know any Spark modeling tool?

2014-12-25 Thread Haopu Wang
Hi, I think a modeling tool may be helpful because sometimes it's hard/tricky to program Spark. I don't know if there is already such a tool. Thanks! - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional

RE: HiveContext cannot be serialized

2015-02-16 Thread Haopu Wang
To: Michael Armbrust Cc: Haopu Wang; dev@spark.apache.org Subject: Re: HiveContext cannot be serialized I submitted a patch https://github.com/apache/spark/pull/4628 On Mon, Feb 16, 2015 at 10:59 AM, Michael Armbrust mich...@databricks.com wrote: I was suggesting you mark the variable

HiveContext cannot be serialized

2015-02-16 Thread Haopu Wang
When I'm investigating this issue (in the end of this email), I take a look at HiveContext's code and find this change (https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db): - @transient protected[hive] lazy val hiveconf = new

[SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table src). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application?

RE: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Haopu Wang
Great! Thank you! From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, April 02, 2015 8:11 AM To: Haopu Wang Cc: user; dev@spark.apache.org Subject: Re: Can I call aggregate UDF in DataFrame? You totally can. https://github.com/apache/spark

RE: Is SQLContext thread-safe?

2015-04-30 Thread Haopu Wang
[mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang
I want to filter a DataFrame based on a Date column. If the DataFrame object is constructed from a scala case class, it's working (either compare as String or Date). But if the DataFrame is generated by specifying a Schema to an RDD, it doesn't work. Below is the exception and test code.

RE: [SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-17 Thread Haopu Wang
Can someone help? Thank you! From: Haopu Wang Sent: Monday, June 15, 2015 3:36 PM To: user; dev@spark.apache.org Subject: [SparkStreaming] NPE in DStreamCheckPointData.scala:125 I use the attached program to test checkpoint. It's quite simple. When I run

[SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-15 Thread Haopu Wang
I use the attached program to test checkpoint. It's quite simple. When I run the program second time, it will load checkpoint data, that's expected, however I see NPE in driver log. Do you have any idea about the issue? I'm on Spark 1.4.0, thank you very much! == logs ==

RE: Should I avoid "state" in an Spark application?

2016-06-12 Thread Haopu Wang
Can someone look at my questions? Thanks again! From: Haopu Wang Sent: 2016年6月12日 16:40 To: u...@spark.apache.org Subject: Should I avoid "state" in an Spark application? I have a Spark application whose structure is below: var ts: