RE: Should I avoid "state" in an Spark application?

2016-06-12 Thread Haopu Wang
Can someone look at my questions? Thanks again! From: Haopu Wang Sent: 2016年6月12日 16:40 To: u...@spark.apache.org Subject: Should I avoid "state" in an Spark application? I have a Spark application whose structure is below: var ts:

RE: [SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-17 Thread Haopu Wang
Can someone help? Thank you! From: Haopu Wang Sent: Monday, June 15, 2015 3:36 PM To: user; dev@spark.apache.org Subject: [SparkStreaming] NPE in DStreamCheckPointData.scala:125 I use the attached program to test checkpoint. It's quite simple. When

[SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-15 Thread Haopu Wang
I use the attached program to test checkpoint. It's quite simple. When I run the program second time, it will load checkpoint data, that's expected, however I see NPE in driver log. Do you have any idea about the issue? I'm on Spark 1.4.0, thank you very much! == logs == 15/

RE: [SparkSQL] cannot filter by a DateType column

2015-05-10 Thread Haopu Wang
, 2015 2:41 AM To: Haopu Wang Cc: user; dev@spark.apache.org Subject: Re: [SparkSQL] cannot filter by a DateType column What version of Spark are you using? It appears that at least in master we are doing the conversion correctly, but its possible older versions of applySchema do not. If you can

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang
I want to filter a DataFrame based on a Date column. If the DataFrame object is constructed from a scala case class, it's working (either compare as String or Date). But if the DataFrame is generated by specifying a Schema to an RDD, it doesn't work. Below is the exception and test code. D

RE: Is SQLContext thread-safe?

2015-04-30 Thread Haopu Wang
Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Yes it is thread safe, at least it's supposed to be. -Original Message----- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4

RE: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Haopu Wang
Great! Thank you! From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, April 02, 2015 8:11 AM To: Haopu Wang Cc: user; dev@spark.apache.org Subject: Re: Can I call aggregate UDF in DataFrame? You totally can. https://github.com/apache/spark

Can I call aggregate UDF in DataFrame?

2015-03-26 Thread Haopu Wang
Specifically there are only 5 aggregate functions in class org.apache.spark.sql.GroupedData: sum/max/min/mean/count. Can I plugin a function to calculate stddev? Thank you! - To unsubscribe, e-mail: dev-unsubscr...@spark.apache

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
SELECT key,value FROM src") scala> output.saveAsTable("outputtable") From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Wednesday, March 11, 2015 8:25 AM To: Haopu Wang; user; dev@spark.apache.org Subject: RE: [SparkSQL] Reuse HiveContext to diffe

[SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table "src"). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application? Thank

RE: HiveContext cannot be serialized

2015-02-16 Thread Haopu Wang
To: Michael Armbrust Cc: Haopu Wang; dev@spark.apache.org Subject: Re: HiveContext cannot be serialized I submitted a patch https://github.com/apache/spark/pull/4628 On Mon, Feb 16, 2015 at 10:59 AM, Michael Armbrust wrote: I was suggesting you mark the variable that is holding the

HiveContext cannot be serialized

2015-02-16 Thread Haopu Wang
When I'm investigating this issue (in the end of this email), I take a look at HiveContext's code and find this change (https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db): - @transient protected[hive] lazy val hiveconf = new

Do you know any Spark modeling tool?

2014-12-25 Thread Haopu Wang
Hi, I think a modeling tool may be helpful because sometimes it's hard/tricky to program Spark. I don't know if there is already such a tool. Thanks! - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional comma

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-07 Thread Haopu Wang
Liquan, yes, for full outer join, one hash table on both sides is more efficient. For the left/right outer join, it looks like one hash table should be enought. From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 18:34 To: Haopu Wang Cc: dev

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
anks again! From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 12:31 To: Haopu Wang Cc: dev@spark.apache.org; user Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin? Hi Haopu, My understanding is that the hashtable on both left and

Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To unsubscrib

FW: Spark SQL 1.1.0: NPE when join two cached table

2014-09-22 Thread Haopu Wang
FWD to dev mail list for helps From: Haopu Wang Sent: 2014年9月22日 16:35 To: u...@spark.apache.org Subject: Spark SQL 1.1.0: NPE when join two cached table I have two data sets and want to join them on each first field. Sample data are below: data set