Re: scalac crash when compiling DataTypeConversions.scala

2014-10-27 Thread guoxu1231
Hi Stephen, I tried it again, To avoid the profile impact, I execute "mvn -DskipTests clean package" with Hadoop 1.0.4 by default and open the IDEA and import it as a maven project, and I didn't choose any profile in the import wizard. Then "Make project" or "re-build project" in IDEA, unfortun

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
Sure, let's still focus on the streaming simulation use case. It's a very useful problem to solve. If we're going to use the same Spark-streaming core for the simulation, the most simple way is to have a globally sorted RDDs and use ssc.queueStream. Thus collecting the Key part to driver is probab

Re: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
Would you mind to share DDLs of all involved tables? What format are these tables stored in? Is this issue specific to this query? I guess Hive, Shark and Spark SQL all read from the same HDFS dataset? On 10/27/14 3:45 PM, lyf刘钰帆 wrote: Hi, I am using SparkSQL 1.1.0 with cdh 4.6.0 recently,

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
Yes, I understand what you want, but maybe hard to achieve without collecting back to driver node. Besides, can we just think of another way to do it. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 4:07 PM To: Shao, Saisai Cc: user@spark.apache.

Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Jianshi Huang
There's an annoying small usability issue in HiveContext. By default, it creates a local metastore which forbids other processes using HiveContext to be launched from the same directory. How can I make the metastore local to each HiveContext? Is there an in-memory metastore configuration? /tmp/xx

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
Yeah, you're absolutely right Saisai. My point is we should allow this kind of logic in RDD, let's say transforming type RDD[(Key, Iterable[T])] to Seq[(Key, RDD[T])]. Make sense? Jianshi On Mon, Oct 27, 2014 at 3:56 PM, Shao, Saisai wrote: > I think what you want is to make each bucket as a

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
I think what you want is to make each bucket as a new RDD as what you mentioned in Pig syntax. gs = ORDER g BY group ASC, g.timestamp ASC // 'group' is the rounded timestamp for each bucket From my understanding, currently in Spark there’s no such kind of API to achieve this, maybe you have t

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
Ok, back to Scala code, I'm wondering why I cannot do this: data.groupBy(timestamp / window) .sortByKey() // no sort method available here .map(sc.parallelize(_._2.sortBy(_._1))) // nested RDD, hmm... .collect() // returns Seq[RDD[(Timestamp, T)]] Jianshi On Mon, Oct 27, 2014 at 3:24 PM

Re: Spark SQL configuration

2014-10-27 Thread Akhil Das
You will face problems if the spark version isn't compatible with your hadoop version. (Lets say you have hadoop 2.x and you downloaded spark pre-compiled with hadoop 1.x then it would be a problem.) Of course you can use spark without telling about any hadoop configurations unless you are trying t

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
You're absolutely right, it's not 'scalable' as I'm using collect(). However, it's important to have the RDDs ordered by the timestamp of the time window (groupBy puts data to corresponding timewindow). It's fairly easy to do in Pig, but somehow I have no idea how to express it in RDD... Somethi

Re: Spark optimization

2014-10-27 Thread Akhil Das
There is no tool to tweak a spark cluster, but while writing the job, you can consider the Tuning guidelines . Thanks Best Regards On Mon, Oct 27, 2014 at 3:14 AM, Morbious wrote: > I wonder if there is any tool to tweak spark (worker and master)

Re: Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-10-27 Thread Akhil Das
It works fine on my *Spark 1.1.0* Thanks Best Regards On Mon, Oct 27, 2014 at 12:22 AM, octavian.ganea wrote: > Hi Akhil, > > Please see this related message. > > http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-td17263.html > > I am curious if this works for you also. >

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
I think you solution may not be extendable if the data size is increasing, since you have to collect all your data back to driver node, so the memory usage of driver will be a problem. why not filter out specific time-range data as a rdd, after filtering the whole time-range, you will get a se

<    1   2