RE: SQL with Spark Streaming

2015-03-12 Thread Huang, Jie
@Tobias,

According to my understanding, your approach is to register a series of tables 
by using transformWith, right? And then, you can get a new Dstream (i.e., 
SchemaDstream), which consists of lots of SchemaRDDs.

Please correct me if my understanding is wrong.

Thank you  Best Regards,
Grace (Huang Jie)

From: Jason Dai [mailto:jason@gmail.com]
Sent: Wednesday, March 11, 2015 10:45 PM
To: Irfan Ahmad
Cc: Tobias Pfeiffer; Cheng, Hao; Mohit Anchlia; user@spark.apache.org; Shao, 
Saisai; Dai, Jason; Huang, Jie
Subject: Re: SQL with Spark Streaming

Sorry typo; should be https://github.com/intel-spark/stream-sql

Thanks,
-Jason

On Wed, Mar 11, 2015 at 10:19 PM, Irfan Ahmad 
ir...@cloudphysics.commailto:ir...@cloudphysics.com wrote:
Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql


Irfan Ahmad
CTO | Co-Founder | CloudPhysicshttp://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 11, 2015 at 6:41 AM, Jason Dai 
jason@gmail.commailto:jason@gmail.com wrote:
Yes, a previous prototype is available 
https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at last 
year's Spark Summit 
(http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark)

We are currently porting the prototype to use the latest DataFrame API, and 
will provide a stable version for people to try soon.

Thabnks,
-Jason


On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer 
t...@preferred.jpmailto:t...@preferred.jp wrote:
Hi,

On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
Intel has a prototype for doing this, SaiSai and Jason are the authors. 
Probably you can ask them for some materials.

The github repository is here: https://github.com/intel-spark/stream-sql

Also, what I did is writing a wrapper class SchemaDStream that internally holds 
a DStream[Row] and a DStream[StructType] (the latter having just one element in 
every RDD) and then allows to do
- operations SchemaRDD = SchemaRDD using 
`rowStream.transformWith(schemaStream, ...)`
- in particular you can register this stream's data as a table this way
- and via a companion object with a method `fromSQL(sql: String): 
SchemaDStream` you can get a new stream from previously registered tables.

However, you are limited to batch-internal operations, i.e., you can't 
aggregate across batches.

I am not able to share the code at the moment, but will within the next months. 
It is not very advanced code, though, and should be easy to replicate. Also, I 
have no idea about the performance of transformWith

Tobias






Re: SQL with Spark Streaming

2015-03-11 Thread Tobias Pfeiffer
Hi,

On Thu, Mar 12, 2015 at 12:08 AM, Huang, Jie jie.hu...@intel.com wrote:

 According to my understanding, your approach is to register a series of
 tables by using transformWith, right? And then, you can get a new Dstream
 (i.e., SchemaDstream), which consists of lots of SchemaRDDs.


Yep, it's basically the following:

case class SchemaDStream(sqlc: SQLContext,
 dataStream: DStream[Row],
 schemaStream: DStream[StructType]) {
  def registerStreamAsTable(name: String): Unit = {
foreachRDD(_.registerTempTable(name))
  }

  def foreachRDD(func: SchemaRDD = Unit): Unit = {
def executeFunction(dataRDD: RDD[Row], schemaRDD: RDD[StructType]):
RDD[Unit] = {
  val schema: StructType = schemaRDD.collect.head
  val dataWithSchema: SchemaRDD = sqlc.applySchema(dataRDD, schema)
  val result = func(dataWithSchema)
  schemaRDD.map(x = result) // return RDD[Unit]
}
dataStream.transformWith(schemaStream, executeFunction
_).foreachRDD(_.count())
  }

}

In a similar way you could add a `transform(func: SchemaRDD = SchemaRDD)`
method. But as I said, I am not sure about performance.

Tobias


Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
Yes, a previous prototype is available
https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at
last year's Spark Summit (
http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark
)

We are currently porting the prototype to use the latest DataFrame API, and
will provide a stable version for people to try soon.

Thabnks,
-Jason


On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Intel has a prototype for doing this, SaiSai and Jason are the authors.
 Probably you can ask them for some materials.


 The github repository is here: https://github.com/intel-spark/stream-sql

 Also, what I did is writing a wrapper class SchemaDStream that internally
 holds a DStream[Row] and a DStream[StructType] (the latter having just one
 element in every RDD) and then allows to do
 - operations SchemaRDD = SchemaRDD using
 `rowStream.transformWith(schemaStream, ...)`
 - in particular you can register this stream's data as a table this way
 - and via a companion object with a method `fromSQL(sql: String):
 SchemaDStream` you can get a new stream from previously registered tables.

 However, you are limited to batch-internal operations, i.e., you can't
 aggregate across batches.

 I am not able to share the code at the moment, but will within the next
 months. It is not very advanced code, though, and should be easy to
 replicate. Also, I have no idea about the performance of transformWith

 Tobias




Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
Sorry typo; should be https://github.com/intel-spark/stream-sql

Thanks,
-Jason

On Wed, Mar 11, 2015 at 10:19 PM, Irfan Ahmad ir...@cloudphysics.com
wrote:

 Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql


 *Irfan Ahmad*
 CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com
 Best of VMworld Finalist
 Best Cloud Management Award
 NetworkWorld 10 Startups to Watch
 EMA Most Notable Vendor

 On Wed, Mar 11, 2015 at 6:41 AM, Jason Dai jason@gmail.com wrote:

 Yes, a previous prototype is available
 https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at
 last year's Spark Summit (
 http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark
 )

 We are currently porting the prototype to use the latest DataFrame API,
 and will provide a stable version for people to try soon.

 Thabnks,
 -Jason


 On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Hi,

 On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Intel has a prototype for doing this, SaiSai and Jason are the
 authors. Probably you can ask them for some materials.


 The github repository is here: https://github.com/intel-spark/stream-sql

 Also, what I did is writing a wrapper class SchemaDStream that
 internally holds a DStream[Row] and a DStream[StructType] (the latter
 having just one element in every RDD) and then allows to do
 - operations SchemaRDD = SchemaRDD using
 `rowStream.transformWith(schemaStream, ...)`
 - in particular you can register this stream's data as a table this way
 - and via a companion object with a method `fromSQL(sql: String):
 SchemaDStream` you can get a new stream from previously registered tables.

 However, you are limited to batch-internal operations, i.e., you can't
 aggregate across batches.

 I am not able to share the code at the moment, but will within the next
 months. It is not very advanced code, though, and should be easy to
 replicate. Also, I have no idea about the performance of transformWith

 Tobias






Re: SQL with Spark Streaming

2015-03-11 Thread Irfan Ahmad
Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 11, 2015 at 6:41 AM, Jason Dai jason@gmail.com wrote:

 Yes, a previous prototype is available
 https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at
 last year's Spark Summit (
 http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark
 )

 We are currently porting the prototype to use the latest DataFrame API,
 and will provide a stable version for people to try soon.

 Thabnks,
 -Jason


 On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Intel has a prototype for doing this, SaiSai and Jason are the
 authors. Probably you can ask them for some materials.


 The github repository is here: https://github.com/intel-spark/stream-sql

 Also, what I did is writing a wrapper class SchemaDStream that internally
 holds a DStream[Row] and a DStream[StructType] (the latter having just one
 element in every RDD) and then allows to do
 - operations SchemaRDD = SchemaRDD using
 `rowStream.transformWith(schemaStream, ...)`
 - in particular you can register this stream's data as a table this way
 - and via a companion object with a method `fromSQL(sql: String):
 SchemaDStream` you can get a new stream from previously registered tables.

 However, you are limited to batch-internal operations, i.e., you can't
 aggregate across batches.

 I am not able to share the code at the moment, but will within the next
 months. It is not very advanced code, though, and should be easy to
 replicate. Also, I have no idea about the performance of transformWith

 Tobias





Re: SQL with Spark Streaming

2015-03-10 Thread Tobias Pfeiffer
Hi,

On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Intel has a prototype for doing this, SaiSai and Jason are the authors.
 Probably you can ask them for some materials.


The github repository is here: https://github.com/intel-spark/stream-sql

Also, what I did is writing a wrapper class SchemaDStream that internally
holds a DStream[Row] and a DStream[StructType] (the latter having just one
element in every RDD) and then allows to do
- operations SchemaRDD = SchemaRDD using
`rowStream.transformWith(schemaStream, ...)`
- in particular you can register this stream's data as a table this way
- and via a companion object with a method `fromSQL(sql: String):
SchemaDStream` you can get a new stream from previously registered tables.

However, you are limited to batch-internal operations, i.e., you can't
aggregate across batches.

I am not able to share the code at the moment, but will within the next
months. It is not very advanced code, though, and should be easy to
replicate. Also, I have no idea about the performance of transformWith

Tobias


RE: SQL with Spark Streaming

2015-03-10 Thread Cheng, Hao
Intel has a prototype for doing this, SaiSai and Jason are the authors. 
Probably you can ask them for some materials.

From: Mohit Anchlia [mailto:mohitanch...@gmail.com]
Sent: Wednesday, March 11, 2015 8:12 AM
To: user@spark.apache.org
Subject: SQL with Spark Streaming

Does Spark Streaming also supports SQLs? Something like how Esper does CEP.