RE: SQL with Spark Streaming
@Tobias, According to my understanding, your approach is to register a series of tables by using transformWith, right? And then, you can get a new Dstream (i.e., SchemaDstream), which consists of lots of SchemaRDDs. Please correct me if my understanding is wrong. Thank you Best Regards, Grace (Huang Jie) From: Jason Dai [mailto:jason@gmail.com] Sent: Wednesday, March 11, 2015 10:45 PM To: Irfan Ahmad Cc: Tobias Pfeiffer; Cheng, Hao; Mohit Anchlia; user@spark.apache.org; Shao, Saisai; Dai, Jason; Huang, Jie Subject: Re: SQL with Spark Streaming Sorry typo; should be https://github.com/intel-spark/stream-sql Thanks, -Jason On Wed, Mar 11, 2015 at 10:19 PM, Irfan Ahmad ir...@cloudphysics.commailto:ir...@cloudphysics.com wrote: Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql Irfan Ahmad CTO | Co-Founder | CloudPhysicshttp://www.cloudphysics.com Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 11, 2015 at 6:41 AM, Jason Dai jason@gmail.commailto:jason@gmail.com wrote: Yes, a previous prototype is available https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at last year's Spark Summit (http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark) We are currently porting the prototype to use the latest DataFrame API, and will provide a stable version for people to try soon. Thabnks, -Jason On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp wrote: Hi, On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com wrote: Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. The github repository is here: https://github.com/intel-spark/stream-sql Also, what I did is writing a wrapper class SchemaDStream that internally holds a DStream[Row] and a DStream[StructType] (the latter having just one element in every RDD) and then allows to do - operations SchemaRDD = SchemaRDD using `rowStream.transformWith(schemaStream, ...)` - in particular you can register this stream's data as a table this way - and via a companion object with a method `fromSQL(sql: String): SchemaDStream` you can get a new stream from previously registered tables. However, you are limited to batch-internal operations, i.e., you can't aggregate across batches. I am not able to share the code at the moment, but will within the next months. It is not very advanced code, though, and should be easy to replicate. Also, I have no idea about the performance of transformWith Tobias
Re: SQL with Spark Streaming
Hi, On Thu, Mar 12, 2015 at 12:08 AM, Huang, Jie jie.hu...@intel.com wrote: According to my understanding, your approach is to register a series of tables by using transformWith, right? And then, you can get a new Dstream (i.e., SchemaDstream), which consists of lots of SchemaRDDs. Yep, it's basically the following: case class SchemaDStream(sqlc: SQLContext, dataStream: DStream[Row], schemaStream: DStream[StructType]) { def registerStreamAsTable(name: String): Unit = { foreachRDD(_.registerTempTable(name)) } def foreachRDD(func: SchemaRDD = Unit): Unit = { def executeFunction(dataRDD: RDD[Row], schemaRDD: RDD[StructType]): RDD[Unit] = { val schema: StructType = schemaRDD.collect.head val dataWithSchema: SchemaRDD = sqlc.applySchema(dataRDD, schema) val result = func(dataWithSchema) schemaRDD.map(x = result) // return RDD[Unit] } dataStream.transformWith(schemaStream, executeFunction _).foreachRDD(_.count()) } } In a similar way you could add a `transform(func: SchemaRDD = SchemaRDD)` method. But as I said, I am not sure about performance. Tobias
Re: SQL with Spark Streaming
Yes, a previous prototype is available https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at last year's Spark Summit ( http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark ) We are currently porting the prototype to use the latest DataFrame API, and will provide a stable version for people to try soon. Thabnks, -Jason On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote: Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. The github repository is here: https://github.com/intel-spark/stream-sql Also, what I did is writing a wrapper class SchemaDStream that internally holds a DStream[Row] and a DStream[StructType] (the latter having just one element in every RDD) and then allows to do - operations SchemaRDD = SchemaRDD using `rowStream.transformWith(schemaStream, ...)` - in particular you can register this stream's data as a table this way - and via a companion object with a method `fromSQL(sql: String): SchemaDStream` you can get a new stream from previously registered tables. However, you are limited to batch-internal operations, i.e., you can't aggregate across batches. I am not able to share the code at the moment, but will within the next months. It is not very advanced code, though, and should be easy to replicate. Also, I have no idea about the performance of transformWith Tobias
Re: SQL with Spark Streaming
Sorry typo; should be https://github.com/intel-spark/stream-sql Thanks, -Jason On Wed, Mar 11, 2015 at 10:19 PM, Irfan Ahmad ir...@cloudphysics.com wrote: Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 11, 2015 at 6:41 AM, Jason Dai jason@gmail.com wrote: Yes, a previous prototype is available https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at last year's Spark Summit ( http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark ) We are currently porting the prototype to use the latest DataFrame API, and will provide a stable version for people to try soon. Thabnks, -Jason On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote: Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. The github repository is here: https://github.com/intel-spark/stream-sql Also, what I did is writing a wrapper class SchemaDStream that internally holds a DStream[Row] and a DStream[StructType] (the latter having just one element in every RDD) and then allows to do - operations SchemaRDD = SchemaRDD using `rowStream.transformWith(schemaStream, ...)` - in particular you can register this stream's data as a table this way - and via a companion object with a method `fromSQL(sql: String): SchemaDStream` you can get a new stream from previously registered tables. However, you are limited to batch-internal operations, i.e., you can't aggregate across batches. I am not able to share the code at the moment, but will within the next months. It is not very advanced code, though, and should be easy to replicate. Also, I have no idea about the performance of transformWith Tobias
Re: SQL with Spark Streaming
Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 11, 2015 at 6:41 AM, Jason Dai jason@gmail.com wrote: Yes, a previous prototype is available https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at last year's Spark Summit ( http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark ) We are currently porting the prototype to use the latest DataFrame API, and will provide a stable version for people to try soon. Thabnks, -Jason On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote: Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. The github repository is here: https://github.com/intel-spark/stream-sql Also, what I did is writing a wrapper class SchemaDStream that internally holds a DStream[Row] and a DStream[StructType] (the latter having just one element in every RDD) and then allows to do - operations SchemaRDD = SchemaRDD using `rowStream.transformWith(schemaStream, ...)` - in particular you can register this stream's data as a table this way - and via a companion object with a method `fromSQL(sql: String): SchemaDStream` you can get a new stream from previously registered tables. However, you are limited to batch-internal operations, i.e., you can't aggregate across batches. I am not able to share the code at the moment, but will within the next months. It is not very advanced code, though, and should be easy to replicate. Also, I have no idea about the performance of transformWith Tobias
Re: SQL with Spark Streaming
Hi, On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote: Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. The github repository is here: https://github.com/intel-spark/stream-sql Also, what I did is writing a wrapper class SchemaDStream that internally holds a DStream[Row] and a DStream[StructType] (the latter having just one element in every RDD) and then allows to do - operations SchemaRDD = SchemaRDD using `rowStream.transformWith(schemaStream, ...)` - in particular you can register this stream's data as a table this way - and via a companion object with a method `fromSQL(sql: String): SchemaDStream` you can get a new stream from previously registered tables. However, you are limited to batch-internal operations, i.e., you can't aggregate across batches. I am not able to share the code at the moment, but will within the next months. It is not very advanced code, though, and should be easy to replicate. Also, I have no idea about the performance of transformWith Tobias
RE: SQL with Spark Streaming
Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. From: Mohit Anchlia [mailto:mohitanch...@gmail.com] Sent: Wednesday, March 11, 2015 8:12 AM To: user@spark.apache.org Subject: SQL with Spark Streaming Does Spark Streaming also supports SQLs? Something like how Esper does CEP.