Re: Happy Diwali to those forum members who celebrate this great festival

2016-10-30 Thread Sivakumaran S
Thank you Dr Mich :)

Regards

Sivakumaran S

> On 30-Oct-2016, at 4:07 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Enjoy the festive season.
> 
> Regards,
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  



Re: Scala Vs Python

2016-09-02 Thread Sivakumaran S
Whatever benefits you may accrue from the rapid prototyping and coding in 
Python, it will be offset against the time taken to convert it to run inside 
the JVM. This of course depends on the complexity of the DAG. I guess it is a 
matter of language preference. 

Regards,

Sivakumaran S
> On 02-Sep-2016, at 8:58 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> From an outsider point of view nobody likes change :)
> 
> However, it appears to me that Scala is a rising star and if one learns it, 
> it is another iron in the fire so to speak. I believe as we progress in time 
> Spark is going to move away from Python. If you look at 2014 Databricks code 
> examples, they were mostly in Python. Now they are mostly in Scala for a 
> reason.
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 September 2016 at 08:23, Jakob Odersky <ja...@odersky.com 
> <mailto:ja...@odersky.com>> wrote:
> Forgot to answer your question about feature parity of Python w.r.t. Spark's 
> different components
> I mostly work with scala so I can't say for sure but I think that all pre-2.0 
> features (that's basically everything except Structured Streaming) are on 
> par. Structured Streaming is a pretty new feature and Python support is 
> currently not available. The API is not final however and I reckon that 
> Python support will arrive once it gets finalized, probably in the next 
> version.
> 
> 



Re: How to convert List into json object / json Array

2016-08-30 Thread Sivakumaran S
Look at scala.util.parsing.json or the Jackson library for json manipulation. 
Also read 
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
<http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets>

Regards,

Sivakumaran S

Re: Design patterns involving Spark

2016-08-28 Thread Sivakumaran S
Spark best fits for processing. But depending on the use case, you could expand 
the scope of Spark to moving data using the native connectors. The only that 
Spark is not, is Storage. Connectors are available for most storage options 
though.

Regards,

Sivakumaran S



> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <ashok34...@yahoo.com.INVALID> wrote:
> 
> Hi,
> 
> There are design patterns that use Spark extensively. I am new to this area 
> so I would appreciate if someone explains where Spark fits in especially 
> within faster or streaming use case.
> 
> What are the best practices involving Spark. Is it always best to deploy it 
> for processing engine, 
> 
> For example when we have a pattern 
> 
> Input Data -> Data in Motion -> Processing -> Storage 
> 
> Where does Spark best fit in.
> 
> Thanking you 


Re: quick question

2016-08-25 Thread Sivakumaran S
A Spark job using a streaming context is an endless “while" loop till you kill 
it or specify when to stop. Initiate a TCP Server before you start the stream 
processing 
(https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_a_WebSocket_server_in_Java
 
<https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_a_WebSocket_server_in_Java>).
 

As long as your driver program is running, it will have a TCP server listening 
to connections on the port you specify (you can check with netstat).

And as long as your job is running, a client (A browser in your case running 
the dashboard code) will be able to connect to the TCP server running in your 
Spark job and receive the data that you write from the TCP Server.

As per the websocket protocol, this connection is an open connection. Read this 
too - 
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers
 
<https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers>

Once you have the data you need at the client, you just need to figure out a 
way to push into the javascript object holding the data in your dashboard and 
refresh it. If it is an array, you can just take off the oldest data and add 
the latest data to it. If it is a hash or a dictionary, you could just update 
the value.

I would suggest using JSON for server-client communication. It is easier to 
navigate JSON objects in Javascript :) But your requirements may vary.

This may help too 
(http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/HomeWebsocket/WebsocketHome.html#section7
 
<http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/HomeWebsocket/WebsocketHome.html#section7>)

Regards,

Sivakumaran S 




> On 25-Aug-2016, at 8:09 PM, kant kodali <kanth...@gmail.com> wrote:
> 
> Your assumption is right (thats what I intend to do). My driver code will be 
> in Java. The link sent by Kevin is a API reference to websocket. 
> I understand how websockets works in general but my question was more geared 
> towards seeing the end to end path on how front end dashboard 
> gets updated in realtime. when we collect the data back to the driver program 
> and finished writing data to websocket client the websocket connection
>  terminate right so 
> 
> 1) is Spark driver program something that needs to run for ever like a 
> typical server? if not,
> 2) then do we need to open a web socket connection each time when the task 
> terminates?
> 
> 
> 
> 
> 
> On Thu, Aug 25, 2016 6:06 AM, Sivakumaran S siva.kuma...@me.com 
> <mailto:siva.kuma...@me.com> wrote:
> I am assuming that you are doing some calculations over a time window. At the 
> end of the calculations (using RDDs or SQL), once you have collected the data 
> back to the driver program, you format the data in the way your client 
> (dashboard) requires it and write it to the websocket. 
> 
> Is your driver code in Python? The link Kevin has sent should start you off.
> 
> Regards,
> 
> Sivakumaran 
>> On 25-Aug-2016, at 11:53 AM, kant kodali <kanth...@gmail.com 
>> <mailto:kanth...@gmail.com>> wrote:
>> 
>> yes for now it will be Spark Streaming Job but later it may change.
>> 
>> 
>> 
>> 
>> 
>> On Thu, Aug 25, 2016 2:37 AM, Sivakumaran S siva.kuma...@me.com 
>> <mailto:siva.kuma...@me.com> wrote:
>> Is this a Spark Streaming job?
>> 
>> Regards,
>> 
>> Sivakumaran S
>> 
>> 
>>> @Sivakumaran when you say create a web socket object in your spark code I 
>>> assume you meant a spark "task" opening websocket 
>>> connection from one of the worker machines to some node.js server in that 
>>> case the websocket connection terminates after the spark 
>>> task is completed right ? and when new data comes in a new task gets 
>>> created and opens a new websocket connection again…is that how it should be
>> 
>>> On 25-Aug-2016, at 7:08 AM, kant kodali <kanth...@gmail.com 
>>> <mailto:kanth...@gmail.com>> wrote:
>>> 
>>> @Sivakumaran when you say create a web socket object in your spark code I 
>>> assume you meant a spark "task" opening websocket connection from one of 
>>> the worker machines to some node.js server in that case the websocket 
>>> connection terminates after the spark task is completed right ? and when 
>>> new data comes in a new task gets created and opens a new websocket 
>>> connection again…is that how it should be?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com 
>>> <mailto

Re: quick question

2016-08-25 Thread Sivakumaran S
I am assuming that you are doing some calculations over a time window. At the 
end of the calculations (using RDDs or SQL), once you have collected the data 
back to the driver program, you format the data in the way your client 
(dashboard) requires it and write it to the websocket. 

Is your driver code in Python? The link Kevin has sent should start you off.

Regards,

Sivakumaran 
> On 25-Aug-2016, at 11:53 AM, kant kodali <kanth...@gmail.com> wrote:
> 
> yes for now it will be Spark Streaming Job but later it may change.
> 
> 
> 
> 
> 
> On Thu, Aug 25, 2016 2:37 AM, Sivakumaran S siva.kuma...@me.com 
> <mailto:siva.kuma...@me.com> wrote:
> Is this a Spark Streaming job?
> 
> Regards,
> 
> Sivakumaran S
> 
> 
>> @Sivakumaran when you say create a web socket object in your spark code I 
>> assume you meant a spark "task" opening websocket 
>> connection from one of the worker machines to some node.js server in that 
>> case the websocket connection terminates after the spark 
>> task is completed right ? and when new data comes in a new task gets created 
>> and opens a new websocket connection again…is that how it should be
> 
>> On 25-Aug-2016, at 7:08 AM, kant kodali <kanth...@gmail.com 
>> <mailto:kanth...@gmail.com>> wrote:
>> 
>> @Sivakumaran when you say create a web socket object in your spark code I 
>> assume you meant a spark "task" opening websocket connection from one of the 
>> worker machines to some node.js server in that case the websocket connection 
>> terminates after the spark task is completed right ? and when new data comes 
>> in a new task gets created and opens a new websocket connection again…is 
>> that how it should be?
>> 
>> 
>> 
>> 
>> 
>> On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com 
>> <mailto:siva.kuma...@me.com> wrote:
>> You create a websocket object in your spark code and write your data to the 
>> socket. You create a websocket object in your dashboard code and receive the 
>> data in realtime and update the dashboard. You can use Node.js in your 
>> dashboard (socket.io <http://socket.io/>). I am sure there are other ways 
>> too.
>> 
>> Does that help?
>> 
>> Sivakumaran S
>> 
>>> On 25-Aug-2016, at 6:30 AM, kant kodali <kanth...@gmail.com 
>>> <mailto:kanth...@gmail.com>> wrote:
>>> 
>>> so I would need to open a websocket connection from spark worker machine to 
>>> where?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Aug 24, 2016 8:51 PM, Kevin Mellott kevin.r.mell...@gmail.com 
>>> <mailto:kevin.r.mell...@gmail.com> wrote:
>>> In the diagram you referenced, a real-time dashboard can be created using 
>>> WebSockets. This technology essentially allows your web page to keep an 
>>> active line of communication between the client and server, in which case 
>>> you can detect and display new information without requiring any user input 
>>> of page refreshes. The link below contains additional information on this 
>>> concept, as well as links to several different implementations (based on 
>>> your programming language preferences).
>>> 
>>> https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API 
>>> <https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API>
>>> 
>>> Hope this helps!
>>> - Kevin
>>> 
>>> On Wed, Aug 24, 2016 at 3:52 PM, kant kodali <kanth...@gmail.com 
>>> <mailto:kanth...@gmail.com>> wrote:
>>> 
>>> -- Forwarded message --
>>> From: kant kodali <kanth...@gmail.com <mailto:kanth...@gmail.com>>
>>> Date: Wed, Aug 24, 2016 at 1:49 PM
>>> Subject: quick question
>>> To: d...@spark.apache.org <mailto:d...@spark.apache.org>, 
>>> us...@spark.apache.org <mailto:us...@spark.apache.org>
>>> 
>>> 
>>> 
>>> 
>>> In this picture what does "Dashboards" really mean? is there a open source 
>>> project which can allow me to push the results back to Dashboards such that 
>>> Dashboards are always in sync with real time updates? (a push based 
>>> solution is better than poll but i am open to whatever is possible given 
>>> the above picture)
>>> 
>> 
> 
> 



Re: quick question

2016-08-25 Thread Sivakumaran S
Is this a Spark Streaming job?

Regards,

Sivakumaran S


> @Sivakumaran when you say create a web socket object in your spark code I 
> assume you meant a spark "task" opening websocket 
> connection from one of the worker machines to some node.js server in that 
> case the websocket connection terminates after the spark 
> task is completed right ? and when new data comes in a new task gets created 
> and opens a new websocket connection again…is that how it should be

> On 25-Aug-2016, at 7:08 AM, kant kodali <kanth...@gmail.com> wrote:
> 
> @Sivakumaran when you say create a web socket object in your spark code I 
> assume you meant a spark "task" opening websocket connection from one of the 
> worker machines to some node.js server in that case the websocket connection 
> terminates after the spark task is completed right ? and when new data comes 
> in a new task gets created and opens a new websocket connection again…is that 
> how it should be?
> 
> 
> 
> 
> 
> On Wed, Aug 24, 2016 10:38 PM, Sivakumaran S siva.kuma...@me.com 
> <mailto:siva.kuma...@me.com> wrote:
> You create a websocket object in your spark code and write your data to the 
> socket. You create a websocket object in your dashboard code and receive the 
> data in realtime and update the dashboard. You can use Node.js in your 
> dashboard (socket.io <http://socket.io/>). I am sure there are other ways too.
> 
> Does that help?
> 
> Sivakumaran S
> 
>> On 25-Aug-2016, at 6:30 AM, kant kodali <kanth...@gmail.com 
>> <mailto:kanth...@gmail.com>> wrote:
>> 
>> so I would need to open a websocket connection from spark worker machine to 
>> where?
>> 
>> 
>> 
>> 
>> 
>> On Wed, Aug 24, 2016 8:51 PM, Kevin Mellott kevin.r.mell...@gmail.com 
>> <mailto:kevin.r.mell...@gmail.com> wrote:
>> In the diagram you referenced, a real-time dashboard can be created using 
>> WebSockets. This technology essentially allows your web page to keep an 
>> active line of communication between the client and server, in which case 
>> you can detect and display new information without requiring any user input 
>> of page refreshes. The link below contains additional information on this 
>> concept, as well as links to several different implementations (based on 
>> your programming language preferences).
>> 
>> https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API 
>> <https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API>
>> 
>> Hope this helps!
>> - Kevin
>> 
>> On Wed, Aug 24, 2016 at 3:52 PM, kant kodali <kanth...@gmail.com 
>> <mailto:kanth...@gmail.com>> wrote:
>> 
>> -- Forwarded message --
>> From: kant kodali <kanth...@gmail.com <mailto:kanth...@gmail.com>>
>> Date: Wed, Aug 24, 2016 at 1:49 PM
>> Subject: quick question
>> To: d...@spark.apache.org <mailto:d...@spark.apache.org>, 
>> us...@spark.apache.org <mailto:us...@spark.apache.org>
>> 
>> 
>> 
>> 
>> In this picture what does "Dashboards" really mean? is there a open source 
>> project which can allow me to push the results back to Dashboards such that 
>> Dashboards are always in sync with real time updates? (a push based solution 
>> is better than poll but i am open to whatever is possible given the above 
>> picture)
>> 
> 



Re: quick question

2016-08-24 Thread Sivakumaran S
You create a websocket object in your spark code and write your data to the 
socket. You create a websocket object in your dashboard code and receive the 
data in realtime and update the dashboard. You can use Node.js in your 
dashboard (socket.io). I am sure there are other ways too.

Does that help?

Sivakumaran S

> On 25-Aug-2016, at 6:30 AM, kant kodali <kanth...@gmail.com> wrote:
> 
> so I would need to open a websocket connection from spark worker machine to 
> where?
> 
> 
> 
> 
> 
> On Wed, Aug 24, 2016 8:51 PM, Kevin Mellott kevin.r.mell...@gmail.com 
> <mailto:kevin.r.mell...@gmail.com> wrote:
> In the diagram you referenced, a real-time dashboard can be created using 
> WebSockets. This technology essentially allows your web page to keep an 
> active line of communication between the client and server, in which case you 
> can detect and display new information without requiring any user input of 
> page refreshes. The link below contains additional information on this 
> concept, as well as links to several different implementations (based on your 
> programming language preferences).
> 
> https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API 
> <https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API>
> 
> Hope this helps!
> - Kevin
> 
> On Wed, Aug 24, 2016 at 3:52 PM, kant kodali <kanth...@gmail.com 
> <mailto:kanth...@gmail.com>> wrote:
> 
> -- Forwarded message --
> From: kant kodali <kanth...@gmail.com <mailto:kanth...@gmail.com>>
> Date: Wed, Aug 24, 2016 at 1:49 PM
> Subject: quick question
> To: d...@spark.apache.org <mailto:d...@spark.apache.org>, 
> us...@spark.apache.org <mailto:us...@spark.apache.org>
> 
> 
> 
> 
> In this picture what does "Dashboards" really mean? is there a open source 
> project which can allow me to push the results back to Dashboards such that 
> Dashboards are always in sync with real time updates? (a push based solution 
> is better than poll but i am open to whatever is possible given the above 
> picture)
> 



Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Sivakumaran S
I am testing with one partition now. I am using Kafka 0.9 and Spark 1.6.1 
(Scala 2.11). Just start with one topic first and then add more. I am not 
partitioning the topic.

HTH, 

Regards,

Sivakumaran

> On 10-Aug-2016, at 5:56 AM, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> 
> wrote:
> 
> Hi Siva,
> 
> Does topic  has partitions? which version of Spark you are using?
> 
> On Wed, Aug 10, 2016 at 2:38 AM, Sivakumaran S <siva.kuma...@me.com 
> <mailto:siva.kuma...@me.com>> wrote:
> Hi,
> 
> Here is a working example I did.
> 
> HTH
> 
> Regards,
> 
> Sivakumaran S
> 
> val topics = "test"
> val brokers = "localhost:9092"
> val topicsSet = topics.split(",").toSet
> val sparkConf = new 
> SparkConf().setAppName("KafkaWeatherCalc").setMaster("local") 
> //spark://localhost:7077 <>
> val sc = new SparkContext(sparkConf)
> val ssc = new StreamingContext(sc, Seconds(60))
> val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
> val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicsSet)
> messages.foreachRDD(rdd => {
>   if (rdd.isEmpty()) {
> println("Failed to get data from Kafka. Please check that the Kafka 
> producer is streaming data.")
> System.exit(-1)
>   }
>   val sqlContext = 
> org.apache.spark.sql.SQLContext.getOrCreate(rdd.sparkContext)
>   val weatherDF = sqlContext.read.json(rdd.map(_._2)).toDF()
>   //Process your DF as required here on
> }
> 
> 
> 
>> On 09-Aug-2016, at 9:47 PM, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com 
>> <mailto:diwakar.dhanusk...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> I am reading json messages from kafka . Topics has 2 partitions. When 
>> running streaming job using spark-submit, I could see that  val dataFrame = 
>> sqlContext.read.json(rdd.map(_._2)) executes indefinitely. Am I doing 
>> something wrong here. Below is code .This environment is cloudera sandbox 
>> env. Same issue in hadoop production cluster mode except that it is 
>> restricted thats why tried to reproduce issue in Cloudera sandbox. Kafka 
>> 0.10 and  Spark 1.4.
>> 
>> val kafkaParams = 
>> Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092", 
>> "group.id <http://group.id/>" -> "xyz","auto.offset.reset"->"smallest")
>> val conf = new SparkConf().setMaster("local[3]").setAppName("topic")
>> val ssc = new StreamingContext(conf, Seconds(1))
>> 
>> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
>> 
>> val topics = Set("gpp.minf")
>> val kafkaStream = KafkaUtils.createDirectStream[String, String, 
>> StringDecoder,StringDecoder](ssc, kafkaParams, topics)
>> 
>> kafkaStream.foreachRDD(
>>   rdd => {
>> if (rdd.count > 0){
>> val dataFrame = sqlContext.read.json(rdd.map(_._2)) 
>>dataFrame.printSchema()
>> //dataFrame.foreach(println)
>> }
>> }
> 
> 



Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Sivakumaran S
Hi,

Here is a working example I did.

HTH

Regards,

Sivakumaran S

val topics = "test"
val brokers = "localhost:9092"
val topicsSet = topics.split(",").toSet
val sparkConf = new 
SparkConf().setAppName("KafkaWeatherCalc").setMaster("local") 
//spark://localhost:7077
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(60))
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicsSet)
messages.foreachRDD(rdd => {
  if (rdd.isEmpty()) {
println("Failed to get data from Kafka. Please check that the Kafka 
producer is streaming data.")
System.exit(-1)
  }
  val sqlContext = org.apache.spark.sql.SQLContext.getOrCreate(rdd.sparkContext)
  val weatherDF = sqlContext.read.json(rdd.map(_._2)).toDF()
  //Process your DF as required here on
}



> On 09-Aug-2016, at 9:47 PM, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> 
> wrote:
> 
> Hi,
> 
> I am reading json messages from kafka . Topics has 2 partitions. When running 
> streaming job using spark-submit, I could see that  val dataFrame = 
> sqlContext.read.json(rdd.map(_._2)) executes indefinitely. Am I doing 
> something wrong here. Below is code .This environment is cloudera sandbox 
> env. Same issue in hadoop production cluster mode except that it is 
> restricted thats why tried to reproduce issue in Cloudera sandbox. Kafka 0.10 
> and  Spark 1.4.
> 
> val kafkaParams = 
> Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092", 
> "group.id <http://group.id/>" -> "xyz","auto.offset.reset"->"smallest")
> val conf = new SparkConf().setMaster("local[3]").setAppName("topic")
> val ssc = new StreamingContext(conf, Seconds(1))
> 
> val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext)
> 
> val topics = Set("gpp.minf")
> val kafkaStream = KafkaUtils.createDirectStream[String, String, 
> StringDecoder,StringDecoder](ssc, kafkaParams, topics)
> 
> kafkaStream.foreachRDD(
>   rdd => {
> if (rdd.count > 0){
> val dataFrame = sqlContext.read.json(rdd.map(_._2)) 
>dataFrame.printSchema()
> //dataFrame.foreach(println)
> }
> }



Re: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Sivakumaran S
Does it have anything to do with the fact that the mail address is displayed as 
user @spark.apache.org ? There is a space before ‘@‘. 
This is as received in my mail client.

Sivakumaran


> On 08-Aug-2016, at 7:42 PM, Chris Mattmann  wrote:
> 
> Weird!
> 
> 
> 
> 
> 
> On 8/8/16, 11:10 AM, "Sean Owen"  wrote:
> 
>> I also don't know what's going on with the "This post has NOT been
>> accepted by the mailing list yet" message, because actually the
>> messages always do post. In fact this has been sent to the list 4
>> times:
>> 
>> https://www.mail-archive.com/search?l=user%40spark.apache.org=dueckm=0=0
>> 
>> On Mon, Aug 8, 2016 at 3:03 PM, Chris Mattmann  wrote:
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 8/8/16, 2:03 AM, "matthias.du...@fiduciagad.de" 
>>>  wrote:
>>> 
 Hello,
 
 I write to you because I am not really sure whether I did everything right 
 when registering and subscribing to the spark user list.
 
 I posted the appended question to Spark User list after subscribing and 
 receiving the "WELCOME to user@spark.apache.org" mail from 
 "user-h...@spark.apache.org".
 But this post is still in state "This post has NOT been accepted by the 
 mailing list yet.".
 
 Is this because I forgot something to do or did something wrong with my 
 user account (dueckm)? Or is it because no member of the Spark User List 
 reacted to that post yet?
 
 Thanks a lot for yout help.
 
 Matthias
 
 Fiducia & GAD IT AG | www.fiduciagad.de
 AG Frankfurt a. M. HRB 102381 | Sitz der Gesellschaft: Hahnstr. 48, 60528 
 Frankfurt a. M. | USt-IdNr. DE 143582320
 Vorstand: Klaus-Peter Bruns (Vorsitzender), Claus-Dieter Toben (stv. 
 Vorsitzender),
 
 Jens-Olaf Bartels, Martin Beyer, Jörg Dreinhöfer, Wolfgang Eckert, Carsten 
 Pfläging, Jörg Staff
 Vorsitzender des Aufsichtsrats: Jürgen Brinkmann
 
 - Weitergeleitet von Matthias Dück/M/FAG/FIDUCIA/DE am 08.08.2016 
 10:57 -
 
 Von: dueckm 
 An: user@spark.apache.org
 Datum: 04.08.2016 13:27
 Betreff: Are join/groupBy operations with wide Java Beans using Dataset 
 API much slower than using RDD API?
 
 
 
 
 
 Hello,
 
 I built a prototype that uses join and groupBy operations via Spark RDD 
 API.
 Recently I migrated it to the Dataset API. Now it runs much slower than 
 with
 the original RDD implementation.
 Did I do something wrong here? Or is this a price I have to pay for the 
 more
 convienient API?
 Is there a known solution to deal with this effect (eg configuration via
 "spark.sql.shuffle.partitions" - but now could I determine the correct
 value)?
 In my prototype I use Java Beans with a lot of attributes. Does this slow
 down Spark-operations with Datasets?
 
 Here I have an simple example, that shows the difference:
 JoinGroupByTest.zip
 
 - I build 2 RDDs and join and group them. Afterwards I count and display 
 the
 joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() )
 - When I do the same actions with Datasets it takes approximately 40 times
 as long (Methodd e.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).
 
 Thank you very much for your help.
 Matthias
 
 PS1: excuse me for sending this post more than once, but I am new to this
 mailing list and probably did something wrong when registering/subscribing,
 so my previous postings have not been accepted ...
 
 PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
 RDD implementation, jobs 2/3 to Dataset):
 
 
 
 
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27473.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org
 
>>> 
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
> 
> 
> -
> To unsubscribe e-mail: 

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Sivakumaran S
Not an expert here, but the first step would be devote some time and identify 
which of these 112 factors are actually causative. Some domain knowledge of the 
data may be required. Then, you can start of with PCA. 

HTH,

Regards,

Sivakumaran S
> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane@gmail.com> wrote:
> 
> Great question Rohit.  I am in my early days of ML as well and it would be 
> great if we get some idea on this from other experts on this group. 
> 
> I know we can reduce dimensions by using PCA, but i think that does not allow 
> us to understand which factors from the original are we using in the end. 
> 
> - Tony L.
> 
> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <rohitchaddha1...@gmail.com 
> <mailto:rohitchaddha1...@gmail.com>> wrote:
> 
> I have a data-set where each data-point has 112 factors. 
> 
> I want to remove the factors which are not relevant, and say reduce to 20 
> factors out of these 112 and then do clustering of data-points using these 20 
> factors.
> 
> How do I do these and how do I figure out which of the 20 factors are useful 
> for analysis. 
> 
> I see SVD and PCA implementations, but I am not sure if these give which 
> elements are removed and which are remaining. 
> 
> Can someone please help me understand what to do here 
> 
> thanks,
> -Rohit 
> 
> 



Re: Help testing the Spark Extensions for the Apache Bahir 2.0.0 release

2016-08-07 Thread Sivakumaran S
Hi,

How can I help? 

regards,

Sivakumaran S
> On 06-Aug-2016, at 6:18 PM, Luciano Resende <luckbr1...@gmail.com> wrote:
> 
> Apache Bahir is voting it's 2.0.0 release based on Apache Spark 2.0.0. 
> 
> https://www.mail-archive.com/dev@bahir.apache.org/msg00312.html 
> <https://www.mail-archive.com/dev@bahir.apache.org/msg00312.html>
> 
> We appreciate any help reviewing/testing the release, which contains the 
> following Apache Spark extensions:
> 
> Akka DStream connector
> MQTT DStream connector
> Twitter DStream connector
> ZeroMQ DStream connector
> 
> MQTT Structured Streaming
> 
> Thanks in advance
> 
> -- 
> Luciano Resende
> http://twitter.com/lresende1975 <http://twitter.com/lresende1975>
> http://lresende.blogspot.com/ <http://lresende.blogspot.com/>


Re: Visualization of data analysed using spark

2016-07-31 Thread Sivakumaran S
Hi Tony,

If your requirement is browser based plotting (real time or other wise), you 
can load the data and display it in a browser using D3. Since D3 has very low 
level plotting routines, you can look at C3 ( provided by www.pubnub.com) or 
Rickshaw (https://github.com/shutterstock/rickshaw 
) both of which provide a higher 
level abstraction for plotting.  

HTH,

Regards,

Sivakumaran 

> On 31-Jul-2016, at 7:35 AM, Gourav Sengupta  wrote:
> 
> If you are using  Python, please try using Bokeh and its related stack. Most 
> of the people in this forum including guys at data bricks have not tried that 
> stack from Anaconda, its worth a try when you are visualizing data in big 
> data stack.
> 
> 
> Regards,
> Gourav
> 
> On Sat, Jul 30, 2016 at 10:25 PM, Rerngvit Yanggratoke 
> > 
> wrote:
> Since you already have an existing application (not starting from scratch), 
> the simplest way to visualize would be to export the data to a file (e.g., a 
> CSV file) and visualise using other tools, e.g., Excel, RStudio, Matlab, 
> Jupiter, Zeppelin, Tableu, Elastic Stack.
> The choice depends on your background and preferences of the technology. Note 
> that if you are dealing with a large dataset, you generally first should 
> apply sampling to the data. A good mechanism to sampling depends on your 
> application domain.
> 
> - Rerngvit
> > On 30 Jul 2016, at 21:45, Tony Lane  > > wrote:
> >
> > I am developing my analysis application by using spark (in eclipse as the 
> > IDE)
> >
> > what is a good way to visualize the data, taking into consideration i have 
> > multiple files which make up my spark application.
> >
> > I have seen some notebook demo's but not sure how to use my application 
> > with such notebooks.
> >
> > thoughts/ suggestions/ experiences -- please share
> >
> > -Tony
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 



Is spark-submit a single point of failure?

2016-07-22 Thread Sivakumaran S
Hello,

I have a spark streaming process on a cluster ingesting a realtime data stream 
from Kafka. The aggregated, processed output is written to Cassandra and also 
used for dashboard display.

My question is - If the node running the driver program fails, I am guessing 
that the entire process fails and has to be restarted. Is there any way to 
obviate this? Is my understanding correct that the spark-submit in its current 
form is a Single Point of Vulnerability, much akin to the NameNode in HDFS?

regards

Sivakumaran S
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question on Spark shell

2016-07-11 Thread Sivakumaran S
That was my bad with the title. 

I am getting that output when I run my application, both from the IDE as well 
as in the console. 

I want the server logs itself displayed in the terminal from where I start the 
server. Right now, running the command ‘start-master.sh’ returns the prompt. I 
want the Spark logs as events occur (INFO, WARN, ERROR); like enabling debug 
mode wherein server output is printed to screen. 

I have to edit the log4j properties file, that much I have learnt so far. 
Should be able to hack it now. Thanks for the help. Guess just helping to frame 
the question was enough to find the answer :)




> On 11-Jul-2016, at 6:57 PM, Anthony May <anthony...@gmail.com> wrote:
> 
> I see. The title of your original email was "Spark Shell" which is a Spark 
> REPL environment based on the Scala Shell, hence why I misunderstood you.
> 
> You should have the same output starting the application on the console. You 
> are not seeing any output?
> 
> On Mon, 11 Jul 2016 at 11:55 Sivakumaran S <siva.kuma...@me.com 
> <mailto:siva.kuma...@me.com>> wrote:
> I am running a spark streaming application using Scala in the IntelliJ IDE. I 
> can see the Spark output in the IDE itself (aggregation and stuff). I want 
> the spark server logging (INFO, WARN, etc) to be displayed in screen when I 
> start the master in the console. For example, when I start a kafka cluster, 
> the prompt is not returned and the debug log is printed to the terminal. I 
> want that set up with my spark server. 
> 
> I hope that explains my retrograde requirement :)
> 
> 
> 
>> On 11-Jul-2016, at 6:49 PM, Anthony May <anthony...@gmail.com 
>> <mailto:anthony...@gmail.com>> wrote:
>> 
>> Starting the Spark Shell gives you a Spark Context to play with straight 
>> away. The output is printed to the console.
>> 
>> On Mon, 11 Jul 2016 at 11:47 Sivakumaran S <siva.kuma...@me.com 
>> <mailto:siva.kuma...@me.com>> wrote:
>> Hello,
>> 
>> Is there a way to start the spark server with the log output piped to 
>> screen? I am currently running spark in the standalone mode on a single 
>> machine.
>> 
>> Regards,
>> 
>> Sivakumaran
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
> 



Re: Question on Spark shell

2016-07-11 Thread Sivakumaran S
I am running a spark streaming application using Scala in the IntelliJ IDE. I 
can see the Spark output in the IDE itself (aggregation and stuff). I want the 
spark server logging (INFO, WARN, etc) to be displayed in screen when I start 
the master in the console. For example, when I start a kafka cluster, the 
prompt is not returned and the debug log is printed to the terminal. I want 
that set up with my spark server. 

I hope that explains my retrograde requirement :)



> On 11-Jul-2016, at 6:49 PM, Anthony May <anthony...@gmail.com> wrote:
> 
> Starting the Spark Shell gives you a Spark Context to play with straight 
> away. The output is printed to the console.
> 
> On Mon, 11 Jul 2016 at 11:47 Sivakumaran S <siva.kuma...@me.com 
> <mailto:siva.kuma...@me.com>> wrote:
> Hello,
> 
> Is there a way to start the spark server with the log output piped to screen? 
> I am currently running spark in the standalone mode on a single machine.
> 
> Regards,
> 
> Sivakumaran
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 



Question on Spark shell

2016-07-11 Thread Sivakumaran S
Hello,

Is there a way to start the spark server with the log output piped to screen? I 
am currently running spark in the standalone mode on a single machine. 

Regards,

Sivakumaran


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: problem extracting map from json

2016-07-07 Thread Sivakumaran S
Hi Michal,

Will an example help?



import scala.util.parsing.json._//Requires scala-parsec-combinators 
because it is no longer part of core scala

val wbJSON = JSON.parseFull(weatherBox) //wbJSON is a JSON object now
//Depending on the structure, now traverse through the object
val listWeatherReports = wbJSON.get.asInstanceOf[Map[String, Any]]
val cod:String = listWeatherReports.get("cod").get.asInstanceOf[String]



Regards,

Sivakumaran

> On 07-Jul-2016, at 1:18 PM, Michal Vince  wrote:
> 
> Hi guys
> 
> I`m trying to extract Map[String, Any] from json string, this works well in 
> any scala repl I tried, both scala 2.11 and 2.10 and using both json4s and 
> liftweb-json libraries, but if I try to do the same thing in spark-shell I`m 
> always getting No information known about type... exception
> 
> I`ved tried different versions of json4s and liftweb-json but with the same 
> result. I was wondering if anybody have idea what I might be doing wrong.
> 
> 
> I`m using spark 1.6.1 precompiled with mapr hadoop distro in scala 2.10.5
> 
> 
> 
> scala> import org.json4s._
> import org.json4s._
> 
> scala> import org.json4s.jackson.JsonMethods._
> import org.json4s.jackson.JsonMethods._
> 
> scala>
> 
> scala> implicit val formats = org.json4s.DefaultFormats
> formats: org.json4s.DefaultFormats.type = org.json4s.DefaultFormats$@46270641
> 
> scala> val json = parse(""" { "name": "joe", "children": [ { "name": "Mary", 
> "age": 5 }, { "name": "Mazy", "age": 3 } ] } """)
> json: org.json4s.JValue = JObject(List((name,JString(joe)), 
> (children,JArray(List(JObject(List((name,JString(Mary)), (age,JInt(5, 
> JObject(List((name,JString(Mazy)), (age,JInt(3)
> 
> 
> scala> json.extract[Map[String, Any]]
> org.json4s.package$MappingException: No information known about type
> at 
> org.json4s.Extraction$ClassInstanceBuilder.org$json4s$Extraction$ClassInstanceBuilder$$instantiate(Extraction.scala:465)
> at 
> org.json4s.Extraction$ClassInstanceBuilder$$anonfun$result$6.apply(Extraction.scala:491)
> at 
> org.json4s.Extraction$ClassInstanceBuilder$$anonfun$result$6.apply(Extraction.scala:488)
> at 
> org.json4s.Extraction$.org$json4s$Extraction$$customOrElse(Extraction.scala:500)
> at 
> org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:488)
> at org.json4s.Extraction$.extract(Extraction.scala:332)
> at 
> org.json4s.Extraction$$anonfun$extract$5.apply(Extraction.scala:316)
> at 
> org.json4s.Extraction$$anonfun$extract$5.apply(Extraction.scala:316)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.json4s.Extraction$.extract(Extraction.scala:316)
> at org.json4s.Extraction$.extract(Extraction.scala:42)
> at 
> org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:53)
> at $iwC$$iwC$$iwC$$iwC.(:55)
> at $iwC$$iwC$$iwC.(:57)
> at $iwC$$iwC.(:59)
> at $iwC.(:61)
> at (:63)
> at .(:67)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at 

Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Sivakumaran S
Arnauld,

You could aggregate the first table and then merge it with the second table 
(assuming that they are similarly structured) and then carry out the second 
aggregation. Unless the data is very large, I don’t see why you should persist 
it to disk. IMO, nested aggregation is more elegant and readable than a complex 
single stage.

Regards,

Sivakumaran


> On 07-Jul-2016, at 1:06 PM, Arnaud Bailly <arnaud.oq...@gmail.com> wrote:
> 
> It's aggregation at multiple levels in a query: first do some aggregation on 
> one tavle, then join with another table and do a second aggregation. I could 
> probably rewrite the query in such a way that it does aggregation in one pass 
> but that would obfuscate the purpose of the various stages.
> 
> Le 7 juil. 2016 12:55, "Sivakumaran S" <siva.kuma...@me.com 
> <mailto:siva.kuma...@me.com>> a écrit :
> Hi Arnauld,
> 
> Sorry for the doubt, but what exactly is multiple aggregation? What is the 
> use case?
> 
> Regards,
> 
> Sivakumaran
> 
> 
>> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <arnaud.oq...@gmail.com 
>> <mailto:arnaud.oq...@gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I understand multiple aggregations over streaming dataframes is not 
>> currently supported in Spark 2.0. Is there a workaround? Out of the top of 
>> my head I could think of having a two stage approach: 
>>  - first query writes output to disk/memory using "complete" mode
>>  - second query reads from this output
>> 
>> Does this makes sense?
>> 
>> Furthermore, I would like to understand what are the technical hurdles that 
>> are preventing Spark SQL from implementing multiple aggregation right now? 
>> 
>> Thanks,
>> -- 
>> Arnaud Bailly
>> 
>> twitter: abailly
>> skype: arnaud-bailly
>> linkedin: http://fr.linkedin.com/in/arnaudbailly/ 
>> <http://fr.linkedin.com/in/arnaudbailly/>



Re: Multiple aggregations over streaming dataframes

2016-07-07 Thread Sivakumaran S
Hi Arnauld,

Sorry for the doubt, but what exactly is multiple aggregation? What is the use 
case?

Regards,

Sivakumaran


> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly  wrote:
> 
> Hello,
> 
> I understand multiple aggregations over streaming dataframes is not currently 
> supported in Spark 2.0. Is there a workaround? Out of the top of my head I 
> could think of having a two stage approach: 
>  - first query writes output to disk/memory using "complete" mode
>  - second query reads from this output
> 
> Does this makes sense?
> 
> Furthermore, I would like to understand what are the technical hurdles that 
> are preventing Spark SQL from implementing multiple aggregation right now? 
> 
> Thanks,
> -- 
> Arnaud Bailly
> 
> twitter: abailly
> skype: arnaud-bailly
> linkedin: http://fr.linkedin.com/in/arnaudbailly/ 
> 


Re: Python to Scala

2016-06-18 Thread Sivakumaran S
If you can identify a suitable java example in the spark directory, you can use 
that as a template and convert it to scala code using http://javatoscala.com/ 


Siva

> On 18-Jun-2016, at 6:27 AM, Aakash Basu  wrote:
> 
> I don't have a sound knowledge in Python and on the other hand we are working 
> on Spark on Scala, so I don't think it will be allowed to run PySpark along 
> with it, so the requirement is to convert the code to scala and use it. But 
> I'm finding it difficult.
> 
> Did not find a better forum for help than ours. Hence this mail.
> 
> On 18-Jun-2016 10:39 AM, "Stephen Boesch"  > wrote:
> What are you expecting us to do?  Yash provided a reasonable approach - based 
> on the info you had provided in prior emails.  Otherwise you can convert it 
> from python to spark - or find someone else who feels comfortable to do it.  
> That kind of inquiry would likelybe appropriate on a job board.
> 
> 
> 
> 2016-06-17 21:47 GMT-07:00 Aakash Basu  >:
> Hey,
> 
> Our complete project is in Spark on Scala, I code in Scala for Spark, though 
> am new, but I know it and still learning. But I need help in converting this 
> code to Scala. I've nearly no knowledge in Python, hence, requested the 
> experts here.
> 
> Hope you get me now.
> 
> Thanks,
> Aakash.
> 
> On 18-Jun-2016 10:07 AM, "Yash Sharma"  > wrote:
> You could use pyspark to run the python code on spark directly. That will cut 
> the effort of learning scala.
> 
> https://spark.apache.org/docs/0.9.0/python-programming-guide.html 
> 
> - Thanks, via mobile,  excuse brevity.
> 
> On Jun 18, 2016 2:34 PM, "Aakash Basu"  > wrote:
> Hi all,
> 
> I've a python code, which I want to convert to Scala for using it in a Spark 
> program. I'm not so well acquainted with python and learning scala now. Any 
> Python+Scala expert here? Can someone help me out in this please?
> 
> Thanks & Regards,
> Aakash.
> 
> 



Re: choice of RDD function

2016-06-16 Thread Sivakumaran S
Dear Jacek and Cody,


I receive a stream of JSON (exactly this much: 4 json objects) once every 30 
seconds from Kafka as follows (I have changed my data source to include more 
fields)
: 
{"windspeed":4.23,"pressure":1008.39,"location":"Dundee","latitude":56.5,"longitude":-2.96667,"id":2650752,"humidity":97.0,"temp":12.54,"winddirection":12.0003}
{"windspeed":4.23,"pressure":1008.39,"location":"Saint 
Andrews","latitude":56.338711,"longitude":-2.79902,"id":2638864,"humidity":97.0,"temp":12.54,"winddirection":12.0003}
{"windspeed":5.53,"pressure":1016.25,"location":"Arbroath","latitude":56.563171,"longitude":-2.58736,"id":2657215,"humidity":96.0,"temp":11.59,"winddirection":9.50031}
{"windspeed":4.73,"pressure":994.0,"location":"Aberdeen","latitude":57.143688,"longitude":-2.09814,"id":2657832,"humidity":1.0,"temp":0.0,"winddirection":357.0}
{"windspeed":6.13,"pressure":994.0,"location":"Peterhead","latitude":57.50584,"longitude":-1.79806,"id":2640351,"humidity":1.0,"temp":0.0,"winddirection":8.50031}

In my Spark app, I have set the batch duration as 60 seconds. Now, as per the 
1.6.1 documentation, "Spark SQL can automatically infer the schema of a JSON 
dataset and load it as a DataFrame. This conversion can be done using 
SQLContext.read.json() on either an RDD of String, or a JSON file.”. But what 
both of you pointed out is correct, it consumes the RDD twice, i do not 
understand why. Below is the snap of the DAG. 

I do not need stateful calculations and I need to write this result to a 
database at a later stage. Any input to improve this solution is appreciated. 




Regards,

Siva

> On 16-Jun-2016, at 12:48 PM, Sivakumaran S <siva.kuma...@me.com> wrote:
> 
> Hi Jacek and Cody,
> 
> First of all, thanks for helping me out.
> 
> I started with using combineByKey while testing with just one field. Of 
> course it worked fine, but I was worried that the code would become 
> unreadable if there were many fields. Which is why I shifted to sqlContext 
> because the code is comprehensible. Let me work out the stream statistics and 
> update you in a while. 
> 
> 
> 
> Regards,
> 
> Siva
> 
> 
> 
>> On 16-Jun-2016, at 11:29 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>> 
>> Rather
>> 
>> val df = sqlContext.read.json(rdd)
>> 
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>> 
>> 
>> On Wed, Jun 15, 2016 at 11:55 PM, Sivakumaran S <siva.kuma...@me.com> wrote:
>>> Cody,
>>> 
>>> Are you referring to the  val lines = messages.map(_._2)?
>>> 
>>> Regards,
>>> 
>>> Siva
>>> 
>>>> On 15-Jun-2016, at 10:32 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>>> 
>>>> Doesn't that result in consuming each RDD twice, in order to infer the
>>>> json schema?
>>>> 
>>>> On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S <siva.kuma...@me.com> 
>>>> wrote:
>>>>> Of course :)
>>>>> 
>>>>> object sparkStreaming {
>>>>> def main(args: Array[String]) {
>>>>>  StreamingExamples.setStreamingLogLevels() //Set reasonable logging
>>>>> levels for streaming if the user has not configured log4j.
>>>>>  val topics = "test"
>>>>>  val brokers = "localhost:9092"
>>>>>  val topicsSet = topics.split(",").toSet
>>>>>  val sparkConf = new
>>>>> SparkConf().setAppName("KafkaDroneCalc").setMaster("local")
>>>>> //spark://localhost:7077
>>>>>  val sc = new SparkContext(sparkConf)
>>>>>  val ssc = new StreamingContext(sc, Seconds(30))
>>>>>  val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
>>>>>  val messages = KafkaUtils.createDirectStream[String, String,
>>>>> StringDecoder, StringDecoder] (ssc, kafkaParams, topicsSet)
>>>>>  val lines = messages.map(_._2)
>>>>>  val sqlContext = new org.apach

Re: choice of RDD function

2016-06-16 Thread Sivakumaran S
Hi Jacek and Cody,

First of all, thanks for helping me out.

I started with using combineByKey while testing with just one field. Of course 
it worked fine, but I was worried that the code would become unreadable if 
there were many fields. Which is why I shifted to sqlContext because the code 
is comprehensible. Let me work out the stream statistics and update you in a 
while. 



Regards,

Siva



> On 16-Jun-2016, at 11:29 AM, Jacek Laskowski <ja...@japila.pl> wrote:
> 
> Rather
> 
> val df = sqlContext.read.json(rdd)
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Wed, Jun 15, 2016 at 11:55 PM, Sivakumaran S <siva.kuma...@me.com> wrote:
>> Cody,
>> 
>> Are you referring to the  val lines = messages.map(_._2)?
>> 
>> Regards,
>> 
>> Siva
>> 
>>> On 15-Jun-2016, at 10:32 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>> 
>>> Doesn't that result in consuming each RDD twice, in order to infer the
>>> json schema?
>>> 
>>> On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S <siva.kuma...@me.com> wrote:
>>>> Of course :)
>>>> 
>>>> object sparkStreaming {
>>>> def main(args: Array[String]) {
>>>>   StreamingExamples.setStreamingLogLevels() //Set reasonable logging
>>>> levels for streaming if the user has not configured log4j.
>>>>   val topics = "test"
>>>>   val brokers = "localhost:9092"
>>>>   val topicsSet = topics.split(",").toSet
>>>>   val sparkConf = new
>>>> SparkConf().setAppName("KafkaDroneCalc").setMaster("local")
>>>> //spark://localhost:7077
>>>>   val sc = new SparkContext(sparkConf)
>>>>   val ssc = new StreamingContext(sc, Seconds(30))
>>>>   val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
>>>>   val messages = KafkaUtils.createDirectStream[String, String,
>>>> StringDecoder, StringDecoder] (ssc, kafkaParams, topicsSet)
>>>>   val lines = messages.map(_._2)
>>>>   val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>>>   lines.foreachRDD( rdd => {
>>>> val df = sqlContext.read.json(rdd)
>>>> df.registerTempTable(“drone")
>>>> sqlContext.sql("SELECT id, AVG(temp), AVG(rotor_rpm),
>>>> AVG(winddirection), AVG(windspeed) FROM drone GROUP BY id").show()
>>>>   })
>>>>   ssc.start()
>>>>   ssc.awaitTermination()
>>>> }
>>>> }
>>>> 
>>>> I haven’t checked long running performance though.
>>>> 
>>>> Regards,
>>>> 
>>>> Siva
>>>> 
>>>> On 15-Jun-2016, at 5:02 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Good to hear so! Mind sharing a few snippets of your solution?
>>>> 
>>>> Pozdrawiam,
>>>> Jacek Laskowski
>>>> 
>>>> https://medium.com/@jaceklaskowski/
>>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>>> Follow me at https://twitter.com/jaceklaskowski
>>>> 
>>>> 
>>>> On Wed, Jun 15, 2016 at 5:03 PM, Sivakumaran S <siva.kuma...@me.com> wrote:
>>>> 
>>>> Thanks Jacek,
>>>> 
>>>> Job completed!! :) Just used data frames and sql query. Very clean and
>>>> functional code.
>>>> 
>>>> Siva
>>>> 
>>>> On 15-Jun-2016, at 3:10 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>>>> 
>>>> mapWithState
>>>> 
>>>> 
>>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Cody, 

Are you referring to the  val lines = messages.map(_._2)? 

Regards,

Siva

> On 15-Jun-2016, at 10:32 PM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> Doesn't that result in consuming each RDD twice, in order to infer the
> json schema?
> 
> On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S <siva.kuma...@me.com> wrote:
>> Of course :)
>> 
>> object sparkStreaming {
>>  def main(args: Array[String]) {
>>StreamingExamples.setStreamingLogLevels() //Set reasonable logging
>> levels for streaming if the user has not configured log4j.
>>val topics = "test"
>>val brokers = "localhost:9092"
>>val topicsSet = topics.split(",").toSet
>>val sparkConf = new
>> SparkConf().setAppName("KafkaDroneCalc").setMaster("local")
>> //spark://localhost:7077
>>val sc = new SparkContext(sparkConf)
>>val ssc = new StreamingContext(sc, Seconds(30))
>>val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
>>val messages = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder] (ssc, kafkaParams, topicsSet)
>>val lines = messages.map(_._2)
>>val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>lines.foreachRDD( rdd => {
>>  val df = sqlContext.read.json(rdd)
>>  df.registerTempTable(“drone")
>>  sqlContext.sql("SELECT id, AVG(temp), AVG(rotor_rpm),
>> AVG(winddirection), AVG(windspeed) FROM drone GROUP BY id").show()
>>})
>>ssc.start()
>>ssc.awaitTermination()
>>  }
>> }
>> 
>> I haven’t checked long running performance though.
>> 
>> Regards,
>> 
>> Siva
>> 
>> On 15-Jun-2016, at 5:02 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>> 
>> Hi,
>> 
>> Good to hear so! Mind sharing a few snippets of your solution?
>> 
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>> 
>> 
>> On Wed, Jun 15, 2016 at 5:03 PM, Sivakumaran S <siva.kuma...@me.com> wrote:
>> 
>> Thanks Jacek,
>> 
>> Job completed!! :) Just used data frames and sql query. Very clean and
>> functional code.
>> 
>> Siva
>> 
>> On 15-Jun-2016, at 3:10 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>> 
>> mapWithState
>> 
>> 
>> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Of course :)

object sparkStreaming {
  def main(args: Array[String]) {
StreamingExamples.setStreamingLogLevels() //Set reasonable logging levels 
for streaming if the user has not configured log4j.
val topics = "test"
val brokers = "localhost:9092"
val topicsSet = topics.split(",").toSet
val sparkConf = new 
SparkConf().setAppName("KafkaDroneCalc").setMaster("local") 
//spark://localhost:7077
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(30))
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder] (ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
lines.foreachRDD( rdd => {
  val df = sqlContext.read.json(rdd)
  df.registerTempTable(“drone")
  sqlContext.sql("SELECT id, AVG(temp), AVG(rotor_rpm), AVG(winddirection), 
AVG(windspeed) FROM drone GROUP BY id").show()
})
ssc.start()
ssc.awaitTermination()
  }
}
I haven’t checked long running performance though. 

Regards,

Siva

> On 15-Jun-2016, at 5:02 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> 
> Hi,
> 
> Good to hear so! Mind sharing a few snippets of your solution?
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Wed, Jun 15, 2016 at 5:03 PM, Sivakumaran S <siva.kuma...@me.com> wrote:
>> Thanks Jacek,
>> 
>> Job completed!! :) Just used data frames and sql query. Very clean and
>> functional code.
>> 
>> Siva
>> 
>> On 15-Jun-2016, at 3:10 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>> 
>> mapWithState
>> 
>> 



Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Thanks Jacek,

Job completed!! :) Just used data frames and sql query. Very clean and 
functional code.

Siva

> On 15-Jun-2016, at 3:10 PM, Jacek Laskowski  wrote:
> 
> mapWithState



choice of RDD function

2016-06-14 Thread Sivakumaran S
Dear friends,

I have set up Kafka 0.9.0.0, Spark 1.6.1 and Scala 2.10. My source is sending a 
json string periodically to a topic in kafka. I am able to consume this topic 
using Spark Streaming and print it. The schema of the source json is as 
follows: 

{ “id”: 121156, “ht”: 42, “rotor_rpm”: 180, “temp”: 14.2, “time”:146593512}
{ “id”: 121157, “ht”: 18, “rotor_rpm”: 110, “temp”: 12.2, “time”: 146593512}
{ “id”: 121156, “ht”: 36, “rotor_rpm”: 160, “temp”: 14.4, “time”: 146593513}
{ “id”: 121157, “ht”: 19, “rotor_rpm”: 120, “temp”: 12.0, “time”: 146593513}
and so on.


In Spark streaming, I want to find the average of “ht” (height), “rotor_rpm” 
and “temp” for each “id". I also want to find the max and min of the same 
fields in the time window (300 seconds in this case). 

Q1. Can this be done using plain RDD and streaming functions or does it 
require Dataframes/SQL? There may be more fields added to the json at a later 
stage. There will be a lot of “id”s at a later stage.

Q2. If it can be done using either, which one would render to be more 
efficient and fast?

As of now, the entire set up is in a single laptop. 

Thanks in advance.

Regards,

Siva