Re: Spark as a service

2015-03-25 Thread Irfan Ahmad
You're welcome. How did it go?


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 25, 2015 at 7:53 AM, Ashish Mukherjee 
ashish.mukher...@gmail.com wrote:

 Thank you

 On Tue, Mar 24, 2015 at 8:40 PM, Irfan Ahmad ir...@cloudphysics.com
 wrote:

 Also look at the spark-kernel and spark job server projects.

 Irfan
 On Mar 24, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 Perhaps this project, https://github.com/calrissian/spark-jetty-server,
 could help with your requirements.

 On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele 
 jeffrey.jed...@gmail.com wrote:

 I don't think there's are general approach to that - the usecases are
 just to different. If you really need it, you probably will have to
 implement yourself in the driver of your application.

 PS: Make sure to use the reply to all button so that the mailing list
 is included in your reply. Otherwise only I will get your mail.

 Regards,
 Jeff

 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com
 :

 Hi Jeffrey,

 Thanks. Yes, this resolves the SQL problem. My bad - I was looking for
 something which would work for Spark Streaming and other Spark jobs too,
 not just SQL.

 Regards,
 Ashish

 On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele 
 jeffrey.jed...@gmail.com wrote:

 Hi Ashish,
 this might be what you're looking for:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

 Regards,
 Jeff

 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee 
 ashish.mukher...@gmail.com:

 Hello,

 As of now, if I have to execute a Spark job, I need to create a jar
 and deploy it.  If I need to run a dynamically formed SQL from a Web
 application, is there any way of using SparkSQL in this manner? Perhaps,
 through a Web Service or something similar.

 Regards,
 Ashish









Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-25 Thread Irfan Ahmad
Hmmm this seems very accumulo-specific, doesn't it? Not sure how to
help with that.


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Tue, Mar 24, 2015 at 4:09 PM, David Holiday dav...@annaisystems.com
wrote:

  hi all,

  got a vagrant image with spark notebook, spark, accumulo, and hadoop all
 running. from notebook I can manually create a scanner and pull test data
 from a table I created using one of the accumulo examples:

 val instanceNameS = accumuloval zooServersS = localhost:2181val instance: 
 Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val connector: 
 Connector = instance.getConnector( root, new PasswordToken(password))val 
 auths = new Authorizations(exampleVis)val scanner = 
 connector.createScanner(batchtest1, auths)

 scanner.setRange(new Range(row_00, row_10))
 for(entry: Entry[Key, Value] - scanner) {
   println(entry.getKey +  is  + entry.getValue)}

 will give the first ten rows of table data. when I try to create the RDD
 thusly:

 val rdd2 =
   sparkContext.newAPIHadoopRDD (
 new Configuration(),
 classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
 classOf[org.apache.accumulo.core.data.Key],
 classOf[org.apache.accumulo.core.data.Value]
   )

 I get an RDD returned to me that I can't do much with due to the following
 error:

 java.io.IOException: Input info has not been set. at
 org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
 at
 org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
 at
 org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at
 scala.Option.getOrElse(Option.scala:120) at
 org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
 org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
 org.apache.spark.rdd.RDD.count(RDD.scala:927)

 which totally makes sense in light of the fact that I haven't specified
 any parameters as to which table to connect with, what the auths are, etc.

 so my question is: what do I need to do from here to get those first ten
 rows of table data into my RDD?



  DAVID HOLIDAY
  Software Engineer
  760 607 3300 | Office
  312 758 8385 | Mobile
  dav...@annaisystems.com broo...@annaisystems.com



 www.AnnaiSystems.com

  On Mar 19, 2015, at 11:25 AM, David Holiday dav...@annaisystems.com
 wrote:

  kk - I'll put something together and get back to you with more :-)

 DAVID HOLIDAY
  Software Engineer
  760 607 3300 | Office
  312 758 8385 | Mobile
  dav...@annaisystems.com broo...@annaisystems.com


 GetFileAttachment.jpg
 www.AnnaiSystems.com http://www.annaisystems.com/

  On Mar 19, 2015, at 10:59 AM, Irfan Ahmad ir...@cloudphysics.com wrote:

  Once you setup spark-notebook, it'll handle the submits for interactive
 work. Non-interactive is not handled by it. For that spark-kernel could be
 used.

  Give it a shot ... it only takes 5 minutes to get it running in
 local-mode.


  *Irfan Ahmad*
 CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com/
 Best of VMworld Finalist
  Best Cloud Management Award
  NetworkWorld 10 Startups to Watch
 EMA Most Notable Vendor

 On Thu, Mar 19, 2015 at 9:51 AM, David Holiday dav...@annaisystems.com
 wrote:

 hi all - thx for the alacritous replies! so regarding how to get things
 from notebook to spark and back, am I correct that spark-submit is the way
 to go?

 DAVID HOLIDAY
  Software Engineer
  760 607 3300 | Office
  312 758 8385 | Mobile
  dav...@annaisystems.com broo...@annaisystems.com


 GetFileAttachment.jpg
 www.AnnaiSystems.com http://www.annaisystems.com/

  On Mar 19, 2015, at 1:14 AM, Paolo Platter paolo.plat...@agilelab.it
 wrote:

   Yes, I would suggest spark-notebook too.
 It's very simple to setup and it's growing pretty fast.

 Paolo

 Inviata dal mio Windows Phone
  --
 Da: Irfan Ahmad ir...@cloudphysics.com
 Inviato: ‎19/‎03/‎2015 04:05
 A: davidh dav...@annaisystems.com
 Cc: user@spark.apache.org
 Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?

  I forgot to mention that there is also Zeppelin and jove-notebook but I
 haven't got any experience with those yet.


  *Irfan Ahmad*
 CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com/
 Best of VMworld Finalist
  Best Cloud Management Award
  NetworkWorld 10 Startups to Watch
 EMA Most Notable Vendor

 On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad ir...@cloudphysics.com
 wrote:

 Hi David,

  W00t indeed and great questions. On the notebook front, there are two
 options

Re: Spark as a service

2015-03-24 Thread Irfan Ahmad
Also look at the spark-kernel and spark job server projects.

Irfan
On Mar 24, 2015 5:03 AM, Todd Nist tsind...@gmail.com wrote:

 Perhaps this project, https://github.com/calrissian/spark-jetty-server,
 could help with your requirements.

 On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele jeffrey.jed...@gmail.com
 wrote:

 I don't think there's are general approach to that - the usecases are
 just to different. If you really need it, you probably will have to
 implement yourself in the driver of your application.

 PS: Make sure to use the reply to all button so that the mailing list is
 included in your reply. Otherwise only I will get your mail.

 Regards,
 Jeff

 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com:

 Hi Jeffrey,

 Thanks. Yes, this resolves the SQL problem. My bad - I was looking for
 something which would work for Spark Streaming and other Spark jobs too,
 not just SQL.

 Regards,
 Ashish

 On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele 
 jeffrey.jed...@gmail.com wrote:

 Hi Ashish,
 this might be what you're looking for:


 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

 Regards,
 Jeff

 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee ashish.mukher...@gmail.com
 :

 Hello,

 As of now, if I have to execute a Spark job, I need to create a jar
 and deploy it.  If I need to run a dynamically formed SQL from a Web
 application, is there any way of using SparkSQL in this manner? Perhaps,
 through a Web Service or something similar.

 Regards,
 Ashish








Re: Visualizing Spark Streaming data

2015-03-20 Thread Irfan Ahmad
Grafana allows pretty slick interactive use patterns, especially with
graphite as the back-end. In a multi-user environment, why not have each
user just build their own independent dashboards and name them under some
simple naming convention?


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Fri, Mar 20, 2015 at 1:06 AM, Harut Martirosyan 
harut.martiros...@gmail.com wrote:

 Hey Jeffrey.
 Thanks for reply.

 I already have something similar, I use Grafana and Graphite, and for
 simple metric streaming we've got all set-up right.

 My question is about interactive patterns. For instance, dynamically
 choose an event to monitor, dynamically choose group-by field or any sort
 of filter, then view results. This is easy when you have 1 user, but if you
 have team of analysts all specifying their own criteria, it becomes hard to
 manage them all.

 On 20 March 2015 at 12:02, Jeffrey Jedele jeffrey.jed...@gmail.com
 wrote:

 Hey Harut,
 I don't think there'll by any general practices as this part heavily
 depends on your environment, skills and what you want to achieve.

 If you don't have a general direction yet, I'd suggest you to have a look
 at Elasticsearch+Kibana. It's very easy to set up, powerful and therefore
 gets a lot of traction currently.

 Regards,
 Jeff

 2015-03-20 8:43 GMT+01:00 Harut harut.martiros...@gmail.com:

 I'm trying to build a dashboard to visualize stream of events coming from
 mobile devices.
 For example, I have event called add_photo, from which I want to
 calculate
 trending tags for added photos for last x minutes. Then I'd like to
 aggregate that by country, etc. I've built the streaming part, which
 reads
 from Kafka, and calculates needed results and get appropriate RDDs, the
 question is now how to connect it to UI.

 Is there any general practices on how to pass parameters to spark from
 some
 custom built UI, how to organize data retrieval, what intermediate
 storages
 to use, etc.

 Thanks in advance.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Visualizing-Spark-Streaming-data-tp22160.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 RGRDZ Harut



Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-18 Thread Irfan Ahmad
Hi David,

W00t indeed and great questions. On the notebook front, there are two
options depending on what you are looking for. You can either go with
iPython 3 with Spark-kernel as a backend or you can use spark-notebook.
Both have interesting tradeoffs.

If you have looking for a single notebook platform for your data scientists
that has R, Python as well as a Spark Shell, you'll likely want to go with
iPython + Spark-kernel. Downsides with the spark-kernel project are that
data visualization isn't quite there yet, early days for documentation and
blogs/etc. Upside is that R and Python work beautifully and that the
ipython committers are super-helpful.

If you are OK with a primarily spark/scala experience, then I suggest you
with spark-notebook. Upsides are that the project is a little further
along, visualization support is better than spark-kernel (though not as
good as iPython with Python) and the committer is awesome with help.
Downside is that you won't get R and Python.

FWIW: I'm using both at the moment!

Hope that helps.


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 18, 2015 at 5:45 PM, davidh dav...@annaisystems.com wrote:

 hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and
 scanning through this archive with only moderate success. in other words --
 my way of saying sorry if this is answered somewhere obvious and I missed
 it
 :-)

 i've been tasked with figuring out how to connect Notebook, Spark, and
 Accumulo together. The end user will do her work via notebook. thus far,
 I've successfully setup a Vagrant image containing Spark, Accumulo, and
 Hadoop. I was able to use some of the Accumulo example code to create a
 table populated with data, create a simple program in scala that, when
 fired
 off to Spark via spark-submit, connects to accumulo and prints the first
 ten
 rows of data in the table. so w00t on that - but now I'm left with more
 questions:

 1) I'm still stuck on what's considered 'best practice' in terms of hooking
 all this together. Let's say Sally, a  user, wants to do some analytic work
 on her data. She pecks the appropriate commands into notebook and fires
 them
 off. how does this get wired together on the back end? Do I, from notebook,
 use spark-submit to send a job to spark and let spark worry about hooking
 into accumulo or is it preferable to create some kind of open stream
 between
 the two?

 2) if I want to extend spark's api, do I need to first submit an endless
 job
 via spark-submit that does something like what this gentleman describes
 http://blog.madhukaraphatak.com/extending-spark-api  ? is there an
 alternative (other than refactoring spark's source) that doesn't involve
 extending the api via a job submission?

 ultimately what I'm looking for help locating docs, blogs, etc that may
 shed
 some light on this.

 t/y in advance!

 d



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: SQL with Spark Streaming

2015-03-11 Thread Irfan Ahmad
Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 11, 2015 at 6:41 AM, Jason Dai jason@gmail.com wrote:

 Yes, a previous prototype is available
 https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at
 last year's Spark Summit (
 http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark
 )

 We are currently porting the prototype to use the latest DataFrame API,
 and will provide a stable version for people to try soon.

 Thabnks,
 -Jason


 On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Intel has a prototype for doing this, SaiSai and Jason are the
 authors. Probably you can ask them for some materials.


 The github repository is here: https://github.com/intel-spark/stream-sql

 Also, what I did is writing a wrapper class SchemaDStream that internally
 holds a DStream[Row] and a DStream[StructType] (the latter having just one
 element in every RDD) and then allows to do
 - operations SchemaRDD = SchemaRDD using
 `rowStream.transformWith(schemaStream, ...)`
 - in particular you can register this stream's data as a table this way
 - and via a companion object with a method `fromSQL(sql: String):
 SchemaDStream` you can get a new stream from previously registered tables.

 However, you are limited to batch-internal operations, i.e., you can't
 aggregate across batches.

 I am not able to share the code at the moment, but will within the next
 months. It is not very advanced code, though, and should be easy to
 replicate. Also, I have no idea about the performance of transformWith

 Tobias