Introducing Spark User Group in Korea & Question on creating non-software goods (stickers)

2016-04-01 Thread Kevin (Sangwoo) Kim
Hi all!

I'm Kevin, one of contributors of Spark and I'm organizing Spark User Group
in Korea. We're having 2500 members in community, and it's even growing
faster today.
https://www.facebook.com/groups/sparkkoreauser/
 -
Sorry, it's all Korean.

My co-organizer Sanghoon Lee (phoenixl...@gmail.com) is managing a big
event on 14th, April. It would be an event of 300+ people, and we're having
Doug Cutting (Creator of Hadoop) on that event, Dr. Kim, author of
DeepSpark: Spark-Based Deep Learning.
This is a page of the event (Sorry, it's all Korean again)
http://onoffmix.com/event/65057

My one question is, are we able to use Spark logo for printing some
stickers? (And where can we get the logo file if we can?)
I guess it would be a simple small logo sticker, and of course it's free
for participants of the event and Spark user groups.
The cost for sticker will be sponsored by SK C, it's one of big company
in Korea and Mr. Lee is working for, and the sponsorship is totally
non-commercial.

And any feedback is welcome for successful community & event!

Regards,
Kevin


Re: ClassNotFoundException

2015-03-16 Thread Kevin (Sangwoo) Kim
Hi Ralph,

It seems like https://issues.apache.org/jira/browse/SPARK-6299 issue, which
is I'm working on.
I submitted a PR for it, would you test it?

Regards,
Kevin

On Tue, Mar 17, 2015 at 1:11 AM Ralph Bergmann ra...@dasralph.de wrote:

 Hi,


 I want to try the JavaSparkPi example[1] on a remote Spark server but I
 get a ClassNotFoundException.

 When I run it local it works but not remote.

 I added the spark-core lib as dependency. Do I need more?

 Any ideas?

 Thanks Ralph


 [1] ...
 https://github.com/apache/spark/blob/master/examples/
 src/main/java/org/apache/spark/examples/JavaSparkPi.java

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


Re: Use of nscala-time within spark-shell

2015-02-17 Thread Kevin (Sangwoo) Kim
Great, or you can just use nscala-time with scala 2.10!

On Tue Feb 17 2015 at 5:41:53 PM Hammam CHAMSI hscha...@hotmail.com wrote:

 Thanks Kevin for your reply,

 I downloaded the pre_built version and as you said the default spark scala
 version is 2.10. I'm now building spark 1.2.1 with scala 2.11. I'll share
 the results here.

 Regards,
 --
 From: kevin...@apache.org
 Date: Tue, 17 Feb 2015 01:10:09 +
 Subject: Re: Use of nscala-time within spark-shell
 To: hscha...@hotmail.com; user@spark.apache.org


 What is your scala version used to build Spark?
 It seems your nscala-time library scala version is 2.11,
 and default Spark scala version is 2.10.


 On Tue Feb 17 2015 at 1:51:47 AM Hammam CHAMSI hscha...@hotmail.com
 wrote:

 Hi All,

 Thanks in advance for your help. I have timestamp which I need to convert
 to datetime using scala. A folder contains the three needed jar files:
 joda-convert-1.5.jar  joda-time-2.4.jar  nscala-time_2.11-1.8.0.jar
 Using scala REPL and adding the jars: scala -classpath *.jar
 I can use nscala-time like following:

 scala import com.github.nscala_time.time.Imports._
 import com.github.nscala_time.time.Imports._

 scala import org.joda._
 import org.joda._

 scala DateTime.now
 res0: org.joda.time.DateTime = 2015-02-12T15:51:46.928+01:00

 But when i try to use spark-shell:
 ADD_JARS=/home/scala_test_class/nscala-time_2.11-1.8.0.jar,/home/scala_test_class/joda-time-2.4.jar,/home/scala_test_class/joda-convert-1.5.jar
 /usr/local/spark/bin/spark-shell --master local --driver-memory 2g
 --executor-memory 2g --executor-cores 1

 It successfully imports the jars:

 scala import com.github.nscala_time.time.Imports._
 import com.github.nscala_time.time.Imports._

 scala import org.joda._
 import org.joda._

 but fails using them
 scala DateTime.now
 java.lang.NoSuchMethodError:
 scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
 at
 com.github.nscala_time.time.LowPriorityOrderingImplicits$class.ReadableInstantOrdering(Implicits.scala:69)

 at
 com.github.nscala_time.time.Imports$.ReadableInstantOrdering(Imports.scala:20)

 at
 com.github.nscala_time.time.OrderingImplicits$class.$init$(Implicits.scala:61)

 at com.github.nscala_time.time.Imports$.init(Imports.scala:20)
 at com.github.nscala_time.time.Imports$.clinit(Imports.scala)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:17)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC$$iwC.init(console:28)
 at $iwC$$iwC$$iwC.init(console:30)
 at $iwC$$iwC.init(console:32)
 at $iwC.init(console:34)
 at init(console:36)
 at .init(console:40)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
 at
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
 at
 org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
 at
 org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)

 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
 at
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
 at
 org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)

 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 

Re: Use of nscala-time within spark-shell

2015-02-17 Thread Kevin (Sangwoo) Kim
Then, why don't you use nscala-time_2.10-1.8.0.jar, not
nscala-time_2.11-1.8.0.jar ?

On Tue Feb 17 2015 at 5:55:50 PM Hammam CHAMSI hscha...@hotmail.com wrote:

 I can use nscala-time with scala, but my issue is that I can't use it
 witinh spark-shell console! It gives my the error below.

 Thanks

 --
 From: kevin...@apache.org
 Date: Tue, 17 Feb 2015 08:50:04 +

 Subject: Re: Use of nscala-time within spark-shell
 To: hscha...@hotmail.com; kevin...@apache.org; user@spark.apache.org


 Great, or you can just use nscala-time with scala 2.10!

 On Tue Feb 17 2015 at 5:41:53 PM Hammam CHAMSI hscha...@hotmail.com
 wrote:

 Thanks Kevin for your reply,

 I downloaded the pre_built version and as you said the default spark scala
 version is 2.10. I'm now building spark 1.2.1 with scala 2.11. I'll share
 the results here.

 Regards,
 --
 From: kevin...@apache.org
 Date: Tue, 17 Feb 2015 01:10:09 +
 Subject: Re: Use of nscala-time within spark-shell
 To: hscha...@hotmail.com; user@spark.apache.org


 What is your scala version used to build Spark?
 It seems your nscala-time library scala version is 2.11,
 and default Spark scala version is 2.10.


 On Tue Feb 17 2015 at 1:51:47 AM Hammam CHAMSI hscha...@hotmail.com
 wrote:

 Hi All,

 Thanks in advance for your help. I have timestamp which I need to convert
 to datetime using scala. A folder contains the three needed jar files:
 joda-convert-1.5.jar  joda-time-2.4.jar  nscala-time_2.11-1.8.0.jar
 Using scala REPL and adding the jars: scala -classpath *.jar
 I can use nscala-time like following:

 scala import com.github.nscala_time.time.Imports._
 import com.github.nscala_time.time.Imports._

 scala import org.joda._
 import org.joda._

 scala DateTime.now
 res0: org.joda.time.DateTime = 2015-02-12T15:51:46.928+01:00

 But when i try to use spark-shell:
 ADD_JARS=/home/scala_test_class/nscala-time_2.11-1.8.0.jar,/home/scala_test_class/joda-time-2.4.jar,/home/scala_test_class/joda-convert-1.5.jar
 /usr/local/spark/bin/spark-shell --master local --driver-memory 2g
 --executor-memory 2g --executor-cores 1

 It successfully imports the jars:

 scala import com.github.nscala_time.time.Imports._
 import com.github.nscala_time.time.Imports._

 scala import org.joda._
 import org.joda._

 but fails using them
 scala DateTime.now
 java.lang.NoSuchMethodError:
 scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
 at
 com.github.nscala_time.time.LowPriorityOrderingImplicits$class.ReadableInstantOrdering(Implicits.scala:69)

 at
 com.github.nscala_time.time.Imports$.ReadableInstantOrdering(Imports.scala:20)

 at
 com.github.nscala_time.time.OrderingImplicits$class.$init$(Implicits.scala:61)

 at com.github.nscala_time.time.Imports$.init(Imports.scala:20)
 at com.github.nscala_time.time.Imports$.clinit(Imports.scala)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:17)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC$$iwC.init(console:28)
 at $iwC$$iwC$$iwC.init(console:30)
 at $iwC$$iwC.init(console:32)
 at $iwC.init(console:34)
 at init(console:36)
 at .init(console:40)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
 at
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
 at
 org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
 at
 org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)

 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
 at
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
 at
 org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)

 at
 

Re: Use of nscala-time within spark-shell

2015-02-16 Thread Kevin (Sangwoo) Kim
What is your scala version used to build Spark?
It seems your nscala-time library scala version is 2.11,
and default Spark scala version is 2.10.


On Tue Feb 17 2015 at 1:51:47 AM Hammam CHAMSI hscha...@hotmail.com wrote:

 Hi All,

 Thanks in advance for your help. I have timestamp which I need to convert
 to datetime using scala. A folder contains the three needed jar files:
 joda-convert-1.5.jar  joda-time-2.4.jar  nscala-time_2.11-1.8.0.jar
 Using scala REPL and adding the jars: scala -classpath *.jar
 I can use nscala-time like following:

 scala import com.github.nscala_time.time.Imports._
 import com.github.nscala_time.time.Imports._

 scala import org.joda._
 import org.joda._

 scala DateTime.now
 res0: org.joda.time.DateTime = 2015-02-12T15:51:46.928+01:00

 But when i try to use spark-shell:
 ADD_JARS=/home/scala_test_class/nscala-time_2.11-1.8.0.jar,/home/scala_test_class/joda-time-2.4.jar,/home/scala_test_class/joda-convert-1.5.jar
 /usr/local/spark/bin/spark-shell --master local --driver-memory 2g
 --executor-memory 2g --executor-cores 1

 It successfully imports the jars:

 scala import com.github.nscala_time.time.Imports._
 import com.github.nscala_time.time.Imports._

 scala import org.joda._
 import org.joda._

 but fails using them
 scala DateTime.now
 java.lang.NoSuchMethodError:
 scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
 at
 com.github.nscala_time.time.LowPriorityOrderingImplicits$class.ReadableInstantOrdering(Implicits.scala:69)

 at
 com.github.nscala_time.time.Imports$.ReadableInstantOrdering(Imports.scala:20)

 at
 com.github.nscala_time.time.OrderingImplicits$class.$init$(Implicits.scala:61)

 at com.github.nscala_time.time.Imports$.init(Imports.scala:20)
 at com.github.nscala_time.time.Imports$.clinit(Imports.scala)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:17)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC$$iwC.init(console:28)
 at $iwC$$iwC$$iwC.init(console:30)
 at $iwC$$iwC.init(console:32)
 at $iwC.init(console:34)
 at init(console:36)
 at .init(console:40)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
 at
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
 at
 org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
 at
 org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)

 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
 at
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
 at
 org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)

 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 Your help is very aappreciated,

 Regards,

 Hammam



Re: Can spark job server be used to visualize streaming data?

2015-02-13 Thread Kevin (Sangwoo) Kim
I'm not very sure for CDH 5.3,
but now Zeppelin works for Spark 1.2 as spark-repl has been published in
Spark 1.2.1
Please try again!

On Fri Feb 13 2015 at 3:55:19 PM Su She suhsheka...@gmail.com wrote:

 Thanks Kevin for the link, I have had issues trying to install zeppelin as
 I believe it is not yet supported for CDH 5.3, and Spark 1.2. Please
 correct me if I am mistaken.

 On Thu, Feb 12, 2015 at 7:33 PM, Kevin (Sangwoo) Kim kevin...@apache.org
 wrote:

 Apache Zeppelin also has a scheduler and then you can reload your chart
 periodically,
 Check it out:
 http://zeppelin.incubator.apache.org/docs/tutorial/tutorial.html




 On Fri Feb 13 2015 at 7:29:00 AM Silvio Fiorito 
 silvio.fior...@granturing.com wrote:

   One method I’ve used is to publish each batch to a message bus or
 queue with a custom UI listening on the other end, displaying the results
 in d3.js or some other app. As far as I’m aware there isn’t a tool that
 will directly take a DStream.

  Spark Notebook seems to have some support for updating graphs
 periodically. I haven’t used it myself yet so not sure how well it works.
 See here: https://github.com/andypetrella/spark-notebook

   From: Su She
 Date: Thursday, February 12, 2015 at 1:55 AM
 To: Felix C
 Cc: Kelvin Chu, user@spark.apache.org

 Subject: Re: Can spark job server be used to visualize streaming data?

   Hello Felix,

  I am already streaming in very simple data using Kafka (few messages /
 second, each record only has 3 columns...really simple, but looking to
 scale once I connect everything). I am processing it in Spark Streaming and
 am currently writing word counts to hdfs. So the part where I am confused
 is...

 Kafka Publishes Data - Kafka Consumer/Spark Streaming Receives Data -
 Spark Word Count - *How do I visualize?*

  is there a viz tool that I can set up to visualize JavaPairDStreams?
 or do I have to write to hbase/hdfs first?

  Thanks!

 On Wed, Feb 11, 2015 at 10:39 PM, Felix C felixcheun...@hotmail.com
 wrote:

  What kind of data do you have? Kafka is a popular source to use with
 spark streaming.
 But, spark streaming also support reading from a file. Its called basic
 source

 https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers

 --- Original Message ---

 From: Su She suhsheka...@gmail.com
 Sent: February 11, 2015 10:23 AM
 To: Felix C felixcheun...@hotmail.com
 Cc: Kelvin Chu 2dot7kel...@gmail.com, user@spark.apache.org
 Subject: Re: Can spark job server be used to visualize streaming data?

   Thank you Felix and Kelvin. I think I'll def be using the k-means
 tools in mlib.

  It seems the best way to stream data is by storing in hbase and then
 using an api in my viz to extract data? Does anyone have any thoughts on
 this?

   Thanks!


 On Tue, Feb 10, 2015 at 11:45 PM, Felix C felixcheun...@hotmail.com
 wrote:

  Checkout

 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

 In there are links to how that is done.


 --- Original Message ---

 From: Kelvin Chu 2dot7kel...@gmail.com
 Sent: February 10, 2015 12:48 PM
 To: Su She suhsheka...@gmail.com
 Cc: user@spark.apache.org
 Subject: Re: Can spark job server be used to visualize streaming data?

   Hi Su,

  Out of the box, no. But, I know people integrate it with Spark
 Streaming to do real-time visualization. It will take some work though.

  Kelvin

 On Mon, Feb 9, 2015 at 5:04 PM, Su She suhsheka...@gmail.com wrote:

  Hello Everyone,

  I was reading this blog post:
 http://homes.esat.kuleuven.be/~bioiuser/blog/a-d3-visualisation-from-spark-as-a-service/

  and was wondering if this approach can be taken to visualize
 streaming data...not just historical data?

  Thank you!

  -Suh








Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Kevin (Sangwoo) Kim
Apache Zeppelin also has a scheduler and then you can reload your chart
periodically,
Check it out:
http://zeppelin.incubator.apache.org/docs/tutorial/tutorial.html




On Fri Feb 13 2015 at 7:29:00 AM Silvio Fiorito 
silvio.fior...@granturing.com wrote:

   One method I’ve used is to publish each batch to a message bus or queue
 with a custom UI listening on the other end, displaying the results in
 d3.js or some other app. As far as I’m aware there isn’t a tool that will
 directly take a DStream.

  Spark Notebook seems to have some support for updating graphs
 periodically. I haven’t used it myself yet so not sure how well it works.
 See here: https://github.com/andypetrella/spark-notebook

   From: Su She
 Date: Thursday, February 12, 2015 at 1:55 AM
 To: Felix C
 Cc: Kelvin Chu, user@spark.apache.org

 Subject: Re: Can spark job server be used to visualize streaming data?

   Hello Felix,

  I am already streaming in very simple data using Kafka (few messages /
 second, each record only has 3 columns...really simple, but looking to
 scale once I connect everything). I am processing it in Spark Streaming and
 am currently writing word counts to hdfs. So the part where I am confused
 is...

 Kafka Publishes Data - Kafka Consumer/Spark Streaming Receives Data -
 Spark Word Count - *How do I visualize?*

  is there a viz tool that I can set up to visualize JavaPairDStreams? or
 do I have to write to hbase/hdfs first?

  Thanks!

 On Wed, Feb 11, 2015 at 10:39 PM, Felix C felixcheun...@hotmail.com
 wrote:

  What kind of data do you have? Kafka is a popular source to use with
 spark streaming.
 But, spark streaming also support reading from a file. Its called basic
 source

 https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers

 --- Original Message ---

 From: Su She suhsheka...@gmail.com
 Sent: February 11, 2015 10:23 AM
 To: Felix C felixcheun...@hotmail.com
 Cc: Kelvin Chu 2dot7kel...@gmail.com, user@spark.apache.org
 Subject: Re: Can spark job server be used to visualize streaming data?

   Thank you Felix and Kelvin. I think I'll def be using the k-means
 tools in mlib.

  It seems the best way to stream data is by storing in hbase and then
 using an api in my viz to extract data? Does anyone have any thoughts on
 this?

   Thanks!


 On Tue, Feb 10, 2015 at 11:45 PM, Felix C felixcheun...@hotmail.com
 wrote:

  Checkout

 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

 In there are links to how that is done.


 --- Original Message ---

 From: Kelvin Chu 2dot7kel...@gmail.com
 Sent: February 10, 2015 12:48 PM
 To: Su She suhsheka...@gmail.com
 Cc: user@spark.apache.org
 Subject: Re: Can spark job server be used to visualize streaming data?

   Hi Su,

  Out of the box, no. But, I know people integrate it with Spark
 Streaming to do real-time visualization. It will take some work though.

  Kelvin

 On Mon, Feb 9, 2015 at 5:04 PM, Su She suhsheka...@gmail.com wrote:

  Hello Everyone,

  I was reading this blog post:
 http://homes.esat.kuleuven.be/~bioiuser/blog/a-d3-visualisation-from-spark-as-a-service/

  and was wondering if this approach can be taken to visualize streaming
 data...not just historical data?

  Thank you!

  -Suh







Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread Kevin (Sangwoo) Kim
Great to hear you got solution!!
Cheers!

Kevin

On Wed Jan 21 2015 at 11:13:44 AM jagaximo takuya_seg...@dwango.co.jp
wrote:

 Kevin (Sangwoo) Kim wrote
  If keys are not too many,
  You can do like this:
 
  val data = List(
(A, Set(1,2,3)),
(A, Set(1,2,4)),
(B, Set(1,2,3))
  )
  val rdd = sc.parallelize(data)
  rdd.persist()
 
  rdd.filter(_._1 == A).flatMap(_._2).distinct.count
  rdd.filter(_._1 == B).flatMap(_._2).distinct.count
  rdd.unpersist()
 
  ==
  data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1,
  2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3)))
  rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[
 Int])]
  = ParallelCollectionRDD[6940] at parallelize at
  console
  :66
  res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at
  console
  :66
  res334: Long = 4
  res335: Long = 3
  res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at
  console
  :66
 
  Regards,
  Kevin

 Wow, Got it! good solution
 Fortunately, I know what keys have large size Set, I was able to adopt this
 approach.

 thanks!




 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-
 that-include-large-Set-tp21248p21275.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin (Sangwoo) Kim
In your code, you're doing combination of large sets, like
(set1 ++ set2).size
which is not a good idea.

(rdd1 ++ rdd2).distinct
is equivalent implementation and will compute in distributed manner.
Not very sure your computation on key'd sets are feasible to be transformed
into RDDs.

Regards,
Kevin


On Tue Jan 20 2015 at 1:57:52 PM Kevin Jung itsjb.j...@samsung.com wrote:

 As far as I know, the tasks before calling saveAsText  are transformations
 so
 that they are lazy computed. Then saveAsText action performs all
 transformations and your Set[String] grows up at this time. It creates
 large
 collection if you have few keys and this causes OOM easily when your
 executor memory and fraction settings are not suitable for computing this.
 If you want only collection counts by keys , you can use countByKey() or
 map() RDD[(String, Set[String])] to RDD[(String,Long)] after creating hoge
 RDD to make reduceByKey collect only counts of keys.



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-
 that-include-large-Set-tp21248p21251.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin (Sangwoo) Kim
If keys are not too many,
You can do like this:

val data = List(
  (A, Set(1,2,3)),
  (A, Set(1,2,4)),
  (B, Set(1,2,3))
)
val rdd = sc.parallelize(data)
rdd.persist()

rdd.filter(_._1 == A).flatMap(_._2).distinct.count
rdd.filter(_._1 == B).flatMap(_._2).distinct.count
rdd.unpersist()

==
data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1, 2,
3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3))) rdd:
org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[Int])] =
ParallelCollectionRDD[6940] at parallelize at console:66 res332: rdd.type
= ParallelCollectionRDD[6940] at parallelize at console:66 res334: Long =
4 res335: Long = 3 res336: rdd.type = ParallelCollectionRDD[6940] at
parallelize at console:66

Regards,
Kevin



On Tue Jan 20 2015 at 2:53:22 PM jagaximo takuya_seg...@dwango.co.jp
wrote:

 That i want to do, get unique count for each key. so take map() or
 countByKey(), not get unique count. (because duplicate string is likely to
 be counted)...




 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-
 that-include-large-Set-tp21248p21254.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




ExceptionInInitializerError when using a class defined in REPL

2015-01-18 Thread Kevin (Sangwoo) Kim
Hi experts,

I'm getting ExceptionInInitializerError when using a class defined in REPL.
Code is something like this:

case class TEST(a: String)
sc.textFile(~~~).map(TEST(_)).count

The code above used to works well until yesterday, but suddenly for some
reason it doesn't work with the error.
Confirmed it still works with local mode.

I'm getting headache while working into this problem during whole weekend.
Any ideas?

environment:
aws ec2, s3
spark v1.1.1, hadoop 2.2

Attaching error logs:
===

15/01/18 13:54:22 INFO TaskSetManager: Lost task 0.19 in stage 0.0 (TID 19)
on executor ip-172-16-186-181.ap-northeast-1.compute.internal:
java.lang.ExceptionInInitializerError (null) [duplicate 5]
15/01/18 13:54:22 ERROR TaskSetManager: Task 0 in stage 0.0 failed 20
times; aborting job
15/01/18 13:54:22 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
15/01/18 13:54:22 INFO TaskSchedulerImpl: Cancelling stage 0
15/01/18 13:54:22 INFO DAGScheduler: Failed to run first at console:45
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 20 times, most recent failure: Lost task 0.19 in stage
0.0 (TID 19, ip-172-16-186-181.ap-northeast-1.compute.internal):
java.lang.ExceptionInInitializerError:
$iwC.init(console:6)
init(console:35)
.init(console:39)
.clinit(console)

$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DailyStatsChartBuilder$.fromCsv(console:41)

$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:43)

$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:43)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1079)
org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1079)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Re: Futures timed out during unpersist

2015-01-17 Thread Kevin (Sangwoo) Kim
data size is about 300~400GB, I'm using 800GB cluster and set driver memory
to 50GB.

On Sat Jan 17 2015 at 6:01:46 PM Akhil Das ak...@sigmoidanalytics.com
wrote:

 What is the data size? Have you tried increasing the driver memory??

 Thanks
 Best Regards

 On Sat, Jan 17, 2015 at 1:01 PM, Kevin (Sangwoo) Kim kevin...@apache.org
 wrote:

 Hi experts,
 I got an error during unpersist RDD.
 Any ideas?

 java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds] at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at
 scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107) at
 org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:103)
 at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:951) at
 org.apache.spark.rdd.RDD.unpersist(RDD.scala:168)





Futures timed out during unpersist

2015-01-16 Thread Kevin (Sangwoo) Kim
Hi experts,
I got an error during unpersist RDD.
Any ideas?

java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107) at
org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:103)
at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:951) at
org.apache.spark.rdd.RDD.unpersist(RDD.scala:168)