Re: Shuffle Files

2014-03-04 Thread Aniket Mokashi
From BlockManager code + ShuffleMapTask code, it writes under
spark.local.dir or java.io.tmpdir.

val diskBlockManager = new DiskBlockManager(shuffleBlockManager,
conf.get(spark.local.dir,  System.getProperty(java.io.tmpdir)))




On Mon, Mar 3, 2014 at 10:45 PM, Usman Ghani us...@platfora.com wrote:

 Where on the filesystem does spark write the shuffle files?




-- 
...:::Aniket:::... Quetzalco@tl


RE: Connection Refused When Running SparkPi Locally

2014-03-04 Thread Li, Rui
I've encountered similar problems.
Maybe you can try using hostname or FQDN (rather than IP address) of your node 
for the master URI.
In my case, AKKA picks the FQDN for master URI and worker has to use exactly 
the same string for connection.

From: Benny Thompson [mailto:ben.d.tho...@gmail.com]
Sent: Saturday, March 01, 2014 10:18 AM
To: u...@spark.incubator.apache.org
Subject: Connection Refused When Running SparkPi Locally

I'm trying to run a simple execution of the SparkPi example.  I started the 
master and one worker, then executed the job on my local cluster, but end up 
getting a sequence of errors all ending with

Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: /127.0.0.1:39398http://127.0.0.1:39398

I originally tried running my master and worker without configuration but ended 
up with the same error.  I tried to change to 127.0.0.1 to test if it was maybe 
just a firewall issue since the server is locked down from the outside world.

My conf/spark-conf.sh contains the following:
export SPARK_MASTER_IP=127.0.0.1

Here is the order and commands I run:
1) sbin/start-master.sh (to start the master)
2) bin/spark-class org.apache.spark.deploy.worker.Worker 
spark://127.0.0.1:7077http://127.0.0.1:7077 --ip 127.0.0.1 --port  (in a 
different session on the same machine to start the slave)
3) bin/run-example org.apache.spark.examples.SparkPi 
spark://127.0.0.1:7077http://127.0.0.1:7077 (in a different session on the 
same machine to start the job)

I find it hard to believe that I'm locked down enough that running locally 
would cause problems.

Any help is greatly appreciated!

Thanks,
Benny


RE: Actors and sparkcontext actions

2014-03-04 Thread Suraj Satishkumar Sheth
Hi Ognen,
See if this helps. I was working on this :

class MyClass[T](sc : SparkContext, flag1 : Boolean, rdd : RDD[T], hdfsPath : 
String) extends Actor {

  def act(){
if(flag1) this.process()
else this.count
  }
  
  private def process(){
println(sc.textFile(hdfsPath).count)
//do the processing
  }
  
  private def count(){
   println(rdd.count)
   //do the counting
  }

}

Thanks and Regards,
Suraj Sheth


-Original Message-
From: Ognen Duzlevski [mailto:og...@nengoiksvelzud.com] 
Sent: 27 February 2014 01:09
To: u...@spark.incubator.apache.org
Subject: Actors and sparkcontext actions

Can someone point me to a simple, short code example of creating a basic Actor 
that gets a context and runs an operation such as .textFile.count? 
I am trying to figure out how to create just a basic actor that gets a message 
like this:

case class Msg(filename:String, ctx: SparkContext)

and then something like this:

class HelloActor extends Actor {
 import context.dispatcher

 def receive = {
 case Msg(fn,ctx) = {
 // get the count here!
 // cts.textFile(fn).count
 }
 case _ = println(huh?)
 }
}

Where I would want to do something like:

val conf = new
SparkConf().setMaster(spark://192.168.10.29:7077).setAppName(Hello).setSparkHome(/Users/maketo/plainvanilla/spark-0.9)
val sc = new SparkContext(conf)
val system = ActorSystem(mySystem)

val helloActor1 = system.actorOf( Props[ HelloActor], name = helloactor1)
helloActor1 ! new Msg(test.json,sc)

Thanks,
Ognen


Re: Problem with delete spark temp dir on spark 0.8.1

2014-03-04 Thread Akhil Das
Hi,

Try to clean your temp dir, System.getProperty(java.io.tmpdir)

Also, Can you paste a longer stacktrace?




Thanks
Best Regards


On Tue, Mar 4, 2014 at 2:55 PM, goi cto goi@gmail.com wrote:

 Hi,

 I am running a spark java program on a local machine. when I try to write
 the output to a file (RDD.SaveAsTextFile) I am getting this exception:

 Exception in thread Delete Spark temp dir ...

 This is running on my local window machine.

 Any ideas?

 --
 Eran | CTO





-- 
Thanks
Best Regards


Re: Problem with delete spark temp dir on spark 0.8.1

2014-03-04 Thread goi cto
Exception in thread delete Spark temp dir C:\Users\...
java.io.IOException: failed to delete: C:\Users\...\simple-project-1.0.jar
 at org.apache.spark.util.utils$.deleteRecursively(Utils.scala:495)
 at
org.apache.spark.util.utils$$anonfun$deleteRecursively$1.apply(Utils.scala:491)

I deleted my temp dir as suggested and indeed all spark.. directories were
deleted. after which I run the program again and got the same error again.
also indeed a spark-... directory with the simple-project-1.0.jar was
found left on the file system.
I had no problem deleting this file once the program completed.

Eran


On Tue, Mar 4, 2014 at 11:36 AM, Akhil Das ak...@mobipulse.in wrote:

 Hi,

 Try to clean your temp dir, System.getProperty(java.io.tmpdir)

  Also, Can you paste a longer stacktrace?




 Thanks
 Best Regards


 On Tue, Mar 4, 2014 at 2:55 PM, goi cto goi@gmail.com wrote:

 Hi,

 I am running a spark java program on a local machine. when I try to write
 the output to a file (RDD.SaveAsTextFile) I am getting this exception:

 Exception in thread Delete Spark temp dir ...

 This is running on my local window machine.

 Any ideas?

 --
 Eran | CTO





 --
 Thanks
 Best Regards




-- 
Eran | CTO


RDD Manipulation in Scala.

2014-03-04 Thread trottdw
Hello, I am using Spark with Scala and I am attempting to understand the
different filtering and mapping capabilities available.  I haven't found an
example of the specific task I would like to do.

I am trying to read in a tab spaced text file and filter specific entries. 
I would like this filter to be applied to different columns and not lines.  
I was using the following to split the data but attempts to filter by
column afterwards are not working.
-
   val data = sc.textFile(test_data.txt)
   var parsedData = data.map( _.split(\t).map(_.toString))
--

To try to give a more concrete example of my goal,
Suppose the data file is:
A1A2 A3 A4
B1B2 A3 A4
C1A2 C2 C3


How would I filter the data based on the second column to only return those
entries which have A2 in column two?  So, that the resulting RDD would just
be:

A1A2 A3 A4
C1A2 C2 C3

Is there a convenient way to do this?  Any suggestions or assistance would
be appreciated.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Manipulation-in-Scala-tp2285.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: RDD Manipulation in Scala.

2014-03-04 Thread trottdw
Thanks Sean, I think that is doing what I needed.  It was much simpler than
what I had been attempting.

Is it possible to do an OR statement filter?  So, that for example column 2
can be filtered by A2 appearances and column 3 by A4?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Manipulation-in-Scala-tp2285p2287.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Actors and sparkcontext actions

2014-03-04 Thread Debasish Das
Hi Ognen,

Any particular reason of choosing scalatra over options like play or spray
?

Is scalatra much better in serving apis or is it due to similarity with
ruby's sinatra ?

Did you try the other options and then pick scalatra ?

Thanks.
Deb



On Tue, Mar 4, 2014 at 4:50 AM, Ognen Duzlevski og...@plainvanillagames.com
 wrote:

 Suraj, I posted to this list a link to my blog where I detail how to do a
 simple actor/sparkcontext thing with the added obstacle of it being within
 a Scalatra servlet.

 Thanks for the code!
 Ognen


 On 3/4/14, 3:20 AM, Suraj Satishkumar Sheth wrote:

 Hi Ognen,
 See if this helps. I was working on this :

 class MyClass[T](sc : SparkContext, flag1 : Boolean, rdd : RDD[T],
 hdfsPath : String) extends Actor {

def act(){
  if(flag1) this.process()
  else this.count
}
   private def process(){
  println(sc.textFile(hdfsPath).count)
  //do the processing
}
   private def count(){
 println(rdd.count)
 //do the counting
}

 }

 Thanks and Regards,
 Suraj Sheth


 -Original Message-
 From: Ognen Duzlevski [mailto:og...@nengoiksvelzud.com]
 Sent: 27 February 2014 01:09
 To: u...@spark.incubator.apache.org
 Subject: Actors and sparkcontext actions

 Can someone point me to a simple, short code example of creating a basic
 Actor that gets a context and runs an operation such as .textFile.count?
 I am trying to figure out how to create just a basic actor that gets a
 message like this:

 case class Msg(filename:String, ctx: SparkContext)

 and then something like this:

 class HelloActor extends Actor {
   import context.dispatcher

   def receive = {
   case Msg(fn,ctx) = {
   // get the count here!
   // cts.textFile(fn).count
   }
   case _ = println(huh?)
   }
 }

 Where I would want to do something like:

 val conf = new
 SparkConf().setMaster(spark://192.168.10.29:7077).setAppName(Hello).
 setSparkHome(/Users/maketo/plainvanilla/spark-0.9)
 val sc = new SparkContext(conf)
 val system = ActorSystem(mySystem)

 val helloActor1 = system.actorOf( Props[ HelloActor], name =
 helloactor1)
 helloActor1 ! new Msg(test.json,sc)

 Thanks,
 Ognen


 --
 Some people, when confronted with a problem, think I know, I'll use
 regular expressions. Now they have two problems.
 -- Jamie Zawinski




Re: o.a.s.u.Vector instances for equality

2014-03-04 Thread Oleksandr Olgashko
Thanks.
Does it make sence to add ==/equals method for Vector with this (or same)
behavior?


2014-03-04 6:00 GMT+02:00 Shixiong Zhu zsxw...@gmail.com:

 Vector is an enhanced Array[Double]. You can compare it like
 Array[Double]. E.g.,

 scala val v1 = Vector(1.0, 2.0)
 v1: org.apache.spark.util.Vector = (1.0, 2.0)

 scala val v2 = Vector(1.0, 2.0)
 v2: org.apache.spark.util.Vector = (1.0, 2.0)

 scala val exactResult = v1.elements.sameElements(v2.elements) // exact
 comparison
 exactResult: Boolean = true

 scala val delta = 1E-6
 delta: Double = 1.0E-6

 scala val inexactResult = v1.elements.length == v2.elements.length 
 v1.elements.zip(v2.elements).forall { case (x, y) = (x - y).abs  delta }
 // inexact comparison
 inexactResult : Boolean = true

 Best Regards,
 Shixiong Zhu


 2014-03-04 4:23 GMT+08:00 Oleksandr Olgashko alexandrolg...@gmail.com:

 Hello. How should i better check two Vector's for equality?

 val a = new Vector(Array(1))
 val b = new Vector(Array(1))
 println(a == b)
 // false





Re: Actors and sparkcontext actions

2014-03-04 Thread Ognen Duzlevski

Deb,

On 3/4/14, 9:02 AM, Debasish Das wrote:

Hi Ognen,

Any particular reason of choosing scalatra over options like play or 
spray ?


Is scalatra much better in serving apis or is it due to similarity 
with ruby's sinatra ?


Did you try the other options and then pick scalatra ?
Not really. I just happen to like Scalatra, it is easy with read and 
easy to write. If it did not work out I would have gone probably with 
something like Unfiltered.

Ognen


Re: Missing Spark URL after staring the master

2014-03-04 Thread Bin Wang
Hi Mayur,

I am using CDH4.6.0p0.26.  And the latest Cloudera Spark parcel is Spark
0.9.0 CDH4.6.0p0.50.
As I mentioned, somehow, the Cloudera Spark version doesn't contain the
run-example shell scripts.. However, it is automatically configured and it
is pretty easy to set up across the cluster...

Thanks,
Bin


On Tue, Mar 4, 2014 at 10:59 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 I have on cloudera vm
 http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM
 which version are you trying to setup on cloudera.. also which cloudera
 version are you using...


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang binwang...@gmail.com wrote:

 Hi Ognen/Mayur,

 Thanks for the reply and it is good to know how easy it is to setup Spark
 on AWS cluster.

 My situation is a bit different from yours, our company already have a
 cluster and it really doesn't make that much sense not to use them. That is
 why I have been going through this. I really wish there are some
 tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster
 or .. some way to tweak the CDH Spark distribution, so it is up to date.

 Ognen, of course it will be very helpful if you can 'history | grep
 spark... ' and document the work that you have done since you've already
 made it!

 Bin



 On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski 
 og...@plainvanillagames.com wrote:

  I should add that in this setup you really do not need to look for the
 printout of the master node's IP - you set it yourself a priori. If anyone
 is interested, let me know, I can write it all up so that people can follow
 some set of instructions. Who knows, maybe I can come up with a set of
 scripts to automate it all...

 Ognen



 On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:

 I have a Standalone spark cluster running in an Amazon VPC that I set up
 by hand. All I did was provision the machines from a common AMI image (my
 underlying distribution is Ubuntu), I created a sparkuser on each machine
 and I have a /home/sparkuser/spark folder where I downladed spark. I did
 this on the master only, I did sbt/sbt assemble and I set up the
 conf/spark-env.sh to point to the master which is an IP address (in my case
 10.10.0.200, the port is the default 7077). I also set up the slaves file
 in the same subdirectory to have all 16 ip addresses of the worker nodes
 (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I
 then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz
 file to each worker using the same sparkuser account and unpacked the
 .tgz on each slave (this will effectively replicate everything from master
 to all slaves - you can script this so you don't do it by hand).

 Your AMI should have the distribution's version of Java and git
 installed by the way.

 All you have to do then is sparkuser@spark-master
 spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh)
 and it will all automagically start :)

 All my Amazon nodes come with 4x400 Gb of ephemeral space which I have
 set up into a 1.6TB RAID0 array on each node and I am pooling this into an
 HDFS filesystem which is operated by a namenode outside the spark cluster
 while all the datanodes are the same nodes as the spark workers. This
 enables replication and extremely fast access since ephemeral is much
 faster than EBS or anything else on Amazon (you can do even better with SSD
 drives on this setup but it will cost ya).

 If anyone is interested I can document our pipeline set up - I came up
 with it myself and do not have a clue as to what the industry standards are
 since I could not find any written instructions anywhere online about how
 to set up a whole data analytics pipeline from the point of ingestion to
 the point of analytics (people don't want to share their secrets? or am I
 just in the dark and incapable of using Google properly?). My requirement
 was that I wanted this to run within a VPC for added security and
 simplicity, the Amazon security groups get really old quickly. Added bonus
 is that you can use a VPN as an entry into the whole system and your
 cluster instantly becomes local to you in terms of IPs etc. I use OpenVPN
 since I don't like Cisco nor Juniper (the only two options Amazon provides
 for their VPN gateways).

 Ognen


 On 3/3/14, 1:00 PM, Bin Wang wrote:

 Hi there,

  I have a CDH cluster set up, and I tried using the Spark parcel come
 with Cloudera Manager, but it turned out they even don't have the
 run-example shell command in the bin folder. Then I removed it from the
 cluster and cloned the incubator-spark into the name node of my cluster,
 and built from source there successfully with everything as default.

  I ran a few examples and everything seems work fine in the local mode.
 Then I am thinking about scale it to my cluster, which is what the
 

Spark Streaming Maven Build

2014-03-04 Thread Bin Wang
Hi there,

I tried the Kafka WordCount example and it works perfect and the code is
pretty straightforward to understand.

Can anyone show to me how to start your own maven project with the
KafkaWordCount example using minimum-effort.

1. How the pom file should look like (including jar-plugin?
assembly-plugin?..etc)
2. mvn install or mvn clean install or mvn install compile assembly:single?
3. after you have a jar file, then how do you execute the jar file instead
of using bin/run-example...


To answer those people who might ask what you have done
(Here is a derivative from the KafkaWordCount example that I have written
to do IP count example where the input data from Kafka is actually JSON
string.
https://github.com/biwa7636/binwangREPO
I had such a bad lucky got it to working. So if anyone can copy the code of
WordCountExample and build it using maven and got it working.. if you can
share your pom and those maven commands, I will be so appreciated!)

Best regards and let me know if you need further info.

Bin