Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-18 Thread Raghavendra Pandey
If you are running spark in local mode, executor parameters are not used as
there is no executor. You should try to set corresponding driver parameter
to effect it.

On Mon, Jan 19, 2015, 00:21 Sean Owen so...@cloudera.com wrote:

 OK. Are you sure the executor has the memory you think? -Xmx24g in
 its command line? It may be that for some reason your job is reserving
 an exceptionally large amount of non-heap memory. I am not sure that's
 to be expected with the ALS job though. Even if the settings work,
 considering using the explicit command line configuration.

 On Sat, Jan 17, 2015 at 12:49 PM, Antony Mayi antonym...@yahoo.com
 wrote:
  the values are for sure applied as expected - confirmed using the spark
 UI
  environment page...
 
  it comes from my defaults configured using
  'spark.yarn.executor.memoryOverhead=8192' (yes, now increased even
 more) in
  /etc/spark/conf/spark-defaults.conf and 'export
 SPARK_EXECUTOR_MEMORY=24G'
  in /etc/spark/conf/spark-env.sh

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: running a job on ec2 Initial job has not accepted any resources

2015-01-18 Thread Akhil Das
Just make sure both versions of spark are same (the one from where you are
submitting the job, and the one to which you are submitting the job).
Another reason would be firewall issues if you are submitting the job from
another network/remote machine.

Thanks
Best Regards

On Sun, Jan 18, 2015 at 6:02 PM, Grzegorz Dubicki 
grzegorz.dubi...@gmail.com wrote:

 Hi mehrdad,

 I seem to have the same issue as you wrote about here. Did you manage to
 resolve it?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/running-a-job-on-ec2-Initial-job-has-not-accepted-any-resources-tp20607p21218.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Join DStream With Other Datasets

2015-01-18 Thread Ji ZHANG
Hi Sean,

Thanks for your advice, a normal 'val' will suffice. But will it be
serialized and transferred every batch and every partition? That's why
broadcast exists, right?

For now I'm going to use 'val', but I'm still looking for a broadcast-way
solution.


On Sun, Jan 18, 2015 at 5:36 PM, Sean Owen so...@cloudera.com wrote:

 I think that this problem is not Spark-specific since you are simply side
 loading some data into memory. Therefore you do not need an answer that
 uses Spark.

 Simply load the data and then poll for an update each time it is accessed?
 Or some reasonable interval? This is just something you write in Java/Scala.
 On Jan 17, 2015 2:06 PM, Ji ZHANG zhangj...@gmail.com wrote:

 Hi,

 I want to join a DStream with some other dataset, e.g. join a click
 stream with a spam ip list. I can think of two possible solutions, one
 is use broadcast variable, and the other is use transform operation as
 is described in the manual.

 But the problem is the spam ip list will be updated outside of the
 spark streaming program, so how can it be noticed to reload the list?

 For broadcast variables, they are immutable.

 For transform operation, is it costly to reload the RDD on every
 batch? If it is, and I use RDD.persist(), does it mean I need to
 launch a thread to regularly unpersist it so that it can get the
 updates?

 Any ideas will be appreciated. Thanks.

 --
 Jerry

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Jerry


Re: maven doesn't build dependencies with Scala 2.11

2015-01-18 Thread Ted Yu
bq. there was no 2.11 Kafka available

That's right. Adding external/kafka module resulted in:

[ERROR] Failed to execute goal on project spark-streaming-kafka_2.11: Could
not resolve dependencies for project
org.apache.spark:spark-streaming-kafka_2.11:jar:1.3.0-SNAPSHOT: Could not
find artifact org.apache.kafka:kafka_2.11:jar:0.8.0 in central (
https://repo1.maven.org/maven2) - [Help 1]

Cheers

On Sun, Jan 18, 2015 at 10:41 AM, Sean Owen so...@cloudera.com wrote:

 I could be wrong, but I thought this was on purpose. At the time it
 was set up, there was no 2.11 Kafka available? or one of its
 dependencies wouldn't work with 2.11?

 But I'm not sure what the OP means by maven doesn't build Spark's
 dependencies because Ted indicates it does, and of course you can see
 that these artifacts are published.

 On Sun, Jan 18, 2015 at 2:46 AM, Ted Yu yuzhih...@gmail.com wrote:
  There're 3 jars under lib_managed/jars directory with and without
  -Dscala-2.11 flag.
 
  Difference between scala-2.10 and scala-2.11 profiles is that scala-2.10
  profile has the following:
modules
  moduleexternal/kafka/module
/modules
 
  FYI
 
  On Sat, Jan 17, 2015 at 4:07 PM, Ted Yu yuzhih...@gmail.com wrote:
 
  I did the following:
   1655  dev/change-version-to-2.11.sh
   1657  mvn -DHADOOP_PROFILE=hadoop-2.4 -Pyarn,hive -Phadoop-2.4
  -Dscala-2.11 -DskipTests clean package
 
  And mvn command passed.
 
  Did you see any cross-compilation errors ?
 
  Cheers
 
  BTW the two links you mentioned are consistent in terms of building for
  Scala 2.11
 
  On Sat, Jan 17, 2015 at 3:43 PM, Walrus theCat walrusthe...@gmail.com
  wrote:
 
  Hi,
 
  When I run this:
 
  dev/change-version-to-2.11.sh
  mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
 
  as per here, maven doesn't build Spark's dependencies.
 
  Only when I run:
 
  dev/change-version-to-2.11.sh
  sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests  clean package
 
  as gathered from here, do I get Spark's dependencies built without any
  cross-compilation errors.
 
  Question:
 
  - How can I make maven do this?
 
  - How can I specify the use of Scala 2.11 in my own .pom files?
 
  Thanks
 
 
 



RE: Streaming with Java: Expected ReduceByWindow to Return JavaDStream

2015-01-18 Thread Shao, Saisai
Hi Jeff,

From my understanding it seems more like a bug, since JavaDStreamLike is used 
for Java code, return a Scala DStream is not reasonable. You can fix this by 
submitting a PR, or I can help you to fix this.

Thanks
Jerry

From: Jeff Nadler [mailto:jnad...@srcginc.com]
Sent: Monday, January 19, 2015 2:04 PM
To: user@spark.apache.org
Subject: Streaming with Java: Expected ReduceByWindow to Return JavaDStream


Can anyone tell me if my expectations are sane?

I'm trying to do a reduceByWindow using the 3-arg signature (not providing an 
inverse reduce function):

JavaDStreamwhatevs reducedStream = messages.reduceByWindow((x, y) - 
reduce(x, y), Durations.seconds(5), Durations.seconds(5));

This isn't building; looks like it's returning DStream not JavaDStream.

From JavaDStreamLike.scala, looks like this sig returns DStream, the 4-arg sig 
with the inverse reduce returns JavaDStream.

def reduceByWindow(
reduceFunc: (T, T) = T,
windowDuration: Duration,
slideDuration: Duration
  ): DStream[T] = {
  dstream.reduceByWindow(reduceFunc, windowDuration, slideDuration)
}

So I'm just a noob.  Is this a bug or am I missing something?

Thanks!

Jeff Nadler




RE: SparkSQL 1.2.0 sources API error

2015-01-18 Thread Cheng, Hao
It seems the netty jar works with an incompatible method signature.  Can you 
check if there different versions of netty jar in your classpath?

From: Walrus theCat [mailto:walrusthe...@gmail.com]
Sent: Sunday, January 18, 2015 3:37 PM
To: user@spark.apache.org
Subject: Re: SparkSQL 1.2.0 sources API error

I'm getting this also, with Scala 2.11 and Scala 2.10:

15/01/18 07:34:51 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/01/18 07:34:51 INFO Remoting: Starting remoting
15/01/18 07:34:51 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread 
[sparkDriver-akka.remote.default-remote-dispatcher-7] shutting down ActorSystem 
[sparkDriver]
java.lang.NoSuchMethodError: 
org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V
at 
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:283)
at 
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:240)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
at scala.util.Try$.apply(Try.scala:161)
at 
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at scala.util.Success.flatMap(Try.scala:200)
at 
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:692)
at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:684)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at 
akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:684)
at 
akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:492)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Shutting down remote daemon.
15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote 
daemon shut down; proceeding with flushing remote transports.
15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Remoting shut down.
Exception in thread main java.util.concurrent.TimeoutException: Futures timed 
out after [1 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:180)
at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
  

Re: SparkSQL 1.2.0 sources API error

2015-01-18 Thread Ted Yu
NioWorkerPool(Executor workerExecutor, int workerCount) was added in netty
3.5.4
https://github.com/netty/netty/blob/netty-3.5.4.Final/src/main/java/org/jboss/netty/channel/socket/nio/NioWorkerPool.java

If there is a netty jar in the classpath older than the above release, you
would see the exception.

Cheers

On Sun, Jan 18, 2015 at 4:49 PM, Cheng, Hao hao.ch...@intel.com wrote:

  It seems the netty jar works with an incompatible method signature.  Can
 you check if there different versions of netty jar in your classpath?



 *From:* Walrus theCat [mailto:walrusthe...@gmail.com]
 *Sent:* Sunday, January 18, 2015 3:37 PM
 *To:* user@spark.apache.org
 *Subject:* Re: SparkSQL 1.2.0 sources API error



 I'm getting this also, with Scala 2.11 and Scala 2.10:

 15/01/18 07:34:51 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/01/18 07:34:51 INFO Remoting: Starting remoting
 15/01/18 07:34:51 ERROR actor.ActorSystemImpl: Uncaught fatal error from
 thread [sparkDriver-akka.remote.default-remote-dispatcher-7] shutting down
 ActorSystem [sparkDriver]
 java.lang.NoSuchMethodError:
 org.jboss.netty.channel.socket.nio.NioWorkerPool.init(Ljava/util/concurrent/Executor;I)V
 at
 akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:283)
 at
 akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:240)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at
 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
 at scala.util.Try$.apply(Try.scala:161)
 at
 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
 at
 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
 at
 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
 at scala.util.Success.flatMap(Try.scala:200)
 at
 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
 at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:692)
 at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:684)
 at
 scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
 at
 scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
 at
 akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:684)
 at
 akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:492)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Shutting down remote daemon.
 15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Remote daemon shut down; proceeding with flushing remote transports.
 15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator:
 Remoting shut down.
 Exception in thread main java.util.concurrent.TimeoutException: Futures
 timed out after [1 milliseconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at akka.remote.Remoting.start(Remoting.scala:180)
 at
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
 at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
 at 

Re: Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.

2015-01-18 Thread Josh Rosen
This looks like a bug in the master branch of Spark, related to some recent
changes to EventLoggingListener.  You can reproduce this bug on a fresh
Spark checkout by running

./bin/spark-shell --conf spark.eventLog.enabled=true --conf
spark.eventLog.dir=/tmp/nonexistent-dir

where /tmp/nonexistent-dir is a directory that doesn't exist and /tmp
exists.

It looks like older versions of EventLoggingListener would create the
directory if it didn't exist.  I think the issue here is that the
error-checking code is overzealous and catches some non-error conditions,
too; I've filed https://issues.apache.org/jira/browse/SPARK-5311 to
investigate this.

On Sun, Jan 18, 2015 at 1:59 PM, Ganon Pierce ganon.pie...@me.com wrote:

 I posted about the Application WebUI error (specifically application WebUI
 not the master WebUI generally) and have spent at least a few hours a day
 for over week trying to resolve it so I’d be very grateful for any
 suggestions. It is quite troubling that I appear to be the only one
 encountering this issue and I’ve tried to include everything here which
 might be relevant (sorry for the length). Please see the thread Current
 Build Gives HTTP ERROR”
 https://www.mail-archive.com/user@spark.apache.org/msg18752.html to see
 specifics about the application webUI issue and the master log.


 Environment:

 I’m doing my spark builds and application programming in scala locally on
 my macbook pro in eclipse, using modified ec2 launch scripts to launch my
 cluster, uploading my spark builds and models to s3, and uploading
 applications to and submitting them from ec2. I’m using java 8 locally and
 also installing and using java 8 on my ec2 instances (which works with
 spark 1.2.0). I have a windows machine at home (macbook is work machine),
 but have not yet attempted to launch from there.


 Errors:

 I’ve built two different recent git versions of spark both multiple times,
 and when running applications both have produced an Application WebUI error
 and this exception:

 Exception in thread main java.lang.IllegalArgumentException: Log
 directory /tmp/spark-events does not exist.

 While both will display the master webUI just fine including
 running/completed applications, registered workers etc, when I try to
 access a running or completed application’s WebUI by clicking their
 respective link, I receive a server error. When I manually create the above
 log directory, the exception goes away, but the WebUI problem does not. I
 don’t have any strong evidence, but I suspect these errors and whatever is
 causing them are related.


 Why and How of Modifications to Launch Scripts for Installation of
 Unreleased Spark Versions:

 When using a prebuilt version of spark on my cluster everything works
 except the new methods I need, which I had previously added to my custom
 version of spark and used by building the spark-assembly.jar locally and
 then replacing the assembly file produced through the 1.1.0 ec2 launch
 scripts. However, since my pull request was accepted and can now be found
 in the apache/spark repository along with some additional features I’d like
 to use and because I’d like a more elegant permanent solution for launching
 a cluster and installing unreleased versions of spark to my ec2 clusters,
 I’ve modified the included ec2 launch scripts in this way (credit to gen
 tang here:
 https://www.mail-archive.com/user%40spark.apache.org/msg18761.html
 https://www.mail-archive.com/user@spark.apache.org/msg18761.html):

 1. Clone the most recent git version of spark
 2. Use the make-dist script
 3. Tar the dist folder and upload the resulting
 spark-1.3.0-snapshot-hadoop1.tgz to s3 and change file permissions
 4. Fork the mesos/spark-ec2 repository and modify the spark/init.sh script
 to do a wget of my hosted distribution instead of spark’s stable release
 5. Modify my spark_ec2.py script to point to my repository.
 6. Modify my spark_ec2.py script to install java 8 on my ec2 instances.
 (This works and does not produce the above stated errors when using a
 stable release like 1.2.0).


 Additional Possibly Related Info:

 As far as I can tell (I went through line by line), when I launch my
 recent build vs when I launch the most recent stable release the console
 prints almost identical INFO and WARNINGS except where you would expect
 things to be different e.g. version numbers. I’ve noted that after launch
 the prebuilt stable version does not have a /tmp/spark-events directory,
 but it is created when the application is launched, while it is never
 created in my build. Further, in my unreleased builds the application logs
 that I find are always stored as .inprogress files (when I set the logging
 directory to /root/ or add the /tmp/spark-events directory manually) even
 after completion, which I believe is supposed to change to .completed (or
 something similar) when the application finishes.


 Thanks for any help!




Re: No Output

2015-01-18 Thread Deep Pradhan
The error in the log file says:

*java.lang.OutOfMemoryError: GC overhead limit exceeded*

with certain task ID and the error repeats for further task IDs.

What could be the problem?

On Sun, Jan 18, 2015 at 2:45 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Updating the Spark version means setting up the entire cluster once more?
 Or can we update it in some other way?

 On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you paste the code? Also you can try updating your spark version.

 Thanks
 Best Regards

 On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I am using Spark-1.0.0 in a single node cluster. When I run a job with
 small data set it runs perfectly but when I use a data set of 350 KB, no
 output is being produced and when I try to run it the second time it is
 giving me an exception telling that SparkContext was shut down.
 Can anyone help me on this?

 Thank You






Re: running a job on ec2 Initial job has not accepted any resources

2015-01-18 Thread Grzegorz Dubicki
Hi mehrdad,

I seem to have the same issue as you wrote about here. Did you manage to
resolve it?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/running-a-job-on-ec2-Initial-job-has-not-accepted-any-resources-tp20607p21218.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Avoid broacasting huge variables

2015-01-18 Thread octavian.ganea
The singleton hack works very different in spark 1.2.0 (it does not work if
the program has multiple map-reduce jobs in the same program). I guess there
should be an official documentation on how to have each machine/node do an
init step locally before executing any other instructions (e.g. loading
locally a very big object once at the begining that can be used in all
further map jobs that will be assigned to that worker).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Avoid-broacasting-huge-variables-tp14696p21220.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Directory / File Reading Patterns

2015-01-18 Thread Bob Tiernay
Also, I used the following pattern to extract information from a file path and 
add it to the output of a transformation:
https://gist.github.com/btiernay/1ad5e3dea08904fe07d9
You may find it useful as well.





Cheers,
Bob

From: btier...@hotmail.com
To: so...@cloudera.com; snu...@hortonworks.com
CC: user@spark.apache.org
Subject: RE: Directory / File Reading Patterns
Date: Sun, 18 Jan 2015 15:41:53 +




You may also want to keep an eye on SPARK-5182 / SPARK-5302 which may help if 
you are using Spark SQL. It should be noted that this is possible with 
HiveContext today.





Cheers,
Bob

Date: Sun, 18 Jan 2015 08:47:06 +
Subject: Re: Directory / File Reading Patterns
From: so...@cloudera.com
To: snu...@hortonworks.com
CC: user@spark.apache.org

I think that putting part of the data (only) in a filename is an anti-pattern, 
but we sometimes have to play these where they lie.
You can list all the directory paths containing the CSV files, map them each to 
RDDs with textFile, transform the RDDs to include info from the path, and then 
simply union them. 
This should be pretty fine performance wise even. 
On Jan 17, 2015 11:48 PM, Steve Nunez snu...@hortonworks.com wrote:





Hello Users,



I’ve got a real-world use case that seems common enough that its pattern would 
be documented somewhere, but I can’t find any references to a simple solution. 
The challenge is that data is getting dumped into a directory structure, and 
that directory structure
 itself contains features that I need in my model. For example:



bank_code
Trader
Day-1.csv
Day-2.csv
…



Each CVS file contains a list of all the trades made by that individual each 
day. The problem is that the bank  trader should be part of the feature set. 
I.e. We need the RDD to look like:
(bank, trader, day, list-of-trades)



Anyone got any elegant solutions for doing this?



Cheers,
- SteveN















  

Re: Reducer memory exceeded

2015-01-18 Thread Sean Owen
I think the problem is that you have a single object that is larger than
2GB and so fails to serialize to a byte array. I think it is best not to
design it this way as you can't parallelize combining maps. You could go
all the way to emit key value pairs and reduceByKey. There are solutions
between these two as well.
On Jan 18, 2015 1:47 PM, octavian.ganea octavian.ga...@inf.ethz.ch
wrote:

 Hi,

 Please help me with this problem. I would really appreciate your help !

 I am using spark 1.2.0. I have a map-reduce job written in spark in the
 following way:

 val sumW = splittedTrainingDataRDD.map(localTrainingData = LocalSGD(w,
 localTrainingData, numeratorCtEta, numitorCtEta, regularizer,
 0.2).reduce((w1,w2) = {w1.add(w2); w2.clear; w1})

 Here, w is trove TLongDoubleHashMap containing no more than 50 million
 elements (in RAM this is ~ 15 GB). w1.add(w2) does addition of the values
 of
 the same key, for each key of both maps.

 My initial configuration is:
 conf.set(spark.cores.max, 16)
 conf.set(spark.akka.frameSize, 10)
 conf.set(spark.executor.memory, 120g)
 conf.set(spark.reducer.maxMbInFlight, 10)
 conf.set(spark.storage.memoryFraction, 0.9)
 conf.set(spark.shuffle.file.buffer.kb, 1000)
 conf.set(spark.broadcast.factory,
 org.apache.spark.broadcast.HttpBroadcastFactory)
 conf.set(spark.driver.maxResultSize, 120g)
 val sc = new SparkContext(conf)

 I am running this on a cluster with 8 machines, each machine has 16 cores
 and 130 GB RAM.

 My spark-env.sh contains:
  ulimit -n 20
  SPARK_JAVA_OPTS=-Xms120G -Xmx120G -XX:-UseGCOverheadLimit
 -XX:-UseCompressedOops
  SPARK_DRIVER_MEMORY=120G


 The error I get is at the reducer above (the reducer above is in file
 called
 Learning.scala, line 313):

 15/01/18 14:35:52 ERROR ActorSystemImpl: Uncaught fatal error from thread
 [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem
 [sparkDriver]
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at

 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at

 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at
 java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at

 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at

 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:836)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
 at

 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 15/01/18 14:35:52 INFO DAGScheduler: Job 2 failed: reduce at
 Learning.scala:313, took 54.657239 s
 Exception in thread main org.apache.spark.SparkException: Job cancelled
 because SparkContext was shut down
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
 at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
 at

 org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701)
 at

 

RE: Directory / File Reading Patterns

2015-01-18 Thread Bob Tiernay
You may also want to keep an eye on SPARK-5182 / SPARK-5302 which may help if 
you are using Spark SQL. It should be noted that this is possible with 
HiveContext today.





Cheers,
Bob

Date: Sun, 18 Jan 2015 08:47:06 +
Subject: Re: Directory / File Reading Patterns
From: so...@cloudera.com
To: snu...@hortonworks.com
CC: user@spark.apache.org

I think that putting part of the data (only) in a filename is an anti-pattern, 
but we sometimes have to play these where they lie.
You can list all the directory paths containing the CSV files, map them each to 
RDDs with textFile, transform the RDDs to include info from the path, and then 
simply union them. 
This should be pretty fine performance wise even. 
On Jan 17, 2015 11:48 PM, Steve Nunez snu...@hortonworks.com wrote:





Hello Users,



I’ve got a real-world use case that seems common enough that its pattern would 
be documented somewhere, but I can’t find any references to a simple solution. 
The challenge is that data is getting dumped into a directory structure, and 
that directory structure
 itself contains features that I need in my model. For example:



bank_code
Trader
Day-1.csv
Day-2.csv
…



Each CVS file contains a list of all the trades made by that individual each 
day. The problem is that the bank  trader should be part of the feature set. 
I.e. We need the RDD to look like:
(bank, trader, day, list-of-trades)



Anyone got any elegant solutions for doing this?



Cheers,
- SteveN














  

Reducer memory exceeded

2015-01-18 Thread octavian.ganea
Hi,

Please help me with this problem. I would really appreciate your help !

I am using spark 1.2.0. I have a map-reduce job written in spark in the
following way:

val sumW = splittedTrainingDataRDD.map(localTrainingData = LocalSGD(w,
localTrainingData, numeratorCtEta, numitorCtEta, regularizer,
0.2).reduce((w1,w2) = {w1.add(w2); w2.clear; w1}) 

Here, w is trove TLongDoubleHashMap containing no more than 50 million
elements (in RAM this is ~ 15 GB). w1.add(w2) does addition of the values of
the same key, for each key of both maps.

My initial configuration is:
conf.set(spark.cores.max, 16)
conf.set(spark.akka.frameSize, 10)
conf.set(spark.executor.memory, 120g)
conf.set(spark.reducer.maxMbInFlight, 10)
conf.set(spark.storage.memoryFraction, 0.9)
conf.set(spark.shuffle.file.buffer.kb, 1000)
conf.set(spark.broadcast.factory,
org.apache.spark.broadcast.HttpBroadcastFactory)  
conf.set(spark.driver.maxResultSize, 120g)
val sc = new SparkContext(conf)

I am running this on a cluster with 8 machines, each machine has 16 cores
and 130 GB RAM.

My spark-env.sh contains:
 ulimit -n 20
 SPARK_JAVA_OPTS=-Xms120G -Xmx120G -XX:-UseGCOverheadLimit
-XX:-UseCompressedOops
 SPARK_DRIVER_MEMORY=120G


The error I get is at the reducer above (the reducer above is in file called
Learning.scala, line 313):

15/01/18 14:35:52 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem
[sparkDriver]
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:836)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/01/18 14:35:52 INFO DAGScheduler: Job 2 failed: reduce at
Learning.scala:313, took 54.657239 s
Exception in thread main org.apache.spark.SparkException: Job cancelled
because SparkContext was shut down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:701)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1428)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundPostStop(DAGScheduler.scala:1375)
at
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
at
akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369)
at 

ExceptionInInitializerError when using a class defined in REPL

2015-01-18 Thread Kevin (Sangwoo) Kim
Hi experts,

I'm getting ExceptionInInitializerError when using a class defined in REPL.
Code is something like this:

case class TEST(a: String)
sc.textFile(~~~).map(TEST(_)).count

The code above used to works well until yesterday, but suddenly for some
reason it doesn't work with the error.
Confirmed it still works with local mode.

I'm getting headache while working into this problem during whole weekend.
Any ideas?

environment:
aws ec2, s3
spark v1.1.1, hadoop 2.2

Attaching error logs:
===

15/01/18 13:54:22 INFO TaskSetManager: Lost task 0.19 in stage 0.0 (TID 19)
on executor ip-172-16-186-181.ap-northeast-1.compute.internal:
java.lang.ExceptionInInitializerError (null) [duplicate 5]
15/01/18 13:54:22 ERROR TaskSetManager: Task 0 in stage 0.0 failed 20
times; aborting job
15/01/18 13:54:22 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
15/01/18 13:54:22 INFO TaskSchedulerImpl: Cancelling stage 0
15/01/18 13:54:22 INFO DAGScheduler: Failed to run first at console:45
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 20 times, most recent failure: Lost task 0.19 in stage
0.0 (TID 19, ip-172-16-186-181.ap-northeast-1.compute.internal):
java.lang.ExceptionInInitializerError:
$iwC.init(console:6)
init(console:35)
.init(console:39)
.clinit(console)

$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DailyStatsChartBuilder$.fromCsv(console:41)

$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:43)

$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:43)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1079)
org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1079)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Trying to find where Spark persists RDDs when run with YARN

2015-01-18 Thread Hemanth Yamijala
Hi,

I am trying to find where Spark persists RDDs when we call the persist()
api and executed under YARN. This is purely for understanding...

In my driver program, I wait indefinitely, so as to avoid any clean up
problems.

In the actual job, I roughly do the following:

JavaRDDString lines = context.textFile(args[0]);
lines.persist(StorageLevel.DISK_ONLY());
lines.collect();

When run with local executor, I can see that the files (like rdd_1_0) are
persisted under directories like
/var/folders/mt/51srrjc15wl3n829qkgnh2dmgp/T/spark-local-20150118201458-6147/15.

Where similarly can I find these under Yarn ?

Thanks
hemanth


Re: No Output

2015-01-18 Thread Deep Pradhan
Updating the Spark version means setting up the entire cluster once more?
Or can we update it in some other way?

On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Can you paste the code? Also you can try updating your spark version.

 Thanks
 Best Regards

 On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 I am using Spark-1.0.0 in a single node cluster. When I run a job with
 small data set it runs perfectly but when I use a data set of 350 KB, no
 output is being produced and when I try to run it the second time it is
 giving me an exception telling that SparkContext was shut down.
 Can anyone help me on this?

 Thank You





How to create distributed matrixes from hive tables.

2015-01-18 Thread guxiaobo1982
Hi,


We have large datasets with data format for Spark MLLib matrix, but there are 
pre-computed by Hive and stored inside Hive, my question is can we create a 
distributed matrix such as IndexedRowMatrix directlly from Hive tables, 
avoiding reading data from Hive tables and feed them into an empty Matrix.


Regards

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-18 Thread octavian.ganea
The singleton hack works very different in spark 1.2.0 (it does not work if
the program has multiple map-reduce jobs in the same program). I guess there
should be an official documentation on how to have each machine/node do an
init step locally before executing any other instructions (e.g. loading
locally a very big object once at the begining that can be used in all
further map jobs that will be assigned to that worker).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tp11048p21219.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Avoid broacasting huge variables

2015-01-18 Thread Sean Owen
Why do you say it does not work? The singleton pattern works the same as
ever. It is not a pattern that involves Spark.
On Jan 18, 2015 12:57 PM, octavian.ganea octavian.ga...@inf.ethz.ch
wrote:

 The singleton hack works very different in spark 1.2.0 (it does not work if
 the program has multiple map-reduce jobs in the same program). I guess
 there
 should be an official documentation on how to have each machine/node do an
 init step locally before executing any other instructions (e.g. loading
 locally a very big object once at the begining that can be used in all
 further map jobs that will be assigned to that worker).



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Avoid-broacasting-huge-variables-tp14696p21220.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Join DStream With Other Datasets

2015-01-18 Thread Sean Owen
I think that this problem is not Spark-specific since you are simply side
loading some data into memory. Therefore you do not need an answer that
uses Spark.

Simply load the data and then poll for an update each time it is accessed?
Or some reasonable interval? This is just something you write in Java/Scala.
On Jan 17, 2015 2:06 PM, Ji ZHANG zhangj...@gmail.com wrote:

 Hi,

 I want to join a DStream with some other dataset, e.g. join a click
 stream with a spam ip list. I can think of two possible solutions, one
 is use broadcast variable, and the other is use transform operation as
 is described in the manual.

 But the problem is the spam ip list will be updated outside of the
 spark streaming program, so how can it be noticed to reload the list?

 For broadcast variables, they are immutable.

 For transform operation, is it costly to reload the RDD on every
 batch? If it is, and I use RDD.persist(), does it mean I need to
 launch a thread to regularly unpersist it so that it can get the
 updates?

 Any ideas will be appreciated. Thanks.

 --
 Jerry

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Join DStream With Other Datasets

2015-01-18 Thread Ji ZHANG
Hi,

After some experiments, there're three methods that work in this 'join
DStream with other dataset which is updated periodically'.

1. Create an RDD in transform operation

val words = ssc.socketTextStream(localhost, ).flatMap(_.split(_))
val filtered = words transform { rdd =
  val spam = ssc.sparkContext.textFile(spam.txt).collect.toSet
  rdd.filter(!spam(_))
}

The caveat is 'spam.txt' will be read in every batch.

2. Use variable broadcast variable...

  var bc = ssc.sparkContext.broadcast(getSpam)
  val filtered = words.filter(!bc.value(_))

  val pool = Executors.newSingleThreadScheduledExecutor
  pool.scheduleAtFixedRate(new Runnable {
def run(): Unit = {
  val obc = bc
  bc = ssc.sparkContext.broadcast(getSpam)
  obc.unpersist
}
  }, 0, 5, TimeUnit.SECONDS)

I'm surprised to come up with this solution, but I don't like var, and
the unpersist thing looks evil.

3. Use accumulator

val spam = ssc.sparkContext.accumulableCollection(getSpam.to[mutable.HashSet])
val filtered = words.filter(!spam.value(_))

def run(): Unit = {
  spam.setValue(getSpam.to[mutable.HashSet])
}

Now it looks less ugly...

Anyway, I still hope there's a better solution.

On Sun, Jan 18, 2015 at 2:12 AM, Jörn Franke jornfra...@gmail.com wrote:
 Can't you send a special event through spark streaming once the list is
 updated? So you have your normal events and a special reload event

 Le 17 janv. 2015 15:06, Ji ZHANG zhangj...@gmail.com a écrit :

 Hi,

 I want to join a DStream with some other dataset, e.g. join a click
 stream with a spam ip list. I can think of two possible solutions, one
 is use broadcast variable, and the other is use transform operation as
 is described in the manual.

 But the problem is the spam ip list will be updated outside of the
 spark streaming program, so how can it be noticed to reload the list?

 For broadcast variables, they are immutable.

 For transform operation, is it costly to reload the RDD on every
 batch? If it is, and I use RDD.persist(), does it mean I need to
 launch a thread to regularly unpersist it so that it can get the
 updates?

 Any ideas will be appreciated. Thanks.

 --
 Jerry

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Directory / File Reading Patterns

2015-01-18 Thread Sean Owen
I think that putting part of the data (only) in a filename is an
anti-pattern, but we sometimes have to play these where they lie.

You can list all the directory paths containing the CSV files, map them
each to RDDs with textFile, transform the RDDs to include info from the
path, and then simply union them.

This should be pretty fine performance wise even.
On Jan 17, 2015 11:48 PM, Steve Nunez snu...@hortonworks.com wrote:

  Hello Users,

  I’ve got a real-world use case that seems common enough that its pattern
 would be documented somewhere, but I can’t find any references to a simple
 solution. The challenge is that data is getting dumped into a directory
 structure, and that directory structure itself contains features that I
 need in my model. For example:

  bank_code
 Trader
 Day-1.csv
 Day-2.csv
 …

  Each CVS file contains a list of all the trades made by that individual
 each day. The problem is that the bank  trader should be part of the
 feature set. I.e. We need the RDD to look like:
 (bank, trader, day, list-of-trades)

  Anyone got any elegant solutions for doing this?

  Cheers,
 - SteveN







Re: No Output

2015-01-18 Thread Akhil Das
You can try increasing the parallelism, can you be more specific about the
task that you are doing? May be pasting the piece of code would help.
On 18 Jan 2015 13:22, Deep Pradhan pradhandeep1...@gmail.com wrote:

 The error in the log file says:

 *java.lang.OutOfMemoryError: GC overhead limit exceeded*

 with certain task ID and the error repeats for further task IDs.

 What could be the problem?

 On Sun, Jan 18, 2015 at 2:45 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Updating the Spark version means setting up the entire cluster once more?
 Or can we update it in some other way?

 On Sat, Jan 17, 2015 at 3:22 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you paste the code? Also you can try updating your spark version.

 Thanks
 Best Regards

 On Sat, Jan 17, 2015 at 2:40 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 Hi,
 I am using Spark-1.0.0 in a single node cluster. When I run a job with
 small data set it runs perfectly but when I use a data set of 350 KB, no
 output is being produced and when I try to run it the second time it is
 giving me an exception telling that SparkContext was shut down.
 Can anyone help me on this?

 Thank You







Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-18 Thread Sean Owen
OK. Are you sure the executor has the memory you think? -Xmx24g in
its command line? It may be that for some reason your job is reserving
an exceptionally large amount of non-heap memory. I am not sure that's
to be expected with the ALS job though. Even if the settings work,
considering using the explicit command line configuration.

On Sat, Jan 17, 2015 at 12:49 PM, Antony Mayi antonym...@yahoo.com wrote:
 the values are for sure applied as expected - confirmed using the spark UI
 environment page...

 it comes from my defaults configured using
 'spark.yarn.executor.memoryOverhead=8192' (yes, now increased even more) in
 /etc/spark/conf/spark-defaults.conf and 'export SPARK_EXECUTOR_MEMORY=24G'
 in /etc/spark/conf/spark-env.sh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Maven out of memory error

2015-01-18 Thread Sean Owen
Oh: are you running the tests with a different profile setting than
what the last assembly was built with? this particular test depends on
those matching. Not 100% sure that's the problem, but a good guess.

On Sat, Jan 17, 2015 at 4:54 PM, Ted Yu yuzhih...@gmail.com wrote:
 The test passed here:

 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/1215/consoleFull

 It passed locally with the following command:

 mvn -DHADOOP_PROFILE=hadoop-2.4 -Phadoop-2.4 -Pyarn -Phive test
 -Dtest=JavaAPISuite

 FYI

 On Sat, Jan 17, 2015 at 8:23 AM, Andrew Musselman
 andrew.mussel...@gmail.com wrote:

 Failing for me and another team member on the command line, for what it's
 worth.

  On Jan 17, 2015, at 2:39 AM, Sean Owen so...@cloudera.com wrote:
 
  Hm, this test hangs for me in IntelliJ. It could be a real problem,
  and a combination of a) just recently actually enabling Java tests, b)
  recent updates to the complicated Guava shading situation.
 
  The manifestation of the error usually suggests that something totally
  failed to start (because of, say, class incompatibility errors, etc.)
  Thus things hang and time out waiting for the dead component. It's
  sometimes hard to get answers from the embedded component that dies
  though.
 
  That said, it seems to pass on the command line. For example my recent
  Jenkins job shows it passes:
 
  https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25682/consoleFull
 
  I'll try to uncover more later this weekend. Thoughts welcome though.
 
  On Fri, Jan 16, 2015 at 8:26 PM, Andrew Musselman
  andrew.mussel...@gmail.com wrote:
  Thanks Ted, got farther along but now have a failing test; is this a
  known
  issue?
 
  ---
  T E S T S
  ---
  Running org.apache.spark.JavaAPISuite
  Tests run: 72, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
  123.462 sec
   FAILURE! - in org.apache.spark.JavaAPISuite
  testGuavaOptional(org.apache.spark.JavaAPISuite)  Time elapsed: 106.5
  sec
   ERROR!
  org.apache.spark.SparkException: Job aborted due to stage failure:
  Master
  removed our application: FAILED
 at
 
  org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199)
 at
 
  org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188)
 at
 
  org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187)
 at
 
  scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
  scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 
  org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187)
 at
 
  org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 at
 
  org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 at scala.Option.foreach(Option.scala:236)
 at
 
  org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
 at
 
  org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at
 
  org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at
 
  akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at
  scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 
  scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 
  scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 
  scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
  Running org.apache.spark.JavaJdbcRDDSuite
  Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.846
  sec -
  in org.apache.spark.JavaJdbcRDDSuite
 
  Results :
 
 
  Tests in error:
   JavaAPISuite.testGuavaOptional » Spark Job aborted due to stage
  failure:
  Maste...
 
  On Fri, Jan 16, 2015 at 12:06 PM, Ted Yu yuzhih...@gmail.com wrote:
 
  Can you try doing this before running mvn ?
 
  export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
  -XX:ReservedCodeCacheSize=512m
 
  What OS are you using ?
 
  Cheers
 
  On Fri, Jan 16, 2015 at 12:03 PM, Andrew Musselman
  andrew.mussel...@gmail.com wrote:
 
  Just got the latest from Github and tried running `mvn test`; is this
  error common and do you have any advice on fixing it?
 
  

Re: maven doesn't build dependencies with Scala 2.11

2015-01-18 Thread Sean Owen
I could be wrong, but I thought this was on purpose. At the time it
was set up, there was no 2.11 Kafka available? or one of its
dependencies wouldn't work with 2.11?

But I'm not sure what the OP means by maven doesn't build Spark's
dependencies because Ted indicates it does, and of course you can see
that these artifacts are published.

On Sun, Jan 18, 2015 at 2:46 AM, Ted Yu yuzhih...@gmail.com wrote:
 There're 3 jars under lib_managed/jars directory with and without
 -Dscala-2.11 flag.

 Difference between scala-2.10 and scala-2.11 profiles is that scala-2.10
 profile has the following:
   modules
 moduleexternal/kafka/module
   /modules

 FYI

 On Sat, Jan 17, 2015 at 4:07 PM, Ted Yu yuzhih...@gmail.com wrote:

 I did the following:
  1655  dev/change-version-to-2.11.sh
  1657  mvn -DHADOOP_PROFILE=hadoop-2.4 -Pyarn,hive -Phadoop-2.4
 -Dscala-2.11 -DskipTests clean package

 And mvn command passed.

 Did you see any cross-compilation errors ?

 Cheers

 BTW the two links you mentioned are consistent in terms of building for
 Scala 2.11

 On Sat, Jan 17, 2015 at 3:43 PM, Walrus theCat walrusthe...@gmail.com
 wrote:

 Hi,

 When I run this:

 dev/change-version-to-2.11.sh
 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

 as per here, maven doesn't build Spark's dependencies.

 Only when I run:

 dev/change-version-to-2.11.sh
 sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests  clean package

 as gathered from here, do I get Spark's dependencies built without any
 cross-compilation errors.

 Question:

 - How can I make maven do this?

 - How can I specify the use of Scala 2.11 in my own .pom files?

 Thanks




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-18 Thread Nicholas Chammas
Nathan,

I posted a bunch of questions for you as a comment on your question
http://stackoverflow.com/q/28002443/877069 on Stack Overflow. If you
answer them (don't forget to @ping me) I may be able to help you.

Nick

On Sat Jan 17 2015 at 3:49:54 PM gen tang gen.tan...@gmail.com wrote:

 Hi,

 This is because ssh-ready is the ec2 scripy means that all the instances
 are in the status of running and all the instances in the status of OK,
 In another word, the instances is ready to download and to install
 software, just as emr is ready for bootstrap actions.
 Before, the script just repeatedly prints the information showing that we
 are waiting for every instance being launched.And it is quite ugly, so they
 change the information to print
 However, you can use ssh to connect the instance even if it is in the
 status of pending. If you wait patiently a little more,, the script will
 finish the launch of cluster.

 Cheers
 Gen


 On Sat, Jan 17, 2015 at 7:00 PM, Nathan Murthy nathan.mur...@gmail.com
 wrote:

 Originally posted here:
 http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script

 I'm trying to launch a standalone Spark cluster using its pre-packaged
 EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state:

 ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k
 key-pair -i identity-file.pem -r us-west-2 -s 3 launch test
 Setting up security groups...
 Searching for existing cluster test...
 Spark AMI: ami-ae6e0d9e
 Launching instances...
 Launched 3 slaves in us-west-2c, regid = r-b___6
 Launched master in us-west-2c, regid = r-0__0
 Waiting for all instances in cluster to enter 'ssh-ready'
 state..

 Yet I can SSH into these instances without compaint:

 ubuntu@machine:~$ ssh -i identity-file.pem root@master-ip
 Last login: Day MMM DD HH:mm:ss 20YY from
 c-AA-BBB--DDD.eee1.ff.provider.net

__|  __|_  )
_|  ( /   Amazon Linux AMI
   ___|\___|___|

 https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
 There are 59 security update(s) out of 257 total update(s) available
 Run sudo yum update to apply all updates.
 Amazon Linux version 2014.09 is available.
 root@ip-internal ~]$

 I'm trying to figure out if this is a problem in AWS or with the Spark
 scripts. I've never had this issue before until recently.


 --
 Nathan Murthy // 713.884.7110 (mobile) // @natemurthy





R: Spark Streaming with Kafka

2015-01-18 Thread Eduardo Alfaia
I have the same issue.

- Messaggio originale -
Da: Rasika Pohankar rasikapohan...@gmail.com
Inviato: ‎18/‎01/‎2015 18:48
A: user@spark.apache.org user@spark.apache.org
Oggetto: Spark Streaming with Kafka

I am using Spark Streaming to process data received through Kafka. The Spark 
version is 1.2.0. I have written the code in Java and am compiling it using 
sbt. The program runs and receives data from Kafka and processes it as well. 
But it stops receiving data suddenly after some time( it has run for an hour up 
till now while receiving data from Kafka and then always stopped receiving). 
The program continues to run, it only stops receiving data. After a while, 
sometimes it starts and sometimes doesn't. So I stop the program and start 
again.

Earlier I was using Spark 1.0.0. Upgraded to check if the problem was in that 
version. But after upgrading also, it is happening.


Is this a known issue? Can someone please help.


Thanking you.







View this message in context: Spark Streaming with Kafka
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Trying to find where Spark persists RDDs when run with YARN

2015-01-18 Thread Sean Owen
These will be under the working directory of the YARN container
running the executor. I don't have it handy but think it will also be
a spark-local or similar directory.

On Sun, Jan 18, 2015 at 2:50 PM, Hemanth Yamijala yhema...@gmail.com wrote:
 Hi,

 I am trying to find where Spark persists RDDs when we call the persist() api
 and executed under YARN. This is purely for understanding...

 In my driver program, I wait indefinitely, so as to avoid any clean up
 problems.

 In the actual job, I roughly do the following:

 JavaRDDString lines = context.textFile(args[0]);
 lines.persist(StorageLevel.DISK_ONLY());
 lines.collect();

 When run with local executor, I can see that the files (like rdd_1_0) are
 persisted under directories like
 /var/folders/mt/51srrjc15wl3n829qkgnh2dmgp/T/spark-local-20150118201458-6147/15.

 Where similarly can I find these under Yarn ?

 Thanks
 hemanth

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Row similarities

2015-01-18 Thread Pat Ferrel
Right, done with matrix blocks. Seems like a lot of duplicate effort. but 
that’s the way of OSS sometimes. 

I didn’t see transpose in the Jira. Are there plans for transpose and 
rowSimilarity without transpose? The latter seems easier than columnSimilarity 
in the general/naive case. Thresholds could also be used there.

A threshold for downsampling is going to be extremely hard to use in practice, 
but not sure what the threshold applies to so I must skimmed too fast.  If the 
threshold were some number of sigmas it would be more usable and could extend 
to non-cosine similarities. Would add a non-trivial step to the calc, of 
course. But without it I’ve never seen a threshold actually used effectively 
and I’ve tried. It’s so dependent on the dataset.

Observation as a question: One of the reasons I use the Mahout Spark R-like DSL 
is for row and column similarity (uses in cooccurrence recommenders) and you 
can count on a robust full linear algebra implementation. Once you have full 
linear algebra on top of an optimizer many of the factorization methods are 
dead simple. Seems like MLlib has chosen to do optimized higher level 
algorithms first instead of full linear algebra first?

On Jan 17, 2015, at 6:27 PM, Reza Zadeh r...@databricks.com wrote:

We're focused on providing block matrices, which makes transposition simple: 
https://issues.apache.org/jira/browse/SPARK-3434 
https://issues.apache.org/jira/browse/SPARK-3434

On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel p...@occamsmachete.com 
mailto:p...@occamsmachete.com wrote:
In the Mahout Spark R-like DSL [A’A] and [AA’] doesn’t actually do a 
transpose—it’s optimized out. Mahout has had a stand alone row matrix transpose 
since day 1 and supports it in the Spark version. Can’t really do matrix 
algebra without it even though it’s often possible to optimize it away. 

Row similarity with LLR is much simpler than cosine since you only need 
non-zero sums for column, row, and matrix elements so rowSimilarity is 
implemented in Mahout for Spark. Full blown row similarity including all the 
different similarity methods (long since implemented in hadoop mapreduce) 
hasn’t been moved to spark yet.

Yep, rows are not covered in the blog, my mistake. Too bad it has a lot of uses 
and can at very least be optimized for output matrix symmetry.

On Jan 17, 2015, at 11:44 AM, Andrew Musselman andrew.mussel...@gmail.com 
mailto:andrew.mussel...@gmail.com wrote:

Yeah okay, thanks.

On Jan 17, 2015, at 11:15 AM, Reza Zadeh r...@databricks.com 
mailto:r...@databricks.com wrote:

 Pat, columnSimilarities is what that blog post is about, and is already part 
 of Spark 1.2.
 
 rowSimilarities in a RowMatrix is a little more tricky because you can't 
 transpose a RowMatrix easily, and is being tracked by this JIRA: 
 https://issues.apache.org/jira/browse/SPARK-4823 
 https://issues.apache.org/jira/browse/SPARK-4823
 
 Andrew, sometimes (not always) it's OK to transpose a RowMatrix, if for 
 example the number of rows in your RowMatrix is less than 1m, you can 
 transpose it and use rowSimilarities.
 
 
 On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel p...@occamsmachete.com 
 mailto:p...@occamsmachete.com wrote:
 BTW it looks like row and column similarities (cosine based) are coming to 
 MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the 
 master yet. Does anyone know the status?
 
 See: 
 https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
  
 https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
 
 Also the method for computation reduction (make it less than O(n^2)) seems 
 rooted in cosine. A different computation reduction method is used in the 
 Mahout code tied to LLR. Seems like we should get these together.
  
 On Jan 17, 2015, at 9:37 AM, Andrew Musselman andrew.mussel...@gmail.com 
 mailto:andrew.mussel...@gmail.com wrote:
 
 Excellent, thanks Pat.
 
 On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com 
 mailto:p...@occamsmachete.com wrote:
 
 Mahout’s Spark implementation of rowsimilarity is in the Scala 
 SimilarityAnalysis class. It actually does either row or column similarity 
 but only supports LLR at present. It does [AA’] for columns or [A’A] for 
 rows first then calculates the distance (LLR) for non-zero elements. This is 
 a major optimization for sparse matrices. As I recall the old hadoop code 
 only did this for half the matrix since it’s symmetric but that optimization 
 isn’t in the current code because the downsampling is done as LLR is 
 calculated, so the entire similarity matrix is never actually calculated 
 unless you disable downsampling. 
 
 The primary use is for recommenders but I’ve used it (in the test suite) for 
 row-wise text token similarity too.  
 
 On Jan 17, 2015, at 9:00 AM, Andrew Musselman andrew.mussel...@gmail.com 
 mailto:andrew.mussel...@gmail.com wrote:
 
 Yeah that's the kind of thing I'm looking 

Spark Streaming with Kafka

2015-01-18 Thread Rasika Pohankar
I am using Spark Streaming to process data received through Kafka. The
Spark version is 1.2.0. I have written the code in Java and am compiling it
using sbt. The program runs and receives data from Kafka and processes it
as well. But it stops receiving data suddenly after some time( it has run
for an hour up till now while receiving data from Kafka and then always
stopped receiving). The program continues to run, it only stops receiving
data. After a while, sometimes it starts and sometimes doesn't. So I stop
the program and start again.
Earlier I was using Spark 1.0.0. Upgraded to check if the problem was in
that version. But after upgrading also, it is happening.

Is this a known issue? Can someone please help.

Thanking you.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-tp21222.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.

2015-01-18 Thread Ganon Pierce
I posted about the Application WebUI error (specifically application WebUI not 
the master WebUI generally) and have spent at least a few hours a day for over 
week trying to resolve it so I’d be very grateful for any suggestions. It is 
quite troubling that I appear to be the only one encountering this issue and 
I’ve tried to include everything here which might be relevant (sorry for the 
length). Please see the thread Current Build Gives HTTP ERROR” 
https://www.mail-archive.com/user@spark.apache.org/msg18752.html 
https://www.mail-archive.com/user@spark.apache.org/msg18752.html to see 
specifics about the application webUI issue and the master log.


Environment:

I’m doing my spark builds and application programming in scala locally on my 
macbook pro in eclipse, using modified ec2 launch scripts to launch my cluster, 
uploading my spark builds and models to s3, and uploading applications to and 
submitting them from ec2. I’m using java 8 locally and also installing and 
using java 8 on my ec2 instances (which works with spark 1.2.0). I have a 
windows machine at home (macbook is work machine), but have not yet attempted 
to launch from there.


Errors:

I’ve built two different recent git versions of spark both multiple times, and 
when running applications both have produced an Application WebUI error and 
this exception: 

Exception in thread main java.lang.IllegalArgumentException: Log directory 
/tmp/spark-events does not exist.

While both will display the master webUI just fine including running/completed 
applications, registered workers etc, when I try to access a running or 
completed application’s WebUI by clicking their respective link, I receive a 
server error. When I manually create the above log directory, the exception 
goes away, but the WebUI problem does not. I don’t have any strong evidence, 
but I suspect these errors and whatever is causing them are related. 


Why and How of Modifications to Launch Scripts for Installation of Unreleased 
Spark Versions:

When using a prebuilt version of spark on my cluster everything works except 
the new methods I need, which I had previously added to my custom version of 
spark and used by building the spark-assembly.jar locally and then replacing 
the assembly file produced through the 1.1.0 ec2 launch scripts. However, since 
my pull request was accepted and can now be found in the apache/spark 
repository along with some additional features I’d like to use and because I’d 
like a more elegant permanent solution for launching a cluster and installing 
unreleased versions of spark to my ec2 clusters, I’ve modified the included ec2 
launch scripts in this way (credit to gen tang here: 
https://www.mail-archive.com/user%40spark.apache.org/msg18761.html 
https://www.mail-archive.com/user@spark.apache.org/msg18761.html):

1. Clone the most recent git version of spark
2. Use the make-dist script 
3. Tar the dist folder and upload the resulting 
spark-1.3.0-snapshot-hadoop1.tgz to s3 and change file permissions
4. Fork the mesos/spark-ec2 repository and modify the spark/init.sh script to 
do a wget of my hosted distribution instead of spark’s stable release
5. Modify my spark_ec2.py script to point to my repository.
6. Modify my spark_ec2.py script to install java 8 on my ec2 instances. (This 
works and does not produce the above stated errors when using a stable release 
like 1.2.0).


Additional Possibly Related Info:

As far as I can tell (I went through line by line), when I launch my recent 
build vs when I launch the most recent stable release the console prints almost 
identical INFO and WARNINGS except where you would expect things to be 
different e.g. version numbers. I’ve noted that after launch the prebuilt 
stable version does not have a /tmp/spark-events directory, but it is created 
when the application is launched, while it is never created in my build. 
Further, in my unreleased builds the application logs that I find are always 
stored as .inprogress files (when I set the logging directory to /root/ or add 
the /tmp/spark-events directory manually) even after completion, which I 
believe is supposed to change to .completed (or something similar) when the 
application finishes.


Thanks for any help!



Re: Maven out of memory error

2015-01-18 Thread Ted Yu
Yes.
That could be the cause.

On Sun, Jan 18, 2015 at 11:47 AM, Sean Owen so...@cloudera.com wrote:

 Oh: are you running the tests with a different profile setting than
 what the last assembly was built with? this particular test depends on
 those matching. Not 100% sure that's the problem, but a good guess.

 On Sat, Jan 17, 2015 at 4:54 PM, Ted Yu yuzhih...@gmail.com wrote:
  The test passed here:
 
 
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/1215/consoleFull
 
  It passed locally with the following command:
 
  mvn -DHADOOP_PROFILE=hadoop-2.4 -Phadoop-2.4 -Pyarn -Phive test
  -Dtest=JavaAPISuite
 
  FYI
 
  On Sat, Jan 17, 2015 at 8:23 AM, Andrew Musselman
  andrew.mussel...@gmail.com wrote:
 
  Failing for me and another team member on the command line, for what
 it's
  worth.
 
   On Jan 17, 2015, at 2:39 AM, Sean Owen so...@cloudera.com wrote:
  
   Hm, this test hangs for me in IntelliJ. It could be a real problem,
   and a combination of a) just recently actually enabling Java tests, b)
   recent updates to the complicated Guava shading situation.
  
   The manifestation of the error usually suggests that something totally
   failed to start (because of, say, class incompatibility errors, etc.)
   Thus things hang and time out waiting for the dead component. It's
   sometimes hard to get answers from the embedded component that dies
   though.
  
   That said, it seems to pass on the command line. For example my recent
   Jenkins job shows it passes:
  
  
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25682/consoleFull
  
   I'll try to uncover more later this weekend. Thoughts welcome though.
  
   On Fri, Jan 16, 2015 at 8:26 PM, Andrew Musselman
   andrew.mussel...@gmail.com wrote:
   Thanks Ted, got farther along but now have a failing test; is this a
   known
   issue?
  
   ---
   T E S T S
   ---
   Running org.apache.spark.JavaAPISuite
   Tests run: 72, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
   123.462 sec
FAILURE! - in org.apache.spark.JavaAPISuite
   testGuavaOptional(org.apache.spark.JavaAPISuite)  Time elapsed: 106.5
   sec
ERROR!
   org.apache.spark.SparkException: Job aborted due to stage failure:
   Master
   removed our application: FAILED
  at
  
   org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199)
  at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188)
  at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187)
  at
  
  
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at
   scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at
  
  
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187)
  at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
  at
  
  
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
  at scala.Option.foreach(Option.scala:236)
  at
  
  
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
  at
  
  
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399)
  at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
  at
  
  
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
  at akka.actor.ActorCell.invoke(ActorCell.scala:487)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
  at akka.dispatch.Mailbox.run(Mailbox.scala:220)
  at
  
  
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
  at
   scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at
  
  
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at
  
  
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at
  
  
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  
   Running org.apache.spark.JavaJdbcRDDSuite
   Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.846
   sec -
   in org.apache.spark.JavaJdbcRDDSuite
  
   Results :
  
  
   Tests in error:
JavaAPISuite.testGuavaOptional » Spark Job aborted due to stage
   failure:
   Maste...
  
   On Fri, Jan 16, 2015 at 12:06 PM, Ted Yu yuzhih...@gmail.com
 wrote:
  
   Can you try doing this before running mvn ?
  
   export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
   -XX:ReservedCodeCacheSize=512m
  
   What OS