Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
i just looked at my dependencies in sbt, and when using cdh4.5.0
dependencies i see that hadoop clients pulls in jboss netty (via zookeeper)
and asm 3.x (via jersey-server). so somehow these exclusion rules are not
working anymore? i will look into sbt-pom-reader a bit to try to understand
whats happening


On Mon, Jul 28, 2014 at 8:45 PM, Patrick Wendell pwend...@gmail.com wrote:

 All of the scripts we use to publish Spark releases are in the Spark
 repo itself, so you could follow these as a guideline. The publishing
 process in Maven is similar to in SBT:


 https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65

 On Mon, Jul 28, 2014 at 12:39 PM, Koert Kuipers ko...@tresata.com wrote:
  ah ok thanks. guess i am gonna read up about maven-release-plugin then!
 
 
  On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote:
 
  This is not something you edit yourself. The Maven release plugin
  manages setting all this. I think virtually everything you're worried
  about is done for you by this plugin.
 
  Maven requires artifacts to set a version and it can't inherit one. I
  feel like I understood the reason this is necessary at one point.
 
  On Mon, Jul 28, 2014 at 8:33 PM, Koert Kuipers ko...@tresata.com
 wrote:
   and if i want to change the version, it seems i have to change it in
 all
   23
   pom files? mhhh. is it mandatory for these sub-project pom files to
   repeat
   that version info? useful?
  
   spark$ grep 1.1.0-SNAPSHOT * -r  | wc -l
   23
  
  
  
   On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com
   wrote:
  
   hey we used to publish spark inhouse by simply overriding the
 publishTo
   setting. but now that we are integrated in SBT with maven i cannot
 find
   it
   anymore.
  
   i tried looking into the pom file, but after reading 1144 lines of
 xml
   i
   1) havent found anything that looks like publishing
   2) i feel somewhat sick too
   3) i am considering alternative careers to developing...
  
   where am i supposed to look?
   thanks for your help!
  
  
 
 



how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
hey we used to publish spark inhouse by simply overriding the publishTo
setting. but now that we are integrated in SBT with maven i cannot find it
anymore.

i tried looking into the pom file, but after reading 1144 lines of xml i
1) havent found anything that looks like publishing
2) i feel somewhat sick too
3) i am considering alternative careers to developing...

where am i supposed to look?
thanks for your help!


Re: how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
and if i want to change the version, it seems i have to change it in all 23
pom files? mhhh. is it mandatory for these sub-project pom files to repeat
that version info? useful?

spark$ grep 1.1.0-SNAPSHOT * -r  | wc -l
23



On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote:

 hey we used to publish spark inhouse by simply overriding the publishTo
 setting. but now that we are integrated in SBT with maven i cannot find it
 anymore.

 i tried looking into the pom file, but after reading 1144 lines of xml i
 1) havent found anything that looks like publishing
 2) i feel somewhat sick too
 3) i am considering alternative careers to developing...

 where am i supposed to look?
 thanks for your help!



Re: how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
ah ok thanks. guess i am gonna read up about maven-release-plugin then!


On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote:

 This is not something you edit yourself. The Maven release plugin
 manages setting all this. I think virtually everything you're worried
 about is done for you by this plugin.

 Maven requires artifacts to set a version and it can't inherit one. I
 feel like I understood the reason this is necessary at one point.

 On Mon, Jul 28, 2014 at 8:33 PM, Koert Kuipers ko...@tresata.com wrote:
  and if i want to change the version, it seems i have to change it in all
 23
  pom files? mhhh. is it mandatory for these sub-project pom files to
 repeat
  that version info? useful?
 
  spark$ grep 1.1.0-SNAPSHOT * -r  | wc -l
  23
 
 
 
  On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
  hey we used to publish spark inhouse by simply overriding the publishTo
  setting. but now that we are integrated in SBT with maven i cannot find
 it
  anymore.
 
  i tried looking into the pom file, but after reading 1144 lines of xml i
  1) havent found anything that looks like publishing
  2) i feel somewhat sick too
  3) i am considering alternative careers to developing...
 
  where am i supposed to look?
  thanks for your help!
 
 



graphx cached partitions wont go away

2014-07-26 Thread Koert Kuipers
i have graphx queries running inside a service where i collect the results
to the driver and do not hold any references to the rdds involved in the
queries. my assumption was that with the references gone spark would go and
remove the cached rdds from memory (note, i did not cache them, graphx did).

yet they hang around...

is my understanding of how the ContextCleaner works incorrect? or could it
be that grapx holds some references internally to rdds, preventing garbage
collection? maybe even circular references?


using shapeless in spark to optimize data layout in memory

2014-07-23 Thread Koert Kuipers
hello all,
in case anyone is interested, i just wrote a short blog about using
shapeless in spark to optimize data layout in memory.

blog is here:
http://tresata.com/tresata-open-sources-spark-columnar

code is here:
https://github.com/tresata/spark-columnar


Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Koert Kuipers
but be aware that spark-defaults.conf is only used if you use spark-submit
On Jul 17, 2014 4:29 PM, Zongheng Yang zonghen...@gmail.com wrote:

 One way is to set this in your conf/spark-defaults.conf:

 spark.executor.extraLibraryPath /path/to/native/lib

 The key is documented here:
 http://spark.apache.org/docs/latest/configuration.html

 On Thu, Jul 17, 2014 at 1:25 PM, Eric Friedman
 eric.d.fried...@gmail.com wrote:
  I used to use SPARK_LIBRARY_PATH to specify the location of native libs
  for lzo compression when using spark 0.9.0.
 
  The references to that environment variable have disappeared from the
 docs
  for
  spark 1.0.1 and it's not clear how to specify the location for lzo.
 
  Any guidance?



Re: spark ui on yarn

2014-07-13 Thread Koert Kuipers
my yarn environment does have less memory for the executors.

i am checking if the RDDs are cached by calling sc.getRDDStorageInfo, which
shows an RDD as fully cached in memory, yet it does not show up in the UI


On Sun, Jul 13, 2014 at 1:49 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 The UI code is the same in both, but one possibility is that your
 executors were given less memory on YARN. Can you check that? Or otherwise,
 how do you know that some RDDs were cached?

 Matei

 On Jul 12, 2014, at 4:12 PM, Koert Kuipers ko...@tresata.com wrote:

 hey shuo,
 so far all stage links work fine for me.

 i did some more testing, and it seems kind of random what shows up on the
 gui and what does not. some partially cached RDDs make it to the GUI, while
 some fully cached ones do not. I have not been able to detect a pattern.

 is the codebase for the gui different in standalone than in yarn-client
 mode?


 On Sat, Jul 12, 2014 at 3:34 AM, Shuo Xiang shuoxiang...@gmail.com
 wrote:

 Hi Koert,
   Just curious did you find any information like CANNOT FIND ADDRESS
 after clicking into some stage? I've seen similar problems due to lost of
 executors.

 Best,



 On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers ko...@tresata.com wrote:

 I just tested a long lived application (that we normally run in
 standalone mode) on yarn in client mode.

 it looks to me like cached rdds are missing in the storage tap of the ui.

 accessing the rdd storage information via the spark context shows rdds
 as fully cached but they are missing on storage page.

 spark 1.0.0







Re: spark ui on yarn

2014-07-12 Thread Koert Kuipers
hey shuo,
so far all stage links work fine for me.

i did some more testing, and it seems kind of random what shows up on the
gui and what does not. some partially cached RDDs make it to the GUI, while
some fully cached ones do not. I have not been able to detect a pattern.

is the codebase for the gui different in standalone than in yarn-client
mode?


On Sat, Jul 12, 2014 at 3:34 AM, Shuo Xiang shuoxiang...@gmail.com wrote:

 Hi Koert,
   Just curious did you find any information like CANNOT FIND ADDRESS
 after clicking into some stage? I've seen similar problems due to lost of
 executors.

 Best,



 On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers ko...@tresata.com wrote:

 I just tested a long lived application (that we normally run in
 standalone mode) on yarn in client mode.

 it looks to me like cached rdds are missing in the storage tap of the ui.

 accessing the rdd storage information via the spark context shows rdds as
 fully cached but they are missing on storage page.

 spark 1.0.0





spark ui on yarn

2014-07-11 Thread Koert Kuipers
I just tested a long lived application (that we normally run in standalone
mode) on yarn in client mode.

it looks to me like cached rdds are missing in the storage tap of the ui.

accessing the rdd storage information via the spark context shows rdds as
fully cached but they are missing on storage page.

spark 1.0.0


Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
not sure I understand why unifying how you submit app for different
platforms and dynamic configuration cannot be part of SparkConf and
SparkContext?

for classpath a simple script similar to hadoop classpath that shows what
needs to be added should be sufficient.

on spark standalone I can launch a program just fine with just SparkConf
and SparkContext. not on yarn, so the spark-launch script must be doing a
few things extra there I am missing... which makes things more difficult
because I am not sure its realistic to expect every application that needs
to run something on spark to be launched using spark-submit.
 On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:

 It fulfills a few different functions. The main one is giving users a
 way to inject Spark as a runtime dependency separately from their
 program and make sure they get exactly the right version of Spark. So
 a user can bundle an application and then use spark-submit to send it
 to different types of clusters (or using different versions of Spark).

 It also unifies the way you bundle and submit an app for Yarn, Mesos,
 etc... this was something that became very fragmented over time before
 this was added.

 Another feature is allowing users to set configuration values
 dynamically rather than compile them inside of their program. That's
 the one you mention here. You can choose to use this feature or not.
 If you know your configs are not going to change, then you don't need
 to set them with spark-submit.


 On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
 wrote:
  What is the purpose of spark-submit? Does it do anything outside of
  the standard val conf = new SparkConf ... val sc = new SparkContext
  ... ?



Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
did you explicitly cache the rdd? we cache rdds and share them between jobs
just fine within one context in spark 1.0.x. but we do not use the ooyala
job server...


On Wed, Jul 9, 2014 at 10:03 AM, premdass premdas...@yahoo.co.in wrote:

 Hi,

 I using spark 1.0.0  , using Ooyala Job Server, for a low latency query
 system. Basically a long running context is created, which enables to run
 multiple jobs under the same context, and hence sharing of the data.

 It was working fine in 0.9.1. However in spark 1.0 release, the RDD's
 created and cached by a Job-1 gets cleaned up by BlockManager (can see log
 statements saying cleaning up RDD) and so the cached RDD's are not
 available
 for Job-2, though Both Job-1 and Job-2 are running under same spark
 context.

 I tried using the spark.cleaner.referenceTracking = false setting, how-ever
 this causes the issue that unpersisted RDD's are not cleaned up properly,
 and occupying the Spark's memory..


 Had anybody faced issue like this before? If so, any advice would be
 greatly
 appreicated.


 Also is there any way, to mark an RDD as being used under a context, event
 though the job using that had been finished (so subsequent jobs can use
 that
 RDD).


 Thanks,
 Prem



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
we simply hold on to the reference to the rdd after it has been cached. so
we have a single Map[String, RDD[X]] for cached RDDs for the application


On Wed, Jul 9, 2014 at 11:00 AM, premdass premdas...@yahoo.co.in wrote:

 Hi,

 Yes . I am  caching the RDD's by calling cache method..


 May i ask, how you are sharing RDD's across jobs in same context? By the
 RDD
 name. I tried printing the RDD's of the Spark context, and when the
 referenceTracking is enabled, i get empty list after the clean up.

 Thanks,
 Prem




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p9191.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
sandy, that makes sense. however i had trouble doing programmatic execution
on yarn in client mode as well. the application-master in yarn came up but
then bombed because it was looking for jars that dont exist (it was looking
in the original file paths on the driver side, which are not available on
the yarn node). my guess is that spark-submit is changing some settings
(perhaps preparing the distributed cache and modifying settings
accordingly), which makes it harder to run things programmatically. i could
be wrong however. i gave up debugging and resorted to using spark-submit
for now.



On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Spark still supports the ability to submit jobs programmatically without
 shell scripts.

 Koert,
 The main reason that the unification can't be a part of SparkContext is
 that YARN and standalone support deploy modes where the driver runs in a
 managed process on the cluster.  In this case, the SparkContext is created
 on a remote node well after the application is launched.


 On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:

 One another +1. For me it's a question of embedding. With
 SparkConf/SparkContext I can easily create larger projects with Spark as a
 separate service (just like MySQL and JDBC, for example). With spark-submit
 I'm bound to Spark as a main framework that defines how my application
 should look like. In my humble opinion, using Spark as embeddable library
 rather than main framework and runtime is much easier.




 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:

 +1 as well for being able to submit jobs programmatically without using
 shell script.

 we also experience issues of submitting jobs programmatically without
 using spark-submit. In fact, even in the Hadoop World, I rarely used
 hadoop jar to submit jobs in shell.



 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com
 wrote:

 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.

 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.

 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in
 using
  spark-submit versus SparkConf/SparkContext if you are willing to set
 your
  own config?
 
  If there are any gaps, +1 on having parity within
 SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically.
 In
  theory, we could shell out to spark-submit but it's not the best
 option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not
 knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
 wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that
 shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just
 SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be
 doing a
  few things extra there I am missing... which makes things more
 difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  It fulfills a few different functions. The main one is giving users
 a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark.
 So
  a user can bundle an application and then use spark-submit to send
 it
  to different types of clusters (or using different versions of
 Spark).
 
  It also unifies the way you bundle and submit an app for Yarn,
 Mesos,
  etc... this was something that became very fragmented over time
 before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't
 need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James 
 srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside
 of
   the standard val conf = new SparkConf ... val sc = new
 SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread Koert Kuipers
do you control your cluster and spark deployment? if so, you can try to
rebuild with jetty 9.x


On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter 
martingammelsae...@gmail.com wrote:

 Digging a bit more I see that there is yet another jetty instance that
 is causing the problem, namely the BroadcastManager has one. I guess
 this one isn't very wise to disable... It might very well be that the
 WebUI is a problem as well, but I guess the code doesn't get far
 enough. Any ideas on how to solve this? Spark seems to use jetty
 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source
 of the problem. Any ideas?

 On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter
 martingammelsae...@gmail.com wrote:
  Hi!
 
  I am building a web frontend for a Spark app, allowing users to input
  sql/hql and get results back. When starting a SparkContext from within
  my server code (using jetty/dropwizard) I get the error
 
  java.lang.NoSuchMethodError:
  org.eclipse.jetty.server.AbstractConnector: method init()V not found
 
  when Spark tries to fire up its own jetty server. This does not happen
  when running the same code without my web server. This is probably
  fixable somehow(?) but I'd like to disable the webUI as I don't need
  it, and ideally I would like to access that information
  programatically instead, allowing me to embed it in my own web
  application.
 
  Is this possible?
 
  --
  Best regards,
  Martin Gammelsæter



 --
 Mvh.
 Martin Gammelsæter
 92209139



Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-07 Thread Koert Kuipers
you could only do the deep check if the hashcodes are the same and design
hashcodes that do not take all elements into account.

the alternative seems to be putting cache statements all over graphx, as is
currently the case, which is trouble for any long lived application where
caching is carefully managed. I think? I am currently forced to do
unpersists on vertices after almost every intermediate graph
transformation, or accept my rdd cache getting polluted
On Jul 7, 2014 12:03 AM, Ankur Dave ankurd...@gmail.com wrote:

 Well, the alternative is to do a deep equality check on the index arrays,
 which would be somewhat expensive since these are pretty large arrays (one
 element per vertex in the graph). But, in case the reference equality check
 fails, it actually might be a good idea to do the deep check before
 resorting to the slow code path.

 Ankur http://www.ankurdave.com/



tiers of caching

2014-07-07 Thread Koert Kuipers
i noticed that some algorithms such as graphx liberally cache RDDs for
efficiency, which makes sense. however it can also leave a long trail of
unused yet cached RDDs, that might push other RDDs out of memory.

in a long-lived spark context i would like to decide which RDDs stick
around. would it make sense to create tiers of caching, to distinguish
explicitly cached RDDs by the application from RDDs that are temporary
cached by algos, so as to make sure these temporary caches don't push
application RDDs out of memory?


Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Koert Kuipers
spark has a setting to put user jars in front of classpath, which should do
the trick.
however i had no luck with this. see here:

https://issues.apache.org/jira/browse/SPARK-1863



On Mon, Jul 7, 2014 at 1:31 PM, Robert James srobertja...@gmail.com wrote:

 spark-submit includes a spark-assembly uber jar, which has older
 versions of many common libraries.  These conflict with some of the
 dependencies we need.  I have been racking my brain trying to find a
 solution (including experimenting with ProGuard), but haven't been
 able to: when we use spark-submit, we get NoMethodErrors, even though
 the code compiles fine, because the runtime classes are different than
 the compile time classes!

 Can someone recommend a solution? We are using scala, sbt, and
 sbt-assembly, but are happy using another tool (please provide
 instructions how to).



acl for spark ui

2014-07-07 Thread Koert Kuipers
i was testing using the acl for spark ui in secure mode on yarn in client
mode.

it works great. my spark 1.0.0 configuration has:
spark.authenticate = true
spark.ui.acls.enable = true
spark.ui.view.acls = koert
spark.ui.filters =
org.apache.hadoop.security.authentication.server.AuthenticationFilter
spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params=type=kerberos,kerberos.principal=HTTP/mybox@MYDOMAIN
,kerberos.keytab=/some/keytab

i confirmed that i can access the ui from firefox after doing kinit.

however i also saw this in the logs of my driver program:
2014-07-07 17:21:56 DEBUG server.Server: RESPONSE /broadcast_0  401
handled=true

and

2014-07-07 17:21:56 DEBUG server.Server: REQUEST
/jars/somejar-assembly-0.1-SNAPSHOT.jar on BlockingHttpConnection@3d6396f5
,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParse\
r{s=-5,l=10,c=0},r=1
2014-07-07 17:21:56 DEBUG server.Server: RESPONSE
/jars/somejar-assembly-0.1-SNAPSHOT.jar  401 handled=true

what does this mean? is the webserver also responsible for handing out
other stuff such as broadcast variables and jars, and is this now being
rejected by my servlet filter? thats not good... the 401 response is
exactly the same one i see when i try to access the website after kdestroy.
for example:

2014-07-07 17:35:08 DEBUG server.AuthenticationFilter: Request [
http://mybox:5001/] triggering authentication
2014-07-07 17:35:08 DEBUG server.Server: RESPONSE /  401 handled=true


Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-06 Thread Koert Kuipers
probably a dumb question, but why is reference equality used for the
indexes?


On Sun, Jul 6, 2014 at 12:43 AM, Ankur Dave ankurd...@gmail.com wrote:

 When joining two VertexRDDs with identical indexes, GraphX can use a fast
 code path (a zip join without any hash lookups). However, the check for
 identical indexes is performed using reference equality.

 Without caching, two copies of the index are created. Although the two
 indexes are structurally identical, they fail reference equality, and so
 GraphX mistakenly uses the slow path involving a hash lookup per joined
 element.

 I'm working on a patch https://github.com/apache/spark/pull/1297 that
 attempts an optimistic zip join with per-element fallback to hash lookups,
 which would improve this situation.

 Ankur http://www.ankurdave.com/




Re: taking top k values of rdd

2014-07-05 Thread Koert Kuipers
hey nick,
you are right. i didnt explain myself well and my code example was wrong...
i am keeping a priority-queue with k items per partition (using
com.twitter.algebird.mutable.PriorityQueueMonoid.build to limit the sizes
of the queues).
but this still means i am sending k items per partition to my driver, so k
x p, while i only need k.
thanks! koert



On Sat, Jul 5, 2014 at 1:21 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 To make it efficient in your case you may need to do a bit of custom code
 to emit the top k per partition and then only send those to the driver. On
 the driver you can just top k the combined top k from each partition
 (assuming you have (object, count) for each top k list).

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Sat, Jul 5, 2014 at 10:17 AM, Koert Kuipers ko...@tresata.com wrote:

 my initial approach to taking top k values of a rdd was using a
 priority-queue monoid. along these lines:

  rdd.mapPartitions({ items = Iterator.single(new PriorityQueue(...)) },
 false).reduce(monoid.plus)

 this works fine, but looking at the code for reduce it first reduces
 within a partition (which doesnt help me) and then sends the results to the
 driver where these again get reduced. this means that for every partition
 the (potentially very bulky) priorityqueue gets shipped to the driver.

 my driver is client side, not inside cluster, and i cannot change this,
 so this shipping to driver of all these queues can be expensive.

 is there a better way to do this? should i try to a shuffle first to
 reduce the partitions to the minimal amount (since number of queues shipped
 is equal to number of partitions)?

 is was a way to reduce to a single item RDD, so the queues stay inside
 cluster and i can retrieve the final result with RDD.first?





Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-05 Thread Koert Kuipers
thanks for replying. why is joining two vertexrdds without caching slow?
what is recomputed unnecessarily?
i am not sure what is different here from joining 2 regular RDDs (where
nobody seems to recommend to cache before joining i think...)


On Thu, Jul 3, 2014 at 10:52 PM, Ankur Dave ankurd...@gmail.com wrote:

 Oh, I just read your message more carefully and noticed that you're
 joining a regular RDD with a VertexRDD. In that case I'm not sure why the
 warning is occurring, but it might be worth caching both operands
 (graph.vertices and the regular RDD) just to be sure.

 Ankur http://www.ankurdave.com/




Re: MLLib : Math on Vector and Matrix

2014-07-02 Thread Koert Kuipers
i did the second option: re-implemented .toBreeze as .breeze using pimp
classes


On Wed, Jul 2, 2014 at 5:00 PM, Thunder Stumpges thunder.stump...@gmail.com
 wrote:

 I am upgrading from Spark 0.9.0 to 1.0 and I had a pretty good amount of
 code working with internals of MLLib. One of the big changes was the move
 from the old jblas.Matrix to the Vector/Matrix classes included in MLLib.

 However I don't see how we're supposed to use them for ANYTHING other than
 a container for passing data to the included APIs... how do we do any math
 on them? Looking at the internal code, there are quite a number of
 private[mllib] declarations including access to the Breeze representations
 of the classes.

 Was there a good reason this was not exposed? I could see maybe not
 wanting to expose the 'toBreeze' function which would tie it to the breeze
 implementation, however it would be nice to have the various mathematics
 wrapped at least.

 Right now I see no way to code any vector/matrix math without moving my
 code namespaces into org.apache.spark.mllib or duplicating the code in
 'toBreeze' in my own util functions. Not very appealing.

 What are others doing?
 thanks,
 Thunder




why is toBreeze private everywhere in mllib?

2014-07-01 Thread Koert Kuipers
its kind of handy to be able to convert stuff to breeze... is there some
other way i am supposed to access that functionality?


Re: Spark's Hadooop Dependency

2014-06-25 Thread Koert Kuipers
libraryDependencies ++= Seq(
  org.apache.spark %% spark-core % versionSpark % provided
exclude(org.apache.hadoop, hadoop-client)
  org.apache.hadoop % hadoop-client % versionHadoop % provided
)


On Wed, Jun 25, 2014 at 11:26 AM, Robert James srobertja...@gmail.com
wrote:

 To add Spark to a SBT project, I do:
   libraryDependencies += org.apache.spark %% spark-core % 1.0.0
 % provided

 How do I make sure that the spark version which will be downloaded
 will depend on, and use, Hadoop 2, and not Hadoop 1?

 Even with a line:
libraryDependencies += org.apache.hadoop % hadoop-client % 2.4.0

 I still see SBT downloading Hadoop 1:

 [debug] == resolving dependencies

 org.apache.spark#spark-core_2.10;1.0.0-org.apache.hadoop#hadoop-client;1.0.4
 [compile-master(*)]
 [debug] dependency descriptor has been mediated: dependency:
 org.apache.hadoop#hadoop-client;2.4.0 {compile=[default(compile)]} =
 dependency: org.apache.hadoop#hadoop-client;1.0.4
 {compile=[default(compile)]}



graphx Joining two VertexPartitions with different indexes is slow.

2014-06-25 Thread Koert Kuipers
lately i am seeing a lot of this warning in graphx:
org.apache.spark.graphx.impl.ShippableVertexPartitionOps: Joining two
VertexPartitions with different indexes is slow.

i am using Graph.outerJoinVertices to join in data from a regular RDD (that
is co-partitioned). i would like this operation to be fast, since i use it
frequently. should i be doing something different?


Re: Using Spark as web app backend

2014-06-24 Thread Koert Kuipers
run your spark app in client mode together with a spray rest service, that
the front end can talk to


On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Hi all,

 So far, I run my spark jobs with spark-shell or spark-submit command. I'd
 like to go further and I wonder how to use spark as a backend of a web
 application. Specificaly, I want a frontend application ( build with nodejs
 )  to communicate with spark on the backend, so that every query from the
 frontend is rooted to spark. And the result from Spark are sent back to the
 frontend.
 Does some of you already experiment this kind of architecture ?


 Cheers,


 Jaonary



Re: trying to understand yarn-client mode

2014-06-20 Thread Koert Kuipers
thanks! i will try that.
i guess what i am most confused about is why the executors are trying to
retrieve the jars directly using the info i provided to add jars to my
spark context. i mean, thats bound to fail no? i could be on a different
machine (so my file://) isnt going to work for them, or i could have the
jars in a directory that is only readable by me.

how come the jars are not just shipped to yarn as part of the job submittal?

i am worried i am supposed to put the jars in a central location and yarn
is going to fetch them from there, leading to jars in yet another place
such as on hdfs which i find pretty messy.


On Thu, Jun 19, 2014 at 2:54 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Coincidentally, I just ran into the same exception. What's probably
 happening is that you're specifying some jar file in your job as an
 absolute local path (e.g. just
 /home/koert/test-assembly-0.1-SNAPSHOT.jar), but your Hadoop config
 has the default FS set to HDFS.

 So your driver does not know that it should tell executors to download
 that file from the driver.

 If you specify the jar with the file: scheme that should solve the
 problem.

 On Thu, Jun 19, 2014 at 10:22 AM, Koert Kuipers ko...@tresata.com wrote:
  i am trying to understand how yarn-client mode works. i am not using
  Application application_1403117970283_0014 failed 2 times due to AM
  Container for appattempt_1403117970283_0014_02 exited with exitCode:
  -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does
 not
  exist
  .Failing this attempt.. Failing the application.


 --
 Marcelo



spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
i noticed that when i submit a job to yarn it mistakenly tries to upload
files to local filesystem instead of hdfs. what could cause this?

in spark-env.sh i have HADOOP_CONF_DIR set correctly (and spark-submit does
find yarn), and my core-site.xml has a fs.defaultFS that is hdfs, not local
filesystem.

thanks! koert


Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
 yeah sure see below. i strongly suspect its something i misconfigured
causing yarn to try to use local filesystem mistakenly.

*

[koert@cdh5-yarn ~]$ /usr/local/lib/spark/bin/spark-submit --class
org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3
--executor-cores 1
hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar 10
14/06/20 12:54:40 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/06/20 12:54:40 INFO RMProxy: Connecting to ResourceManager at
cdh5-yarn.tresata.com/192.168.1.85:8032
14/06/20 12:54:41 INFO Client: Got Cluster metric info from
ApplicationsManager (ASM), number of NodeManagers: 1
14/06/20 12:54:41 INFO Client: Queue info ... queueName: root.default,
queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0,
  queueApplicationCount = 0, queueChildQueueCount = 0
14/06/20 12:54:41 INFO Client: Max mem capabililty of a single resource in
this cluster 8192
14/06/20 12:54:41 INFO Client: Preparing Local resources
14/06/20 12:54:41 WARN BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
14/06/20 12:54:41 INFO Client: Uploading
hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar to
file:/home/koert/.sparkStaging/application_1403201750110_0060/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar
14/06/20 12:54:43 INFO Client: Setting up the launch environment
14/06/20 12:54:43 INFO Client: Setting up container launch context
14/06/20 12:54:43 INFO Client: Command for starting the Spark
ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx512m,
-Djava.io.tmpdir=$PWD/tmp, -Dspark.akka.retry.wait=\3\,
-Dspark.storage.blockManagerTimeoutIntervalMs=\12\,
-Dspark.storage.blockManagerHeartBeatMs=\12\,
-Dspark.app.name=\org.apache.spark.examples.SparkPi\,
-Dspark.akka.frameSize=\1\, -Dspark.akka.timeout=\3\,
-Dspark.worker.timeout=\3\,
-Dspark.akka.logLifecycleEvents=\true\,
-Dlog4j.configuration=log4j-spark-container.properties,
org.apache.spark.deploy.yarn.ApplicationMaster, --class,
org.apache.spark.examples.SparkPi, --jar ,
hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar,
--args  '10' , --executor-memory, 1024, --executor-cores, 1,
--num-executors , 3, 1, LOG_DIR/stdout, 2, LOG_DIR/stderr)
14/06/20 12:54:43 INFO Client: Submitting application to ASM
14/06/20 12:54:43 INFO YarnClientImpl: Submitted application
application_1403201750110_0060
14/06/20 12:54:44 INFO Client: Application report from ASM:
 application identifier: application_1403201750110_0060
 appId: 60
 clientToAMToken: null
 appDiagnostics:
 appMasterHost: N/A
 appQueue: root.koert
 appMasterRpcPort: -1
 appStartTime: 1403283283505
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED
 appTrackingUrl:
http://cdh5-yarn.tresata.com:8088/proxy/application_1403201750110_0060/
 appUser: koert
14/06/20 12:54:45 INFO Client: Application report from ASM:
 application identifier: application_1403201750110_0060
 appId: 60
 clientToAMToken: null
 appDiagnostics:
 appMasterHost: N/A
 appQueue: root.koert
 appMasterRpcPort: -1
 appStartTime: 1403283283505
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED
 appTrackingUrl:
http://cdh5-yarn.tresata.com:8088/proxy/application_1403201750110_0060/
 appUser: koert
14/06/20 12:54:46 INFO Client: Application report from ASM:
 application identifier: application_1403201750110_0060
 appId: 60
 clientToAMToken: null
 appDiagnostics:
 appMasterHost: N/A
 appQueue: root.koert
 appMasterRpcPort: -1
 appStartTime: 1403283283505
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED
 appTrackingUrl:
http://cdh5-yarn.tresata.com:8088/proxy/application_1403201750110_0060/
 appUser: koert
14/06/20 12:54:47 INFO Client: Application report from ASM:
 application identifier: application_1403201750110_0060
 appId: 60
 clientToAMToken: null
 appDiagnostics: Application application_1403201750110_0060 failed 2
times due to AM Container for appattempt_1403201750110_0060_02 exited
with  exitCode: -1000 due to: File
file:/home/koert/.sparkStaging/application_1403201750110_0060/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar
does not exist
.Failing this attempt.. Failing the application.
 appMasterHost: N/A
 appQueue: root.koert
 appMasterRpcPort: -1
 appStartTime: 1403283283505
 yarnAppState: FAILED
 distributedFinalState: FAILED
 appTrackingUrl:
cdh5-yarn.tresata.com:8088/cluster/app/application_1403201750110_0060
 appUser: koert




On Fri, Jun 20, 2014 at 12:42 PM, Marcelo Vanzin van...@cloudera.com
wrote:

 Hi Koert,

 Could you provide more details? Job arguments, log messages, errors, etc.

 On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers ko...@tresata.com wrote:
  i noticed

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
ok solved it. as it happened in spark/conf i also had a file called
core.site.xml (with some tachyone related stuff in it) so thats why it
ignored /etc/hadoop/conf/core-site.xml




On Fri, Jun 20, 2014 at 3:24 PM, Koert Kuipers ko...@tresata.com wrote:

 i put some logging statements in yarn.Client and that confirms its using
 local filesystem:
 14/06/20 15:20:33 INFO Client: fs.defaultFS is file:///

 so somehow fs.defaultFS is not being picked up from
 /etc/hadoop/conf/core-site.xml, but spark does correctly pick up
 yarn.resourcemanager.hostname from /etc/hadoop/conf/yarn-site.xml

 strange!


 On Fri, Jun 20, 2014 at 1:26 PM, Koert Kuipers ko...@tresata.com wrote:

 in /etc/hadoop/conf/core-site.xml:
   property
 namefs.defaultFS/name
 valuehdfs://cdh5-yarn.tresata.com:8020/value
   /property


 also hdfs seems the default:
 [koert@cdh5-yarn ~]$ hadoop fs -ls /
 Found 5 items
 drwxr-xr-x   - hdfs supergroup  0 2014-06-19 12:31 /data
 drwxrwxrwt   - hdfs supergroup  0 2014-06-20 12:17 /lib
 drwxrwxrwt   - hdfs supergroup  0 2014-06-18 14:58 /tmp
 drwxr-xr-x   - hdfs supergroup  0 2014-06-18 15:02 /user
 drwxr-xr-x   - hdfs supergroup  0 2014-06-18 14:59 /var

 and in my spark-site.env:
 export HADOOP_CONF_DIR=/etc/hadoop/conf



 On Fri, Jun 20, 2014 at 1:04 PM, bc Wong bcwal...@cloudera.com wrote:

 Koert, is there any chance that your fs.defaultFS isn't setup right?


 On Fri, Jun 20, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com
 wrote:

  yeah sure see below. i strongly suspect its something i misconfigured
 causing yarn to try to use local filesystem mistakenly.

 *

 [koert@cdh5-yarn ~]$ /usr/local/lib/spark/bin/spark-submit --class
 org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3
 --executor-cores 1
 hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar 10
 14/06/20 12:54:40 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/06/20 12:54:40 INFO RMProxy: Connecting to ResourceManager at
 cdh5-yarn.tresata.com/192.168.1.85:8032
 14/06/20 12:54:41 INFO Client: Got Cluster metric info from
 ApplicationsManager (ASM), number of NodeManagers: 1
 14/06/20 12:54:41 INFO Client: Queue info ... queueName: root.default,
 queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0,
   queueApplicationCount = 0, queueChildQueueCount = 0
 14/06/20 12:54:41 INFO Client: Max mem capabililty of a single resource
 in this cluster 8192
 14/06/20 12:54:41 INFO Client: Preparing Local resources
 14/06/20 12:54:41 WARN BlockReaderLocal: The short-circuit local reads
 feature cannot be used because libhadoop cannot be loaded.
 14/06/20 12:54:41 INFO Client: Uploading
 hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar to
 file:/home/koert/.sparkStaging/application_1403201750110_0060/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar
 14/06/20 12:54:43 INFO Client: Setting up the launch environment
 14/06/20 12:54:43 INFO Client: Setting up container launch context
 14/06/20 12:54:43 INFO Client: Command for starting the Spark
 ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx512m,
 -Djava.io.tmpdir=$PWD/tmp, -Dspark.akka.retry.wait=\3\,
 -Dspark.storage.blockManagerTimeoutIntervalMs=\12\,
 -Dspark.storage.blockManagerHeartBeatMs=\12\, 
 -Dspark.app.name=\org.apache.spark.examples.SparkPi\,
 -Dspark.akka.frameSize=\1\, -Dspark.akka.timeout=\3\,
 -Dspark.worker.timeout=\3\,
 -Dspark.akka.logLifecycleEvents=\true\,
 -Dlog4j.configuration=log4j-spark-container.properties,
 org.apache.spark.deploy.yarn.ApplicationMaster, --class,
 org.apache.spark.examples.SparkPi, --jar ,
 hdfs://cdh5-yarn/lib/spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar,
 --args  '10' , --executor-memory, 1024, --executor-cores, 1,
 --num-executors , 3, 1, LOG_DIR/stdout, 2, LOG_DIR/stderr)
 14/06/20 12:54:43 INFO Client: Submitting application to ASM
 14/06/20 12:54:43 INFO YarnClientImpl: Submitted application
 application_1403201750110_0060
 14/06/20 12:54:44 INFO Client: Application report from ASM:
  application identifier: application_1403201750110_0060
  appId: 60
  clientToAMToken: null
  appDiagnostics:
  appMasterHost: N/A
  appQueue: root.koert
  appMasterRpcPort: -1
  appStartTime: 1403283283505
  yarnAppState: ACCEPTED
  distributedFinalState: UNDEFINED
  appTrackingUrl:
 http://cdh5-yarn.tresata.com:8088/proxy/application_1403201750110_0060/
  appUser: koert
 14/06/20 12:54:45 INFO Client: Application report from ASM:
  application identifier: application_1403201750110_0060
  appId: 60
  clientToAMToken: null
  appDiagnostics:
  appMasterHost: N/A
  appQueue: root.koert
  appMasterRpcPort: -1
  appStartTime: 1403283283505
  yarnAppState: ACCEPTED
  distributedFinalState: UNDEFINED
  appTrackingUrl:
 http://cdh5

Re: Running Spark alongside Hadoop

2014-06-20 Thread Koert Kuipers
for development/testing i think its fine to run them side by side as you
suggested, using spark standalone. just be realistic about what size data
you can load with limited RAM.


On Fri, Jun 20, 2014 at 3:43 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 The ideal way to do that is to use a cluster manager like Yarn  mesos.
 You can control how much resources to give to which node etc.
 You should be able to run both together in standalone mode however you may
 experience varying latency  performance in the cluster as both MR  spark
 demand resources from same machines etc.


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Jun 20, 2014 at 3:41 PM, Sameer Tilak ssti...@live.com wrote:

 Dear Spark users,

 I have a small 4 node Hadoop cluster. Each node is a VM -- 4 virtual
 cores, 8GB memory and 500GB disk. I am currently running Hadoop on it. I
 would like to run Spark (in standalone mode) along side Hadoop on the same
 nodes. Given the configuration of my nodes, will that work? Does anyone has
 any experience in terms of stability and performance of running Spark and
 Hadoop on somewhat resource-constrained nodes.  I was looking at the Spark
 documentation and there is a way to configure memory and cores for the and
 worker nodes and memory for the master node: SPARK_WORKER_CORES,
 SPARK_WORKER_MEMORY, SPARK_DAEMON_MEMORY. Any recommendations on how to
 share resource between HAdoop and Spark?







Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Koert Kuipers
still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark
standalone.

for example if i have a akka timeout setting that i would like to be
applied to every piece of the spark framework (so spark master, spark
workers, spark executor sub-processes, spark-shell, etc.). i used to do
that with SPARK_JAVA_OPTS. now i am unsure.

SPARK_DAEMON_JAVA_OPTS works for the master and workers, but not for the
spark-shell i think? i tried using SPARK_DAEMON_JAVA_OPTS, and it does not
seem that useful. for example for a worker it does not apply the settings
to the executor sub-processes, while for SPARK_JAVA_OPTS it does do that.
so seems like SPARK_JAVA_OPTS is my only way to change settings for the
executors, yet its deprecated?


On Wed, Jun 11, 2014 at 10:59 PM, elyast lukasz.jastrzeb...@gmail.com
wrote:

 Hi,

 I tried to use SPARK_JAVA_OPTS in spark-env.sh as well as conf/java-opts
 file to set additional java system properties. In this case I could connect
 to tachyon without any problem.

 However when I tried setting executor and driver extraJavaOptions in
 spark-defaults.conf it doesn't.

 I suspect the root cause may be following:

 SparkSubmit doesn't fork additional JVM to actually run either driver or
 executor process and additional system properties are set after JVM is
 created and other classes are loaded. It may happen that Tachyon CommonConf
 class is already being loaded and since its Singleton it won't pick up and
 changes to system properties.

 Please let me know what do u think.

 Can I use conf/java-opts ? since it's not really documented anywhere?

 Best regards
 Lukasz



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/little-confused-about-SPARK-JAVA-OPTS-alternatives-tp5798p7448.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: trying to understand yarn-client mode

2014-06-19 Thread Koert Kuipers
db tsai,
if in yarn-cluster mode the driver runs inside yarn, how can you do a
rdd.collect and bring the results back to your application?


On Thu, Jun 19, 2014 at 2:33 PM, DB Tsai dbt...@stanford.edu wrote:

 We are submitting the spark job in our tomcat application using
 yarn-cluster mode with great success. As Kevin said, yarn-client mode
 runs driver in your local JVM, and it will have really bad network
 overhead when one do reduce action which will pull all the result from
 executor to your local JVM. Also, since you can only have one spark
 context object in one JVM, it will be tricky to run multiple spark
 jobs concurrently with yarn-clinet mode.

 This is how we submit spark job with yarn-cluster mode. Please use the
 current master code, otherwise, after the job is finished, spark will
 kill the JVM and exit your app.

 We setup the configuration of spark in a scala map.

   def getArgsFromConf(conf: Map[String, String]): Array[String] = {
 Array[String](
   --jar, conf.get(app.jar).getOrElse(),
   --addJars, conf.get(spark.addJars).getOrElse(),
   --class, conf.get(spark.mainClass).getOrElse(),
   --num-executors, conf.get(spark.numWorkers).getOrElse(1),
   --driver-memory, conf.get(spark.masterMemory).getOrElse(1g),
   --executor-memory, conf.get(spark.workerMemory).getOrElse(1g),
   --executor-cores, conf.get(spark.workerCores).getOrElse(1))
   }

   System.setProperty(SPARK_YARN_MODE, true)
   val sparkConf = new SparkConf
   val args = getArgsFromConf(conf)
   new Client(new ClientArguments(args, sparkConf), hadoopConfig,
 sparkConf).run

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Thu, Jun 19, 2014 at 11:22 AM, Kevin Markey kevin.mar...@oracle.com
 wrote:
  Yarn client is much like Spark client mode, except that the executors are
  running in Yarn containers managed by the Yarn resource manager on the
  cluster instead of as Spark workers managed by the Spark master.  The
 driver
  executes as a local client in your local JVM.  It communicates with the
  workers on the cluster.  Transformations are scheduled on the cluster by
 the
  driver's logic.  Actions involve communication between local driver and
  remote cluster executors.  So, there is some additional network overhead,
  especially if the driver is not co-located on the cluster.  In
 yarn-cluster
  mode -- in contrast, the driver is executed as a thread in a Yarn
  application master on the cluster.
 
  In either case, the assembly JAR must be available to the application on
 the
  cluster.  Best to copy it to HDFS and specify its location by exporting
 its
  location as SPARK_JAR.
 
  Kevin Markey
 
 
  On 06/19/2014 11:22 AM, Koert Kuipers wrote:
 
  i am trying to understand how yarn-client mode works. i am not using
  spark-submit, but instead launching a spark job from within my own
  application.
 
  i can see my application contacting yarn successfully, but then in yarn i
  get an immediate error:
 
  Application application_1403117970283_0014 failed 2 times due to AM
  Container for appattempt_1403117970283_0014_02 exited with exitCode:
  -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does
 not
  exist
  .Failing this attempt.. Failing the application.
 
  why is yarn trying to fetch my jar, and why as a local file? i would
 expect
  the jar to be send to yarn over the wire upon job submission?
 
 



Re: trying to understand yarn-client mode

2014-06-19 Thread Koert Kuipers
okay. since for us the main purpose is to retrieve (small) data i guess i
will stick to yarn client mode. thx


On Thu, Jun 19, 2014 at 3:19 PM, DB Tsai dbt...@stanford.edu wrote:

 Currently, we save the result in HDFS, and read it back in our
 application. Since Clinet.run is blocking call, it's easy to do it in
 this way.

 We are now working on using akka to bring back the result to app
 without going through the HDFS, and we can use akka to track the log,
 and stack trace, etc.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Thu, Jun 19, 2014 at 12:08 PM, Koert Kuipers ko...@tresata.com wrote:
  db tsai,
  if in yarn-cluster mode the driver runs inside yarn, how can you do a
  rdd.collect and bring the results back to your application?
 
 
  On Thu, Jun 19, 2014 at 2:33 PM, DB Tsai dbt...@stanford.edu wrote:
 
  We are submitting the spark job in our tomcat application using
  yarn-cluster mode with great success. As Kevin said, yarn-client mode
  runs driver in your local JVM, and it will have really bad network
  overhead when one do reduce action which will pull all the result from
  executor to your local JVM. Also, since you can only have one spark
  context object in one JVM, it will be tricky to run multiple spark
  jobs concurrently with yarn-clinet mode.
 
  This is how we submit spark job with yarn-cluster mode. Please use the
  current master code, otherwise, after the job is finished, spark will
  kill the JVM and exit your app.
 
  We setup the configuration of spark in a scala map.
 
def getArgsFromConf(conf: Map[String, String]): Array[String] = {
  Array[String](
--jar, conf.get(app.jar).getOrElse(),
--addJars, conf.get(spark.addJars).getOrElse(),
--class, conf.get(spark.mainClass).getOrElse(),
--num-executors, conf.get(spark.numWorkers).getOrElse(1),
--driver-memory, conf.get(spark.masterMemory).getOrElse(1g),
--executor-memory,
 conf.get(spark.workerMemory).getOrElse(1g),
--executor-cores, conf.get(spark.workerCores).getOrElse(1))
}
 
System.setProperty(SPARK_YARN_MODE, true)
val sparkConf = new SparkConf
val args = getArgsFromConf(conf)
new Client(new ClientArguments(args, sparkConf), hadoopConfig,
  sparkConf).run
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Thu, Jun 19, 2014 at 11:22 AM, Kevin Markey kevin.mar...@oracle.com
 
  wrote:
   Yarn client is much like Spark client mode, except that the executors
   are
   running in Yarn containers managed by the Yarn resource manager on the
   cluster instead of as Spark workers managed by the Spark master.  The
   driver
   executes as a local client in your local JVM.  It communicates with
 the
   workers on the cluster.  Transformations are scheduled on the cluster
 by
   the
   driver's logic.  Actions involve communication between local driver
 and
   remote cluster executors.  So, there is some additional network
   overhead,
   especially if the driver is not co-located on the cluster.  In
   yarn-cluster
   mode -- in contrast, the driver is executed as a thread in a Yarn
   application master on the cluster.
  
   In either case, the assembly JAR must be available to the application
 on
   the
   cluster.  Best to copy it to HDFS and specify its location by
 exporting
   its
   location as SPARK_JAR.
  
   Kevin Markey
  
  
   On 06/19/2014 11:22 AM, Koert Kuipers wrote:
  
   i am trying to understand how yarn-client mode works. i am not using
   spark-submit, but instead launching a spark job from within my own
   application.
  
   i can see my application contacting yarn successfully, but then in
 yarn
   i
   get an immediate error:
  
   Application application_1403117970283_0014 failed 2 times due to AM
   Container for appattempt_1403117970283_0014_02 exited with
 exitCode:
   -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar
 does
   not
   exist
   .Failing this attempt.. Failing the application.
  
   why is yarn trying to fetch my jar, and why as a local file? i would
   expect
   the jar to be send to yarn over the wire upon job submission?
  
  
 
 



Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Koert Kuipers
for a jvm application its not very appealing to me to use spark submit
my application uses hadoop, so i should use hadoop jar, and my
application uses spark, so it should use spark-submit. if i add a piece
of code that uses some other system there will be yet another suggested way
to launch it. thats not very scalable, since i can only launch it one way
in the end...


On Thu, Jun 19, 2014 at 4:58 PM, Andrew Or and...@databricks.com wrote:

 Hi Koert and Lukasz,

 The recommended way of not hard-coding configurations in your application
 is through conf/spark-defaults.conf as documented here:
 http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
 However, this is only applicable to
 spark-submit, so this may not be useful to you.

 Depending on how you launch your Spark applications, you can workaround
 this by manually specifying these configs as -Dspark.x=y
 in your java command to launch Spark. This is actually how SPARK_JAVA_OPTS
 used to work before 1.0. Note that spark-submit does
 essentially the same thing, but sets these properties programmatically by
 reading from the conf/spark-defaults.conf file and calling
 System.setProperty(spark.x, y).

 Note that spark.executor.extraJavaOpts not intended for spark
 configuration (see http://spark.apache.org/docs/latest/configuration.html
 ).
 SPARK_DAEMON_JAVA_OPTS, as you pointed out, is for Spark daemons like the
 standalone master, worker, and the history server;
 it is also not intended for spark configurations to be picked up by Spark
 executors and drivers. In general, any reference to java opts
 in any variable or config refers to java options, as the name implies, not
 Spark configuration. Unfortunately, it just so happened that we
 used to mix the two in the same environment variable before 1.0.

 Is there a reason you're not using spark-submit? Is it for legacy reasons?
 As of 1.0, most changes to launching Spark applications
 will be done through spark-submit, so you may miss out on relevant new
 features or bug fixes.

 Andrew



 2014-06-19 7:41 GMT-07:00 Koert Kuipers ko...@tresata.com:

 still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark
 standalone.

 for example if i have a akka timeout setting that i would like to be
 applied to every piece of the spark framework (so spark master, spark
 workers, spark executor sub-processes, spark-shell, etc.). i used to do
 that with SPARK_JAVA_OPTS. now i am unsure.

 SPARK_DAEMON_JAVA_OPTS works for the master and workers, but not for the
 spark-shell i think? i tried using SPARK_DAEMON_JAVA_OPTS, and it does not
 seem that useful. for example for a worker it does not apply the settings
 to the executor sub-processes, while for SPARK_JAVA_OPTS it does do that.
 so seems like SPARK_JAVA_OPTS is my only way to change settings for the
 executors, yet its deprecated?


 On Wed, Jun 11, 2014 at 10:59 PM, elyast lukasz.jastrzeb...@gmail.com
 wrote:

 Hi,

 I tried to use SPARK_JAVA_OPTS in spark-env.sh as well as conf/java-opts
 file to set additional java system properties. In this case I could
 connect
 to tachyon without any problem.

 However when I tried setting executor and driver extraJavaOptions in
 spark-defaults.conf it doesn't.

 I suspect the root cause may be following:

 SparkSubmit doesn't fork additional JVM to actually run either driver or
 executor process and additional system properties are set after JVM is
 created and other classes are loaded. It may happen that Tachyon
 CommonConf
 class is already being loaded and since its Singleton it won't pick up
 and
 changes to system properties.

 Please let me know what do u think.

 Can I use conf/java-opts ? since it's not really documented anywhere?

 Best regards
 Lukasz



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/little-confused-about-SPARK-JAVA-OPTS-alternatives-tp5798p7448.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






Re: life if an executor

2014-05-20 Thread Koert Kuipers
if they are tied to the spark context, then why can the subprocess not be
started up with the extra jars (sc.addJars) already on class path? this way
a switch like user-jars-first would be a simple rearranging of the class
path for the subprocess, and the messing with classloaders that is
currently done in executor (which forces people to use reflection is
certain situations and is broken if you want user jars first) would be
history
On May 20, 2014 1:07 AM, Matei Zaharia matei.zaha...@gmail.com wrote:

 They’re tied to the SparkContext (application) that launched them.

 Matei

 On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm
 subprocesses.

 how long to they live for? as long as the worker/slave? or are they tied
 to the sparkcontext and life/die with it?

 thx





Re: life if an executor

2014-05-20 Thread Koert Kuipers
just for my clarification: off heap cannot be java objects, correct? so we
are always talking about serialized off-heap storage?
On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 That's one the main motivation in using Tachyon ;)
 http://tachyon-project.org/

 It gives off heap in-memory caching. And starting Spark 0.9, you can cache
 any RDD in Tachyon just by specifying the appropriate StorageLevel.

 TD




 On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote:

 I guess it needs to be this way to benefit from caching of RDDs in
 memory. It would be nice however if the RDD cache can be dissociated from
 the JVM heap so that in cases where garbage collection is difficult to
 tune, one could choose to discard the JVM and run the next operation in a
 few one.


 On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 They’re tied to the SparkContext (application) that launched them.

 Matei

 On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm
 subprocesses.

 how long to they live for? as long as the worker/slave? or are they tied
 to the sparkcontext and life/die with it?

 thx







Re: life if an executor

2014-05-20 Thread Koert Kuipers
interesting, so it sounds to me like spark is forced to choose between the
ability to add jars during lifetime and the ability to run tasks with user
classpath first (which important for the ability to run jobs on spark
clusters not under your control, so for the viability of 3rd party spark
apps)


On Tue, May 20, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.com wrote:

 One issue is that new jars can be added during the lifetime of a
 SparkContext, which can mean after executors are already started. Off-heap
 storage is always serialized, correct.


 On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote:

 just for my clarification: off heap cannot be java objects, correct? so
 we are always talking about serialized off-heap storage?
 On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 That's one the main motivation in using Tachyon ;)
 http://tachyon-project.org/

 It gives off heap in-memory caching. And starting Spark 0.9, you can
 cache any RDD in Tachyon just by specifying the appropriate StorageLevel.

 TD




 On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote:

 I guess it needs to be this way to benefit from caching of RDDs in
 memory. It would be nice however if the RDD cache can be dissociated from
 the JVM heap so that in cases where garbage collection is difficult to
 tune, one could choose to discard the JVM and run the next operation in a
 few one.


 On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia 
 matei.zaha...@gmail.com wrote:

 They’re tied to the SparkContext (application) that launched them.

 Matei

 On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm
 subprocesses.

 how long to they live for? as long as the worker/slave? or are they
 tied to the sparkcontext and life/die with it?

 thx








Re: File present but file not found exception

2014-05-19 Thread Koert Kuipers
why does it need to be local file? why not do some filter ops on hdfs file
and save to hdfs, from where you can create rdd?

you can read a small file in on driver program and use sc.parallelize to
turn it into RDD
On May 16, 2014 7:01 PM, Sai Prasanna ansaiprasa...@gmail.com wrote:

 I found that if a file is present in all the nodes in the given path in
 localFS, then reading is possible.

 But is there a way to read if the file is present only in certain nodes ??
 [There should be a way !!]

 *NEED: Wanted to do some filter ops in HDFS file, create a local file of
 the result, create an RDD out of it operate *

 Is there any way out ??

 Thanks in advance !




 On Fri, May 9, 2014 at 12:18 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Hi Everyone,

 I think all are pretty busy, the response time in this group has slightly
 increased.

 But anyways, this is a pretty silly problem, but could not get over.

 I have a file in my localFS, but when i try to create an RDD out of it,
 tasks fails with file not found exception is thrown at the log files.

 *var file = sc.textFile(file:///home/sparkcluster/spark/input.txt);*
 *file.top(1);*

 input.txt exists in the above folder but still Spark coudnt find it. Some
 parameters need to be set ??

 Any help is really appreciated. Thanks !!





life if an executor

2014-05-19 Thread Koert Kuipers
from looking at the source code i see executors run in their own jvm
subprocesses.

how long to they live for? as long as the worker/slave? or are they tied to
the sparkcontext and life/die with it?

thx


cant get tests to pass anymore on master master

2014-05-16 Thread Koert Kuipers
i used to be able to get all tests to pass.

with java 6 and sbt i get PermGen errors (no matter how high i make the
PermGen). so i have given up on that.

with java 7 i see 1 error in a bagel test and a few in streaming tests. any
ideas? see the error in BagelSuite below.

[info] - large number of iterations *** FAILED *** (10 seconds, 105
milliseconds)
[info]   The code passed to failAfter did not complete within 10 seconds.
(BagelSuite.scala:85)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at
org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250)
[info]   at
org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250)
[info]   at
org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:282)
[info]   at
org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:246)
[info]   at org.apache.spark.bagel.BagelSuite.failAfter(BagelSuite.scala:32)
[info]   at
org.apache.spark.bagel.BagelSuite$$anonfun$3.apply$mcV$sp(BagelSuite.scala:85)
[info]   at
org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85)
[info]   at
org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85)
[info]   at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
[info]   at
org.apache.spark.bagel.BagelSuite.withFixture(BagelSuite.scala:32)


Re: cant get tests to pass anymore on master master

2014-05-16 Thread Koert Kuipers
i tried on a few different machines, including a server, all same ubuntu
and same java, and got same errors. i also tried modifying the timeouts in
the unit tests and it did not help.

ok i will try blowing away local maven repo and do clean.


On Thu, May 15, 2014 at 12:49 PM, Sean Owen so...@cloudera.com wrote:

 Since the error concerns a timeout -- is the machine slowish?

 What about blowing away everything in your local maven repo, do a
 clean, etc. to rule out environment issues?

 I'm on OS X here FWIW.

 On Thu, May 15, 2014 at 5:24 PM, Koert Kuipers ko...@tresata.com wrote:
  yeah sure. it is ubuntu 12.04 with jdk1.7.0_40
  what else is relevant that i can provide?
 
 
  On Thu, May 15, 2014 at 12:17 PM, Sean Owen so...@cloudera.com wrote:
 
  FWIW I see no failures. Maybe you can say more about your environment,
  etc.
 
  On Wed, May 7, 2014 at 10:01 PM, Koert Kuipers ko...@tresata.com
 wrote:
   i used to be able to get all tests to pass.
  
   with java 6 and sbt i get PermGen errors (no matter how high i make
 the
   PermGen). so i have given up on that.
  
   with java 7 i see 1 error in a bagel test and a few in streaming
 tests.
   any
   ideas? see the error in BagelSuite below.
  
   [info] - large number of iterations *** FAILED *** (10 seconds, 105
   milliseconds)
   [info]   The code passed to failAfter did not complete within 10
   seconds.
   (BagelSuite.scala:85)
   [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
   [info]   at
  
  
 org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250)
   [info]   at
  
  
 org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250)
   [info]   at
  
 org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:282)
   [info]   at
   org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:246)
   [info]   at
   org.apache.spark.bagel.BagelSuite.failAfter(BagelSuite.scala:32)
   [info]   at
  
  
 org.apache.spark.bagel.BagelSuite$$anonfun$3.apply$mcV$sp(BagelSuite.scala:85)
   [info]   at
  
 org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85)
   [info]   at
  
 org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85)
   [info]   at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
   [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
   [info]   at
   org.apache.spark.bagel.BagelSuite.withFixture(BagelSuite.scala:32)
  
 
 



Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
after removing all class paramater of class Path from my code, i tried
again. different but related eror when i set
spark.files.userClassPathFirst=true

now i dont even use FileInputFormat directly. HadoopRDD does...

14/05/16 12:17:17 ERROR Executor: Exception in task ID 45
java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/FileInputFormat
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at
org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:51)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:57)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1610)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1481)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1331)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)



On Thu, May 15, 2014 at 3:03 PM, Koert Kuipers ko...@tresata.com wrote:

 when i set spark.files.userClassPathFirst=true, i get java serialization
 errors in my tasks, see below. when i set userClassPathFirst back to its
 default of false, the serialization errors are gone. my spark.serializer is
 KryoSerializer.

 the class org.apache.hadoop.fs.Path is in the spark assembly jar, but not
 in my task jars (the ones i added to the SparkConf). so looks like the
 ClosureSerializer is having trouble with this class once the
 ChildExecutorURLClassLoader is used? thats me just guessing.

 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 1.0:5 failed 4 times, most recent failure:
 Exception failure in TID 31 on host node05.tresata.com:
 java.lang.NoClassDefFoundError: org/apache/hadoop/fs/Path
 java.lang.Class.getDeclaredConstructors0(Native Method)
 java.lang.Class.privateGetDeclaredConstructors(Class.java:2398)
 java.lang.Class.getDeclaredConstructors(Class.java:1838)

 java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1697)
 java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:50)
 java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:203)
 java.security.AccessController.doPrivileged(Native Method)

 java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:200)
 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:556)

 java.io.ObjectInputStream.readNonProxyDesc

writing my own RDD

2014-05-16 Thread Koert Kuipers
in writing my own RDD i ran into a few issues with respect to stuff being
private in spark.

in compute i would like to return an iterator that respects task killing
(as HadoopRDD does), but the mechanics for that are inside the private
InterruptibleIterator. also the exception i am supposed to throw
(TaskKilledException) is private to spark.


Re: cant get tests to pass anymore on master master

2014-05-16 Thread Koert Kuipers
yeah sure. it is ubuntu 12.04 with jdk1.7.0_40
what else is relevant that i can provide?


On Thu, May 15, 2014 at 12:17 PM, Sean Owen so...@cloudera.com wrote:

 FWIW I see no failures. Maybe you can say more about your environment, etc.

 On Wed, May 7, 2014 at 10:01 PM, Koert Kuipers ko...@tresata.com wrote:
  i used to be able to get all tests to pass.
 
  with java 6 and sbt i get PermGen errors (no matter how high i make the
  PermGen). so i have given up on that.
 
  with java 7 i see 1 error in a bagel test and a few in streaming tests.
 any
  ideas? see the error in BagelSuite below.
 
  [info] - large number of iterations *** FAILED *** (10 seconds, 105
  milliseconds)
  [info]   The code passed to failAfter did not complete within 10 seconds.
  (BagelSuite.scala:85)
  [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
  [info]   at
 
 org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250)
  [info]   at
 
 org.scalatest.concurrent.Timeouts$$anonfun$failAfter$1.apply(Timeouts.scala:250)
  [info]   at
  org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:282)
  [info]   at
  org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:246)
  [info]   at
 org.apache.spark.bagel.BagelSuite.failAfter(BagelSuite.scala:32)
  [info]   at
 
 org.apache.spark.bagel.BagelSuite$$anonfun$3.apply$mcV$sp(BagelSuite.scala:85)
  [info]   at
  org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85)
  [info]   at
  org.apache.spark.bagel.BagelSuite$$anonfun$3.apply(BagelSuite.scala:85)
  [info]   at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
  [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
  [info]   at
  org.apache.spark.bagel.BagelSuite.withFixture(BagelSuite.scala:32)
 



Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
well, i modified ChildExecutorURLClassLoader to also delegate to
parentClassloader if NoClassDefFoundError is thrown... now i get yet
another error. i am clearly missing something with these classloaders. such
nasty stuff... giving up for now. just going to have to not use
spark.files.userClassPathFirst=true for now, until i have more time to look
at this.

14/05/16 13:58:59 ERROR Executor: Exception in task ID 3
java.lang.ClassCastException: cannot assign instance of scala.None$ to
field org.apache.spark.rdd.RDD.checkpointData of type scala.Option in
instance of MyRDD
at
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
at
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1995)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:60)



On Fri, May 16, 2014 at 1:46 PM, Koert Kuipers ko...@tresata.com wrote:

 after removing all class paramater of class Path from my code, i tried
 again. different but related eror when i set
 spark.files.userClassPathFirst=true

 now i dont even use FileInputFormat directly. HadoopRDD does...

 14/05/16 12:17:17 ERROR Executor: Exception in task ID 45
 java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/FileInputFormat
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
 at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at
 org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at
 org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:51)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:57)
 at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1610)
 at
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
 at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1481)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1331)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913

Re: java serialization errors with spark.files.userClassPathFirst=true

2014-05-16 Thread Koert Kuipers
i do not think the current solution will work. i tried writing a version of
ChildExecutorURLClassLoader that does have a proper parent and has a
modified loadClass to reverse the order of parent and child in finding
classes, and that seems to work, but now classes like SparkEnv are loaded
by the child and somehow this means the companion objects are reset or
something like that because i get NPEs.


On Fri, May 16, 2014 at 3:54 PM, Koert Kuipers ko...@tresata.com wrote:

 ok i think the issue is visibility: a classloader can see all classes
 loaded by its parent classloader. but userClassLoader does not have a
 parent classloader, so its not able to see any classes that parentLoader
 is responsible for. in my case userClassLoader is trying to get
 AvroInputFormat which probably somewhere statically references
 FileInputFormat, which is invisible to userClassLoader.


 On Fri, May 16, 2014 at 3:32 PM, Koert Kuipers ko...@tresata.com wrote:

 ok i put lots of logging statements in the ChildExecutorURLClassLoader.
 this is what i see:

 * the urls for userClassLoader are correct and includes only my one jar.

 * for one class that only exists in my jar i see it gets loaded correctly
 using userClassLoader

 * for a class that exists in both my jar and spark kernel it tries to use
 userClassLoader and ends up with a NoClassDefFoundError. the class is
 org.apache.avro.mapred.AvroInputFormat and the NoClassDefFoundError is for
 org.apache.hadoop.mapred.FileInputFormat (which the parentClassLoader is
 responsible for since it is not in my jar). i currently catch this
 NoClassDefFoundError and call parentClassLoader.loadClass but thats clearly
 not a solution since it loads the wrong version.



 On Fri, May 16, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote:

 well, i modified ChildExecutorURLClassLoader to also delegate to
 parentClassloader if NoClassDefFoundError is thrown... now i get yet
 another error. i am clearly missing something with these classloaders. such
 nasty stuff... giving up for now. just going to have to not use
 spark.files.userClassPathFirst=true for now, until i have more time to look
 at this.

 14/05/16 13:58:59 ERROR Executor: Exception in task ID 3
 java.lang.ClassCastException: cannot assign instance of scala.None$ to
 field org.apache.spark.rdd.RDD.checkpointData of type scala.Option in
 instance of MyRDD
 at
 java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
 at
 java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1995)

 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
 scala.collection.immutable.$colon$colon.readObject(List.scala:362)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1891)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:60)



 On Fri, May 16, 2014 at 1:46 PM, Koert Kuipers ko...@tresata.comwrote:

 after removing all class paramater of class Path from my code, i tried
 again. different but related eror when i set
 spark.files.userClassPathFirst=true

 now i dont even use FileInputFormat directly. HadoopRDD does...

 14/05/16 12:17:17 ERROR Executor: Exception in task ID 45
 java.lang.NoClassDefFoundError: org/apache/hadoop/mapred

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Koert Kuipers
hey patrick,
i have a SparkConf i can add them too. i was looking for a way to do this
where they are not hardwired within scala, which is what SPARK_JAVA_OPTS
used to do.
i guess if i just set -Dspark.akka.frameSize=1 on my java app launch
then it will get picked up by the SparkConf too right?


On Wed, May 14, 2014 at 2:54 PM, Patrick Wendell pwend...@gmail.com wrote:

 Just wondering - how are you launching your application? If you want
 to set values like this the right way is to add them to the SparkConf
 when you create a SparkContext.

 val conf = new SparkConf().set(spark.akka.frameSize,
 1).setAppName(...).setMaster(...)
 val sc = new SparkContext(conf)

 - Patrick

 On Wed, May 14, 2014 at 9:09 AM, Koert Kuipers ko...@tresata.com wrote:
  i have some settings that i think are relevant for my application. they
 are
  spark.akka settings so i assume they are relevant for both executors and
 my
  driver program.
 
  i used to do:
  SPARK_JAVA_OPTS=-Dspark.akka.frameSize=1
 
  now this is deprecated. the alternatives mentioned are:
  * some spark-submit settings which are not relevant to me since i do not
 use
  spark-submit (i launch spark jobs from an existing application)
  * spark.executor.extraJavaOptions to set -X options. i am not sure what
 -X
  options are, but it doesnt sound like what i need, since its only for
  executors
  * SPARK_DAEMON_OPTS to set java options for standalone daemons (i.e.
 master,
  worker), that sounds like i should not use it since i am trying to change
  settings for an app, not a daemon.
 
  am i missing the correct setting to use?
  should i do -Dspark.akka.frameSize=1 on my application launch
 directly,
  and then also set spark.executor.extraJavaOptions? so basically repeat
 it?



little confused about SPARK_JAVA_OPTS alternatives

2014-05-14 Thread Koert Kuipers
i have some settings that i think are relevant for my application. they are
spark.akka settings so i assume they are relevant for both executors and my
driver program.

i used to do:
SPARK_JAVA_OPTS=-Dspark.akka.frameSize=1

now this is deprecated. the alternatives mentioned are:
* some spark-submit settings which are not relevant to me since i do not
use spark-submit (i launch spark jobs from an existing application)
* spark.executor.extraJavaOptions to set -X options. i am not sure what -X
options are, but it doesnt sound like what i need, since its only for
executors
* SPARK_DAEMON_OPTS to set java options for standalone daemons (i.e.
master, worker), that sounds like i should not use it since i am trying to
change settings for an app, not a daemon.

am i missing the correct setting to use?
should i do -Dspark.akka.frameSize=1 on my application launch directly,
and then also set spark.executor.extraJavaOptions? so basically repeat it?


Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
resending... my email somehow never made it to the user list.


On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote:

 in writing my own RDD i ran into a few issues with respect to stuff being
 private in spark.

 in compute i would like to return an iterator that respects task killing
 (as HadoopRDD does), but the mechanics for that are inside the private
 InterruptibleIterator. also the exception i am supposed to throw
 (TaskKilledException) is private to spark.



Re: writing my own RDD

2014-05-11 Thread Koert Kuipers
will do
On May 11, 2014 6:44 PM, Aaron Davidson ilike...@gmail.com wrote:

 You got a good point there, those APIs should probably be marked as
 @DeveloperAPI. Would you mind filing a JIRA for that (
 https://issues.apache.org/jira/browse/SPARK)?


 On Sun, May 11, 2014 at 11:51 AM, Koert Kuipers ko...@tresata.com wrote:

 resending... my email somehow never made it to the user list.


 On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote:

 in writing my own RDD i ran into a few issues with respect to stuff
 being private in spark.

 in compute i would like to return an iterator that respects task killing
 (as HadoopRDD does), but the mechanics for that are inside the private
 InterruptibleIterator. also the exception i am supposed to throw
 (TaskKilledException) is private to spark.






Re: os buffer cache does not cache shuffle output file

2014-05-10 Thread Koert Kuipers
yes it seems broken. i got only a few emails in last few days


On Fri, May 9, 2014 at 7:24 AM, wxhsdp wxh...@gmail.com wrote:

 is there something wrong with the mailing list? very few people see my
 thread



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: performance improvement on second operation...without caching?

2014-05-03 Thread Koert Kuipers
Hey Matei,
Not sure i understand that. These are 2 separate jobs. So the second job
takes advantage of the fact that there is map output left somewhere on disk
from the first job, and re-uses that?


On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Hi Diana,

 Apart from these reasons, in a multi-stage job, Spark saves the map output
 files from map stages to the filesystem, so it only needs to rerun the last
 reduce stage. This is why you only saw one stage executing. These files are
 saved for fault recovery but they speed up subsequent runs.

 Matei

 On May 3, 2014, at 5:21 PM, Patrick Wendell pwend...@gmail.com wrote:

 Ethan,

 What you said is actually not true, Spark won't cache RDD's unless you ask
 it to.

 The observation here - that running the same job can speed up
 substantially even without caching - is common. This is because other
 components in the stack are performing caching and optimizations. Two that
 can make a huge difference are:

 1. The OS buffer cache. Which will keep recently read disk blocks in
 memory.
 2. The Java just-in-time compiler (JIT) which will use runtime profiling
 to significantly speed up execution speed.

 These can make a huge difference if you are running the same job
 over-and-over. And there are other things like the OS network stack
 increasing TCP windows and so fourth. These will all improve response time
 as a spark program executes.


 On Fri, May 2, 2014 at 9:27 AM, Ethan Jewett esjew...@gmail.com wrote:

 I believe Spark caches RDDs it has memory for regardless of whether you
 actually call the 'cache' method on the RDD. The 'cache' method just tips
 off Spark that the RDD should have higher priority. At least, that is my
 experience and it seems to correspond with your experience and with my
 recollection of other discussions on this topic on the list. However, going
 back and looking at the programming guide, this is not the way the
 cache/persist behavior is described. Does the guide need to be updated?


 On Fri, May 2, 2014 at 9:04 AM, Diana Carroll dcarr...@cloudera.comwrote:

 I'm just Posty McPostalot this week, sorry folks! :-)

 Anyway, another question today:
 I have a bit of code that is pretty time consuming (pasted at the end of
 the message):
 It reads in a bunch of XML files, parses them, extracts some data in a
 map, counts (using reduce), and then sorts.   All stages are executed when
 I do a final operation (take).  The first stage is the most expensive: on
 first run it takes 30s to a minute.

 I'm not caching anything.

 When I re-execute that take at the end, I expected it to re-execute all
 the same stages, and take approximately the same amount of time, but it
 didn't.  The second take executes only a single stage which collectively
 run very fast: the whole operation takes less than 1 second (down from 5
 minutes!)

 While this is awesome (!) I don't understand it.  If I'm not caching
 data, why would I see such a marked performance improvement on subsequent
 execution?

 (or is this related to the known .9.1 bug about sortByKey executing an
 action when it shouldn't?)

 Thanks,
 Diana
 sparkdev_04-23_KEEP_FOR_BUILDS.png

 # load XML files containing device activation records.
 # Find the most common device models activated
 import xml.etree.ElementTree as ElementTree

 # Given a partition containing multi-line XML, parse the contents.
 # Return an iterator of activation Elements contained in the partition
 def getactivations(fileiterator):
 s = ''
 for i in fileiterator: s = s + str(i)
 filetree = ElementTree.fromstring(s)
 return filetree.getiterator('activation')

 # Get the model name from a device activation record
 def getmodel(activation):
 return activation.find('model').text

 filename=hdfs://localhost/user/training/activations/*.xml

 # parse each partition as a file into an activation XML record
 activations = sc.textFile(filename)
 activationTrees = activations.mapPartitions(lambda xml:
 getactivations(xml))
 models = activationTrees.map(lambda activation: getmodel(activation))

 # count and sort activations by model
 topmodels = models.map(lambda model: (model,1))\
 .reduceByKey(lambda v1,v2: v1+v2)\
 .map(lambda (model,count): (count,model))\
 .sortByKey(ascending=False)

 # display the top 10 models
 for (count,model) in topmodels.take(10):
 print Model %s (%s) % (model,count)

  # repeat!
 for (count,model) in topmodels.take(10):
 print Model %s (%s) % (model,count)







Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-05-02 Thread Koert Kuipers
not sure why applying concat to reference. conf didn't work for you. since
it simply concatenates the files the key akka.version should be preserved.
we had the same situation for a while without issues.
On May 1, 2014 8:46 PM, Shivani Rao raoshiv...@gmail.com wrote:

 Hello Koert,

 That did not work. I specified it in my email already. But I figured a way
 around it  by excluding akka dependencies

 Shivani


 On Tue, Apr 29, 2014 at 12:37 PM, Koert Kuipers ko...@tresata.com wrote:

 you need to merge reference.conf files and its no longer an issue.

 see the Build for for spark itself:
   case reference.conf = MergeStrategy.concat


 On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao raoshiv...@gmail.comwrote:

 Hello folks,

 I was going to post this question to spark user group as well. If you
 have any leads on how to solve this issue please let me know:

 I am trying to build a basic spark project (spark depends on akka) and I
 am trying to create a fatjar using sbt assembly. The goal is to run the
 fatjar via commandline as follows:
  java -cp path to my spark fatjar mainclassname

 I encountered deduplication errors in the following akka libraries
 during sbt assembly
 akka-remote_2.10-2.2.3.jar with
 akka-remote_2.10-2.2.3-shaded-protobuf.jar
  akka-actor_2.10-2.2.3.jar with akka-actor_2.10-2.2.3-shaded-protobuf.jar

 I resolved them by using MergeStrategy.first and that helped with a
 successful compilation of the sbt assembly command. But for some or the
 other configuration parameter in the akka kept throwing up with the
 following message

 Exception in thread main com.typesafe.config.ConfigException$Missing:
 No configuration setting found for key

 I then used MergeStrategy.concat for reference.conf and I started
 getting this repeated error

 Exception in thread main com.typesafe.config.ConfigException$Missing:
 No configuration setting found for key 'akka.version'.

 I noticed that akka.version is only in the akka-actor jars and not in
 the akka-remote. The resulting reference.conf (in my final fat jar) does
 not contain akka.version either. So the strategy is not working.

 There are several things I could try

 a) Use the following dependency https://github.com/sbt/sbt-proguard
 b) Write a build.scala to handle merging of reference.conf

 https://spark-project.atlassian.net/browse/SPARK-395

 http://letitcrash.com/post/21025950392/howto-sbt-assembly-vs-reference-conf

 c) Create a reference.conf by merging all akka configurations and then
 passing it in my java -cp command as shown below

 java -cp jar-name -DConfig.file=config

 The main issue is that if I run the spark jar as sbt run there are no
 errors in accessing any of the akka configuration parameters. It is only
 when I run it via command line (java -cp jar-name classname) that I
 encounter the error.

 Which of these is a long term fix to akka issues? For now, I removed the
 akka dependencies and that solved the problem, but I know that is not a
 long term solution

 Regards,
 Shivani

 --
 Software Engineer
 Analytics Engineering Team@ Box
 Mountain View, CA





 --
 Software Engineer
 Analytics Engineering Team@ Box
 Mountain View, CA



Re: Storage information about an RDD from the API

2014-04-29 Thread Koert Kuipers
SparkContext.getRDDStorageInfo


On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth 
andras.nem...@lynxanalytics.com wrote:

 Hi,

 Is it possible to know from code about an RDD if it is cached, and more
 precisely, how many of its partitions are cached in memory and how many are
 cached on disk? I know I can get the storage level, but I also want to know
 the current actual caching status. Knowing memory consumption would also be
 awesome. :)

 Basically what I'm looking for is the information on the storage tab of
 the UI, but accessible from the API.

 Thanks,
 Andras



Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-04-29 Thread Koert Kuipers
you need to merge reference.conf files and its no longer an issue.

see the Build for for spark itself:
  case reference.conf = MergeStrategy.concat


On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao raoshiv...@gmail.com wrote:

 Hello folks,

 I was going to post this question to spark user group as well. If you have
 any leads on how to solve this issue please let me know:

 I am trying to build a basic spark project (spark depends on akka) and I
 am trying to create a fatjar using sbt assembly. The goal is to run the
 fatjar via commandline as follows:
  java -cp path to my spark fatjar mainclassname

 I encountered deduplication errors in the following akka libraries during
 sbt assembly
 akka-remote_2.10-2.2.3.jar with akka-remote_2.10-2.2.3-shaded-protobuf.jar
  akka-actor_2.10-2.2.3.jar with akka-actor_2.10-2.2.3-shaded-protobuf.jar

 I resolved them by using MergeStrategy.first and that helped with a
 successful compilation of the sbt assembly command. But for some or the
 other configuration parameter in the akka kept throwing up with the
 following message

 Exception in thread main com.typesafe.config.ConfigException$Missing:
 No configuration setting found for key

 I then used MergeStrategy.concat for reference.conf and I started
 getting this repeated error

 Exception in thread main com.typesafe.config.ConfigException$Missing: No
 configuration setting found for key 'akka.version'.

 I noticed that akka.version is only in the akka-actor jars and not in the
 akka-remote. The resulting reference.conf (in my final fat jar) does not
 contain akka.version either. So the strategy is not working.

 There are several things I could try

 a) Use the following dependency https://github.com/sbt/sbt-proguard
 b) Write a build.scala to handle merging of reference.conf

 https://spark-project.atlassian.net/browse/SPARK-395
 http://letitcrash.com/post/21025950392/howto-sbt-assembly-vs-reference-conf

 c) Create a reference.conf by merging all akka configurations and then
 passing it in my java -cp command as shown below

 java -cp jar-name -DConfig.file=config

 The main issue is that if I run the spark jar as sbt run there are no
 errors in accessing any of the akka configuration parameters. It is only
 when I run it via command line (java -cp jar-name classname) that I
 encounter the error.

 Which of these is a long term fix to akka issues? For now, I removed the
 akka dependencies and that solved the problem, but I know that is not a
 long term solution

 Regards,
 Shivani

 --
 Software Engineer
 Analytics Engineering Team@ Box
 Mountain View, CA



Re: ui broken in latest 1.0.0

2014-04-19 Thread Koert Kuipers
got it. makes sense. i am surprised it worked before...
On Apr 18, 2014 9:12 PM, Andrew Or and...@databricks.com wrote:

 Hi Koert,

 I've tracked down what the bug is. The caveat is that each StageInfo only
 keeps around the RDDInfo of the last RDD associated with the Stage. More
 concretely, if you have something like

 sc.parallelize(1 to 1000).persist.map(i = (i, i)).count()

 This creates two RDDs within one Stage, and the persisted RDD doesn't show
 up on the UI because it is not the last RDD of this stage. I filed a JIRA
 for this here: https://issues.apache.org/jira/browse/SPARK-1538.

 Thanks again for reporting this. I will push out a fix shortly.
 Andrew


 On Tue, Apr 8, 2014 at 1:30 PM, Koert Kuipers ko...@tresata.com wrote:

 our one cached RDD in this run has id 3




 *** onStageSubmitted **
 rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B
 _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


 *** onTaskEnd **
 _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
 storageStatusList: List(StorageStatus(BlockManagerId(driver,
 192.168.3.169, 34330, 0),579325132,Map()))


 *** onStageCompleted **
 _rddInfoMap: Map()



 *** onStageSubmitted **
 rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B
 _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


 *** updateRDDInfo **


 *** onTaskEnd **

 _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
  storageStatusList: List(StorageStatus(BlockManagerId(driver,
 192.168.3.169, 34330, 0),579325132,Map(rdd_3_0 -
 BlockStatus(StorageLevel(false, true, false, true, 1),19944,0,0


 *** onStageCompleted **
 _rddInfoMap: Map()



 On Tue, Apr 8, 2014 at 4:20 PM, Koert Kuipers ko...@tresata.com wrote:

 1) at the end of the callback

 2) yes we simply expose sc.getRDDStorageInfo to the user via REST

 3) yes exactly. we define the RDDs at startup, all of them are cached.
 from that point on we only do calculations on these cached RDDs.

 i will add some more println statements for storageStatusList



 On Tue, Apr 8, 2014 at 4:01 PM, Andrew Or and...@databricks.com wrote:

 Hi Koert,

 Thanks for pointing this out. However, I am unable to reproduce this
 locally. It seems that there is a discrepancy between what the
 BlockManagerUI and the SparkContext think is persisted. This is strange
 because both sources ultimately derive this information from the same place
 - by doing sc.getExecutorStorageStatus. I have a couple of questions for
 you:

 1) In your print statements, do you print them in the beginning or at
 the end of each callback? It would be good to keep them at the end, since
 in the beginning the data structures have not been processed yet.
 2) You mention that you get the RDD info through your own API. How do
 you get this information? Is it through sc.getRDDStorageInfo?
 3) What did your application do to produce this behavior? Did you make
 an RDD, persist it once, and then use it many times afterwards or something
 similar?

 It would be super helpful if you could also print out what
 StorageStatusListener's storageStatusList looks like by the end of each
 onTaskEnd. I will continue to look into this on my side, but do let me know
 once you have any updates.

 Andrew


 On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers ko...@tresata.comwrote:

 yet at same time i can see via our own api:

 storageInfo: {
 diskSize: 0,
 memSize: 19944,
 numCachedPartitions: 1,
 numPartitions: 1
 }



 On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.comwrote:

 i put some println statements in BlockManagerUI

 i have RDDs that are cached in memory. I see this:


 *** onStageSubmitted **
 rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false,
 false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B
 _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i tried again with latest master, which includes commit below, but ui page
still shows nothing on storage tab.
koert



commit ada310a9d3d5419e101b24d9b41398f609da1ad3
Author: Andrew Or andrewo...@gmail.com
Date:   Mon Mar 31 23:01:14 2014 -0700

[Hot Fix #42] Persisted RDD disappears on storage page if re-used

If a previously persisted RDD is re-used, its information disappears
from the Storage page.

This is because the tasks associated with re-using the RDD do not
report the RDD's blocks as updated (which is correct). On stage submit,
however, we overwrite any existing

Author: Andrew Or andrewo...@gmail.com

Closes #281 from andrewor14/ui-storage-fix and squashes the following
commits:

408585a [Andrew Or] Fix storage UI bug



On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote:

 got it thanks


 On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote:

 This is fixed in https://github.com/apache/spark/pull/281. Please try
 again with the latest master. -Xiangrui

 On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com wrote:
  i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days
 ago
  (apr 5) that the application detail ui no longer shows any RDDs on the
  storage tab, despite the fact that they are definitely cached.
 
  i am running spark in standalone mode.





assumption that lib_managed is present

2014-04-08 Thread Koert Kuipers
when i start spark-shell i now see

ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or
directory

we do not package a lib_managed with our spark build (never did). maybe the
logic in compute-classpath.sh that searches for datanucleus should check
for the existence of lib_managed before doing ls on it?


Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
note that for a cached rdd in the spark shell it all works fine. but
something is going wrong with the spark-shell in our applications that
extensively cache and re-use RDDs


On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote:

 i tried again with latest master, which includes commit below, but ui page
 still shows nothing on storage tab.
 koert



 commit ada310a9d3d5419e101b24d9b41398f609da1ad3
 Author: Andrew Or andrewo...@gmail.com
 Date:   Mon Mar 31 23:01:14 2014 -0700

 [Hot Fix #42] Persisted RDD disappears on storage page if re-used

 If a previously persisted RDD is re-used, its information disappears
 from the Storage page.

 This is because the tasks associated with re-using the RDD do not
 report the RDD's blocks as updated (which is correct). On stage submit,
 however, we overwrite any existing

 Author: Andrew Or andrewo...@gmail.com

 Closes #281 from andrewor14/ui-storage-fix and squashes the following
 commits:

 408585a [Andrew Or] Fix storage UI bug



 On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote:

 got it thanks


 On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote:

 This is fixed in https://github.com/apache/spark/pull/281. Please try
 again with the latest master. -Xiangrui

 On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com wrote:
  i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days
 ago
  (apr 5) that the application detail ui no longer shows any RDDs on
 the
  storage tab, despite the fact that they are definitely cached.
 
  i am running spark in standalone mode.






Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
sorry, i meant to say: note that for a cached rdd in the spark shell it all
works fine. but something is going wrong with the SPARK-APPLICATION-UI in
our applications that extensively cache and re-use RDDs


On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:

 note that for a cached rdd in the spark shell it all works fine. but
 something is going wrong with the spark-shell in our applications that
 extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote:

 i tried again with latest master, which includes commit below, but ui
 page still shows nothing on storage tab.
  koert



 commit ada310a9d3d5419e101b24d9b41398f609da1ad3
 Author: Andrew Or andrewo...@gmail.com
 Date:   Mon Mar 31 23:01:14 2014 -0700

 [Hot Fix #42] Persisted RDD disappears on storage page if re-used

 If a previously persisted RDD is re-used, its information disappears
 from the Storage page.

 This is because the tasks associated with re-using the RDD do not
 report the RDD's blocks as updated (which is correct). On stage submit,
 however, we overwrite any existing

 Author: Andrew Or andrewo...@gmail.com

 Closes #281 from andrewor14/ui-storage-fix and squashes the following
 commits:

 408585a [Andrew Or] Fix storage UI bug



 On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote:

 got it thanks


 On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote:

 This is fixed in https://github.com/apache/spark/pull/281. Please try
 again with the latest master. -Xiangrui

 On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com
 wrote:
  i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
 days ago
  (apr 5) that the application detail ui no longer shows any RDDs on
 the
  storage tab, despite the fact that they are definitely cached.
 
  i am running spark in standalone mode.







Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i call an action after cache, and i can see that the RDDs are fully
cached using context.getRDDStorageInfo which we expose via our own api.

i did not run make-distribution.sh, we have our own scripts to build a
distribution. however if your question is if i correctly deployed the
latest build, let me check to be sure.


On Tue, Apr 8, 2014 at 12:43 PM, Xiangrui Meng men...@gmail.com wrote:

 That commit did work for me. Could you confirm the following:

 1) After you called cache(), did you make any actions like count() or
 reduce()? If you don't materialize the RDD, it won't show up in the
 storage tab.

 2) Did you run ./make-distribution.sh after you switched to the current
 master?

 Xiangrui

 On Tue, Apr 8, 2014 at 9:33 AM, Koert Kuipers ko...@tresata.com wrote:
  i tried again with latest master, which includes commit below, but ui
 page
  still shows nothing on storage tab.
  koert
 
 
 
  commit ada310a9d3d5419e101b24d9b41398f609da1ad3
  Author: Andrew Or andrewo...@gmail.com
  Date:   Mon Mar 31 23:01:14 2014 -0700
 
  [Hot Fix #42] Persisted RDD disappears on storage page if re-used
 
  If a previously persisted RDD is re-used, its information disappears
  from the Storage page.
 
  This is because the tasks associated with re-using the RDD do not
 report
  the RDD's blocks as updated (which is correct). On stage submit,
 however, we
  overwrite any existing
 
  Author: Andrew Or andrewo...@gmail.com
 
  Closes #281 from andrewor14/ui-storage-fix and squashes the following
  commits:
 
  408585a [Andrew Or] Fix storage UI bug
 
 
 
  On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote:
 
  got it thanks
 
 
  On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote:
 
  This is fixed in https://github.com/apache/spark/pull/281. Please try
  again with the latest master. -Xiangrui
 
  On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com
 wrote:
   i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
 days
   ago
   (apr 5) that the application detail ui no longer shows any RDDs on
   the
   storage tab, despite the fact that they are definitely cached.
  
   i am running spark in standalone mode.
 
 
 



Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i am definitely using latest


On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote:

 That commit fixed the exact problem you described. That is why I want to
 confirm that you switched to the master branch. bin/spark-shell doesn't
 detect code changes, so you need to run ./make-distribution.sh to
 re-compile Spark first. -Xiangrui


 On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote:

 sorry, i meant to say: note that for a cached rdd in the spark shell it
 all works fine. but something is going wrong with the SPARK-APPLICATION-UI
 in our applications that extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:

 note that for a cached rdd in the spark shell it all works fine. but
 something is going wrong with the spark-shell in our applications that
 extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.comwrote:

 i tried again with latest master, which includes commit below, but ui
 page still shows nothing on storage tab.
  koert



 commit ada310a9d3d5419e101b24d9b41398f609da1ad3
 Author: Andrew Or andrewo...@gmail.com
 Date:   Mon Mar 31 23:01:14 2014 -0700

 [Hot Fix #42] Persisted RDD disappears on storage page if re-used

 If a previously persisted RDD is re-used, its information
 disappears from the Storage page.

 This is because the tasks associated with re-using the RDD do not
 report the RDD's blocks as updated (which is correct). On stage submit,
 however, we overwrite any existing

 Author: Andrew Or andrewo...@gmail.com

 Closes #281 from andrewor14/ui-storage-fix and squashes the
 following commits:

 408585a [Andrew Or] Fix storage UI bug



 On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.comwrote:

 got it thanks


 On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.comwrote:

 This is fixed in https://github.com/apache/spark/pull/281. Please try
 again with the latest master. -Xiangrui

 On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com
 wrote:
  i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
 days ago
  (apr 5) that the application detail ui no longer shows any RDDs
 on the
  storage tab, despite the fact that they are definitely cached.
 
  i am running spark in standalone mode.









Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i put some println statements in BlockManagerUI

i have RDDs that are cached in memory. I see this:


*** onStageSubmitted **
rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


*** onTaskEnd **
Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B)


*** onStageCompleted **
Map()

*** onStageSubmitted **
rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)

*** onTaskEnd **
Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B)

*** onStageCompleted **
Map()


The storagelevels you see here are never the ones of my RDDs. and
apparently updateRDDInfo never gets called (i had println in there too).


On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers ko...@tresata.com wrote:

 yes i am definitely using latest


 On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote:

 That commit fixed the exact problem you described. That is why I want to
 confirm that you switched to the master branch. bin/spark-shell doesn't
 detect code changes, so you need to run ./make-distribution.sh to
 re-compile Spark first. -Xiangrui


 On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote:

 sorry, i meant to say: note that for a cached rdd in the spark shell it
 all works fine. but something is going wrong with the SPARK-APPLICATION-UI
 in our applications that extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.comwrote:

 note that for a cached rdd in the spark shell it all works fine. but
 something is going wrong with the spark-shell in our applications that
 extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.comwrote:

 i tried again with latest master, which includes commit below, but ui
 page still shows nothing on storage tab.
  koert



 commit ada310a9d3d5419e101b24d9b41398f609da1ad3
 Author: Andrew Or andrewo...@gmail.com
 Date:   Mon Mar 31 23:01:14 2014 -0700

 [Hot Fix #42] Persisted RDD disappears on storage page if re-used

 If a previously persisted RDD is re-used, its information
 disappears from the Storage page.

 This is because the tasks associated with re-using the RDD do not
 report the RDD's blocks as updated (which is correct). On stage submit,
 however, we overwrite any existing

 Author: Andrew Or andrewo...@gmail.com

 Closes #281 from andrewor14/ui-storage-fix and squashes the
 following commits:

 408585a [Andrew Or] Fix storage UI bug



 On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.comwrote:

 got it thanks


 On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.comwrote:

 This is fixed in https://github.com/apache/spark/pull/281. Please
 try
 again with the latest master. -Xiangrui

 On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com
 wrote:
  i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
 days ago
  (apr 5) that the application detail ui no longer shows any RDDs
 on the
  storage tab, despite the fact that they are definitely cached.
 
  i am running spark in standalone mode.










Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yet at same time i can see via our own api:

storageInfo: {
diskSize: 0,
memSize: 19944,
numCachedPartitions: 1,
numPartitions: 1
}



On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote:

 i put some println statements in BlockManagerUI

 i have RDDs that are cached in memory. I see this:


 *** onStageSubmitted **
 rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1);
 CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
 B; DiskSize: 0.0 B
 _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


 *** onTaskEnd **
 Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1);
 CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
 B; DiskSize: 0.0 B)


 *** onStageCompleted **
 Map()

 *** onStageSubmitted **
 rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1);
 CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
 B; DiskSize: 0.0 B
 _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)

 *** onTaskEnd **
 Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1);
 CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
 B; DiskSize: 0.0 B)

 *** onStageCompleted **
 Map()


 The storagelevels you see here are never the ones of my RDDs. and
 apparently updateRDDInfo never gets called (i had println in there too).


 On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers ko...@tresata.com wrote:

 yes i am definitely using latest


 On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote:

 That commit fixed the exact problem you described. That is why I want to
 confirm that you switched to the master branch. bin/spark-shell doesn't
 detect code changes, so you need to run ./make-distribution.sh to
 re-compile Spark first. -Xiangrui


 On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote:

 sorry, i meant to say: note that for a cached rdd in the spark shell it
 all works fine. but something is going wrong with the SPARK-APPLICATION-UI
 in our applications that extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.comwrote:

 note that for a cached rdd in the spark shell it all works fine. but
 something is going wrong with the spark-shell in our applications that
 extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.comwrote:

 i tried again with latest master, which includes commit below, but ui
 page still shows nothing on storage tab.
  koert



 commit ada310a9d3d5419e101b24d9b41398f609da1ad3
 Author: Andrew Or andrewo...@gmail.com
 Date:   Mon Mar 31 23:01:14 2014 -0700

 [Hot Fix #42] Persisted RDD disappears on storage page if re-used

 If a previously persisted RDD is re-used, its information
 disappears from the Storage page.

 This is because the tasks associated with re-using the RDD do not
 report the RDD's blocks as updated (which is correct). On stage submit,
 however, we overwrite any existing

 Author: Andrew Or andrewo...@gmail.com

 Closes #281 from andrewor14/ui-storage-fix and squashes the
 following commits:

 408585a [Andrew Or] Fix storage UI bug



 On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.comwrote:

 got it thanks


 On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.comwrote:

 This is fixed in https://github.com/apache/spark/pull/281. Please
 try
 again with the latest master. -Xiangrui

 On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers ko...@tresata.com
 wrote:
  i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
 days ago
  (apr 5) that the application detail ui no longer shows any RDDs
 on the
  storage tab, despite the fact that they are definitely cached.
 
  i am running spark in standalone mode.











Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
1) at the end of the callback

2) yes we simply expose sc.getRDDStorageInfo to the user via REST

3) yes exactly. we define the RDDs at startup, all of them are cached. from
that point on we only do calculations on these cached RDDs.

i will add some more println statements for storageStatusList



On Tue, Apr 8, 2014 at 4:01 PM, Andrew Or and...@databricks.com wrote:

 Hi Koert,

 Thanks for pointing this out. However, I am unable to reproduce this
 locally. It seems that there is a discrepancy between what the
 BlockManagerUI and the SparkContext think is persisted. This is strange
 because both sources ultimately derive this information from the same place
 - by doing sc.getExecutorStorageStatus. I have a couple of questions for
 you:

 1) In your print statements, do you print them in the beginning or at the
 end of each callback? It would be good to keep them at the end, since in
 the beginning the data structures have not been processed yet.
 2) You mention that you get the RDD info through your own API. How do you
 get this information? Is it through sc.getRDDStorageInfo?
 3) What did your application do to produce this behavior? Did you make an
 RDD, persist it once, and then use it many times afterwards or something
 similar?

 It would be super helpful if you could also print out what
 StorageStatusListener's storageStatusList looks like by the end of each
 onTaskEnd. I will continue to look into this on my side, but do let me know
 once you have any updates.

 Andrew


 On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers ko...@tresata.com wrote:

 yet at same time i can see via our own api:

 storageInfo: {
 diskSize: 0,
 memSize: 19944,
 numCachedPartitions: 1,
 numPartitions: 1
 }



 On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote:

 i put some println statements in BlockManagerUI

 i have RDDs that are cached in memory. I see this:


 *** onStageSubmitted **
 rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B
 _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


 *** onTaskEnd **
 Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B)


 *** onStageCompleted **
 Map()

 *** onStageSubmitted **
 rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B
 _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)

 *** onTaskEnd **
 Map(7 - RDD 7 (7) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B)

 *** onStageCompleted **
 Map()


 The storagelevels you see here are never the ones of my RDDs. and
 apparently updateRDDInfo never gets called (i had println in there too).


 On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers ko...@tresata.com wrote:

 yes i am definitely using latest


 On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote:

 That commit fixed the exact problem you described. That is why I want
 to confirm that you switched to the master branch. bin/spark-shell doesn't
 detect code changes, so you need to run ./make-distribution.sh to
 re-compile Spark first. -Xiangrui


 On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.comwrote:

 sorry, i meant to say: note that for a cached rdd in the spark shell
 it all works fine. but something is going wrong with the
 SPARK-APPLICATION-UI in our applications that extensively cache and 
 re-use
 RDDs


 On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.comwrote:

 note that for a cached rdd in the spark shell it all works fine. but
 something is going wrong with the spark-shell in our applications that
 extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.comwrote:

 i tried again with latest master, which includes commit below, but
 ui page still shows nothing on storage tab.
  koert



 commit ada310a9d3d5419e101b24d9b41398f609da1ad3
 Author: Andrew Or andrewo...@gmail.com
 Date:   Mon Mar 31 23:01:14 2014 -0700

 [Hot Fix #42] Persisted RDD disappears on storage page if
 re-used

 If a previously persisted RDD is re-used, its information
 disappears from the Storage page.

 This is because

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
our one cached RDD in this run has id 3



*** onStageSubmitted **
rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


*** onTaskEnd **
_rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
storageStatusList: List(StorageStatus(BlockManagerId(driver,
192.168.3.169, 34330, 0),579325132,Map()))


*** onStageCompleted **
_rddInfoMap: Map()


*** onStageSubmitted **
rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


*** updateRDDInfo **


*** onTaskEnd **
_rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
storageStatusList: List(StorageStatus(BlockManagerId(driver,
192.168.3.169, 34330, 0),579325132,Map(rdd_3_0 -
BlockStatus(StorageLevel(false, true, false, true, 1),19944,0,0


*** onStageCompleted **
_rddInfoMap: Map()



On Tue, Apr 8, 2014 at 4:20 PM, Koert Kuipers ko...@tresata.com wrote:

 1) at the end of the callback

 2) yes we simply expose sc.getRDDStorageInfo to the user via REST

 3) yes exactly. we define the RDDs at startup, all of them are cached.
 from that point on we only do calculations on these cached RDDs.

 i will add some more println statements for storageStatusList



 On Tue, Apr 8, 2014 at 4:01 PM, Andrew Or and...@databricks.com wrote:

 Hi Koert,

 Thanks for pointing this out. However, I am unable to reproduce this
 locally. It seems that there is a discrepancy between what the
 BlockManagerUI and the SparkContext think is persisted. This is strange
 because both sources ultimately derive this information from the same place
 - by doing sc.getExecutorStorageStatus. I have a couple of questions for
 you:

 1) In your print statements, do you print them in the beginning or at the
 end of each callback? It would be good to keep them at the end, since in
 the beginning the data structures have not been processed yet.
 2) You mention that you get the RDD info through your own API. How do you
 get this information? Is it through sc.getRDDStorageInfo?
 3) What did your application do to produce this behavior? Did you make an
 RDD, persist it once, and then use it many times afterwards or something
 similar?

 It would be super helpful if you could also print out what
 StorageStatusListener's storageStatusList looks like by the end of each
 onTaskEnd. I will continue to look into this on my side, but do let me know
 once you have any updates.

 Andrew


 On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers ko...@tresata.com wrote:

 yet at same time i can see via our own api:

 storageInfo: {
 diskSize: 0,
 memSize: 19944,
 numCachedPartitions: 1,
 numPartitions: 1
 }



 On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote:

 i put some println statements in BlockManagerUI

 i have RDDs that are cached in memory. I see this:


 *** onStageSubmitted **
 rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B
 _rddInfoMap: Map(2 - RDD 2 (2) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


 *** onTaskEnd **
 Map(2 - RDD 2 (2) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B)


 *** onStageCompleted **
 Map()

 *** onStageSubmitted **
 rddInfo: RDD 7 (7) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B
 _rddInfoMap: Map(7 - RDD 7 (7) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B

RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
any reason why RDDInfo suddenly became private in SPARK-1132?

we are using it to show users status of rdds


Re: RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
ok yeah we are using StageInfo and TaskInfo too...


On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or and...@databricks.com wrote:

 Hi Koert,

 Other users have expressed interest for us to expose similar classes too
 (i.e. StageInfo, TaskInfo). In the newest release, they will be available
 as part of the developer API. The particular PR that will change this is:
 https://github.com/apache/spark/pull/274.

 Cheers,
 Andrew


 On Mon, Apr 7, 2014 at 5:05 PM, Koert Kuipers ko...@tresata.com wrote:

 any reason why RDDInfo suddenly became private in SPARK-1132?

 we are using it to show users status of rdds





Re: Generic types and pair RDDs

2014-04-01 Thread Koert Kuipers
  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd.RDD
  import scala.reflect.ClassTag

  def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) :
RDD[(K, Int)] = {
rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
  }


On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann daniel.siegm...@velos.iowrote:

 When my tuple type includes a generic type parameter, the pair RDD
 functions aren't available. Take for example the following (a join on two
 RDDs, taking the sum of the values):

 def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) :
 RDD[(String, Int)] = {
 rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
 }

 That works fine, but lets say I replace the type of the key with a generic
 type:

 def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)]
 = {
 rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
 }

 This latter function gets the compiler error value join is not a member
 of org.apache.spark.rdd.RDD[(K, Int)].

 The reason is probably obvious, but I don't have much Scala experience.
 Can anyone explain what I'm doing wrong?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io W: www.velos.io



Re: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Koert Kuipers
i have found that i am unable to build/test spark with sbt and java6, but
using java7 works (and it compiles with java target version 1.6 so binaries
are usable from java 6)


On Sat, Mar 22, 2014 at 3:11 PM, Bharath Bhushan manku.ti...@outlook.comwrote:

 Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of
 ram only. Increasing it to 1024MB allowed the assembly to finish
 successfully. Peak usage was around 780MB.

 --
 To: user@spark.apache.org
 From: vboylin1...@gmail.com
 Subject: 答复: unable to build spark - sbt/sbt: line 50: killed
 Date: Sat, 22 Mar 2014 20:03:28 +0800


  Large memory is need to build spark, I think you should make xmx larger,
 2g for example.
  --
 发件人: Bharath Bhushan manku.ti...@outlook.com
 发送时间: 2014/3/22 12:50
 收件人: user@spark.apache.org
 主题: unable to build spark - sbt/sbt: line 50: killed

 I am getting the following error when trying to build spark. I tried
 various sizes for the -Xmx and other memory related arguments to the java
 command line, but the assembly command still fails.

 $ sbt/sbt assembly
 ...
 [info] Compiling 298 Scala sources and 17 Java sources to
 /vagrant/spark-0.9.0-incubating-bin-hadoop2/core/target/scala-2.10/classes...
 sbt/sbt: line 50: 10202 Killed  java -Xmx1900m
 -XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=256m -jar ${JAR} $@

 Versions of software:
 Spark: 0.9.0 (hadoop2 binary)
 Scala: 2.10.3
 Ubuntu: Ubuntu 12.04.4 LTS - Linux vagrant-ubuntu-precise-64
 3.2.0-54-generic
 Java: 1.6.0_45 (oracle java 6)

 I can still use the binaries in bin/ but I was just trying to check if
 sbt/sbt assembly works fine.

 -- Thanks



Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Koert Kuipers
i dont know anything about arm clusters but it looks great. what are
the specs? the nodes have no local disk at all?


On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi chan...@gmail.comwrote:

 Hi all,

 We are a small team doing a research on low-power (and low-cost) ARM
 clusters. We built a 20-node ARM cluster that be able to start Hadoop.
 But as all of you've known, Hadoop is performing on-disk operations,
 so it's not suitable for a constraint machine powered by ARM.

 We then switched to Spark and had to say wow!!

 Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
 size 34GB in 1h50m. We have identified the bottleneck and it's our
 100M network.

 Here's the cluster:
 https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png

 And this is what we got from Spark's shell:
 https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png

 I think it's the first ARM cluster that can process a non-trivial size
 of Big Data.
 (Please correct me if I'm wrong)
 I really want to thank the Spark team that makes this possible !!

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit



Re: trying to understand job cancellation

2014-03-19 Thread Koert Kuipers
on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no
issues yet.


On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers ko...@tresata.com wrote:

 its 0.9 snapshot from january running in standalone mode.

 have these fixed been merged into 0.9?


 On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Which version of Spark is this in, Koert? There might have been some
 fixes more recently for it.

 Matei

 On Mar 5, 2014, at 5:26 PM, Koert Kuipers ko...@tresata.com wrote:

 Sorry I meant to say: seems the issue is shared RDDs between a job that
 got cancelled and a later job.

 However even disregarding that I have the other issue that the active
 task of the cancelled job hangs around forever, not doing anything
 On Mar 5, 2014 7:29 PM, Koert Kuipers ko...@tresata.com wrote:

 yes jobs on RDDs that were not part of the cancelled job work fine.

 so it seems the issue is the cached RDDs that are ahred between the
 cancelled job and the jobs after that.


 On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers ko...@tresata.com wrote:

 well, the new jobs use existing RDDs that were also used in the jon
 that got killed.

 let me confirm that new jobs that use completely different RDDs do not
 get killed.



 On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Quite unlikely as jobid are given in an incremental fashion, so your
 future jobid are not likely to be killed if your groupid is not repeated.I
 guess the issue is something else.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers ko...@tresata.comwrote:

 i did that. my next job gets a random new group job id (a uuid).
 however that doesnt seem to stop the job from getting sucked into the
 cancellation it seems


 On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi 
 mayur.rust...@gmail.com wrote:

 You can randomize job groups as well. to secure yourself against
 termination.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers ko...@tresata.comwrote:

 got it. seems like i better stay away from this feature for now..


 On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi 
 mayur.rust...@gmail.com wrote:

 One issue is that job cancellation is posted on eventloop. So its
 possible that subsequent jobs submitted to job queue may beat the job
 cancellation event  hence the job cancellation event may end up 
 closing
 them too.
 So there's definitely a race condition you are risking even if not
 running into.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers 
 ko...@tresata.comwrote:

 SparkContext.cancelJobGroup


 On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi 
 mayur.rust...@gmail.com wrote:

 How do you cancel the job. Which API do you use?

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers ko...@tresata.com
  wrote:

 i also noticed that jobs (with a new JobGroupId) which i run
 after this use which use the same RDDs get very confused. i see 
 lots of
 cancelled stages and retries that go on forever.


 On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers 
 ko...@tresata.com wrote:

 i have a running job that i cancel while keeping the spark
 context alive.

 at the time of cancellation the active stage is 14.

 i see in logs:
 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to
 cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 10
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 14
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
 was cancelled
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
 TaskSet 14.0 from pool x
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 13
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 12
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 11
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 15

 so far it all looks good. then i get a lot of messages like
 this:
 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
 update with state FINISHED from TID 883 because its task set is 
 gone
 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
 update with state KILLED from TID 888 because its task set is gone

 after this stage 14 hangs around in active stages, without any
 sign of progress or cancellation. it just sits there forever, 
 stuck.
 looking at the logs of the executors confirms this. they task 
 seem to be
 still running, but nothing

combining operations elegantly

2014-03-13 Thread Koert Kuipers
not that long ago there was a nice example on here about how to combine
multiple operations on a single RDD. so basically if you want to do a
count() and something else, how to roll them into a single job. i think
patrick wendell gave the examples.

i cant find them anymore patrick can you please repost? thanks!


Re: Sbt Permgen

2014-03-10 Thread Koert Kuipers
hey sandy, i think that pulreq is not relevant to the 0.9 branch i am using

switching to java 7 for sbt/sbt test made it work. not sure why...


On Sun, Mar 9, 2014 at 11:44 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 There was an issue related to this fixed recently:
 https://github.com/apache/spark/pull/103


 On Sun, Mar 9, 2014 at 8:40 PM, Koert Kuipers ko...@tresata.com wrote:

 edit last line of sbt/sbt, after which i run:
 sbt/sbt test


 On Sun, Mar 9, 2014 at 10:24 PM, Sean Owen so...@cloudera.com wrote:

 How are you specifying these args?
 On Mar 9, 2014 8:55 PM, Koert Kuipers ko...@tresata.com wrote:

 i just checkout out the latest 0.9

 no matter what java options i use in sbt/sbt (i tried -Xmx6G
 -XX:MaxPermSize=2000m -XX:ReservedCodeCacheSize=300m) i keep getting errors
 java.lang.OutOfMemoryError: PermGen space when running the tests.

 curiously i managed to run the tests with the default dependencies, but
 with cdh4.5.0 mr1 dependencies i always hit the dreaded Permgen space 
 issue.

 Any suggestions?






computation slows down 10x because of cached RDDs

2014-03-10 Thread Koert Kuipers
hello all,
i am observing a strange result. i have a computation that i run on a
cached RDD in spark-standalone. it typically takes about 4 seconds.

but when other RDDs that are not relevant to the computation at hand are
cached in memory (in same spark context), the computation takes 40 seconds
or more.

the problem seems to be GC time, which goes from milliseconds to tens of
seconds.

note that my issue is not that memory is full. i have cached about 14G in
RDDs with 66G available across workers for the application. also my
computation did not push any cached RDD out of memory.

any ideas?


Re: computation slows down 10x because of cached RDDs

2014-03-10 Thread Koert Kuipers
hey matei,
it happens repeatedly.

we are currently runnning on java 6 with spark 0.9.

i will add -XX:+PrintGCDetails and collect details, and also look into java
7 G1. thanks






On Mon, Mar 10, 2014 at 6:27 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Does this happen repeatedly if you keep running the computation, or just
 the first time? It may take time to move these Java objects to the old
 generation the first time you run queries, which could lead to a GC pause
 that also slows down the small queries.

 If you can run with -XX:+PrintGCDetails in your Java options, it would
 also be good to see what percent of each GC generation is used.

 The concurrent mark-and-sweep GC -XX:+UseConcMarkSweepGC or the G1 GC in
 Java 7 (-XX:+UseG1GC) might also avoid these pauses by GCing concurrently
 with your application threads.

 Matei

 On Mar 10, 2014, at 3:18 PM, Koert Kuipers ko...@tresata.com wrote:

 hello all,
 i am observing a strange result. i have a computation that i run on a
 cached RDD in spark-standalone. it typically takes about 4 seconds.

 but when other RDDs that are not relevant to the computation at hand are
 cached in memory (in same spark context), the computation takes 40 seconds
 or more.

 the problem seems to be GC time, which goes from milliseconds to tens of
 seconds.

 note that my issue is not that memory is full. i have cached about 14G in
 RDDs with 66G available across workers for the application. also my
 computation did not push any cached RDD out of memory.

 any ideas?





Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
i also noticed that jobs (with a new JobGroupId) which i run after this use
which use the same RDDs get very confused. i see lots of cancelled stages
and retries that go on forever.


On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers ko...@tresata.com wrote:

 i have a running job that i cancel while keeping the spark context alive.

 at the time of cancellation the active stage is 14.

 i see in logs:
 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group
 3a25db23-2e39-4497-b7ab-b26b2a976f9c
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
 cancelled
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0
 from pool x
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15

 so far it all looks good. then i get a lot of messages like this:
 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with
 state FINISHED from TID 883 because its task set is gone
 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with
 state KILLED from TID 888 because its task set is gone

 after this stage 14 hangs around in active stages, without any sign of
 progress or cancellation. it just sits there forever, stuck. looking at the
 logs of the executors confirms this. they task seem to be still running,
 but nothing is happening. for example (by the time i look at this its 4:58
 so this tasks hasnt done anything in 15 mins):

 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
 14/03/04 16:43:16 INFO Executor: Finished task ID 943
 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
 14/03/04 16:43:16 INFO Executor: Finished task ID 945
 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66

 not sure what to make of this. any suggestions? best, koert



Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
got it. seems like i better stay away from this feature for now..


On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 One issue is that job cancellation is posted on eventloop. So its possible
 that subsequent jobs submitted to job queue may beat the job cancellation
 event  hence the job cancellation event may end up closing them too.
 So there's definitely a race condition you are risking even if not running
 into.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers ko...@tresata.com wrote:

 SparkContext.cancelJobGroup


 On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 How do you cancel the job. Which API do you use?

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers ko...@tresata.com wrote:

 i also noticed that jobs (with a new JobGroupId) which i run after this
 use which use the same RDDs get very confused. i see lots of cancelled
 stages and retries that go on forever.


 On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers ko...@tresata.comwrote:

 i have a running job that i cancel while keeping the spark context
 alive.

 at the time of cancellation the active stage is 14.

 i see in logs:
 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
 group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 10
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 14
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
 cancelled
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
 14.0 from pool x
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 13
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 12
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 11
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 15

 so far it all looks good. then i get a lot of messages like this:
 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
 with state FINISHED from TID 883 because its task set is gone
 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
 with state KILLED from TID 888 because its task set is gone

 after this stage 14 hangs around in active stages, without any sign of
 progress or cancellation. it just sits there forever, stuck. looking at 
 the
 logs of the executors confirms this. they task seem to be still running,
 but nothing is happening. for example (by the time i look at this its 4:58
 so this tasks hasnt done anything in 15 mins):

 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
 1007
 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
 driver
 14/03/04 16:43:16 INFO Executor: Finished task ID 943
 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
 1007
 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
 driver
 14/03/04 16:43:16 INFO Executor: Finished task ID 945
 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66

 not sure what to make of this. any suggestions? best, koert








Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
If you need quick response re-use your spark context between queries and
cache rdds in memory
On Mar 3, 2014 12:42 AM, polkosity polkos...@gmail.com wrote:

 Thanks for the advice Mayur.

 I thought I'd report back on the performance difference...  Spark
 standalone
 mode has executors processing at capacity in under a second :)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
yes, tachyon is in memory serialized, which is not as fast as cached in
memory in spark (not serialized). the difference really depends on your job
type.



On Mon, Mar 3, 2014 at 7:10 PM, polkosity polkos...@gmail.com wrote:

 Thats exciting!  Will be looking into that, thanks Andrew.

 Related topic, has anyone had any experience running Spark on Tachyon
 in-memory filesystem, and could offer their views on using it?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



<    1   2   3   4   5