sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Bharath Bhushan
I am facing a weird failure where sbt/sbt assembly” shows a lot of SSL 
certificate errors for repo.maven.apache.org. Is anyone else facing the same 
problems? Any idea why this is happening? Yesterday I was able to successfully 
run it.

Loading https://repo.maven.apache.org shows an invalid cert chain.

—
Thanks

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Sean Owen
I'm also seeing this. It also was working for me previously AFAIK.

Tthe proximate cause is my well-intentioned change that uses HTTPS to
access all artifact repos. The default for Maven Central before would have
been HTTP. While it's a good idea to use HTTPS, it may run into
complications.

I see:

https://issues.sonatype.org/browse/MVNCENTRAL-377
https://issues.apache.org/jira/browse/INFRA-7363

... which suggest that actually this isn't supported, and should return
404.

Then I see:
http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4
... that suggests you have to pay for HTTPS access to Maven Central?

But, we both seem to have found it working (even after the change above)
and it does not return 404 now.

The Maven build works, still, but if you look carefully, it's actually only
because it eventually falls back to HTTP for Maven Central artifacts.

So I think the thing to do is simply back off to HTTP for Maven Central
only. That's unfortunate because there's a small but non-trivial security
issue here in downloading artifacts without security.

Any brighter ideas? I'll open a supplementary PR if so to adjust this.


Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Andrew, this should be fixed in 0.9.1, assuming it is the same hash
collision error we found there.

Kane, is it possible your bigger data is corrupt, such that that any
operations on it fail?


On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com wrote:

 FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and
 have it disabled now. The specific error behavior was that a join would
 consistently return one count of rows with spill enabled and another count
 with it disabled.

 Sent from my mobile phone
 On Mar 22, 2014 1:52 PM, Kane kane.ist...@gmail.com wrote:

 But i was wrong - map also fails on big file and setting
 spark.shuffle.spill
 doesn't help. Map fails with the same error.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




error loading large files in PySpark 0.9.0

2014-03-23 Thread Jeremy Freeman
Hi all,

Hitting a mysterious error loading large text files, specific to PySpark
0.9.0.

In PySpark 0.8.1, this works:

data = sc.textFile(path/to/myfile)
data.count()

But in 0.9.0, it stalls. There are indications of completion up to:

14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
(progress: 15/537)
14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)

And then this repeats indefinitely

14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
runningTasks: 144
14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
runningTasks: 144

Always stalls at the same place. There's nothing in stderr on the workers,
but in stdout there are several of these messages:

INFO PythonRDD: stdin writer to Python finished early

So perhaps the real error is being suppressed as in
https://spark-project.atlassian.net/browse/SPARK-1025

Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
characters per row. Running on a private cluster with 10 nodes, 100GB / 16
cores each, Python v 2.7.6.

I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
and in PySpark in 0.8.1. Happy to post the file, but it should repro for
anything with these dimensions. It *might* be specific to long strings: I
don't see it with fewer characters (10k) per row, but I also don't see it
with many fewer rows but the same number of characters per row.

Happy to try and provide more info / help debug!

-- Jeremy



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: combining operations elegantly

2014-03-23 Thread Richard Siebeling
Hi Koert, Patrick,

do you already have an elegant solution to combine multiple operations on a
single RDD?
Say for example that I want to do a sum over one column, a count and an
average over another column,

thanks in advance,
Richard


On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling rsiebel...@gmail.comwrote:

 Patrick, Koert,

 I'm also very interested in these examples, could you please post them if
 you find them?
 thanks in advance,
 Richard


 On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers ko...@tresata.com wrote:

 not that long ago there was a nice example on here about how to combine
 multiple operations on a single RDD. so basically if you want to do a
 count() and something else, how to roll them into a single job. i think
 patrick wendell gave the examples.

 i cant find them anymore patrick can you please repost? thanks!





Re: distinct on huge dataset

2014-03-23 Thread Kane
Yes, there was an error in data, after fixing it - count fails with Out of
Memory Error.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Debasish Das
I am getting these weird errors which I have not seen before:

[error] Server access Error: handshake alert:  unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit
[info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
[error] Server access Error: handshake alert:  unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.transaction/1.1.1.v201105210645/javax.transaction-1.1.1.v201105210645.orbit
[info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
[error] Server access Error: handshake alert:  unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.mail.glassfish/1.4.1.v201005082020/javax.mail.glassfish-1.4.1.v201005082020.orbit
[info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
[error] Server access Error: handshake alert:  unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.activation/1.1.0.v201105071233/javax.activation-1.1.0.v201105071233.orbit

Is it also due to the SSL error ?

Thanks.
Deb



On Sun, Mar 23, 2014 at 6:08 AM, Sean Owen so...@cloudera.com wrote:

 I'm also seeing this. It also was working for me previously AFAIK.

 Tthe proximate cause is my well-intentioned change that uses HTTPS to
 access all artifact repos. The default for Maven Central before would have
 been HTTP. While it's a good idea to use HTTPS, it may run into
 complications.

 I see:

 https://issues.sonatype.org/browse/MVNCENTRAL-377
 https://issues.apache.org/jira/browse/INFRA-7363

 ... which suggest that actually this isn't supported, and should return
 404.

 Then I see:

 http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4
 ... that suggests you have to pay for HTTPS access to Maven Central?

 But, we both seem to have found it working (even after the change above)
 and it does not return 404 now.

 The Maven build works, still, but if you look carefully, it's actually
 only because it eventually falls back to HTTP for Maven Central artifacts.

 So I think the thing to do is simply back off to HTTP for Maven Central
 only. That's unfortunate because there's a small but non-trivial security
 issue here in downloading artifacts without security.

 Any brighter ideas? I'll open a supplementary PR if so to adjust this.




Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Aaron Davidson
These errors should be fixed on master with Sean's PR:
https://github.com/apache/spark/pull/209

The orbit errors are quite possibly due to using https instead of http,
whether or not the SSL cert was bad. Let us know if they go away with
reverting to http.


On Sun, Mar 23, 2014 at 11:48 AM, Debasish Das debasish.da...@gmail.comwrote:

 I am getting these weird errors which I have not seen before:

 [error] Server access Error: handshake alert:  unrecognized_name url=
 https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit
 [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
 [error] Server access Error: handshake alert:  unrecognized_name url=
 https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.transaction/1.1.1.v201105210645/javax.transaction-1.1.1.v201105210645.orbit
 [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
 [error] Server access Error: handshake alert:  unrecognized_name url=
 https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.mail.glassfish/1.4.1.v201005082020/javax.mail.glassfish-1.4.1.v201005082020.orbit
 [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
 [error] Server access Error: handshake alert:  unrecognized_name url=
 https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.activation/1.1.0.v201105071233/javax.activation-1.1.0.v201105071233.orbit

 Is it also due to the SSL error ?

 Thanks.
 Deb



 On Sun, Mar 23, 2014 at 6:08 AM, Sean Owen so...@cloudera.com wrote:

 I'm also seeing this. It also was working for me previously AFAIK.

 Tthe proximate cause is my well-intentioned change that uses HTTPS to
 access all artifact repos. The default for Maven Central before would have
 been HTTP. While it's a good idea to use HTTPS, it may run into
 complications.

 I see:

 https://issues.sonatype.org/browse/MVNCENTRAL-377
 https://issues.apache.org/jira/browse/INFRA-7363

 ... which suggest that actually this isn't supported, and should return
 404.

 Then I see:

 http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4
 ... that suggests you have to pay for HTTPS access to Maven Central?

 But, we both seem to have found it working (even after the change above)
 and it does not return 404 now.

 The Maven build works, still, but if you look carefully, it's actually
 only because it eventually falls back to HTTP for Maven Central artifacts.

 So I think the thing to do is simply back off to HTTP for Maven Central
 only. That's unfortunate because there's a small but non-trivial security
 issue here in downloading artifacts without security.

 Any brighter ideas? I'll open a supplementary PR if so to adjust this.





No space left on device exception

2014-03-23 Thread Ognen Duzlevski

Hello,

I have a weird error showing up when I run a job on my Spark cluster. 
The version of spark is 0.9 and I have 3+ GB free on the disk when this 
error shows up. Any ideas what I should be looking for?


[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 
167.0:3 failed 4 times (most recent failure: Exception failure: 
java.io.FileNotFoundException: 
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left 
on device))
org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 
times (most recent failure: Exception failure: 
java.io.FileNotFoundException: 
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left 
on device))
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)

at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)


Thanks!
Ognen


Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size
limit. It's possible that this limit has been exceeded. You might try
running the df command to check to free space of /tmp or root if tmp
isn't listed.

3 GB also seems pretty low for the remaining free space of a disk. If your
disk size is in the TB range, it's possible that the last couple GB have
issues when being allocated due to fragmentation or reclamation policies.


On Sun, Mar 23, 2014 at 3:06 PM, Ognen Duzlevski
og...@nengoiksvelzud.comwrote:

 Hello,

 I have a weird error showing up when I run a job on my Spark cluster. The
 version of spark is 0.9 and I have 3+ GB free on the disk when this error
 shows up. Any ideas what I should be looking for?

 [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
 167.0:3 failed 4 times (most recent failure: Exception failure:
 java.io.FileNotFoundException: 
 /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127
 (No space left on device))
 org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times
 (most recent failure: Exception failure: java.io.FileNotFoundException:
 /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left
 on device))
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
 apache$spark$scheduler$DAGScheduler$$abortStage$1.
 apply(DAGScheduler.scala:1028)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
 apache$spark$scheduler$DAGScheduler$$abortStage$1.
 apply(DAGScheduler.scala:1026)
 at scala.collection.mutable.ResizableArray$class.foreach(
 ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
 scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$
 processEvent$10.apply(DAGScheduler.scala:619)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$
 processEvent$10.apply(DAGScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236)
 at org.apache.spark.scheduler.DAGScheduler.processEvent(
 DAGScheduler.scala:619)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$
 $anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)

 Thanks!
 Ognen



Re: combining operations elegantly

2014-03-23 Thread Patrick Wendell
Hey All,

I think the old thread is here:
https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J

The method proposed in that thread is to create a utility class for
doing single-pass aggregations. Using Algebird is a pretty good way to
do this and is a bit more flexible since you don't need to create a
new utility each time you want to do this.

In Spark 1.0 and later you will be able to do this more elegantly with
the schema support:
myRDD.groupBy('user).select(Sum('clicks) as 'clicks,
Average('duration) as 'duration)

and it will use a single pass automatically... but that's not quite
released yet :)

- Patrick




On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers ko...@tresata.com wrote:
 i currently typically do something like this:

 scala val rdd = sc.parallelize(1 to 10)
 scala import com.twitter.algebird.Operators._
 scala import com.twitter.algebird.{Max, Min}
 scala rdd.map{ x = (
  |   1L,
  |   Min(x),
  |   Max(x),
  |   x
  | )}.reduce(_ + _)
 res0: (Long, com.twitter.algebird.Min[Int], com.twitter.algebird.Max[Int],
 Int) = (10,Min(1),Max(10),55)

 however for this you need twitter algebird dependency. without that you have
 to code the reduce function on the tuples yourself...

 another example with 2 columns, where i do conditional count for first
 column, and simple sum for second:
 scala sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) = (
  |   if (x  5) 1 else 0,
  |   y
  | )}.reduce(_ + _)
 res3: (Int, Int) = (5,155)



 On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling rsiebel...@gmail.com
 wrote:

 Hi Koert, Patrick,

 do you already have an elegant solution to combine multiple operations on
 a single RDD?
 Say for example that I want to do a sum over one column, a count and an
 average over another column,

 thanks in advance,
 Richard


 On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling rsiebel...@gmail.com
 wrote:

 Patrick, Koert,

 I'm also very interested in these examples, could you please post them if
 you find them?
 thanks in advance,
 Richard


 On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers ko...@tresata.com wrote:

 not that long ago there was a nice example on here about how to combine
 multiple operations on a single RDD. so basically if you want to do a
 count() and something else, how to roll them into a single job. i think
 patrick wendell gave the examples.

 i cant find them anymore patrick can you please repost? thanks!






Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski

On 3/23/14, 5:49 PM, Matei Zaharia wrote:
You can set spark.local.dir to put this data somewhere other than /tmp 
if /tmp is full. Actually it’s recommended to have multiple local 
disks and set to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions in a transformation influence 
something in terms of disk space consumption? Or inode consumption?


Thanks,
Ognen


Re: Problem with SparkR

2014-03-23 Thread Shivaram Venkataraman
Hi

Thanks for reporting this. It'll be great if you can check a couple of
things:

1. Are you trying to use this with Hadoop2 by any chance ? There was an
incompatible ASM version bug that we fixed for Hadoop2
https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it,
but I just want to check if the same error is cropping up again.

2. Is there a stack trace that follows the IncompatibleClassChangeError ?
If so could you attach that ? The error message indicates there is some
incompatibility between class versions and having a more detailed stack
trace might help us track this down.

Thanks
Shivaram





On Sun, Mar 23, 2014 at 4:48 PM, Jacques Basaldúa jacq...@dybot.com wrote:

  I am really interested in using Spark from R and have tried to use
 SparkR, but always get the same error.



 This is how I installed:



  - I successfully installed Spark version  0.9.0 with Scala  2.10.3
 (OpenJDK 64-Bit Server VM, Java 1.7.0_45)

I can run examples from spark-shell and Python



  - I installed the R package devtools and installed SparkR using:



  - library(devtools)

  - install_github(amplab-extras/SparkR-pkg, subdir=pkg)



   This compiled the package successfully.



 When I try to run the package



 E.g.,

   library(SparkR)

   sc - sparkR.init(master=local)   //- so far the program runs
 fine



   rdd - parallelize(sc, 1:10)  // This returns the following error



   Error in .jcall(getJRDD(rdd), Ljava/util/List;, collect) :

   java.lang.IncompatibleClassChangeError:
 org/apache/spark/util/InnerClosureFinder



 No matter how I try to use the sc (I have tried all the examples) I always
 get an error.



 Any ideas?



 Jacques.



is it possible to access the inputsplit in Spark directly?

2014-03-23 Thread hwpstorage
Hello,
In spark we can use *newAPIHadoopRDD *to access the different distributed
system like HDFS, HBase, and MongoDB via different inputformat.
Is it possible to access the *inputsplit *in Spark directly? Spark can
cache data in local memory.
Perform local computation/aggregation on the local
inputsplit could speed up the whole performance.

Thanks a lot


Re: error loading large files in PySpark 0.9.0

2014-03-23 Thread Matei Zaharia
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your 
SparkContext? It tries to serialize that many objects together at a time, which 
might be too much. By default the batchSize is 1024.

Matei

On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:

 Hi all,
 
 Hitting a mysterious error loading large text files, specific to PySpark
 0.9.0.
 
 In PySpark 0.8.1, this works:
 
 data = sc.textFile(path/to/myfile)
 data.count()
 
 But in 0.9.0, it stalls. There are indications of completion up to:
 
 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
 (progress: 15/537)
 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)
 
 And then this repeats indefinitely
 
 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
 runningTasks: 144
 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
 runningTasks: 144
 
 Always stalls at the same place. There's nothing in stderr on the workers,
 but in stdout there are several of these messages:
 
 INFO PythonRDD: stdin writer to Python finished early
 
 So perhaps the real error is being suppressed as in
 https://spark-project.atlassian.net/browse/SPARK-1025
 
 Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
 characters per row. Running on a private cluster with 10 nodes, 100GB / 16
 cores each, Python v 2.7.6.
 
 I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
 and in PySpark in 0.8.1. Happy to post the file, but it should repro for
 anything with these dimensions. It *might* be specific to long strings: I
 don't see it with fewer characters (10k) per row, but I also don't see it
 with many fewer rows but the same number of characters per row.
 
 Happy to try and provide more info / help debug!
 
 -- Jeremy
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Bharath Bhushan
I don’t see the errors anymore. Thanks Aaron.

On 24-Mar-2014, at 12:52 am, Aaron Davidson ilike...@gmail.com wrote:

 These errors should be fixed on master with Sean's PR: 
 https://github.com/apache/spark/pull/209
 
 The orbit errors are quite possibly due to using https instead of http, 
 whether or not the SSL cert was bad. Let us know if they go away with 
 reverting to http.
 
 
 On Sun, Mar 23, 2014 at 11:48 AM, Debasish Das debasish.da...@gmail.com 
 wrote:
 I am getting these weird errors which I have not seen before:
 
 [error] Server access Error: handshake alert:  unrecognized_name 
 url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit
 [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
 [error] Server access Error: handshake alert:  unrecognized_name 
 url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.transaction/1.1.1.v201105210645/javax.transaction-1.1.1.v201105210645.orbit
 [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
 [error] Server access Error: handshake alert:  unrecognized_name 
 url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.mail.glassfish/1.4.1.v201005082020/javax.mail.glassfish-1.4.1.v201005082020.orbit
 [info] Resolving org.eclipse.jetty.orbit#jetty-orbit;1 ...
 [error] Server access Error: handshake alert:  unrecognized_name 
 url=https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.activation/1.1.0.v201105071233/javax.activation-1.1.0.v201105071233.orbit
 
 Is it also due to the SSL error ?
 
 Thanks.
 Deb
 
 
 
 On Sun, Mar 23, 2014 at 6:08 AM, Sean Owen so...@cloudera.com wrote:
 I'm also seeing this. It also was working for me previously AFAIK.
 
 Tthe proximate cause is my well-intentioned change that uses HTTPS to access 
 all artifact repos. The default for Maven Central before would have been 
 HTTP. While it's a good idea to use HTTPS, it may run into complications.
 
 I see:
 
 https://issues.sonatype.org/browse/MVNCENTRAL-377 
 https://issues.apache.org/jira/browse/INFRA-7363
 
 ... which suggest that actually this isn't supported, and should return 404. 
 
 Then I see:
 http://blog.sonatype.com/2012/10/now-available-ssl-connectivity-to-central/#.Uy7Yuq1_tj4
 ... that suggests you have to pay for HTTPS access to Maven Central?
 
 But, we both seem to have found it working (even after the change above) and 
 it does not return 404 now.
 
 The Maven build works, still, but if you look carefully, it's actually only 
 because it eventually falls back to HTTP for Maven Central artifacts.
 
 So I think the thing to do is simply back off to HTTP for Maven Central only. 
 That's unfortunate because there's a small but non-trivial security issue 
 here in downloading artifacts without security.
 
 Any brighter ideas? I'll open a supplementary PR if so to adjust this.
 
 
 



Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Aaron, thanks for replying. I am very much puzzled as to what is going 
on. A job that used to run on the same cluster is failing with this 
mysterious message about not having enough disk space when in fact I can 
see through watch df -h that the free space is always hovering around 
3+GB on the disk and the free inodes are at 50% (this is on master). I 
went through each slave and the spark/work/app*/stderr and stdout and 
spark/logs/*out files and no mention of too many open files failures on 
any of the slaves nor on the master :(


Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:
By default, with P partitions (for both the pre-shuffle stage and 
post-shuffle), there are P^2 files created. 
With spark.shuffle.consolidateFiles turned on, we would instead create 
only P files. Disk space consumption is largely unaffected, however. 
by the number of partitions unless each partition is particularly small.


You might look at the actual executors' logs, as it's possible that 
this error was caused by an earlier exception, such as too many open 
files.



On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote:


On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other than
/tmp if /tmp is full. Actually it's recommended to have multiple
local disks and set to to a comma-separated list of directories,
one per disk.

Matei, does the number of tasks/partitions in a transformation
influence something in terms of disk space consumption? Or inode
consumption?

Thanks,
Ognen





Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Bleh, strike that, one of my slaves was at 100% inode utilization on the 
file system. It was /tmp/spark* leftovers that apparently did not get 
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up 
/tmp/spark* regularly.


Thanks (and sorry for the noise)!
Ognen

On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
Aaron, thanks for replying. I am very much puzzled as to what is going 
on. A job that used to run on the same cluster is failing with this 
mysterious message about not having enough disk space when in fact I 
can see through watch df -h that the free space is always hovering 
around 3+GB on the disk and the free inodes are at 50% (this is on 
master). I went through each slave and the spark/work/app*/stderr and 
stdout and spark/logs/*out files and no mention of too many open files 
failures on any of the slaves nor on the master :(


Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:
By default, with P partitions (for both the pre-shuffle stage and 
post-shuffle), there are P^2 files created. 
With spark.shuffle.consolidateFiles turned on, we would instead 
create only P files. Disk space consumption is largely unaffected, 
however. by the number of partitions unless each partition is 
particularly small.


You might look at the actual executors' logs, as it's possible that 
this error was caused by an earlier exception, such as too many open 
files.



On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote:


On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere other
than /tmp if /tmp is full. Actually it's recommended to have
multiple local disks and set to to a comma-separated list of
directories, one per disk.

Matei, does the number of tasks/partitions in a transformation
influence something in terms of disk space consumption? Or inode
consumption?

Thanks,
Ognen





--
A distributed system is one in which the failure of a computer you didn't even know 
existed can render your own computer unusable
-- Leslie Lamport



Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our
current roadmap for state cleanup (cleaning up data which was not fully
cleaned up from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski 
og...@plainvanillagames.com wrote:

  Bleh, strike that, one of my slaves was at 100% inode utilization on the
 file system. It was /tmp/spark* leftovers that apparently did not get
 cleaned up properly after failed or interrupted jobs.
 Mental note - run a cron job on all slaves and master to clean up
 /tmp/spark* regularly.

 Thanks (and sorry for the noise)!
 Ognen


 On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

 Aaron, thanks for replying. I am very much puzzled as to what is going on.
 A job that used to run on the same cluster is failing with this mysterious
 message about not having enough disk space when in fact I can see through
 watch df -h that the free space is always hovering around 3+GB on the
 disk and the free inodes are at 50% (this is on master). I went through
 each slave and the spark/work/app*/stderr and stdout and spark/logs/*out
 files and no mention of too many open files failures on any of the slaves
 nor on the master :(

 Thanks
 Ognen

 On 3/23/14, 8:38 PM, Aaron Davidson wrote:

 By default, with P partitions (for both the pre-shuffle stage and
 post-shuffle), there are P^2 files created.
 With spark.shuffle.consolidateFiles turned on, we would instead create only
 P files. Disk space consumption is largely unaffected, however. by the
 number of partitions unless each partition is particularly small.

  You might look at the actual executors' logs, as it's possible that this
 error was caused by an earlier exception, such as too many open files.


 On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
 og...@plainvanillagames.com wrote:

  On 3/23/14, 5:49 PM, Matei Zaharia wrote:

 You can set spark.local.dir to put this data somewhere other than /tmp if
 /tmp is full. Actually it's recommended to have multiple local disks and
 set to to a comma-separated list of directories, one per disk.

  Matei, does the number of tasks/partitions in a transformation influence
 something in terms of disk space consumption? Or inode consumption?

 Thanks,
 Ognen



 --
 A distributed system is one in which the failure of a computer you didn't 
 even know existed can render your own computer unusable
 -- Leslie Lamport




Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
I would love to work on this (and other) stuff if I can bother someone 
with questions offline or on a dev mailing list.

Ognen

On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue I 
haven't seen raised before and this raises another issue which is not 
on our current roadmap for state cleanup (cleaning up data which was 
not fully cleaned up from a crashed process).



On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski 
og...@plainvanillagames.com mailto:og...@plainvanillagames.com wrote:


Bleh, strike that, one of my slaves was at 100% inode utilization
on the file system. It was /tmp/spark* leftovers that apparently
did not get cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.

Thanks (and sorry for the noise)!
Ognen


On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

Aaron, thanks for replying. I am very much puzzled as to what is
going on. A job that used to run on the same cluster is failing
with this mysterious message about not having enough disk space
when in fact I can see through watch df -h that the free space
is always hovering around 3+GB on the disk and the free inodes
are at 50% (this is on master). I went through each slave and the
spark/work/app*/stderr and stdout and spark/logs/*out files and
no mention of too many open files failures on any of the slaves
nor on the master :(

Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:

By default, with P partitions (for both the pre-shuffle stage
and post-shuffle), there are P^2 files created.
With spark.shuffle.consolidateFiles turned on, we would instead
create only P files. Disk space consumption is largely
unaffected, however. by the number of partitions unless each
partition is particularly small.

You might look at the actual executors' logs, as it's possible
that this error was caused by an earlier exception, such as too
many open files.


On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
og...@plainvanillagames.com
mailto:og...@plainvanillagames.com wrote:

On 3/23/14, 5:49 PM, Matei Zaharia wrote:

You can set spark.local.dir to put this data somewhere
other than /tmp if /tmp is full. Actually it's recommended
to have multiple local disks and set to to a
comma-separated list of directories, one per disk.

Matei, does the number of tasks/partitions in a
transformation influence something in terms of disk space
consumption? Or inode consumption?

Thanks,
Ognen





-- 
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable

-- Leslie Lamport



--
No matter what they ever do to us, we must always act for the love of our people 
and the earth. We must not react out of hatred against those who have no sense.
-- John Trudell


How many partitions is my RDD split into?

2014-03-23 Thread Nicholas Chammas
Hey there fellow Dukes of Data,

How can I tell how many partitions my RDD is split into?

I'm interested in knowing because, from what I gather, having a good number
of partitions is good for performance. If I'm looking to understand how my
pipeline is performing, say for a parallelized write out to HDFS, knowing
how many partitions an RDD has would be a good thing to check.

Is that correct?

I could not find an obvious method or property to see how my RDD is
partitioned. Instead, I devised the following thingy:

def f(idx, itr): yield idx

rdd = sc.parallelize([1, 2, 3, 4], 4)
rdd.mapPartitionsWithIndex(f).count()

Frankly, I'm not sure what I'm doing here, but this seems to give me the
answer I'm looking for. Derp. :)

So in summary, should I care about how finely my RDDs are partitioned? And
how would I check on that?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How many partitions is my RDD split into?

2014-03-23 Thread Mark Hamstra
It's much simpler: rdd.partitions.size


On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned? And
 how would I check on that?

 Nick


 --
 View this message in context: How many partitions is my RDD split 
 into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell
As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().

This can happen for example if you do a highly selective filter on an
RDD. For instance, you filter out one day of data from a dataset of a
year.

- Patrick

On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.com wrote:
 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned? And
 how would I check on that?

 Nick


 
 View this message in context: How many partitions is my RDD split into?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: No space left on device exception

2014-03-23 Thread Patrick Wendell
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a No space left on device error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.

On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
og...@nengoiksvelzud.com wrote:
 I would love to work on this (and other) stuff if I can bother someone with
 questions offline or on a dev mailing list.
 Ognen


 On 3/23/14, 10:04 PM, Aaron Davidson wrote:

 Thanks for bringing this up, 100% inode utilization is an issue I haven't
 seen raised before and this raises another issue which is not on our current
 roadmap for state cleanup (cleaning up data which was not fully cleaned up
 from a crashed process).


 On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
 og...@plainvanillagames.com wrote:

 Bleh, strike that, one of my slaves was at 100% inode utilization on the
 file system. It was /tmp/spark* leftovers that apparently did not get
 cleaned up properly after failed or interrupted jobs.
 Mental note - run a cron job on all slaves and master to clean up
 /tmp/spark* regularly.

 Thanks (and sorry for the noise)!
 Ognen


 On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:

 Aaron, thanks for replying. I am very much puzzled as to what is going on.
 A job that used to run on the same cluster is failing with this mysterious
 message about not having enough disk space when in fact I can see through
 watch df -h that the free space is always hovering around 3+GB on the disk
 and the free inodes are at 50% (this is on master). I went through each
 slave and the spark/work/app*/stderr and stdout and spark/logs/*out files
 and no mention of too many open files failures on any of the slaves nor on
 the master :(

 Thanks
 Ognen

 On 3/23/14, 8:38 PM, Aaron Davidson wrote:

 By default, with P partitions (for both the pre-shuffle stage and
 post-shuffle), there are P^2 files created. With
 spark.shuffle.consolidateFiles turned on, we would instead create only P
 files. Disk space consumption is largely unaffected, however. by the number
 of partitions unless each partition is particularly small.

 You might look at the actual executors' logs, as it's possible that this
 error was caused by an earlier exception, such as too many open files.


 On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
 og...@plainvanillagames.com wrote:

 On 3/23/14, 5:49 PM, Matei Zaharia wrote:

 You can set spark.local.dir to put this data somewhere other than /tmp if
 /tmp is full. Actually it's recommended to have multiple local disks and set
 to to a comma-separated list of directories, one per disk.

 Matei, does the number of tasks/partitions in a transformation influence
 something in terms of disk space consumption? Or inode consumption?

 Thanks,
 Ognen



 --
 A distributed system is one in which the failure of a computer you didn't
 even know existed can render your own computer unusable
 -- Leslie Lamport



 --
 No matter what they ever do to us, we must always act for the love of our
 people and the earth. We must not react out of hatred against those who have
 no sense.
 -- John Trudell