Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Vibhor Banga
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows larger
 than the size of RAM available in the cluster, will the application fail,
 or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




-- 
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore


Re: Using Spark on Data size larger than Memory size

2014-05-31 Thread Mayur Rustagi
Clearly thr will be impact on performance but frankly depends on what you
are trying to achieve with the dataset.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com
wrote:

 Some inputs will be really helpful.

 Thanks,
 -Vibhor


 On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com
 wrote:

 Hi all,

 I am planning to use spark with HBase, where I generate RDD by reading
 data from HBase Table.

 I want to know that in the case when the size of HBase Table grows larger
 than the size of RAM available in the cluster, will the application fail,
 or will there be an impact in performance ?

 Any thoughts in this direction will be helpful and are welcome.

 Thanks,
 -Vibhor




 --
 Vibhor Banga
 Software Development Engineer
 Flipkart Internet Pvt. Ltd., Bangalore




Re: Failed to remove RDD error

2014-05-31 Thread Mayur Rustagi
You can increase your akka timeout, should give you some more life.. are
you running out of memory by any chance?


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Sat, May 31, 2014 at 6:52 AM, Michael Chang m...@tellapart.com wrote:

 I'm running a some kafka streaming spark contexts (on 0.9.1), and they
 seem to be dying after 10 or so minutes with a lot of these errors.  I
 can't really tell what's going on here, except that maybe the driver is
 unresponsive somehow?  Has anyone seen this before?

 14/05/31 01:13:30 ERROR BlockManagerMaster: Failed to remove RDD 12635

 akka.pattern.AskTimeoutException: Timed out

 at
 akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)

 at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)

 at
 scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:691)

 at
 scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:688)

 at
 akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)

 at
 akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)

 at
 akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)

 at
 akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)

 at java.lang.Thread.run(Thread.java:744)

 Thanks,

 Mike





Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-31 Thread Xiangrui Meng
The documentation you looked at is not official, though it is from
@pwendell's website. It was for the Spark SQL release. Please find the
official documentation here:

http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm

It contains a working example showing how to construct LabeledPoint
and use it for training.

Best,
Xiangrui

On Fri, May 30, 2014 at 5:10 AM, jamborta jambo...@gmail.com wrote:
 thanks for the reply. I am definitely running 1.0.0, I set it up manually.

 To answer my question, I found out from the examples that it would need a
 new data type called LabeledPoint instead of numpy array.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-MLlib-examples-don-t-work-with-Spark-1-0-0-tp6546p6579.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Create/shutdown objects before/after RDD use (or: Non-serializable classes)

2014-05-31 Thread Xiangrui Meng
Hi Tobias,

One hack you can try is:

rdd.mapPartitions(iter = {
  val x = new X()
  iter.map(row = x.doSomethingWith(row)) ++ { x.shutdown(); Iterator.empty }
})

Best,
Xiangrui

On Thu, May 29, 2014 at 11:38 PM, Tobias Pfeiffer t...@preferred.jp wrote:
 Hi,

 I want to use an object x in my RDD processing as follows:

 val x = new X()
 rdd.map(row = x.doSomethingWith(row))
 println(rdd.count())
 x.shutdown()

 Now the problem is that X is non-serializable, so while this works
 locally, it does not work in cluster setup. I thought I could do

 rdd.mapPartitions(iter = {
   val x = new X()
   val result = iter.map(row = x.doSomethingWith(row))
   x.shutdown()
   result
 })

 to create an instance of X locally, but obviously x.shutdown() is
 called before the first row is processed.

 How can I specify these node-local setup/teardown functions or how do
 I deal in general with non-serializable classes?

 Thanks
 Tobias


Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread prabeesh k
Hi,

scenario : Read data from HDFS and apply hive query  on it and the result
is written back to HDFS.

 Scheme creation, Querying  and saveAsTextFile are working fine with
following mode

   - local mode
   - mesos cluster with single node
   - spark cluster with multi node

Schema creation and querying are working fine with mesos multi node cluster.
But  while trying to write back to HDFS using saveAsTextFile, the following
error occurs

* 14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4
(mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission*
*14/05/30 10:16:35 INFO DAGScheduler: Executor lost:
201405291518-3644595722-5050-17933-1 (epoch 148)*

Let me know your thoughts regarding this.

Regards,
prabeesh


Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-31 Thread Jeremy Lee
Hi there, Patrick. Thanks for the reply...

It wouldn't surprise me that AWS Ubuntu has Python 2.7. Ubuntu is cool like
that. :-)

Alas, the Amazon Linux AMI (2014.03.1) does not, and it's the very first
one on the recommended instance list. (Ubuntu is #4, after Amazon, RedHat,
SUSE) So, users such as myself who deliberately pick the Most Amazon-ish
obvious first choice find they picked the wrong one.

But that's trivial compared to the failure of the cluster to come up,
apparently due to the master's http configuration. Any help on that would
be much appreciated... it's giving me serious grief.



On Sat, May 31, 2014 at 1:37 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hi Jeremy,

 That's interesting, I don't think anyone has ever reported an issue
 running these scripts due to Python incompatibility, but they may require
 Python 2.7+. I regularly run them from the AWS Ubuntu 12.04 AMI... that
 might be a good place to start. But if there is a straightforward way to
 make them compatible with 2.6 we should do that.

 For r3.large, we can add that to the script. It's a newer type. Any
 interest in contributing this?

 - Patrick

 On May 30, 2014 5:08 AM, Jeremy Lee unorthodox.engine...@gmail.com
 wrote:


 Hi there! I'm relatively new to the list, so sorry if this is a repeat:

 I just wanted to mention there are still problems with the EC2 scripts.
 Basically, they don't work.

 First, if you run the scripts on Amazon's own suggested version of linux,
 they break because amazon installs Python2.6.9, and the scripts use a
 couple of Python2.7 commands. I have to sudo yum install python27, and
 then edit the spark-ec2 shell script to use that specific version.
 Annoying, but minor.

 (the base python command isn't upgraded to 2.7 on many systems,
 apparently because it would break yum)

 The second minor problem is that the script doesn't know about the
 r3.large servers... also easily fixed by adding to the spark_ec2.py
 script. Minor,

 The big problem is that after the EC2 cluster is provisioned, installed,
 set up, and everything, it fails to start up the webserver on the master.
 Here's the tail of the log:

 Starting GANGLIA gmond:[  OK  ]
 Shutting down GANGLIA gmond:   [FAILED]
 Starting GANGLIA gmond:[  OK  ]
 Connection to ec2-54-183-82-48.us-west-1.compute.amazonaws.com closed.
 Shutting down GANGLIA gmond:   [FAILED]
 Starting GANGLIA gmond:[  OK  ]
 Connection to ec2-54-183-82-24.us-west-1.compute.amazonaws.com closed.
 Shutting down GANGLIA gmetad:  [FAILED]
 Starting GANGLIA gmetad:   [  OK  ]
 Stopping httpd:[FAILED]
 Starting httpd: httpd: Syntax error on line 153 of
 /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into
 server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object
 file: No such file or directory
[FAILED]

 Basically, the AMI you have chosen does not seem to have a full install
 of apache, and is missing several modules that are referred to in the
 httpd.conf file that is installed. The full list of missing modules is:

 authn_alias_module modules/mod_authn_alias.so
 authn_default_module modules/mod_authn_default.so
 authz_default_module modules/mod_authz_default.so
 ldap_module modules/mod_ldap.so
 authnz_ldap_module modules/mod_authnz_ldap.so
 disk_cache_module modules/mod_disk_cache.so

 Alas, even if these modules are commented out, the server still fails to
 start.

 root@ip-172-31-11-193 ~]$ service httpd start
 Starting httpd: AH00534: httpd: Configuration error: No MPM loaded.

 That means Spark 1.0.0 clusters on EC2 are Dead-On-Arrival when run
 according to the instructions. Sorry.

 Any suggestions on how to proceed? I'll keep trying to fix the webserver,
 but (a) changes to httpd.conf get blown away by resume, and (b) anything
 I do has to be redone every time I provision another cluster. Ugh.

 --
 Jeremy Lee  BCompSci(Hons)
   The Unorthodox Engineers








-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should
give an exception with the reason for the task failure. Also in the
future, for this type of e-mail please only e-mail the user@ list
and not both lists.

- Patrick

On Sat, May 31, 2014 at 3:22 AM, prabeesh k prabsma...@gmail.com wrote:
 Hi,

 scenario : Read data from HDFS and apply hive query  on it and the result is
 written back to HDFS.

  Scheme creation, Querying  and saveAsTextFile are working fine with
 following mode

 local mode
 mesos cluster with single node
 spark cluster with multi node

 Schema creation and querying are working fine with mesos multi node cluster.
 But  while trying to write back to HDFS using saveAsTextFile, the following
 error occurs

  14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4
 (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission
 14/05/30 10:16:35 INFO DAGScheduler: Executor lost:
 201405291518-3644595722-5050-17933-1 (epoch 148)

 Let me know your thoughts regarding this.

 Regards,
 prabeesh


Re: How can I dispose an Accumulator?

2014-05-31 Thread Patrick Wendell
Hey There,

You can remove an accumulator by just letting it go out of scope and
it will be garbage collected. For broadcast variables we actually
store extra information for it, so we provide hooks for users to
remove the associated state. There is no such need for accumulators,
though.

- Patrick

On Thu, May 29, 2014 at 2:13 AM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:
 Hi,



 How can I dispose an Accumulator?

 It has no method like 'unpersist()' which Broadcast provides.



 Thanks.




Re: Spark hook to create external process

2014-05-31 Thread Patrick Wendell
Currently, an executor is always run in it's own JVM, so it should be
possible to just use some static initialization to e.g. launch a
sub-process and set up a bridge with which to communicate.

This is would be a fairly advanced use case, however.

- Patrick



On Thu, May 29, 2014 at 8:39 PM, ansriniv ansri...@gmail.com wrote:
 Hi Matei,

 Thanks for the reply.

 I would like to avoid having to spawn these external processes every time
 during the processing of the task to reduce task latency. I'd like these to
 be pre-spawned as much as possible - tying them to lifecycle of
 corresponding threadpool thread would simplify management for me.

 Also, during processing some back and forth communication is required
 between the Spark executer thread and its associated external process.

 For these 2 reasons, pipe() wouldnt meet my requirement.

 Is there any hook in the ThreadPoolExecutor created by the Spark Executor to
 plug in my own ThreadFactory ?

 Thanks
 Anand



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hook-to-create-external-process-tp6526p6552.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Patrick Wendell
 1. ctx is an instance of JavaSQLContext but the textFile method is called as
 a member of ctx.
 According to the API JavaSQLContext does not have such a member, so im
 guessing this should be sc instead.

Yeah, I think you are correct.

 2. In that same code example the object sqlCtx is referenced, but it is
 never instantiated in the code.
 should this be ctx?

Also correct.

I think it would be good to be consistent and always have ctx refer
to a JavaSparkContext and have sqlCtx refer to a JavaSQLContext.

Any interest in creating a pull request for this? We'd be happy to
accept the change.

- Patrick


Re: getPreferredLocations

2014-05-31 Thread Patrick Wendell
 1) Is there a guarantee that a partition will only be processed on a node
 which is in the getPreferredLocations set of nodes returned by the RDD ?

No there isn't, by default Spark may schedule in a non preferred
location after `spark.locality.wait` has expired.

http://spark.apache.org/docs/latest/configuration.html#scheduling

If you want to have the behavior that this is treated as a constraint,
you can turn spark.locality.wait to a very high value. Keep in mind
though, this will starve your job if all of the preferred locations
are on nodes that are not alive.


 2) I am implementing this custom RDD in Java and plan to extend JavaRDD.
 However, I dont see a getPreferredLocations method in

There is not currently support for writing custom RDD classes in Java.
Or at least, you'd need to write some of the internals in Scala and
then you could write a Java wrapper.

Can I ask though - what is the reason you are trying to write a custom
RDD? This is usually a somewhat advanced use case if you are, e.g.
writing a Spark integration with a new storage system or something
like that.


Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Patrick Wendell
I think there are a few ways to do this... the simplest one might be to
manually build a set of comma-separated paths that excludes the bad file,
and pass that to textFile().

When you call textFile() under the hood it is going to pass your filename
string to hadoopFile() which calls setInputPaths() on the hadoop
FileInputformat.

http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,
org.apache.hadoop.fs.Path...)

I think this can accept a comma-separate list of paths.

So you could do something like this (this is pseudo-code):
files = fs.listStatus(s3n://bucket/stuff/*.gz)
files = files.filter(not the bad file)
fileStr = files.map(f = f.getPath.toString).mkstring(,)

sc.textFile(fileStr)...

- Patrick




On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 YES, your hunches were correct. I've identified at least one file among
 the hundreds I'm processing that is indeed not a valid gzip file.

 Does anyone know of an easy way to exclude a specific file or files when
 calling sc.textFile() on a pattern? e.g. Something like: 
 sc.textFile('s3n://bucket/stuff/*.gz,
 exclude:s3n://bucket/stuff/bad.gz')
 


 On Wed, May 21, 2014 at 11:50 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Thanks for the suggestions, people. I will try to hone in on which
 specific gzipped files, if any, are actually corrupt.

 Michael,

 I'm using Hadoop 1.0.4, which I believe is the default version that gets
 deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281
 https://issues.apache.org/jira/browse/HADOOP-5281, affects Hadoop
 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I
 know there is some funkiness in how Hadoop is versioned, so I'm not sure if
 this issue is relevant to 1.0.4.

 Were you able to resolve your issue by changing your version of Hadoop?
 How did you do that?

 Nick
 


 On Wed, May 21, 2014 at 11:38 PM, Andrew Ash and...@andrewash.com
 wrote:

 One thing you can try is to pull each file out of S3 and decompress with
 gzip -d to see if it works.  I'm guessing there's a corrupted .gz file
 somewhere in your path glob.

 Andrew


 On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com
 wrote:

 Hi Nick,

 Which version of Hadoop are you using with Spark?  I spotted an issue
 with the built-in GzipDecompressor while doing something similar with
 Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files
 blew up from Hadoop/Spark.

 The following JIRA ticket goes into more detail
 https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all
 Hadoop releases prior to 1.2.X

 MC




  *Michael Cutler*
 Founder, CTO


 * Mobile: +44 789 990 7847 Email:   mich...@tumra.com
 mich...@tumra.com Web: tumra.com
 http://tumra.com/?utm_source=signatureutm_medium=email *
 *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq*
 *Registered in England  Wales, 07916412. VAT No. 130595328*


 This email and any files transmitted with it are confidential and may
 also be privileged. It is intended only for the person to whom it is
 addressed. If you have received this email in error, please inform the
 sender immediately. If you are not the intended recipient you must not
 use, disclose, copy, print, distribute or rely on this email.


 On 21 May 2014 14:26, Madhu ma...@madhu.com wrote:

 Can you identify a specific file that fails?
 There might be a real bug here, but I have found gzip to be reliable.
 Every time I have run into a bad header error with gzip, I had a
 non-gzip
 file with the wrong extension for whatever reason.




 -
 Madhu
 https://www.linkedin.com/in/msiddalingaiah
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.








Re: Trouble with EC2

2014-05-31 Thread Matei Zaharia
What instance types did you launch on?

Sometimes you also get a bad individual machine from EC2. It might help to 
remove the node it’s complaining about from the conf/slaves file.

Matei

On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote:

 Hey Folks, 
 
 I'm really having quite a bit of trouble getting spark running on ec2. I'm 
 not using scripts the https://github.com/apache/spark/tree/master/ec2 because 
 I'd like to know how everything works. But I'm going a little crazy. I think 
 that something about the networking configuration must be messed up, but I'm 
 at a loss. Shortly after starting the cluster, I get a lot of this: 
 
 14/05/30 18:03:22 INFO master.Master: Registering worker 
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:22 INFO master.Master: Registering worker 
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker 
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:03:23 INFO master.Master: Registering worker 
 ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 INFO actor.LocalActorRef: Message 
 [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
 Actor[akka://sparkMaster/deadLetters] to 
 Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246]
  was not delivered. [5] dead letters encountered. This logging can be turned 
 off or adjusted with configuration settings 'akka.log-dead-letters' and 
 'akka.log-dead-letters-during-shutdown'.
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association 
 failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by: 
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association 
 failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by: 
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
 ]
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 INFO master.Master: 
 akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
 removing it.
 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
 [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association 
 failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
 akka.remote.EndpointAssociationException: Association failed with 
 [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
 Caused by: 
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
 Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485



Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Nicholas Chammas
That's a neat idea. I'll try that out.


On Sat, May 31, 2014 at 2:45 PM, Patrick Wendell pwend...@gmail.com wrote:

 I think there are a few ways to do this... the simplest one might be to
 manually build a set of comma-separated paths that excludes the bad file,
 and pass that to textFile().

 When you call textFile() under the hood it is going to pass your filename
 string to hadoopFile() which calls setInputPaths() on the hadoop
 FileInputformat.


 http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,
 org.apache.hadoop.fs.Path...)

 I think this can accept a comma-separate list of paths.

 So you could do something like this (this is pseudo-code):
 files = fs.listStatus(s3n://bucket/stuff/*.gz)
 files = files.filter(not the bad file)
 fileStr = files.map(f = f.getPath.toString).mkstring(,)

 sc.textFile(fileStr)...

 - Patrick




 On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 YES, your hunches were correct. I’ve identified at least one file among
 the hundreds I’m processing that is indeed not a valid gzip file.

 Does anyone know of an easy way to exclude a specific file or files when
 calling sc.textFile() on a pattern? e.g. Something like: 
 sc.textFile('s3n://bucket/stuff/*.gz,
 exclude:s3n://bucket/stuff/bad.gz')


 On Wed, May 21, 2014 at 11:50 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Thanks for the suggestions, people. I will try to hone in on which
 specific gzipped files, if any, are actually corrupt.

 Michael,

 I’m using Hadoop 1.0.4, which I believe is the default version that gets
 deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281
 https://issues.apache.org/jira/browse/HADOOP-5281, affects Hadoop
 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I
 know there is some funkiness in how Hadoop is versioned, so I’m not sure if
 this issue is relevant to 1.0.4.

 Were you able to resolve your issue by changing your version of Hadoop?
 How did you do that?

 Nick


 On Wed, May 21, 2014 at 11:38 PM, Andrew Ash and...@andrewash.com
 wrote:

 One thing you can try is to pull each file out of S3 and decompress
 with gzip -d to see if it works.  I'm guessing there's a corrupted .gz
 file somewhere in your path glob.

 Andrew


 On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com
 wrote:

 Hi Nick,

 Which version of Hadoop are you using with Spark?  I spotted an issue
 with the built-in GzipDecompressor while doing something similar with
 Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files
 blew up from Hadoop/Spark.

 The following JIRA ticket goes into more detail
 https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all
 Hadoop releases prior to 1.2.X

 MC




  *Michael Cutler*
 Founder, CTO


 * Mobile: +44 789 990 7847 Email:   mich...@tumra.com
 mich...@tumra.com Web: tumra.com
 http://tumra.com/?utm_source=signatureutm_medium=email *
 *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq*
 *Registered in England  Wales, 07916412. VAT No. 130595328*


 This email and any files transmitted with it are confidential and may
 also be privileged. It is intended only for the person to whom it is
 addressed. If you have received this email in error, please inform the
 sender immediately. If you are not the intended recipient you must
 not use, disclose, copy, print, distribute or rely on this email.


 On 21 May 2014 14:26, Madhu ma...@madhu.com wrote:

 Can you identify a specific file that fails?
 There might be a real bug here, but I have found gzip to be reliable.
 Every time I have run into a bad header error with gzip, I had a
 non-gzip
 file with the wrong extension for whatever reason.




 -
 Madhu
 https://www.linkedin.com/in/msiddalingaiah
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.









hadoopRDD stalls reading entire directory

2014-05-31 Thread Russell Jurney
I'm running the following code to load an entire directory of Avros using
hadoopRDD.

val input = hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/*

// Setup the path for the job vai a Hadoop JobConf
val jobConf= new JobConf(sc.hadoopConfiguration)
jobConf.setJobName(Test Scala Job)
FileInputFormat.setInputPaths(jobConf, input)

val rdd = sc.hadoopRDD(
  jobConf,
  classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]],
  classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]],
  classOf[org.apache.hadoop.io.NullWritable],
  1)


It successfully loads a single file, but when I load an entire directory, I
get this:

scala rdd.first

14/05/31 17:03:01 INFO mapred.FileInputFormat: Total input paths to process
: 17
14/05/31 17:03:02 INFO spark.SparkContext: Starting job: first at
console:43
14/05/31 17:03:02 INFO scheduler.DAGScheduler: Got job 0 (first at
console:43) with 1 output partitions (allowLocal=true)
14/05/31 17:03:02 INFO scheduler.DAGScheduler: Final stage: Stage 0 (first
at console:43)

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/05/31 17:03:02 INFO scheduler.DAGScheduler: Missing parents: List()

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Computing the requested
partition locally

14/05/31 17:03:02 INFO rdd.HadoopRDD: Input split:
hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864

14/05/31 17:03:02 INFO spark.SparkContext: Job finished: first at
console:43, took 0.43242113 s

14/05/31 17:03:02 INFO spark.SparkContext: Starting job: first at
console:43
14/05/31 17:03:02 INFO scheduler.DAGScheduler: Got job 1 (first at
console:43) with 16 output partitions (allowLocal=true)

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Final stage: Stage 1 (first
at console:43)

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Parents of final stage:
List()

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Missing parents: List()

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Submitting Stage 1
(HadoopRDD[0] at hadoopRDD at console:40), which has no missing parents

14/05/31 17:03:02 INFO scheduler.DAGScheduler: Submitting 16 missing tasks
from Stage 1 (HadoopRDD[0] at hadoopRDD at console:40)

14/05/31 17:03:02 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0
with 16 tasks

14/05/31 17:03:17 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient memory

14/05/31 17:03:32 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient memory

...many times...


And never finishes. What should I do?
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


can not access app details on ec2

2014-05-31 Thread wxhsdp
hi, all
  i launch a spark cluster on ec2 with spark version v1.0.0-rc3, everything
goes well except that i
  can not access application details on the web ui, i just click on the
application name, but there's
  not response, has anyone met this before? is this a bug?

  thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/can-not-access-app-details-on-ec2-tp6635.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


spark 1.0.0 on yarn

2014-05-31 Thread Xu (Simon) Chen
Hi all,

I tried a couple ways, but couldn't get it to work..

The following seems to be what the online document (
http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting:
SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client

Help info of spark-shell seems to be suggesting --master yarn
--deploy-mode cluster.

But either way, I am seeing the following messages:
14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /
0.0.0.0:8032
14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

My guess is that spark-shell is trying to talk to resource manager to setup
spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from
though. I am running CDH5 with two resource managers in HA mode. Their
IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.

Any ideas? Thanks.
-Simon


Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Yadid Ayzenberg

Yep, I just issued a pull request.

Yadid


On 5/31/14, 1:25 PM, Patrick Wendell wrote:

1. ctx is an instance of JavaSQLContext but the textFile method is called as
a member of ctx.
According to the API JavaSQLContext does not have such a member, so im
guessing this should be sc instead.

Yeah, I think you are correct.


2. In that same code example the object sqlCtx is referenced, but it is
never instantiated in the code.
should this be ctx?

Also correct.

I think it would be good to be consistent and always have ctx refer
to a JavaSparkContext and have sqlCtx refer to a JavaSQLContext.

Any interest in creating a pull request for this? We'd be happy to
accept the change.

- Patrick




Spark on EC2

2014-05-31 Thread superback
Hi, 
I am trying to run an example on AMAZON EC2 and have successfully
set up one cluster with two nodes on EC2. However, when I was testing an
example using the following command, 

*
./run-example org.apache.spark.examples.GroupByTest spark://`hostname`:7077*

I got the following warnings and errors. Can anyone help one solve this
problem? Thanks very much!

46781 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial
job has not accepted any resources; check your cluster UI to ensure that
workers are registered and have sufficient memory
61544 [spark-akka.actor.default-dispatcher-3] ERROR
org.apache.spark.deploy.client.AppClient$ClientActor - All masters are
unresponsive! Giving up.
61544 [spark-akka.actor.default-dispatcher-3] ERROR
org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Spark
cluster looks dead, giving up.
61546 [spark-akka.actor.default-dispatcher-3] INFO
org.apache.spark.scheduler.TaskSchedulerImpl - Remove TaskSet 0.0 from pool 
61549 [main] INFO org.apache.spark.scheduler.DAGScheduler - Failed to run
count at GroupByTest.scala:50
Exception in thread main org.apache.spark.SparkException: Job aborted:
Spark cluster looks down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-31 Thread Jeremy Lee
It's been another day of spinning up dead clusters...

I thought I'd finally worked out what everyone else knew - don't use the
default AMI - but I've now run through all of the official quick-start
linux releases and I'm none the wiser:

Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
Provisions servers, connects, installs, but the webserver on the master
will not start

Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
Spot instance requests are not supported for this AMI.

SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
Not tested - costs 10x more for spot instances, not economically viable.

Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
Provisions servers, but git is not pre-installed, so the cluster setup
fails.

Have I missed something? What AMI's are people using? I've just gone back
through the archives, and I'm seeing a lot of I can't get EC2 to work and
not a single My EC2 has post-install issues,

The quickstart page says ...can have a spark cluster up and running in
five minutes. But it's been three days for me so far. I'm about to bite
the bullet and start building my own AMI's from scratch... if anyone can
save me from that, I'd be most grateful.

-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers