Re: Using Spark on Data size larger than Memory size
Some inputs will be really helpful. Thanks, -Vibhor On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com wrote: Hi all, I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ? Any thoughts in this direction will be helpful and are welcome. Thanks, -Vibhor -- Vibhor Banga Software Development Engineer Flipkart Internet Pvt. Ltd., Bangalore
Re: Using Spark on Data size larger than Memory size
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga vibhorba...@gmail.com wrote: Some inputs will be really helpful. Thanks, -Vibhor On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga vibhorba...@gmail.com wrote: Hi all, I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ? Any thoughts in this direction will be helpful and are welcome. Thanks, -Vibhor -- Vibhor Banga Software Development Engineer Flipkart Internet Pvt. Ltd., Bangalore
Re: Failed to remove RDD error
You can increase your akka timeout, should give you some more life.. are you running out of memory by any chance? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 31, 2014 at 6:52 AM, Michael Chang m...@tellapart.com wrote: I'm running a some kafka streaming spark contexts (on 0.9.1), and they seem to be dying after 10 or so minutes with a lot of these errors. I can't really tell what's going on here, except that maybe the driver is unresponsive somehow? Has anyone seen this before? 14/05/31 01:13:30 ERROR BlockManagerMaster: Failed to remove RDD 12635 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:691) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:688) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:744) Thanks, Mike
Re: pyspark MLlib examples don't work with Spark 1.0.0
The documentation you looked at is not official, though it is from @pwendell's website. It was for the Spark SQL release. Please find the official documentation here: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm It contains a working example showing how to construct LabeledPoint and use it for training. Best, Xiangrui On Fri, May 30, 2014 at 5:10 AM, jamborta jambo...@gmail.com wrote: thanks for the reply. I am definitely running 1.0.0, I set it up manually. To answer my question, I found out from the examples that it would need a new data type called LabeledPoint instead of numpy array. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-MLlib-examples-don-t-work-with-Spark-1-0-0-tp6546p6579.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Create/shutdown objects before/after RDD use (or: Non-serializable classes)
Hi Tobias, One hack you can try is: rdd.mapPartitions(iter = { val x = new X() iter.map(row = x.doSomethingWith(row)) ++ { x.shutdown(); Iterator.empty } }) Best, Xiangrui On Thu, May 29, 2014 at 11:38 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I want to use an object x in my RDD processing as follows: val x = new X() rdd.map(row = x.doSomethingWith(row)) println(rdd.count()) x.shutdown() Now the problem is that X is non-serializable, so while this works locally, it does not work in cluster setup. I thought I could do rdd.mapPartitions(iter = { val x = new X() val result = iter.map(row = x.doSomethingWith(row)) x.shutdown() result }) to create an instance of X locally, but obviously x.shutdown() is called before the first row is processed. How can I specify these node-local setup/teardown functions or how do I deal in general with non-serializable classes? Thanks Tobias
Unable to execute saveAsTextFile on multi node mesos
Hi, scenario : Read data from HDFS and apply hive query on it and the result is written back to HDFS. Scheme creation, Querying and saveAsTextFile are working fine with following mode - local mode - mesos cluster with single node - spark cluster with multi node Schema creation and querying are working fine with mesos multi node cluster. But while trying to write back to HDFS using saveAsTextFile, the following error occurs * 14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4 (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission* *14/05/30 10:16:35 INFO DAGScheduler: Executor lost: 201405291518-3644595722-5050-17933-1 (epoch 148)* Let me know your thoughts regarding this. Regards, prabeesh
Re: Yay for 1.0.0! EC2 Still has problems.
Hi there, Patrick. Thanks for the reply... It wouldn't surprise me that AWS Ubuntu has Python 2.7. Ubuntu is cool like that. :-) Alas, the Amazon Linux AMI (2014.03.1) does not, and it's the very first one on the recommended instance list. (Ubuntu is #4, after Amazon, RedHat, SUSE) So, users such as myself who deliberately pick the Most Amazon-ish obvious first choice find they picked the wrong one. But that's trivial compared to the failure of the cluster to come up, apparently due to the master's http configuration. Any help on that would be much appreciated... it's giving me serious grief. On Sat, May 31, 2014 at 1:37 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Jeremy, That's interesting, I don't think anyone has ever reported an issue running these scripts due to Python incompatibility, but they may require Python 2.7+. I regularly run them from the AWS Ubuntu 12.04 AMI... that might be a good place to start. But if there is a straightforward way to make them compatible with 2.6 we should do that. For r3.large, we can add that to the script. It's a newer type. Any interest in contributing this? - Patrick On May 30, 2014 5:08 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Hi there! I'm relatively new to the list, so sorry if this is a repeat: I just wanted to mention there are still problems with the EC2 scripts. Basically, they don't work. First, if you run the scripts on Amazon's own suggested version of linux, they break because amazon installs Python2.6.9, and the scripts use a couple of Python2.7 commands. I have to sudo yum install python27, and then edit the spark-ec2 shell script to use that specific version. Annoying, but minor. (the base python command isn't upgraded to 2.7 on many systems, apparently because it would break yum) The second minor problem is that the script doesn't know about the r3.large servers... also easily fixed by adding to the spark_ec2.py script. Minor, The big problem is that after the EC2 cluster is provisioned, installed, set up, and everything, it fails to start up the webserver on the master. Here's the tail of the log: Starting GANGLIA gmond:[ OK ] Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Connection to ec2-54-183-82-48.us-west-1.compute.amazonaws.com closed. Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Connection to ec2-54-183-82-24.us-west-1.compute.amazonaws.com closed. Shutting down GANGLIA gmetad: [FAILED] Starting GANGLIA gmetad: [ OK ] Stopping httpd:[FAILED] Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory [FAILED] Basically, the AMI you have chosen does not seem to have a full install of apache, and is missing several modules that are referred to in the httpd.conf file that is installed. The full list of missing modules is: authn_alias_module modules/mod_authn_alias.so authn_default_module modules/mod_authn_default.so authz_default_module modules/mod_authz_default.so ldap_module modules/mod_ldap.so authnz_ldap_module modules/mod_authnz_ldap.so disk_cache_module modules/mod_disk_cache.so Alas, even if these modules are commented out, the server still fails to start. root@ip-172-31-11-193 ~]$ service httpd start Starting httpd: AH00534: httpd: Configuration error: No MPM loaded. That means Spark 1.0.0 clusters on EC2 are Dead-On-Arrival when run according to the instructions. Sorry. Any suggestions on how to proceed? I'll keep trying to fix the webserver, but (a) changes to httpd.conf get blown away by resume, and (b) anything I do has to be redone every time I provision another cluster. Ugh. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Unable to execute saveAsTextFile on multi node mesos
Can you look at the logs from the executor or in the UI? They should give an exception with the reason for the task failure. Also in the future, for this type of e-mail please only e-mail the user@ list and not both lists. - Patrick On Sat, May 31, 2014 at 3:22 AM, prabeesh k prabsma...@gmail.com wrote: Hi, scenario : Read data from HDFS and apply hive query on it and the result is written back to HDFS. Scheme creation, Querying and saveAsTextFile are working fine with following mode local mode mesos cluster with single node spark cluster with multi node Schema creation and querying are working fine with mesos multi node cluster. But while trying to write back to HDFS using saveAsTextFile, the following error occurs 14/05/30 10:16:35 INFO DAGScheduler: The failed fetch was from Stage 4 (mapPartitionsWithIndex at Operator.scala:333); marking it for resubmission 14/05/30 10:16:35 INFO DAGScheduler: Executor lost: 201405291518-3644595722-5050-17933-1 (epoch 148) Let me know your thoughts regarding this. Regards, prabeesh
Re: How can I dispose an Accumulator?
Hey There, You can remove an accumulator by just letting it go out of scope and it will be garbage collected. For broadcast variables we actually store extra information for it, so we provide hooks for users to remove the associated state. There is no such need for accumulators, though. - Patrick On Thu, May 29, 2014 at 2:13 AM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: Hi, How can I dispose an Accumulator? It has no method like 'unpersist()' which Broadcast provides. Thanks.
Re: Spark hook to create external process
Currently, an executor is always run in it's own JVM, so it should be possible to just use some static initialization to e.g. launch a sub-process and set up a bridge with which to communicate. This is would be a fairly advanced use case, however. - Patrick On Thu, May 29, 2014 at 8:39 PM, ansriniv ansri...@gmail.com wrote: Hi Matei, Thanks for the reply. I would like to avoid having to spawn these external processes every time during the processing of the task to reduce task latency. I'd like these to be pre-spawned as much as possible - tying them to lifecycle of corresponding threadpool thread would simplify management for me. Also, during processing some back and forth communication is required between the Spark executer thread and its associated external process. For these 2 reasons, pipe() wouldnt meet my requirement. Is there any hook in the ThreadPoolExecutor created by the Spark Executor to plug in my own ThreadFactory ? Thanks Anand -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hook-to-create-external-process-tp6526p6552.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: possible typos in spark 1.0 documentation
1. ctx is an instance of JavaSQLContext but the textFile method is called as a member of ctx. According to the API JavaSQLContext does not have such a member, so im guessing this should be sc instead. Yeah, I think you are correct. 2. In that same code example the object sqlCtx is referenced, but it is never instantiated in the code. should this be ctx? Also correct. I think it would be good to be consistent and always have ctx refer to a JavaSparkContext and have sqlCtx refer to a JavaSQLContext. Any interest in creating a pull request for this? We'd be happy to accept the change. - Patrick
Re: getPreferredLocations
1) Is there a guarantee that a partition will only be processed on a node which is in the getPreferredLocations set of nodes returned by the RDD ? No there isn't, by default Spark may schedule in a non preferred location after `spark.locality.wait` has expired. http://spark.apache.org/docs/latest/configuration.html#scheduling If you want to have the behavior that this is treated as a constraint, you can turn spark.locality.wait to a very high value. Keep in mind though, this will starve your job if all of the preferred locations are on nodes that are not alive. 2) I am implementing this custom RDD in Java and plan to extend JavaRDD. However, I dont see a getPreferredLocations method in There is not currently support for writing custom RDD classes in Java. Or at least, you'd need to write some of the internals in Scala and then you could write a Java wrapper. Can I ask though - what is the reason you are trying to write a custom RDD? This is usually a somewhat advanced use case if you are, e.g. writing a Spark integration with a new storage system or something like that.
Re: count()-ing gz files gives java.io.IOException: incorrect header check
I think there are a few ways to do this... the simplest one might be to manually build a set of comma-separated paths that excludes the bad file, and pass that to textFile(). When you call textFile() under the hood it is going to pass your filename string to hadoopFile() which calls setInputPaths() on the hadoop FileInputformat. http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path...) I think this can accept a comma-separate list of paths. So you could do something like this (this is pseudo-code): files = fs.listStatus(s3n://bucket/stuff/*.gz) files = files.filter(not the bad file) fileStr = files.map(f = f.getPath.toString).mkstring(,) sc.textFile(fileStr)... - Patrick On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: YES, your hunches were correct. I've identified at least one file among the hundreds I'm processing that is indeed not a valid gzip file. Does anyone know of an easy way to exclude a specific file or files when calling sc.textFile() on a pattern? e.g. Something like: sc.textFile('s3n://bucket/stuff/*.gz, exclude:s3n://bucket/stuff/bad.gz') On Wed, May 21, 2014 at 11:50 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the suggestions, people. I will try to hone in on which specific gzipped files, if any, are actually corrupt. Michael, I'm using Hadoop 1.0.4, which I believe is the default version that gets deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281 https://issues.apache.org/jira/browse/HADOOP-5281, affects Hadoop 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I know there is some funkiness in how Hadoop is versioned, so I'm not sure if this issue is relevant to 1.0.4. Were you able to resolve your issue by changing your version of Hadoop? How did you do that? Nick On Wed, May 21, 2014 at 11:38 PM, Andrew Ash and...@andrewash.com wrote: One thing you can try is to pull each file out of S3 and decompress with gzip -d to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob. Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote: Hi Nick, Which version of Hadoop are you using with Spark? I spotted an issue with the built-in GzipDecompressor while doing something similar with Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files blew up from Hadoop/Spark. The following JIRA ticket goes into more detail https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all Hadoop releases prior to 1.2.X MC *Michael Cutler* Founder, CTO * Mobile: +44 789 990 7847 Email: mich...@tumra.com mich...@tumra.com Web: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email * *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered in England Wales, 07916412. VAT No. 130595328* This email and any files transmitted with it are confidential and may also be privileged. It is intended only for the person to whom it is addressed. If you have received this email in error, please inform the sender immediately. If you are not the intended recipient you must not use, disclose, copy, print, distribute or rely on this email. On 21 May 2014 14:26, Madhu ma...@madhu.com wrote: Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a bad header error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Trouble with EC2
What instance types did you launch on? Sometimes you also get a bad individual machine from EC2. It might help to remove the node it’s complaining about from the conf/slaves file. Matei On May 30, 2014, at 11:18 AM, PJ$ p...@chickenandwaffl.es wrote: Hey Folks, I'm really having quite a bit of trouble getting spark running on ec2. I'm not using scripts the https://github.com/apache/spark/tree/master/ec2 because I'd like to know how everything works. But I'm going a little crazy. I think that something about the networking configuration must be messed up, but I'm at a loss. Shortly after starting the cluster, I get a lot of this: 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:22 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:03:23 INFO master.Master: Registering worker ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485 ] 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 INFO master.Master: akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, removing it. 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] - [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
Re: count()-ing gz files gives java.io.IOException: incorrect header check
That's a neat idea. I'll try that out. On Sat, May 31, 2014 at 2:45 PM, Patrick Wendell pwend...@gmail.com wrote: I think there are a few ways to do this... the simplest one might be to manually build a set of comma-separated paths that excludes the bad file, and pass that to textFile(). When you call textFile() under the hood it is going to pass your filename string to hadoopFile() which calls setInputPaths() on the hadoop FileInputformat. http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path...) I think this can accept a comma-separate list of paths. So you could do something like this (this is pseudo-code): files = fs.listStatus(s3n://bucket/stuff/*.gz) files = files.filter(not the bad file) fileStr = files.map(f = f.getPath.toString).mkstring(,) sc.textFile(fileStr)... - Patrick On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: YES, your hunches were correct. I’ve identified at least one file among the hundreds I’m processing that is indeed not a valid gzip file. Does anyone know of an easy way to exclude a specific file or files when calling sc.textFile() on a pattern? e.g. Something like: sc.textFile('s3n://bucket/stuff/*.gz, exclude:s3n://bucket/stuff/bad.gz') On Wed, May 21, 2014 at 11:50 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the suggestions, people. I will try to hone in on which specific gzipped files, if any, are actually corrupt. Michael, I’m using Hadoop 1.0.4, which I believe is the default version that gets deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281 https://issues.apache.org/jira/browse/HADOOP-5281, affects Hadoop 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I know there is some funkiness in how Hadoop is versioned, so I’m not sure if this issue is relevant to 1.0.4. Were you able to resolve your issue by changing your version of Hadoop? How did you do that? Nick On Wed, May 21, 2014 at 11:38 PM, Andrew Ash and...@andrewash.com wrote: One thing you can try is to pull each file out of S3 and decompress with gzip -d to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob. Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote: Hi Nick, Which version of Hadoop are you using with Spark? I spotted an issue with the built-in GzipDecompressor while doing something similar with Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files blew up from Hadoop/Spark. The following JIRA ticket goes into more detail https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all Hadoop releases prior to 1.2.X MC *Michael Cutler* Founder, CTO * Mobile: +44 789 990 7847 Email: mich...@tumra.com mich...@tumra.com Web: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email * *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered in England Wales, 07916412. VAT No. 130595328* This email and any files transmitted with it are confidential and may also be privileged. It is intended only for the person to whom it is addressed. If you have received this email in error, please inform the sender immediately. If you are not the intended recipient you must not use, disclose, copy, print, distribute or rely on this email. On 21 May 2014 14:26, Madhu ma...@madhu.com wrote: Can you identify a specific file that fails? There might be a real bug here, but I have found gzip to be reliable. Every time I have run into a bad header error with gzip, I had a non-gzip file with the wrong extension for whatever reason. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
hadoopRDD stalls reading entire directory
I'm running the following code to load an entire directory of Avros using hadoopRDD. val input = hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/* // Setup the path for the job vai a Hadoop JobConf val jobConf= new JobConf(sc.hadoopConfiguration) jobConf.setJobName(Test Scala Job) FileInputFormat.setInputPaths(jobConf, input) val rdd = sc.hadoopRDD( jobConf, classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]], classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]], classOf[org.apache.hadoop.io.NullWritable], 1) It successfully loads a single file, but when I load an entire directory, I get this: scala rdd.first 14/05/31 17:03:01 INFO mapred.FileInputFormat: Total input paths to process : 17 14/05/31 17:03:02 INFO spark.SparkContext: Starting job: first at console:43 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Got job 0 (first at console:43) with 1 output partitions (allowLocal=true) 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Final stage: Stage 0 (first at console:43) 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Computing the requested partition locally 14/05/31 17:03:02 INFO rdd.HadoopRDD: Input split: hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864 14/05/31 17:03:02 INFO spark.SparkContext: Job finished: first at console:43, took 0.43242113 s 14/05/31 17:03:02 INFO spark.SparkContext: Starting job: first at console:43 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Got job 1 (first at console:43) with 16 output partitions (allowLocal=true) 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Final stage: Stage 1 (first at console:43) 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Missing parents: List() 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Submitting Stage 1 (HadoopRDD[0] at hadoopRDD at console:40), which has no missing parents 14/05/31 17:03:02 INFO scheduler.DAGScheduler: Submitting 16 missing tasks from Stage 1 (HadoopRDD[0] at hadoopRDD at console:40) 14/05/31 17:03:02 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 16 tasks 14/05/31 17:03:17 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/05/31 17:03:32 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory ...many times... And never finishes. What should I do? -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
can not access app details on ec2
hi, all i launch a spark cluster on ec2 with spark version v1.0.0-rc3, everything goes well except that i can not access application details on the web ui, i just click on the application name, but there's not response, has anyone met this before? is this a bug? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-not-access-app-details-on-ec2-tp6635.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
spark 1.0.0 on yarn
Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document ( http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at / 0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas? Thanks. -Simon
Re: possible typos in spark 1.0 documentation
Yep, I just issued a pull request. Yadid On 5/31/14, 1:25 PM, Patrick Wendell wrote: 1. ctx is an instance of JavaSQLContext but the textFile method is called as a member of ctx. According to the API JavaSQLContext does not have such a member, so im guessing this should be sc instead. Yeah, I think you are correct. 2. In that same code example the object sqlCtx is referenced, but it is never instantiated in the code. should this be ctx? Also correct. I think it would be good to be consistent and always have ctx refer to a JavaSparkContext and have sqlCtx refer to a JavaSQLContext. Any interest in creating a pull request for this? We'd be happy to accept the change. - Patrick
Spark on EC2
Hi, I am trying to run an example on AMAZON EC2 and have successfully set up one cluster with two nodes on EC2. However, when I was testing an example using the following command, * ./run-example org.apache.spark.examples.GroupByTest spark://`hostname`:7077* I got the following warnings and errors. Can anyone help one solve this problem? Thanks very much! 46781 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 61544 [spark-akka.actor.default-dispatcher-3] ERROR org.apache.spark.deploy.client.AppClient$ClientActor - All masters are unresponsive! Giving up. 61544 [spark-akka.actor.default-dispatcher-3] ERROR org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Spark cluster looks dead, giving up. 61546 [spark-akka.actor.default-dispatcher-3] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Remove TaskSet 0.0 from pool 61549 [main] INFO org.apache.spark.scheduler.DAGScheduler - Failed to run count at GroupByTest.scala:50 Exception in thread main org.apache.spark.SparkException: Job aborted: Spark cluster looks down at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Yay for 1.0.0! EC2 Still has problems.
It's been another day of spinning up dead clusters... I thought I'd finally worked out what everyone else knew - don't use the default AMI - but I've now run through all of the official quick-start linux releases and I'm none the wiser: Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit) Provisions servers, connects, installs, but the webserver on the master will not start Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419 Spot instance requests are not supported for this AMI. SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f Not tested - costs 10x more for spot instances, not economically viable. Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3 Provisions servers, but git is not pre-installed, so the cluster setup fails. Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f Provisions servers, but git is not pre-installed, so the cluster setup fails. Have I missed something? What AMI's are people using? I've just gone back through the archives, and I'm seeing a lot of I can't get EC2 to work and not a single My EC2 has post-install issues, The quickstart page says ...can have a spark cluster up and running in five minutes. But it's been three days for me so far. I'm about to bite the bullet and start building my own AMI's from scratch... if anyone can save me from that, I'd be most grateful. -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers