hi
Hi Can someone help me with the following error that I faced while setting up single node spark framework. karthik@karthik-OptiPlex-9020:~/spark-1.0.0$ MASTER=spark://localhost:7077 sbin/spark-shell bash: sbin/spark-shell: No such file or directory karthik@karthik-OptiPlex-9020:~/spark-1.0.0$ MASTER=spark://localhost:7077 bin/spark-shell Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 14/06/23 10:44:53 INFO spark.SecurityManager: Changing view acls to: karthik 14/06/23 10:44:53 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(karthik) 14/06/23 10:44:53 INFO spark.HttpServer: Starting HTTP Server 14/06/23 10:44:53 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/06/23 10:44:53 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:39588 Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_05) Type in expressions to have them evaluated. Type :help for more information. 14/06/23 10:44:55 INFO spark.SecurityManager: Changing view acls to: karthik 14/06/23 10:44:55 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(karthik) 14/06/23 10:44:55 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/23 10:44:55 INFO Remoting: Starting remoting 14/06/23 10:44:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@karthik-OptiPlex-9020:50294] 14/06/23 10:44:55 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@karthik-OptiPlex-9020:50294] 14/06/23 10:44:55 INFO spark.SparkEnv: Registering MapOutputTracker 14/06/23 10:44:55 INFO spark.SparkEnv: Registering BlockManagerMaster 14/06/23 10:44:55 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140623104455-3297 14/06/23 10:44:55 INFO storage.MemoryStore: MemoryStore started with capacity 294.6 MB. 14/06/23 10:44:55 INFO network.ConnectionManager: Bound socket to port 60264 with id = ConnectionManagerId(karthik-OptiPlex-9020,60264) 14/06/23 10:44:55 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/06/23 10:44:55 INFO storage.BlockManagerInfo: Registering block manager karthik-OptiPlex-9020:60264 with 294.6 MB RAM 14/06/23 10:44:55 INFO storage.BlockManagerMaster: Registered BlockManager 14/06/23 10:44:55 INFO spark.HttpServer: Starting HTTP Server 14/06/23 10:44:55 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/06/23 10:44:55 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:38307 14/06/23 10:44:55 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.0.1.61:38307 14/06/23 10:44:55 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-082a44f6-e877-48cc-8ab7-1bcbcf8136b0 14/06/23 10:44:55 INFO spark.HttpServer: Starting HTTP Server 14/06/23 10:44:55 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/06/23 10:44:55 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:58745 14/06/23 10:44:56 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/06/23 10:44:56 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/06/23 10:44:56 INFO ui.SparkUI: Started SparkUI at http://karthik-OptiPlex-9020:4040 14/06/23 10:44:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/23 10:44:56 INFO client.AppClient$ClientActor: Connecting to master spark://localhost:7077... 14/06/23 10:44:56 INFO repl.SparkILoop: Created spark context.. 14/06/23 10:44:56 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@localhost:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@localhost:7077] Spark context available as sc. scala 14/06/23 10:44:56 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@localhost:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@localhost:7077] 14/06/23 10:44:56 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@localhost:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@localhost:7077] 14/06/23 10:44:56 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@localhost:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@localhost:7077] 14/06/23 10:45:16 INFO client.AppClient$ClientActor: Connecting to master spark://localhost:7077... 14/06/23 10:45:16 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@localhost:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@localhost:7077] 14/06/23 10:45:16 WARN client.AppClient$ClientActor: Could
Scheduling in spark
Hi, I am a post graduate student, new to spark. I want to understand how Spark scheduler works. I just have theoretical understanding of DAG scheduler and the underlying task scheduler. I want to know, given a job to the framework, after the DAG scheduler phase, how the scheduling happens?? Can someone help me out as to how to proceed in these lines. I have some exposure towards Hadoop schedulers and tools like Mumak simulator for experiments. Can someone please tell me how to perform simulations on Spark w.r.to schedulers. Thanks in advance
Hi
Hi I have this doubt: I understand that each java process runs on different JVM instances. Now, if I have a single executor on my machine and run several java processes, then there will be several JVM instances running. Now, process_local means, the data is located on the same JVM as the task that is launched. But, the memory associated with the entire executor is same. Then, how does this memory gets distributed across the JVMs??. I mean, how this memory gets associated with multiple JVMs?? Thank you!!! -karthik
StorageLevel error.
Hi, Can someone help me with the following error: scala val rdd = sc.parallelize(Array(1,2,3,4)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 scala rdd.persist(StorageLevel.MEMORY_ONLY) console:15: error: not found: value StorageLevel rdd.persist(StorageLevel.MEMORY_ONLY) ^ Thank you!!!
Replicate RDDs
Hi I have a three node spark cluster. I restricted the resources per application by setting appropriate parameters and I could run two applications simultaneously. Now, I want to replicate an RDD and run two applications simultaneously. Can someone help how to go about doing this!!! I replicated an RDD of size 1354MB over this cluster. The webUI shows that its replicated twice. But when I go to storage details, the two partitions, each of size ~677MB, are stored on the same node. All other nodes do not contain any partitions. Can someone tell me where am I going wrong? Thank you!! -karthik
operations on replicated RDD
Hi, An RDD replicated by an application is owned by only that application. No other applications can share it. Then, what is motive behind providing the rdd replication feature. What all oparations can be performed on the replicated RDD. Thank you!!! -karthik
RDDs
Hi, Can someone tell me what kind of operations can be performed on a replicated rdd?? What are the use-cases of a replicated rdd. One basic doubt that is bothering me from long time: what is the difference between an application and job in the Spark parlance. I am confused b'cas of Hadoop jargon. Thank you
Fwd: RDDs
-- Forwarded message -- From: rapelly kartheek kartheek.m...@gmail.com Date: Thu, Sep 4, 2014 at 11:49 AM Subject: Re: RDDs To: Liu, Raymond raymond@intel.com Thank you Raymond. I am more clear now. So, if an rdd is replicated over multiple nodes (i.e. say two sets of nodes as it is a collection of chunks), can we run two jobs concurrently and seperately on these two sets of nodes? On Thu, Sep 4, 2014 at 11:38 AM, Liu, Raymond raymond@intel.com wrote: Actually, a replicated RDD and a parallel job on the same RDD, this two conception is not related at all. A replicated RDD just store data on multiple node, it helps with HA and provide better chance for data locality. It is still one RDD, not two separate RDD. While regarding run two jobs on the same RDD, it doesn't matter that the RDD is replicated or not. You can always do it if you wish to. Best Regards, Raymond Liu -Original Message- From: Kartheek.R [mailto:kartheek.m...@gmail.com] Sent: Thursday, September 04, 2014 1:24 PM To: u...@spark.incubator.apache.org Subject: RE: RDDs Thank you Raymond and Tobias. Yeah, I am very clear about what I was asking. I was talking about replicated rdd only. Now that I've got my understanding about job and application validated, I wanted to know if we can replicate an rdd and run two jobs (that need same rdd) of an application in parallel?. -Karthk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13416.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
question on replicate() in blockManager.scala
Hi, var cachedPeers: Seq[BlockManagerId] = null private def replicate(blockId: String, data: ByteBuffer, level: StorageLevel) { val tLevel = StorageLevel(level.useDisk, level.useMemory, level.deserialized, 1) if (cachedPeers == null) { cachedPeers = master.getPeers(blockManagerId, level.replication - 1) } for (peer: BlockManagerId - cachedPeers) { val start = System.nanoTime data.rewind() logDebug(Try to replicate BlockId + blockId + once; The size of the data is + data.limit() + Bytes. To node: + peer) if (!BlockManagerWorker.syncPutBlock(PutBlock(blockId, data, tLevel), new ConnectionManagerId(peer.host, peer.port))) { logError(Failed to call syncPutBlock to + peer) } logDebug(Replicated BlockId + blockId + once used + (System.nanoTime - start) / 1e6 + s; The size of the data is + data.limit() + bytes.) } I get the flow of this code. But, I dont find any method being called for actually writing the data into the set of peers chosen for replication. Where exaclty is the replication happening? Thank you!! -Karthik
replicated rdd storage problem
Hi, Whenever I replicate an rdd, I find that the rdd gets replicated only in one node. I have a 3 node cluster. I set rdd.persist(StorageLevel.MEMORY_ONLY_2) in my application. The webUI shows that its replicates twice. But, the rdd stogare details show that its replicated only once and only in one node. Can someone tell me where am I going wrong??? regards -Karthik
How to profile a spark application
Hi, Can someone tell me how to profile a spark application. -Karthik
Re: How to profile a spark application
Thank you Ted. regards Karthik On Mon, Sep 8, 2014 at 3:33 PM, Ted Yu yuzhih...@gmail.com wrote: See https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Sep 8, 2014, at 2:48 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, Can someone tell me how to profile a spark application. -Karthik
Re: How to profile a spark application
hi Ted, Where do I find the licence keys that I need to copy to the licences directory. Thank you!! On Mon, Sep 8, 2014 at 8:25 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Thank you Ted. regards Karthik On Mon, Sep 8, 2014 at 3:33 PM, Ted Yu yuzhih...@gmail.com wrote: See https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Sep 8, 2014, at 2:48 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, Can someone tell me how to profile a spark application. -Karthik
compiling spark source code
HI, Can someone please tell me how to compile the spark source code to effect the changes in the source code. I was trying to ship the jars to all the slaves, but in vain. -Karthik
Re: compiling spark source code
I have been doing that. All the modifications to the code are not being compiled. On Thu, Sep 11, 2014 at 10:45 PM, Daniil Osipov daniil.osi...@shazam.com wrote: In the spark source folder, execute `sbt/sbt assembly` On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com wrote: HI, Can someone please tell me how to compile the spark source code to effect the changes in the source code. I was trying to ship the jars to all the slaves, but in vain. -Karthik
File operations on spark
Hi I am trying to perform read/write file operations in spark by creating Writable object. But, I am not able to write to a file. The concerned data is not rdd. Can someone please tell me how to perform read/write file operations on non-rdd data in spark. Regards karthik
File I/O in spark
Hi I am trying to perform some read/write file operations in spark. Somehow I am neither able to write to a file nor read. import java.io._ val writer = new PrintWriter(new File(test.txt )) writer.write(Hello Scala) Can someone please tell me how to perform file I/O in spark.
Re: File I/O in spark
Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I see that the file gets created in the master node. But, there wont be any data written to it. On Mon, Sep 15, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com wrote: Is this code running in an executor? You need to make sure the file is accessible on ALL executors. One way to do that is to use a distributed filesystem like HDFS or GlusterFS. On Mon, Sep 15, 2014 at 8:51 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi I am trying to perform some read/write file operations in spark. Somehow I am neither able to write to a file nor read. import java.io._ val writer = new PrintWriter(new File(test.txt )) writer.write(Hello Scala) Can someone please tell me how to perform file I/O in spark.
Re: File I/O in spark
The file gets created on the fly. So I dont know how to make sure that its accessible to all nodes. On Mon, Sep 15, 2014 at 10:10 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I see that the file gets created in the master node. But, there wont be any data written to it. On Mon, Sep 15, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com wrote: Is this code running in an executor? You need to make sure the file is accessible on ALL executors. One way to do that is to use a distributed filesystem like HDFS or GlusterFS. On Mon, Sep 15, 2014 at 8:51 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi I am trying to perform some read/write file operations in spark. Somehow I am neither able to write to a file nor read. import java.io._ val writer = new PrintWriter(new File(test.txt )) writer.write(Hello Scala) Can someone please tell me how to perform file I/O in spark.
Re: File I/O in spark
I came across these APIs in one the scala tutorials over the net. On Mon, Sep 15, 2014 at 10:14 PM, Mohit Jaggi mohitja...@gmail.com wrote: But the above APIs are not for HDFS. On Mon, Sep 15, 2014 at 9:40 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I see that the file gets created in the master node. But, there wont be any data written to it. On Mon, Sep 15, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com wrote: Is this code running in an executor? You need to make sure the file is accessible on ALL executors. One way to do that is to use a distributed filesystem like HDFS or GlusterFS. On Mon, Sep 15, 2014 at 8:51 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi I am trying to perform some read/write file operations in spark. Somehow I am neither able to write to a file nor read. import java.io._ val writer = new PrintWriter(new File(test.txt )) writer.write(Hello Scala) Can someone please tell me how to perform file I/O in spark.
Re: File I/O in spark
Can you please direct me to the right way of doing this. On Mon, Sep 15, 2014 at 10:18 PM, rapelly kartheek kartheek.m...@gmail.com wrote: I came across these APIs in one the scala tutorials over the net. On Mon, Sep 15, 2014 at 10:14 PM, Mohit Jaggi mohitja...@gmail.com wrote: But the above APIs are not for HDFS. On Mon, Sep 15, 2014 at 9:40 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I see that the file gets created in the master node. But, there wont be any data written to it. On Mon, Sep 15, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com wrote: Is this code running in an executor? You need to make sure the file is accessible on ALL executors. One way to do that is to use a distributed filesystem like HDFS or GlusterFS. On Mon, Sep 15, 2014 at 8:51 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi I am trying to perform some read/write file operations in spark. Somehow I am neither able to write to a file nor read. import java.io._ val writer = new PrintWriter(new File(test.txt )) writer.write(Hello Scala) Can someone please tell me how to perform file I/O in spark.
rsync problem
Hi, I'd made some modifications to the spark source code in the master and reflected them to the slaves using rsync. I followed this command: rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname :path/to/destdirectory. This worked perfectly. But, I wanted to simultaneously rsync all the slaves. So, added the other slaves as following: rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname :path/to/destdirectory username@slave2:path username@slave3:path and so on. But this didn't work. Anyway, for now, I did it individually for each node. Can someone give me the right syntax. Secondly, after this rsync, I find that my cluster has become tremendously slow!!! Sometimes the cluster is just shutting down. Job execution is not happening. Can someone throw some light on this aspect. thank you Karthik
Re: rsync problem
Hi Tobias, I've copied the files from master to all the slaves. On Fri, Sep 19, 2014 at 1:37 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: This worked perfectly. But, I wanted to simultaneously rsync all the slaves. So, added the other slaves as following: rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname :path/to/destdirectory username@slave2:path username@slave3:path and so on. The rsync man page says rsync [OPTION...] SRC... [USER@]HOST:DEST so as I understand your command, you have copied a lot of files from various hosts to username@slave3:path. I don't think rsync can copy to various locations at once. Tobias
Re: rsync problem
, * you have copied a lot of files from various hosts to username@slave3:path* only from one node to all the other nodes... On Fri, Sep 19, 2014 at 1:45 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi Tobias, I've copied the files from master to all the slaves. On Fri, Sep 19, 2014 at 1:37 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: This worked perfectly. But, I wanted to simultaneously rsync all the slaves. So, added the other slaves as following: rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname :path/to/destdirectory username@slave2:path username@slave3:path and so on. The rsync man page says rsync [OPTION...] SRC... [USER@]HOST:DEST so as I understand your command, you have copied a lot of files from various hosts to username@slave3:path. I don't think rsync can copy to various locations at once. Tobias
Fwd: rsync problem
-- Forwarded message -- From: rapelly kartheek kartheek.m...@gmail.com Date: Fri, Sep 19, 2014 at 1:51 PM Subject: Re: rsync problem To: Tobias Pfeiffer t...@preferred.jp any idea why the cluster is dying down??? On Fri, Sep 19, 2014 at 1:47 PM, rapelly kartheek kartheek.m...@gmail.com wrote: , * you have copied a lot of files from various hosts to username@slave3:path* only from one node to all the other nodes... On Fri, Sep 19, 2014 at 1:45 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi Tobias, I've copied the files from master to all the slaves. On Fri, Sep 19, 2014 at 1:37 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Sep 19, 2014 at 5:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: This worked perfectly. But, I wanted to simultaneously rsync all the slaves. So, added the other slaves as following: rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname :path/to/destdirectory username@slave2:path username@slave3:path and so on. The rsync man page says rsync [OPTION...] SRC... [USER@]HOST:DEST so as I understand your command, you have copied a lot of files from various hosts to username@slave3:path. I don't think rsync can copy to various locations at once. Tobias
Re: rsync problem
Thank you Soumya Simantha and Tobias. I've deleted the contents of the work folder in all the nodes. Now its working perfectly as it was before. Thank you Karthik On Fri, Sep 19, 2014 at 4:46 PM, Soumya Simanta soumya.sima...@gmail.com wrote: One possible reason is maybe that the checkpointing directory $SPARK_HOME/work is rsynced as well. Try emptying the contents of the work folder on each node and try again. On Fri, Sep 19, 2014 at 4:53 AM, rapelly kartheek kartheek.m...@gmail.com wrote: I * followed this command:rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname:* *path/to/destdirectory. Anyway, for now, I did it individually for each node.* I have copied to each node at a time individually using the above command. So, I guess the copying may not contain any mixture of files. Also, as of now, I am not facing any MethodNotFound exceptions. But, there is no job execution taking place. After sometime, one by one, each goes down and the cluster shuts down. On Fri, Sep 19, 2014 at 2:15 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Sep 19, 2014 at 5:17 PM, rapelly kartheek kartheek.m...@gmail.com wrote: , * you have copied a lot of files from various hosts to username@slave3:path* only from one node to all the other nodes... I don't think rsync can do that in one command as you described. My guess is that now you have a wild mixture of jar files all across your cluster which will lead to fancy exceptions like MethodNotFound etc., that's maybe why your cluster is not working correctly. Tobias
Re: rsync problem
Hi, This is the command I am using for submitting my application, SimpleApp: ./bin/spark-submit --class org.apache.spark.examples.SimpleApp --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar /text-data On Thu, Sep 25, 2014 at 6:52 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I assume you unintentionally did not reply to the list, so I'm adding it back to CC. How do you submit your job to the cluster? Tobias On Thu, Sep 25, 2014 at 2:21 AM, rapelly kartheek kartheek.m...@gmail.com wrote: How do I find out whether a node in the cluster is a master or slave?? Till now I was thinking that slaves file under the conf folder makes the difference. Also, the MASTER_MASTER_IP in the spark-env.sh file. what else differentiates a slave from the master?? On Wed, Sep 24, 2014 at 10:46 PM, rapelly kartheek kartheek.m...@gmail.com wrote: The job execution is taking place perfectly. Previously, all my print statements used to be stored in spark/work/*/stdout file. But, now after doing the rsync, I find that none of the prtint statements are getting reflected in the stdout file under work folder. But, when I go to the code, I find the statements in the code. But, they are not reflected into the stdout file as before. Can you please tell me where I went wrong. All I want is to see my mofication in the code getting relected in output . On Wed, Sep 24, 2014 at 10:22 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I have a very important and fundamental doubt: I have rsynced the entire spark folder from the master to all slaves in the cluster. When I execute a job, its working perfectly. But, when I rsync the entire spark folder of the master to all the slaves, is it not that I am sending the master configurations to all the slaves and making the slaves behave like master?? First of all, is it correct to rsync the entire spark folder?? But, if I change only one file, then how do I rsync it to all?? On Fri, Sep 19, 2014 at 8:44 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Thank you Soumya Simantha and Tobias. I've deleted the contents of the work folder in all the nodes. Now its working perfectly as it was before. Thank you Karthik On Fri, Sep 19, 2014 at 4:46 PM, Soumya Simanta soumya.sima...@gmail.com wrote: One possible reason is maybe that the checkpointing directory $SPARK_HOME/work is rsynced as well. Try emptying the contents of the work folder on each node and try again. On Fri, Sep 19, 2014 at 4:53 AM, rapelly kartheek kartheek.m...@gmail.com wrote: I * followed this command:rsync -avL --progress path/to/spark-1.0.0 username@destinationhostname:* *path/to/destdirectory. Anyway, for now, I did it individually for each node.* I have copied to each node at a time individually using the above command. So, I guess the copying may not contain any mixture of files. Also, as of now, I am not facing any MethodNotFound exceptions. But, there is no job execution taking place. After sometime, one by one, each goes down and the cluster shuts down. On Fri, Sep 19, 2014 at 2:15 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Sep 19, 2014 at 5:17 PM, rapelly kartheek kartheek.m...@gmail.com wrote: , * you have copied a lot of files from various hosts to username@slave3:path* only from one node to all the other nodes... I don't think rsync can do that in one command as you described. My guess is that now you have a wild mixture of jar files all across your cluster which will lead to fancy exceptions like MethodNotFound etc., that's maybe why your cluster is not working correctly. Tobias
Rdd repartitioning
Hi, I was facing GC overhead errors while executing an application with 570MB data(with rdd replication). In order to fix the heap errors, I repartitioned the rdd to 10: val logData = sc.textFile(hdfs:/text_data/text data.txt).persist(StorageLevel.MEMORY_ONLY_2) val parts=logData.coalesce(10,true) println(parts.partitions.length). But the problem is, WebUI still shows number of partitions as 5 while the print statement outputs 10. I tried even repartition(), but face the same problem. Also, does webUI show the storage details of each partition twice when I replicate the rdd? Because, I see that webUI displays each partition only once while it says 2 x replicated. Can someone help me out in this!!! -Karthik
How to convert a non-rdd data to rdd.
Hi, I am trying to write a String that is not an rdd to HDFS. This data is a variable in Spark Scheduler code. None of the spark File operations are working because my data is not rdd. So, I tried using SparkContext.parallelize(data). But it throws error: [error] /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265: not found: value SparkContext [error] SparkContext.parallelize(result) [error] ^ [error] one error found I realized that this data is part of the Scheduler. So, the Sparkcontext would not have got created yet. Any help in writing scheduler variable data to HDFS is appreciated!! -Karthik
Re: How to convert a non-rdd data to rdd.
Its a variable in spark-1.0.0/*/storagre/BlockManagerMaster.scala class. The return data of AskDriverWithReply() method for the getPeers() method. Basically, it is a Seq[ArrayBuffer]: ArraySeq(ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0, s1, 34625, 0)), ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0, s2, 34625, 0)), ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0, s2, 34625, 0)), ArrayBuffer(BlockManagerId(1, s1, 47006, 0), BlockManagerId(0, s2, 34625, 0)), ArrayBuffer(BlockManagerId(driver, karthik, 51051, 0), BlockManagerId(1, s1, 47006, 0))) On Sun, Oct 12, 2014 at 12:59 PM, @Sanjiv Singh [via Apache Spark User List] ml-node+s1001560n16231...@n3.nabble.com wrote: Hi Karthik, Can you provide us more detail of dataset data that you wanted to parallelize with SparkContext.parallelize(data); Regards, Sanjiv Singh Regards Sanjiv Singh Mob : +091 9990-447-339 On Sun, Oct 12, 2014 at 11:45 AM, rapelly kartheek [hidden email] http://user/SendEmail.jtp?type=nodenode=16231i=0 wrote: Hi, I am trying to write a String that is not an rdd to HDFS. This data is a variable in Spark Scheduler code. None of the spark File operations are working because my data is not rdd. So, I tried using SparkContext.parallelize(data). But it throws error: [error] /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265: not found: value SparkContext [error] SparkContext.parallelize(result) [error] ^ [error] one error found I realized that this data is part of the Scheduler. So, the Sparkcontext would not have got created yet. Any help in writing scheduler variable data to HDFS is appreciated!! -Karthik -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-non-rdd-data-to-rdd-tp16230p16231.html To unsubscribe from How to convert a non-rdd data to rdd., click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16230code=a2FydGhlZWsubWJtc0BnbWFpbC5jb218MTYyMzB8LTE1NjA1NDM4NDM= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
Rdd replication
Hi, I am trying to understand rdd replication code. In the process, I frequently execute one spark application whenever I make a change to the code to see effect. My problem is, after a set of repeated executions of the same application, I find that my cluster behaves unusually. Ideally, when I replicate an rdd twice, the webUI displays each partition twice in the RDD storage info tab. But, sometimes I find that it displays each partition only once. Also, when it is replicated only once, each partition gets displayed twice. This happens frequently. Can someone throw some light in this regard.
Read a HDFS file from Spark source code
Hi I am trying to access a file in HDFS from spark source code. Basically, I am tweaking the spark source code. I need to access a file in HDFS from the source code of the spark. I am really not understanding how to go about doing this. Can someone please help me out in this regard. Thank you!! Karthik
Re: Read a HDFS file from Spark source code
Hi Sean, I was following this link; http://mund-consulting.com/Blog/Posts/file-operations-in-HDFS-using-java.aspx But, I was facing FileSystem ambiguity error. I really don't have any idea as to how to go about doing this. Can you please help me how to start off with this? On Wed, Nov 12, 2014 at 11:26 AM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: Instead of a file path, use a HDFS URI. For example: (In Python) data = sc.textFile(hdfs://localhost/user/someuser/data) On Wed, Nov 12, 2014 at 10:12 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi I am trying to access a file in HDFS from spark source code. Basically, I am tweaking the spark source code. I need to access a file in HDFS from the source code of the spark. I am really not understanding how to go about doing this. Can someone please help me out in this regard. Thank you!! Karthik
Read a HDFS file from Spark using HDFS API
Hi, I am trying to read a HDFS file from Spark scheduler code. I could find how to write hdfs read/writes in java. But I need to access hdfs from spark using scala. Can someone please help me in this regard.
Re: Read a HDFS file from Spark using HDFS API
I'll just try out with object Akhil provided. There was no problem working in shell with sc.textFile. Thank you Akhil and Tri. On Fri, Nov 14, 2014 at 9:21 PM, Akhil Das ak...@sigmoidanalytics.com wrote: [image: Inline image 1] Thanks Best Regards On Fri, Nov 14, 2014 at 9:18 PM, Bui, Tri tri@verizonwireless.com.invalid wrote: It should be val file = sc.textFile(hdfs:///localhost:9000/sigmoid/input.txt) 3 “///” Thanks Tri *From:* rapelly kartheek [mailto:kartheek.m...@gmail.com] *Sent:* Friday, November 14, 2014 9:42 AM *To:* Akhil Das; user@spark.apache.org *Subject:* Re: Read a HDFS file from Spark using HDFS API No. I am not accessing hdfs from either shell or a spark application. I want to access from spark Scheduler code. I face an error when I use sc.textFile() as SparkContext wouldn't have been created yet. So, error says: sc not found. On Fri, Nov 14, 2014 at 9:07 PM, Akhil Das ak...@sigmoidanalytics.com wrote: like this? val file = sc.textFile(hdfs://localhost:9000/sigmoid/input.txt) Thanks Best Regards On Fri, Nov 14, 2014 at 9:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I am trying to read a HDFS file from Spark scheduler code. I could find how to write hdfs read/writes in java. But I need to access hdfs from spark using scala. Can someone please help me in this regard.
Re: Read a HDFS file from Spark using HDFS API
Hi Akhil, I face error: not found : value URI On Fri, Nov 14, 2014 at 9:29 PM, rapelly kartheek kartheek.m...@gmail.com wrote: I'll just try out with object Akhil provided. There was no problem working in shell with sc.textFile. Thank you Akhil and Tri. On Fri, Nov 14, 2014 at 9:21 PM, Akhil Das ak...@sigmoidanalytics.com wrote: [image: Inline image 1] Thanks Best Regards On Fri, Nov 14, 2014 at 9:18 PM, Bui, Tri tri@verizonwireless.com.invalid wrote: It should be val file = sc.textFile(hdfs:///localhost:9000/sigmoid/input.txt) 3 “///” Thanks Tri *From:* rapelly kartheek [mailto:kartheek.m...@gmail.com] *Sent:* Friday, November 14, 2014 9:42 AM *To:* Akhil Das; user@spark.apache.org *Subject:* Re: Read a HDFS file from Spark using HDFS API No. I am not accessing hdfs from either shell or a spark application. I want to access from spark Scheduler code. I face an error when I use sc.textFile() as SparkContext wouldn't have been created yet. So, error says: sc not found. On Fri, Nov 14, 2014 at 9:07 PM, Akhil Das ak...@sigmoidanalytics.com wrote: like this? val file = sc.textFile(hdfs://localhost:9000/sigmoid/input.txt) Thanks Best Regards On Fri, Nov 14, 2014 at 9:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I am trying to read a HDFS file from Spark scheduler code. I could find how to write hdfs read/writes in java. But I need to access hdfs from spark using scala. Can someone please help me in this regard.
How to access application name in the spark framework code.
Hi, When I submit a spark application like this: ./bin/spark-submit --class org.apache.spark.examples.SparkKMeans --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar /k-means 4 0.001 Which part of the spark framework code deals with the name of the application?. Basically, I want to access the name of the application in the spark scheduler code. Can someone please tell me where I should look for the code that deals with the name of the currently executing application (say, SparkKMeans)? Thank you.
[no subject]
Hi, I've been fiddling with spark/*/storage/blockManagerMasterActor.getPeers() definition in the context of blockManagerMaster.askDriverWithReply() sending a request GetPeers(). 1) I couldn't understand what the 'selfIndex' is used for?. 2) Also, I tried modifying the 'peers' array by just eliminating some blockManagerId's and passed the modified one to the tabulate method. The application gets executed, but I find that the blockManagerMaster.askDriverWithReply() recieves the sequence of blockManagerIds that include the ones I have eliminated previously. For example, My original 'peers' array contained 5 blockManagerId's: BlockManagerId(2, s2, 39997, 0), BlockManagerId(1, s4, 35874, 0),BlockManagerId(3, s1, 33738, 0), BlockManagerId(0, s3, 38207, 0), BlockManagerId(driver, karthik, 34388, 0). I modified it to peers1 having 3 blockManagerId's : BlockManagerId(2, s2, 39997, 0), BlockManagerId(1, s4, 35874, 0), BlockManagerId(3, s1, 33738, 0). Then I passed this modified peers1 array for the sequence conversion: 'Array.tabulate[BlockManagerId](size) { i = peers1((selfIndex + i + 1) % peers1.length) }.toSeq But, finally when the /storage/blockManagerMaster.askDriverWithReply() gets the result, it contains the blockManagerIds that I have eliminated purposely. Can someone please make me understand how this seq[BlockManagerId] is constructed? Thank you!
java.io.IOException: Filesystem closed
Hi, I face the following exception when submit a spark application. The log file shows: 14/12/02 11:52:58 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1668) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1629) at org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1614) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:120) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:158) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:158) at scala.Option.foreach(Option.scala:236) at org.apache.spark.util.FileLogger.flush(FileLogger.scala:158) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:87) at org.apache.spark.scheduler.EventLoggingListener.onJobEnd(EventLoggingListener.scala:112) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:52) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:52) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:52) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) Someone please help me resolve this!! Thanks
Re: java.io.IOException: Filesystem closed
Sorry for the delayed response. Please find my application attached. On Tue, Dec 2, 2014 at 12:04 PM, Akhil Das ak...@sigmoidanalytics.com wrote: What is the application that you are submitting? Looks like you might have invoked fs inside the app and then closed it within it. Thanks Best Regards On Tue, Dec 2, 2014 at 11:59 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I face the following exception when submit a spark application. The log file shows: 14/12/02 11:52:58 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1668) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1629) at org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1614) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:120) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:158) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:158) at scala.Option.foreach(Option.scala:236) at org.apache.spark.util.FileLogger.flush(FileLogger.scala:158) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:87) at org.apache.spark.scheduler.EventLoggingListener.onJobEnd(EventLoggingListener.scala:112) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:52) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:52) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:52) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) Someone please help me resolve this!! Thanks SimpleApp001.scala Description: Binary data - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.io.IOException: Filesystem closed
But, somehow, if I run this application for the second time, I find that the application gets executed and the results are out regardless of the same errors in logs. On Tue, Dec 2, 2014 at 2:08 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Your code seems to have a lot of threads and i think you might be invoking sc.stop before those threads get finished. Thanks Best Regards On Tue, Dec 2, 2014 at 12:04 PM, Akhil Das ak...@sigmoidanalytics.com wrote: What is the application that you are submitting? Looks like you might have invoked fs inside the app and then closed it within it. Thanks Best Regards On Tue, Dec 2, 2014 at 11:59 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I face the following exception when submit a spark application. The log file shows: 14/12/02 11:52:58 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1668) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1629) at org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1614) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:120) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:158) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:158) at scala.Option.foreach(Option.scala:236) at org.apache.spark.util.FileLogger.flush(FileLogger.scala:158) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:87) at org.apache.spark.scheduler.EventLoggingListener.onJobEnd(EventLoggingListener.scala:112) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:52) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:52) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:52) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) Someone please help me resolve this!! Thanks
Re: java.io.IOException: Filesystem closed
Does the sparkContext shuts down itself by default even if I dont mention specifically in my code?? Because, I ran the application without sc.context(), still I get file system closed error along with correct output. On Tue, Dec 2, 2014 at 2:20 PM, Akhil Das ak...@sigmoidanalytics.com wrote: It could be because those threads are finishing quickly. Thanks Best Regards On Tue, Dec 2, 2014 at 2:19 PM, rapelly kartheek kartheek.m...@gmail.com wrote: But, somehow, if I run this application for the second time, I find that the application gets executed and the results are out regardless of the same errors in logs. On Tue, Dec 2, 2014 at 2:08 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Your code seems to have a lot of threads and i think you might be invoking sc.stop before those threads get finished. Thanks Best Regards On Tue, Dec 2, 2014 at 12:04 PM, Akhil Das ak...@sigmoidanalytics.com wrote: What is the application that you are submitting? Looks like you might have invoked fs inside the app and then closed it within it. Thanks Best Regards On Tue, Dec 2, 2014 at 11:59 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I face the following exception when submit a spark application. The log file shows: 14/12/02 11:52:58 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1668) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1629) at org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1614) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:120) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:158) at org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:158) at scala.Option.foreach(Option.scala:236) at org.apache.spark.util.FileLogger.flush(FileLogger.scala:158) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:87) at org.apache.spark.scheduler.EventLoggingListener.onJobEnd(EventLoggingListener.scala:112) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:52) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:52) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:52) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) Someone please help me resolve this!! Thanks
Necessity for rdd replication.
Hi, I was just thinking about necessity for rdd replication. One category could be something like large number of threads requiring same rdd. Even though, a single rdd can be shared by multiple threads belonging to same application , I believe we can extract better parallelism if the rdd is replicated, am I right?. I am eager to know if there are any real life applications or any other scenarios which force rdd to be replicated. Can someone please throw some light on necessity for rdd replication. Thank you
Profiling a spark application.
Hi, I want to find the time taken for replicating an rdd in spark cluster along with the computation time on the replicated rdd. Can someone please suggest some ideas? Thank you
Storage Locations of an rdd
Hi, I need to find the storage locations (node Ids ) of each partition of a replicated rdd in spark. I mean, if an rdd is replicated twice, I want to find the two nodes for each partition where it is stored. Spark WebUI has a page wherein it depicts the data distribution of each rdd. But, I really don't appreciate what it displays. Can someone please throw some light in this regard? Thank you Karthik
Storage Locations of an rdd
Hi, I need to find the storage locations (node Ids ) of each partition of a replicated rdd in spark. I mean, if an rdd is replicated twice, I want to find the two nodes for each partition where it is stored. Spark WebUI has a page wherein it depicts the data distribution of each rdd. But, I need to know the first and second locations of each partition of the replicated rdd. Can someone please throw some light in this regard? Thank you Karthik
Spark profiler
Hi, I want to find the time taken for replicating an rdd in spark cluster along with the computation time on the replicated rdd. Can someone please suggest a suitable spark profiler? Thank you
NullPointerException
Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I face this problem. Exception in thread Thread-47 org.apache.spark.SparkException: Job aborted due to stage failure: Task 11.0:10 failed 4 times, most recent failure: Exception failure in TID 295 on host s1: java.lang.NullPointerException org.apache.spark.storage.BlockManager.org $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752) org.apache.spark.storage.BlockManager.put(BlockManager.scala:574) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any help? Thank you!
Re: NullPointerException
spark-1.0.0 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I face this problem. Exception in thread Thread-47 org.apache.spark.SparkException: Job aborted due to stage failure: Task 11.0:10 failed 4 times, most recent failure: Exception failure in TID 295 on host s1: java.lang.NullPointerException org.apache.spark.storage.BlockManager.org $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752) org.apache.spark.storage.BlockManager.put(BlockManager.scala:574) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any help? Thank you!
Fwd: NullPointerException
-- Forwarded message -- From: rapelly kartheek kartheek.m...@gmail.com Date: Thu, Jan 1, 2015 at 12:05 PM Subject: Re: NullPointerException To: Josh Rosen rosenvi...@gmail.com, user@spark.apache.org spark-1.0.0 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I face this problem. Exception in thread Thread-47 org.apache.spark.SparkException: Job aborted due to stage failure: Task 11.0:10 failed 4 times, most recent failure: Exception failure in TID 295 on host s1: java.lang.NullPointerException org.apache.spark.storage.BlockManager.org $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752) org.apache.spark.storage.BlockManager.put(BlockManager.scala:574) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any help? Thank you!
Re: NullPointerException
Ok. Let me try out on a newer version. Thank you!! On Thu, Jan 1, 2015 at 12:17 PM, Josh Rosen rosenvi...@gmail.com wrote: It looks like 'null' might be selected as a block replication peer? https://github.com/apache/spark/blob/v1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L786 I know that we fixed some replication bugs in newer versions of Spark (such as https://github.com/apache/spark/pull/2366), so it's possible that this issue would be resolved by updating. Can you try re-running your job with a newer Spark version to see whether you still see the same error? On Wed, Dec 31, 2014 at 10:35 PM, rapelly kartheek kartheek.m...@gmail.com wrote: spark-1.0.0 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I face this problem. Exception in thread Thread-47 org.apache.spark.SparkException: Job aborted due to stage failure: Task 11.0:10 failed 4 times, most recent failure: Exception failure in TID 295 on host s1: java.lang.NullPointerException org.apache.spark.storage.BlockManager.org $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752) org.apache.spark.storage.BlockManager.put(BlockManager.scala:574) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any help? Thank you!
Spark-1.2.0 build error
Hi, I get the following error when I build spark using sbt: [error] Nonzero exit code (128): git clone https://github.com/ScrapCodes/sbt-pom-reader.git /home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader [error] Use 'last' for the full log. Any help please?
Re: UnknownhostException : home
Yes yes.. hadoop/etc/hadoop/hdfs-site.xml file has the path like: hdfs://home/... On Mon, Jan 19, 2015 at 3:21 PM, Sean Owen so...@cloudera.com wrote: I bet somewhere you have a path like hdfs://home/... which would suggest that 'home' is a hostname, when I imagine you mean it as a root directory. On Mon, Jan 19, 2015 at 9:33 AM, Rapelly Kartheek kartheek.m...@gmail.com wrote: Hi, I get the following exception when I run my application: karthik@karthik:~/spark-1.2.0$ ./bin/spark-submit --class org.apache.spark.examples.SimpleApp001 --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar out1.txt log4j:WARN No such property [target] in org.apache.log4j.FileAppender. Exception in thread main java.lang.IllegalArgumentException: java.net.UnknownHostException: home at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:237) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:569) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:512) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:366) at org.apache.spark.util.FileLogger.init(FileLogger.scala:90) at org.apache.spark.scheduler.EventLoggingListener.init(EventLoggingListener.scala:63) at org.apache.spark.SparkContext.init(SparkContext.scala:352) at org.apache.spark.examples.SimpleApp001$.main(SimpleApp001.scala:13) at org.apache.spark.examples.SimpleApp001.main(SimpleApp001.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.UnknownHostException: home ... 20 more I couldn't trace the cause of this exception. Any help in this regard? Thanks
UnknownhostException : home
Hi, I get the following exception when I run my application: karthik@karthik:~/spark-1.2.0$ ./bin/spark-submit --class org.apache.spark.examples.SimpleApp001 --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar out1.txt log4j:WARN No such property [target] in org.apache.log4j.FileAppender. Exception in thread main java.lang.IllegalArgumentException: java.net.UnknownHostException: home at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:237) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:569) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:512) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:366) at org.apache.spark.util.FileLogger.init(FileLogger.scala:90) at org.apache.spark.scheduler.EventLoggingListener.init(EventLoggingListener.scala:63) at org.apache.spark.SparkContext.init(SparkContext.scala:352) at org.apache.spark.examples.SimpleApp001$.main(SimpleApp001.scala:13) at org.apache.spark.examples.SimpleApp001.main(SimpleApp001.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.UnknownHostException: home ... 20 more I couldn't trace the cause of this exception. Any help in this regard? Thanks
Re: UnknownhostException : home
Actually, I don't have any entry in my /etc/hosts file with hostname: home. Infact, I didn't use this hostname naywhere. Then why is it that its trying to resolve this? On Mon, Jan 19, 2015 at 3:15 PM, Ashish paliwalash...@gmail.com wrote: it's not able to resolve home to an IP. Assuming it's your local machine, add an entry in your /etc/hosts file like and then run the program again (use sudo to edit the file) 127.0.0.1 home On Mon, Jan 19, 2015 at 3:03 PM, Rapelly Kartheek kartheek.m...@gmail.com wrote: Hi, I get the following exception when I run my application: karthik@karthik:~/spark-1.2.0$ ./bin/spark-submit --class org.apache.spark.examples.SimpleApp001 --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar out1.txt log4j:WARN No such property [target] in org.apache.log4j.FileAppender. Exception in thread main java.lang.IllegalArgumentException: java.net.UnknownHostException: home at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:237) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:569) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:512) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:366) at org.apache.spark.util.FileLogger.init(FileLogger.scala:90) at org.apache.spark.scheduler.EventLoggingListener.init(EventLoggingListener.scala:63) at org.apache.spark.SparkContext.init(SparkContext.scala:352) at org.apache.spark.examples.SimpleApp001$.main(SimpleApp001.scala:13) at org.apache.spark.examples.SimpleApp001.main(SimpleApp001.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.UnknownHostException: home ... 20 more I couldn't trace the cause of this exception. Any help in this regard? Thanks -- thanks ashish Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: http://www.pbase.com/ashishpaliwal
Re: UnknownhostException : home
Yeah... I made that mistake in spark/conf/spark-defaults.conf for setting: spark.eventLog.dir. Now it works Thank you Karthik On Mon, Jan 19, 2015 at 3:29 PM, Sean Owen so...@cloudera.com wrote: Sorry, to be clear, you need to write hdfs:///home/ Note three slashes; there is an empty host between the 2nd and 3rd. This is true of most URI schemes with a host. On Mon, Jan 19, 2015 at 9:56 AM, Rapelly Kartheek kartheek.m...@gmail.com wrote: Yes yes.. hadoop/etc/hadoop/hdfs-site.xml file has the path like: hdfs://home/... On Mon, Jan 19, 2015 at 3:21 PM, Sean Owen so...@cloudera.com wrote: I bet somewhere you have a path like hdfs://home/... which would suggest that 'home' is a hostname, when I imagine you mean it as a root directory. On Mon, Jan 19, 2015 at 9:33 AM, Rapelly Kartheek kartheek.m...@gmail.com wrote: Hi, I get the following exception when I run my application: karthik@karthik:~/spark-1.2.0$ ./bin/spark-submit --class org.apache.spark.examples.SimpleApp001 --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar out1.txt log4j:WARN No such property [target] in org.apache.log4j.FileAppender. Exception in thread main java.lang.IllegalArgumentException: java.net.UnknownHostException: home at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:237) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:569) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:512) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:366) at org.apache.spark.util.FileLogger.init(FileLogger.scala:90) at org.apache.spark.scheduler.EventLoggingListener.init(EventLoggingListener.scala:63) at org.apache.spark.SparkContext.init(SparkContext.scala:352) at org.apache.spark.examples.SimpleApp001$.main(SimpleApp001.scala:13) at org.apache.spark.examples.SimpleApp001.main(SimpleApp001.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.UnknownHostException: home ... 20 more I couldn't trace the cause of this exception. Any help in this regard? Thanks
Re: Problem with building spark-1.2.0
Yes, this proxy problem is resolved. *how your build refers tohttps://github.com/ScrapCodes/sbt-pom-reader.git https://github.com/ScrapCodes/sbt-pom-reader.git I don't see thisrepo the project code base.* I manually downloaded the sbt-pom-reader directory and moved into .sbt/0.13/staging/*/ directory. But, I face the following: karthik@s4:~/spark-1.2.0$ SPARK_HADOOP_VERSION = 2.3.0 sbt/sbt assembly Using /usr/lib/jvm/java-7-oracle as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. [info] Loading project definition from /home/karthik/spark-1.2.0/project/project [info] Loading project definition from /home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project [warn] Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`). [info] Updating {file:/home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project/}sbt-pom-reader-build... [info] Resolving com.typesafe.sbt#sbt-ghpages;0.5.2 ... Could you please tell me how do I build stand-alone spark-1.2.0 with sbt correctly? On Mon, Jan 12, 2015 at 4:21 PM, Sean Owen so...@cloudera.com wrote: The problem is there in the logs. When it went to clone some code, something went wrong with the proxy: Received HTTP code 407 from proxy after CONNECT Probably you have an HTTP proxy and you have not authenticated. It's specific to your environment. Although it's unrelated, I'm curious how your build refers to https://github.com/ScrapCodes/sbt-pom-reader.git I don't see this repo the project code base. On Mon, Jan 12, 2015 at 9:09 AM, Kartheek.R kartheek.m...@gmail.com wrote: Hi, This is what I am trying to do: karthik@s4:~/spark-1.2.0$ SPARK_HADOOP_VERSION=2.3.0 sbt/sbt clean Using /usr/lib/jvm/java-7-oracle as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. [info] Loading project definition from /home/karthik/spark-1.2.0/project/project Cloning into '/home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader'... fatal: unable to access ' https://github.com/ScrapCodes/sbt-pom-reader.git/': Received HTTP code 407 from proxy after CONNECT java.lang.RuntimeException: Nonzero exit code (128): git clone https://github.com/ScrapCodes/sbt-pom-reader.git
Re: Problem with building spark-1.2.0
yeah.. but none of the sites get opened. On Sun, Jan 4, 2015 at 10:35 PM, Ted Yu yuzhih...@gmail.com wrote: Have you used Google to find some way of accessing github :-) On Jan 4, 2015, at 8:46 AM, Kartheek.R kartheek.m...@gmail.com wrote: The problem is that my network is not able to access github.com for cloning some dependencies as github is blocked in India. What are the other possible ways for this problem?? Thank you! On Sun, Jan 4, 2015 at 9:45 PM, Rapelly Kartheek [hidden email] http:///user/SendEmail.jtp?type=nodenode=20963i=0 wrote: Hi, I get the following error when I build spark-1.2.0 using sbt: [error] Nonzero exit code (128): git clone https://github.com/ScrapCodes/sbt-pom-reader.git /home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader [error] Use 'last' for the full log. Any help please? Thanks -- View this message in context: Re: Problem with building spark-1.2.0 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Problem-with-building-spark-1-2-0-tp20963.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.