Re: Using spark streaming to load data from Kafka to HDFS
Last time I checked, Camus doesn't support storing data as parquet, which is a deal breaker for me. Otherwise it works well for my Kafka topics with low data volume. I am currently using spark streaming to ingest data, generate semi-realtime stats and publish to a dashboard, and dump full dataset into hdfs in parquet at a longer interval. One problem is that storing parquet is sometimes time consuming, and that cause delay of my regular stats-generating tasks. I am thinking of splitting my streaming job into two, one for parquet output and one for stats generation, but obviously this would consume data from Kafka twice. -Simon On Wednesday, May 6, 2015, Rendy Bambang Junior rendy.b.jun...@gmail.com wrote: Because using spark streaming looks like a lot simpler. Whats the difference between Camus and Kafka Streaming for this case? Why Camus excel? Rendy On Wed, May 6, 2015 at 2:15 PM, Saisai Shao sai.sai.s...@gmail.com javascript:_e(%7B%7D,'cvml','sai.sai.s...@gmail.com'); wrote: Also Kafka has a Hadoop consumer API for doing such things, please refer to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi 2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com javascript:_e(%7B%7D,'cvml','afsan...@gmail.com');: why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior rendy.b.jun...@gmail.com javascript:_e(%7B%7D,'cvml','rendy.b.jun...@gmail.com'); wrote: Hi all, I am planning to load data from Kafka to HDFS. Is it normal to use spark streaming to load data from Kafka to HDFS? What are concerns on doing this? There are no processing to be done by Spark, only to store data to HDFS from Kafka for storage and for further Spark processing Rendy
Re: access hdfs file name in map()
Hi Roberto, Ultimately, the info you need is set here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69 Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as HadoopRDDWithEnv, which takes in an additional parameter (varname) in the constructor, then override the compute() function to return something like split.getPipeEnvVars.getOrElse(varName, ) + | + value.toString() as the value. This obviously is less general and makes certain assumptions about the input data. Also you need to write several wrappers in SparkContext, so that you can do something like sc.textFileWithEnv(hdfs path, mapreduce_map_input_file). I was hoping to do something like sc.textFile(hdfs_path).pipe(/usr/bin/awk {print\${mapreduce_map_input_file}\,$0} ). This gives me some weird kyro buffer overflow exception... Haven't got a chance to look into the details yet. -Simon On Fri, Aug 1, 2014 at 7:38 AM, Roberto Torella roberto.tore...@gmail.com wrote: Hi Simon, I'm trying to do the same but I'm quite lost. How did you do that? (Too direct? :) Thanks and ciao, r- -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-tp6551p11160.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Task progress in ipython?
I am pretty happy with using pyspark with ipython notebook. The only issue is that I need to look at the console output or spark ui to track task progress. I wonder if anyone thought of or better wrote something to display some progress bars on the same page when I evaluate a cell in ipynb? I know spark ui is just one window switch and refresh away, but I want something more integrated :) Thanks..
performance difference between spark-shell and spark-submit
Hi all, I implemented a transformation on hdfs files with spark. First tested in spark-shell (with yarn), I implemented essentially the same logic with a spark program (scala), built a jar file and used spark-submit to execute it on my yarn cluster. The weird thing is that spark-submit approach is almost 3x as slow (500s vs 1500s). I am curious why... I am essentially writing a benchmarking program to test the performance of spark in various settings, so my spark program has a Benchmark abstract class, a trait for some common things, and an actual class to perform one specific benchmark. My spark main creates an instance of my benchmark class and execute something like benchmark1.run(), which in turn kicks off spark context, perform data manipulation, etc. I wonder if such constructs introduced some overhead - comparing to direct manipulation commands in spark-shell. Thanks. -Simon
Re: cache spark sql parquet file in memory?
Is there a way to start tachyon on top of a yarn cluster? On Jun 7, 2014 2:11 PM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: I was also thinking of using tachyon to store parquet files - maybe tomorrow I will give a try as well. 2014-06-07 20:01 GMT+02:00 Michael Armbrust mich...@databricks.com: Not a stupid question! I would like to be able to do this. For now, you might try writing the data to tachyon http://tachyon-project.org/ instead of HDFS. This is untested though, please report any issues you run into. Michael On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com wrote: This might be a stupid question... but it seems that saveAsParquetFile() writes everything back to HDFS. I am wondering if it is possible to cache parquet-format intermediate results in memory, and therefore making spark sql queries faster. Thanks. -Simon
cache spark sql parquet file in memory?
This might be a stupid question... but it seems that saveAsParquetFile() writes everything back to HDFS. I am wondering if it is possible to cache parquet-format intermediate results in memory, and therefore making spark sql queries faster. Thanks. -Simon
spark worker and yarn memory
I am slightly confused about the --executor-memory setting. My yarn cluster has a maximum container memory of 8192MB. When I specify --executor-memory 8G in my spark-shell, no container can be started at all. It only works when I lower the executor memory to 7G. But then, on yarn, I see 2 container per node, using 16G of memory. Then on the spark UI, it shows that each worker has 4GB of memory, rather than 7. Can someone explain the relationship among the numbers I see here? Thanks.
compress in-memory cache?
I have a working set larger than available memory, thus I am hoping to turn on rdd compression so that I can store more in-memory. Strangely it made no difference. The number of cached partitions, fraction cached, and size in memory remain the same. Any ideas? I confirmed that rdd compression wasn't on before and it was on for the second test. scala sc.getConf.getAll foreach println ... (spark.rdd.compress,true) ... I haven't tried lzo vs snappy, but my guess is that either one should provide at least some benefit.. Thanks. -Simon
Re: compress in-memory cache?
Thanks.. it works now. -Simon On Thu, Jun 5, 2014 at 10:47 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Have you set the persistence level of the RDD to MEMORY_ONLY_SER ( http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)? If you're calling cache, the default persistence level is MEMORY_ONLY so that setting will have no impact. On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon) Chen xche...@gmail.com wrote: I have a working set larger than available memory, thus I am hoping to turn on rdd compression so that I can store more in-memory. Strangely it made no difference. The number of cached partitions, fraction cached, and size in memory remain the same. Any ideas? I confirmed that rdd compression wasn't on before and it was on for the second test. scala sc.getConf.getAll foreach println ... (spark.rdd.compress,true) ... I haven't tried lzo vs snappy, but my guess is that either one should provide at least some benefit.. Thanks. -Simon
Re: spark worker and yarn memory
Nice explanation... Thanks! On Thu, Jun 5, 2014 at 5:50 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Xu, As crazy as it might sound, this all makes sense. There are a few different quantities at play here: * the heap size of the executor (controlled by --executor-memory) * the amount of memory spark requests from yarn (the heap size plus 384 mb to account for fixed memory costs outside if the heap) * the amount of memory yarn grants to the container (yarn rounds up to the nearest multiple of yarn.scheduler.minimum-allocation-mb or yarn.scheduler.fair.increment-allocation-mb, depending on the scheduler used) * the amount of memory spark uses for caching on each executor, which is spark.storage.memoryFraction (default 0.6) of the executor heap size So, with --executor-memory 8g, spark requests 8g + 384m from yarn, which doesn't fit into it's container max. With --executor-memory 7g, Spark requests 7g + 384m from yarn, which fits into its container max. This gets rounded up to 8g by the yarn scheduler. 7g is still used as the executor heap size, and .6 of this is about 4g, shown as the cache space in the spark. -Sandy On Jun 5, 2014, at 9:44 AM, Xu (Simon) Chen xche...@gmail.com wrote: I am slightly confused about the --executor-memory setting. My yarn cluster has a maximum container memory of 8192MB. When I specify --executor-memory 8G in my spark-shell, no container can be started at all. It only works when I lower the executor memory to 7G. But then, on yarn, I see 2 container per node, using 16G of memory. Then on the spark UI, it shows that each worker has 4GB of memory, rather than 7. Can someone explain the relationship among the numbers I see here? Thanks.
Re: Join : Giving incorrect result
Maybe your two workers have different assembly jar files? I just ran into a similar problem that my spark-shell is using a different jar file than my workers - got really confusing results. On Jun 4, 2014 8:33 AM, Ajay Srivastava a_k_srivast...@yahoo.com wrote: Hi, I am doing join of two RDDs which giving different results ( counting number of records ) each time I run this code on same input. The input files are large enough to be divided in two splits. When the program runs on two workers with single core assigned to these, output is consistent and looks correct. But when single worker is used with two or more than two cores, the result seems to be random. Every time, count of joined record is different. Does this sound like a defect or I need to take care of something while using join ? I am using spark-0.9.1. Regards Ajay
Re: access hdfs file name in map()
N/M.. I wrote a HadoopRDD subclass and append one env field of the HadoopPartition to the value in compute function. It worked pretty well. Thanks! On Jun 4, 2014 12:22 AM, Xu (Simon) Chen xche...@gmail.com wrote: I don't quite get it.. mapPartitionWithIndex takes a function that maps an integer index and an iterator to another iterator. How does that help with retrieving the hdfs file name? I am obviously missing some context.. Thanks. On May 30, 2014 1:28 AM, Aaron Davidson ilike...@gmail.com wrote: Currently there is not a way to do this using textFile(). However, you could pretty straightforwardly define your own subclass of HadoopRDD [1] in order to get access to this information (likely using mapPartitionsWithIndex to look up the InputSplit for a particular partition). Note that sc.textFile() is just a convenience function to construct a new HadoopRDD [2]. [1] HadoopRDD: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93 [2] sc.textFile(): https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hello, A quick question about using spark to parse text-format CSV files stored on hdfs. I have something very simple: sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p = (XXX, p[0], p[2])) Here, I want to replace XXX with a string, which is the current csv filename for the line. This is needed since some information may be encoded in the file name, like date. In hive, I am able to define an external table and use INPUT__FILE__NAME as a column in queries. I wonder if spark has something similar. Thanks! -Simon
Re: access hdfs file name in map()
I don't quite get it.. mapPartitionWithIndex takes a function that maps an integer index and an iterator to another iterator. How does that help with retrieving the hdfs file name? I am obviously missing some context.. Thanks. On May 30, 2014 1:28 AM, Aaron Davidson ilike...@gmail.com wrote: Currently there is not a way to do this using textFile(). However, you could pretty straightforwardly define your own subclass of HadoopRDD [1] in order to get access to this information (likely using mapPartitionsWithIndex to look up the InputSplit for a particular partition). Note that sc.textFile() is just a convenience function to construct a new HadoopRDD [2]. [1] HadoopRDD: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93 [2] sc.textFile(): https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hello, A quick question about using spark to parse text-format CSV files stored on hdfs. I have something very simple: sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p = (XXX, p[0], p[2])) Here, I want to replace XXX with a string, which is the current csv filename for the line. This is needed since some information may be encoded in the file name, like date. In hive, I am able to define an external table and use INPUT__FILE__NAME as a column in queries. I wonder if spark has something similar. Thanks! -Simon
Re: spark 1.0.0 on yarn
OK, rebuilding the assembly jar file with cdh5 works now... Thanks.. -Simon On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote: That helped a bit... Now I have a different failure: the start up process is stuck in an infinite loop outputting the following message: 14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1401672868277 yarnAppState: ACCEPTED I am using the hadoop 2 prebuild package. Probably it doesn't have the latest yarn client. -Simon On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com wrote: As a debugging step, does it work if you use a single resource manager with the key yarn.resourcemanager.address instead of using two named resource managers? I wonder if somehow the YARN client can't detect this multi-master set-up. On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark-shell and interact with workers spawned on my yarn cluster. So in my /opt/hadoop/conf/yarn-site.xml, I have: ... property nameyarn.resourcemanager.address.rm1/name valuecontroller-1.mycomp.com:23140/value /property ... property nameyarn.resourcemanager.address.rm2/name valuecontroller-2.mycomp.com:23140/value /property ... And the other usual stuff. So spark 1.0 is launched like this: Spark Command: java -cp ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class org.apache.spark.repl.Main I do see /opt/hadoop/conf included, but not sure it's the right place. Thanks.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being set-up correctly. - Patrick On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document (http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas? Thanks. -Simon
pyspark problems on yarn (job not parallelized, and Py4JJavaError)
Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040. I have the following simple script: import pyspark data = sc.textFile(hdfs://test/tmp/data/*).cache() oneday = data.map(lambda line: line.split(,)).\ map(lambda f: (f[0], float(f[1]))).\ filter(lambda t: t[0] = 2013-01-01 and t[0] 2013-01-02).\ map(lambda t: (parser.parse(t[0]), t[1])) oneday.take(1) By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes... When I do data.count(), things would blow up altogether. But I do see in the error message something like this: Error from python worker: /usr/bin/python: No module named pyspark Am I supposed to install pyspark on every worker node? Thanks. -Simon
Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)
1) yes, that sc.parallelize(range(10)).count() has the same error. 2) the files seem to be correct 3) I have trouble at this step, ImportError: No module named pyspark but I seem to have files in the jar file: $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python import pyspark Traceback (most recent call last): File stdin, line 1, in module ImportError: No module named pyspark $ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark pyspark/ pyspark/rddsampler.py pyspark/broadcast.py pyspark/serializers.py pyspark/java_gateway.py pyspark/resultiterable.py pyspark/accumulators.py pyspark/sql.py pyspark/__init__.py pyspark/daemon.py pyspark/context.py pyspark/cloudpickle.py pyspark/join.py pyspark/tests.py pyspark/files.py pyspark/conf.py pyspark/rdd.py pyspark/storagelevel.py pyspark/statcounter.py pyspark/shell.py pyspark/worker.py 4) All my nodes should be running java 7, so probably this is not related. 5) I'll do it in a bit. Any ideas on 3)? Thanks. -Simon On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote: Hi Simon, You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try: 1) Run a simple pyspark shell with yarn-client, and do sc.parallelize(range(10)).count() to see if you get the same error 2) If so, check if your assembly jar is compiled correctly. Run $ jar -tf path/to/assembly/jar pyspark $ jar -tf path/to/assembly/jar py4j to see if the files are there. For Py4j, you need both the python files and the Java class files. 3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH: $ PYTHONPATH=/path/to/assembly/jar python import pyspark 4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar. There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6: $ cd /path/to/spark/home $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0 5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful. Let me know if you get it working (or not). Cheers, Andrew 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com: Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040. I have the following simple script: import pyspark data = sc.textFile(hdfs://test/tmp/data/*).cache() oneday = data.map(lambda line: line.split(,)).\ map(lambda f: (f[0], float(f[1]))).\ filter(lambda t: t[0] = 2013-01-01 and t[0] 2013-01-02).\ map(lambda t: (parser.parse(t[0]), t[1])) oneday.take(1) By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes... When I do data.count(), things would blow up altogether. But I do see in the error message something like this: Error from python worker: /usr/bin/python: No module named pyspark Am I supposed to install pyspark on every worker node? Thanks. -Simon
Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)
So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers, it seems that the jar file is distributed and included in classpath correctly. I think the problem is likely at step 3.. I build my jar file with maven, like this: mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean package Anything that I might have missed? Thanks. -Simon On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen xche...@gmail.com wrote: 1) yes, that sc.parallelize(range(10)).count() has the same error. 2) the files seem to be correct 3) I have trouble at this step, ImportError: No module named pyspark but I seem to have files in the jar file: $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python import pyspark Traceback (most recent call last): File stdin, line 1, in module ImportError: No module named pyspark $ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark pyspark/ pyspark/rddsampler.py pyspark/broadcast.py pyspark/serializers.py pyspark/java_gateway.py pyspark/resultiterable.py pyspark/accumulators.py pyspark/sql.py pyspark/__init__.py pyspark/daemon.py pyspark/context.py pyspark/cloudpickle.py pyspark/join.py pyspark/tests.py pyspark/files.py pyspark/conf.py pyspark/rdd.py pyspark/storagelevel.py pyspark/statcounter.py pyspark/shell.py pyspark/worker.py 4) All my nodes should be running java 7, so probably this is not related. 5) I'll do it in a bit. Any ideas on 3)? Thanks. -Simon On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote: Hi Simon, You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try: 1) Run a simple pyspark shell with yarn-client, and do sc.parallelize(range(10)).count() to see if you get the same error 2) If so, check if your assembly jar is compiled correctly. Run $ jar -tf path/to/assembly/jar pyspark $ jar -tf path/to/assembly/jar py4j to see if the files are there. For Py4j, you need both the python files and the Java class files. 3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH: $ PYTHONPATH=/path/to/assembly/jar python import pyspark 4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar. There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6: $ cd /path/to/spark/home $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0 5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful. Let me know if you get it working (or not). Cheers, Andrew 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com: Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040. I have the following simple script: import pyspark data = sc.textFile(hdfs://test/tmp/data/*).cache() oneday = data.map(lambda line: line.split(,)).\ map(lambda f: (f[0], float(f[1]))).\ filter(lambda t: t[0] = 2013-01-01 and t[0] 2013-01-02).\ map(lambda t: (parser.parse(t[0]), t[1])) oneday.take(1) By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes... When I do data.count(), things would blow up altogether. But I do see in the error message something like this: Error from python worker: /usr/bin/python: No module named pyspark Am I supposed to install pyspark on every worker node? Thanks. -Simon
Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)
I asked several people, no one seems to believe that we can do this: $ PYTHONPATH=/path/to/assembly/jar python import pyspark This following pull request did mention something about generating a zip file for all python related modules: https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html I've tested that zipped modules can as least be imported via zipimport. Any ideas? -Simon On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote: Hi Simon, You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try: 1) Run a simple pyspark shell with yarn-client, and do sc.parallelize(range(10)).count() to see if you get the same error 2) If so, check if your assembly jar is compiled correctly. Run $ jar -tf path/to/assembly/jar pyspark $ jar -tf path/to/assembly/jar py4j to see if the files are there. For Py4j, you need both the python files and the Java class files. 3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH: $ PYTHONPATH=/path/to/assembly/jar python import pyspark 4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar. There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6: $ cd /path/to/spark/home $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0 5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful. Let me know if you get it working (or not). Cheers, Andrew 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com: Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040. I have the following simple script: import pyspark data = sc.textFile(hdfs://test/tmp/data/*).cache() oneday = data.map(lambda line: line.split(,)).\ map(lambda f: (f[0], float(f[1]))).\ filter(lambda t: t[0] = 2013-01-01 and t[0] 2013-01-02).\ map(lambda t: (parser.parse(t[0]), t[1])) oneday.take(1) By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes... When I do data.count(), things would blow up altogether. But I do see in the error message something like this: Error from python worker: /usr/bin/python: No module named pyspark Am I supposed to install pyspark on every worker node? Thanks. -Simon
Re: spark 1.0.0 on yarn
I built my new package like this: mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean package Spark-shell is working now, but pyspark is still broken. I reported the problem on a different thread. Please take a look if you can... Desperately need ideas.. Thanks. -Simon On Mon, Jun 2, 2014 at 2:47 PM, Patrick Wendell pwend...@gmail.com wrote: Okay I'm guessing that our upstreaming Hadoop2 package isn't new enough to work with CDH5. We should probably clarify this in our downloads. Thanks for reporting this. What was the exact string you used when building? Also which CDH-5 version are you building against? On Mon, Jun 2, 2014 at 8:11 AM, Xu (Simon) Chen xche...@gmail.com wrote: OK, rebuilding the assembly jar file with cdh5 works now... Thanks.. -Simon On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote: That helped a bit... Now I have a different failure: the start up process is stuck in an infinite loop outputting the following message: 14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1401672868277 yarnAppState: ACCEPTED I am using the hadoop 2 prebuild package. Probably it doesn't have the latest yarn client. -Simon On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com wrote: As a debugging step, does it work if you use a single resource manager with the key yarn.resourcemanager.address instead of using two named resource managers? I wonder if somehow the YARN client can't detect this multi-master set-up. On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark-shell and interact with workers spawned on my yarn cluster. So in my /opt/hadoop/conf/yarn-site.xml, I have: ... property nameyarn.resourcemanager.address.rm1/name valuecontroller-1.mycomp.com:23140/value /property ... property nameyarn.resourcemanager.address.rm2/name valuecontroller-2.mycomp.com:23140/value /property ... And the other usual stuff. So spark 1.0 is launched like this: Spark Command: java -cp ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class org.apache.spark.repl.Main I do see /opt/hadoop/conf included, but not sure it's the right place. Thanks.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being set-up correctly. - Patrick On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document (http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas
Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)
OK, my colleague found this: https://mail.python.org/pipermail/python-list/2014-May/671353.html And my jar file has 70011 files. Fantastic.. On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen xche...@gmail.com wrote: I asked several people, no one seems to believe that we can do this: $ PYTHONPATH=/path/to/assembly/jar python import pyspark This following pull request did mention something about generating a zip file for all python related modules: https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html I've tested that zipped modules can as least be imported via zipimport. Any ideas? -Simon On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote: Hi Simon, You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try: 1) Run a simple pyspark shell with yarn-client, and do sc.parallelize(range(10)).count() to see if you get the same error 2) If so, check if your assembly jar is compiled correctly. Run $ jar -tf path/to/assembly/jar pyspark $ jar -tf path/to/assembly/jar py4j to see if the files are there. For Py4j, you need both the python files and the Java class files. 3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH: $ PYTHONPATH=/path/to/assembly/jar python import pyspark 4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar. There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6: $ cd /path/to/spark/home $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0 5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful. Let me know if you get it working (or not). Cheers, Andrew 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com: Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040. I have the following simple script: import pyspark data = sc.textFile(hdfs://test/tmp/data/*).cache() oneday = data.map(lambda line: line.split(,)).\ map(lambda f: (f[0], float(f[1]))).\ filter(lambda t: t[0] = 2013-01-01 and t[0] 2013-01-02).\ map(lambda t: (parser.parse(t[0]), t[1])) oneday.take(1) By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes... When I do data.count(), things would blow up altogether. But I do see in the error message something like this: Error from python worker: /usr/bin/python: No module named pyspark Am I supposed to install pyspark on every worker node? Thanks. -Simon
Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)
Nope... didn't try java 6. The standard installation guide didn't say anything about java 7 and suggested to do -DskipTests for the build.. http://spark.apache.org/docs/latest/building-with-maven.html So, I didn't see the warning message... On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell pwend...@gmail.com wrote: Are you building Spark with Java 6 or Java 7. Java 6 uses the extended Zip format and Java 7 uses Zip64. I think we've tried to add some build warnings if Java 7 is used, for this reason: https://github.com/apache/spark/blob/master/make-distribution.sh#L102 Any luck if you use JDK 6 to compile? On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen xche...@gmail.com wrote: OK, my colleague found this: https://mail.python.org/pipermail/python-list/2014-May/671353.html And my jar file has 70011 files. Fantastic.. On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen xche...@gmail.com wrote: I asked several people, no one seems to believe that we can do this: $ PYTHONPATH=/path/to/assembly/jar python import pyspark This following pull request did mention something about generating a zip file for all python related modules: https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html I've tested that zipped modules can as least be imported via zipimport. Any ideas? -Simon On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote: Hi Simon, You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try: 1) Run a simple pyspark shell with yarn-client, and do sc.parallelize(range(10)).count() to see if you get the same error 2) If so, check if your assembly jar is compiled correctly. Run $ jar -tf path/to/assembly/jar pyspark $ jar -tf path/to/assembly/jar py4j to see if the files are there. For Py4j, you need both the python files and the Java class files. 3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH: $ PYTHONPATH=/path/to/assembly/jar python import pyspark 4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar. There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6: $ cd /path/to/spark/home $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0 5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application , which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful. Let me know if you get it working (or not). Cheers, Andrew 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com: Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040. I have the following simple script: import pyspark data = sc.textFile(hdfs://test/tmp/data/*).cache() oneday = data.map(lambda line: line.split(,)).\ map(lambda f: (f[0], float(f[1]))).\ filter(lambda t: t[0] = 2013-01-01 and t[0] 2013-01-02).\ map(lambda t: (parser.parse(t[0]), t[1])) oneday.take(1) By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes... When I do data.count(), things would blow up altogether. But I do see in the error message something like this: Error from python worker: /usr/bin/python: No module named pyspark Am I supposed to install pyspark on every worker node? Thanks. -Simon
Re: spark 1.0.0 on yarn
Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark-shell and interact with workers spawned on my yarn cluster. So in my /opt/hadoop/conf/yarn-site.xml, I have: ... property nameyarn.resourcemanager.address.rm1/name valuecontroller-1.mycomp.com:23140/value /property ... property nameyarn.resourcemanager.address.rm2/name valuecontroller-2.mycomp.com:23140/value /property ... And the other usual stuff. So spark 1.0 is launched like this: Spark Command: java -cp ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class org.apache.spark.repl.Main I do see /opt/hadoop/conf included, but not sure it's the right place. Thanks.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being set-up correctly. - Patrick On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document (http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas? Thanks. -Simon
Re: spark 1.0.0 on yarn
That helped a bit... Now I have a different failure: the start up process is stuck in an infinite loop outputting the following message: 14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1401672868277 yarnAppState: ACCEPTED I am using the hadoop 2 prebuild package. Probably it doesn't have the latest yarn client. -Simon On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com wrote: As a debugging step, does it work if you use a single resource manager with the key yarn.resourcemanager.address instead of using two named resource managers? I wonder if somehow the YARN client can't detect this multi-master set-up. On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com wrote: Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark-shell and interact with workers spawned on my yarn cluster. So in my /opt/hadoop/conf/yarn-site.xml, I have: ... property nameyarn.resourcemanager.address.rm1/name valuecontroller-1.mycomp.com:23140/value /property ... property nameyarn.resourcemanager.address.rm2/name valuecontroller-2.mycomp.com:23140/value /property ... And the other usual stuff. So spark 1.0 is launched like this: Spark Command: java -cp ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class org.apache.spark.repl.Main I do see /opt/hadoop/conf included, but not sure it's the right place. Thanks.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being set-up correctly. - Patrick On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com wrote: Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document (http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas? Thanks. -Simon
spark 1.0.0 on yarn
Hi all, I tried a couple ways, but couldn't get it to work.. The following seems to be what the online document ( http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting: SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client Help info of spark-shell seems to be suggesting --master yarn --deploy-mode cluster. But either way, I am seeing the following messages: 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at / 0.0.0.0:8032 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) My guess is that spark-shell is trying to talk to resource manager to setup spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from though. I am running CDH5 with two resource managers in HA mode. Their IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up. Any ideas? Thanks. -Simon
access hdfs file name in map()
Hello, A quick question about using spark to parse text-format CSV files stored on hdfs. I have something very simple: sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p = (XXX, p[0], p[2])) Here, I want to replace XXX with a string, which is the current csv filename for the line. This is needed since some information may be encoded in the file name, like date. In hive, I am able to define an external table and use INPUT__FILE__NAME as a column in queries. I wonder if spark has something similar. Thanks! -Simon