Re: Using spark streaming to load data from Kafka to HDFS

2015-08-22 Thread Xu (Simon) Chen
Last time I checked, Camus doesn't support storing data as parquet, which
is a deal breaker for me. Otherwise it works well for my Kafka topics with
low data volume.

I am currently using spark streaming to ingest data, generate semi-realtime
stats and publish to a dashboard, and dump full dataset into hdfs in
parquet at a longer interval. One problem is that storing parquet is
sometimes time consuming, and that cause delay of my regular
stats-generating tasks. I am thinking of splitting my streaming job into
two, one for parquet output and one for stats generation, but obviously
this would consume data from Kafka twice.

-Simon

On Wednesday, May 6, 2015, Rendy Bambang Junior rendy.b.jun...@gmail.com
wrote:

 Because using spark streaming looks like a lot simpler. Whats the
 difference between Camus and Kafka Streaming for this case? Why Camus excel?

 Rendy

 On Wed, May 6, 2015 at 2:15 PM, Saisai Shao sai.sai.s...@gmail.com
 javascript:_e(%7B%7D,'cvml','sai.sai.s...@gmail.com'); wrote:

 Also Kafka has a Hadoop consumer API for doing such things, please refer
 to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi


 2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com
 javascript:_e(%7B%7D,'cvml','afsan...@gmail.com');:

 why not try https://github.com/linkedin/camus - camus is kafka to HDFS
 pipeline

 On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior 
 rendy.b.jun...@gmail.com
 javascript:_e(%7B%7D,'cvml','rendy.b.jun...@gmail.com'); wrote:

 Hi all,

 I am planning to load data from Kafka to HDFS. Is it normal to use
 spark streaming to load data from Kafka to HDFS? What are concerns on doing
 this?

 There are no processing to be done by Spark, only to store data to HDFS
 from Kafka for storage and for further Spark processing

 Rendy







Re: access hdfs file name in map()

2014-08-01 Thread Xu (Simon) Chen
Hi Roberto,

Ultimately, the info you need is set here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69

Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as
HadoopRDDWithEnv, which takes in an additional parameter (varname) in the
constructor, then override the compute() function to return something like
split.getPipeEnvVars.getOrElse(varName, ) + | + value.toString()
as the value. This obviously is less general and makes certain assumptions
about the input data. Also you need to write several wrappers in
SparkContext, so that you can do something like sc.textFileWithEnv(hdfs
path, mapreduce_map_input_file).

I was hoping to do something like
sc.textFile(hdfs_path).pipe(/usr/bin/awk
{print\${mapreduce_map_input_file}\,$0} ). This gives me some weird
kyro buffer overflow exception... Haven't got a chance to look into the
details yet.

-Simon



On Fri, Aug 1, 2014 at 7:38 AM, Roberto Torella roberto.tore...@gmail.com
wrote:

 Hi Simon,

 I'm trying to do the same but I'm quite lost.

 How did you do that? (Too direct? :)


 Thanks and ciao,
 r-



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-tp6551p11160.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Task progress in ipython?

2014-06-26 Thread Xu (Simon) Chen
I am pretty happy with using pyspark with ipython notebook. The only issue
is that I need to look at the console output or spark ui to track task
progress. I wonder if anyone thought of or better wrote something to
display some progress bars on the same page when I evaluate a cell in ipynb?

I know spark ui is just one window switch and refresh away, but I want
something more integrated :)

Thanks..


performance difference between spark-shell and spark-submit

2014-06-09 Thread Xu (Simon) Chen
Hi all,

I implemented a transformation on hdfs files with spark. First tested in
spark-shell (with yarn), I implemented essentially the same logic with a
spark program (scala), built a jar file and used spark-submit to execute it
on my yarn cluster. The weird thing is that spark-submit approach is almost
3x as slow (500s vs 1500s). I am curious why...

I am essentially writing a benchmarking program to test the performance of
spark in various settings, so my spark program has a Benchmark abstract
class, a trait for some common things, and an actual class to perform one
specific benchmark. My spark main creates an instance of my benchmark class
and execute something like benchmark1.run(), which in turn kicks off spark
context, perform data manipulation, etc. I wonder if such constructs
introduced some overhead - comparing to direct manipulation commands in
spark-shell.

Thanks.
-Simon


Re: cache spark sql parquet file in memory?

2014-06-07 Thread Xu (Simon) Chen
Is there a way to start tachyon on top of a yarn cluster?
 On Jun 7, 2014 2:11 PM, Marek Wiewiorka marek.wiewio...@gmail.com
wrote:

 I was also thinking of using tachyon to store parquet files - maybe
 tomorrow I will give a try as well.


 2014-06-07 20:01 GMT+02:00 Michael Armbrust mich...@databricks.com:

 Not a stupid question!  I would like to be able to do this.  For now, you
 might try writing the data to tachyon http://tachyon-project.org/
 instead of HDFS.  This is untested though, please report any issues you run
 into.

 Michael


 On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:

 This might be a stupid question... but it seems that saveAsParquetFile()
 writes everything back to HDFS. I am wondering if it is possible to cache
 parquet-format intermediate results in memory, and therefore making spark
 sql queries faster.

 Thanks.
 -Simon






cache spark sql parquet file in memory?

2014-06-06 Thread Xu (Simon) Chen
This might be a stupid question... but it seems that saveAsParquetFile()
writes everything back to HDFS. I am wondering if it is possible to cache
parquet-format intermediate results in memory, and therefore making spark
sql queries faster.

Thanks.
-Simon


spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
I am slightly confused about the --executor-memory setting. My yarn
cluster has a maximum container memory of 8192MB.

When I specify --executor-memory 8G in my spark-shell, no container can
be started at all. It only works when I lower the executor memory to 7G.
But then, on yarn, I see 2 container per node, using 16G of memory.

Then on the spark UI, it shows that each worker has 4GB of memory, rather
than 7.

Can someone explain the relationship among the numbers I see here?

Thanks.


compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
I have a working set larger than available memory, thus I am hoping to turn
on rdd compression so that I can store more in-memory. Strangely it made no
difference. The number of cached partitions, fraction cached, and size in
memory remain the same. Any ideas?

I confirmed that rdd compression wasn't on before and it was on for the
second test.

scala sc.getConf.getAll foreach println
...
(spark.rdd.compress,true)
...

I haven't tried lzo vs snappy, but my guess is that either one should
provide at least some benefit..

Thanks.
-Simon


Re: compress in-memory cache?

2014-06-05 Thread Xu (Simon) Chen
Thanks.. it works now.

-Simon


On Thu, Jun 5, 2014 at 10:47 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Have you set the persistence level of the RDD to MEMORY_ONLY_SER (
 http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)?
 If you're calling cache, the default persistence level is MEMORY_ONLY so
 that setting will have no impact.


 On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 I have a working set larger than available memory, thus I am hoping to
 turn on rdd compression so that I can store more in-memory. Strangely it
 made no difference. The number of cached partitions, fraction cached, and
 size in memory remain the same. Any ideas?

 I confirmed that rdd compression wasn't on before and it was on for the
 second test.

 scala sc.getConf.getAll foreach println
 ...
 (spark.rdd.compress,true)
 ...

 I haven't tried lzo vs snappy, but my guess is that either one should
 provide at least some benefit..

 Thanks.
 -Simon





Re: spark worker and yarn memory

2014-06-05 Thread Xu (Simon) Chen
Nice explanation... Thanks!


On Thu, Jun 5, 2014 at 5:50 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Xu,

 As crazy as it might sound, this all makes sense.

 There are a few different quantities at play here:
 * the heap size of the executor (controlled by --executor-memory)
 * the amount of memory spark requests from yarn (the heap size plus
 384 mb to account for fixed memory costs outside if the heap)
 * the amount of memory yarn grants to the container (yarn rounds up to
 the nearest multiple of yarn.scheduler.minimum-allocation-mb or
 yarn.scheduler.fair.increment-allocation-mb, depending on the
 scheduler used)
 * the amount of memory spark uses for caching on each executor, which
 is spark.storage.memoryFraction (default 0.6) of the executor heap
 size

 So, with --executor-memory 8g, spark requests 8g + 384m from yarn,
 which doesn't fit into it's container max.  With --executor-memory 7g,
 Spark requests 7g + 384m from yarn, which fits into its container max.
  This gets rounded up to 8g by the yarn scheduler.  7g is still used
 as the executor heap size, and .6 of this is about 4g, shown as the
 cache space in the spark.

 -Sandy

  On Jun 5, 2014, at 9:44 AM, Xu (Simon) Chen xche...@gmail.com wrote:
 
  I am slightly confused about the --executor-memory setting. My yarn
 cluster has a maximum container memory of 8192MB.
 
  When I specify --executor-memory 8G in my spark-shell, no container
 can be started at all. It only works when I lower the executor memory to
 7G. But then, on yarn, I see 2 container per node, using 16G of memory.
 
  Then on the spark UI, it shows that each worker has 4GB of memory,
 rather than 7.
 
  Can someone explain the relationship among the numbers I see here?
 
  Thanks.



Re: Join : Giving incorrect result

2014-06-04 Thread Xu (Simon) Chen
Maybe your two workers have different assembly jar files?

I just ran into a similar problem that my spark-shell is using a different
jar file than my workers - got really confusing results.
On Jun 4, 2014 8:33 AM, Ajay Srivastava a_k_srivast...@yahoo.com wrote:

 Hi,

 I am doing join of two RDDs which giving different results ( counting
 number of records ) each time I run this code on same input.

 The input files are large enough to be divided in two splits. When the
 program runs on two workers with single core assigned to these, output is
 consistent and looks correct. But when single worker is used with two or
 more than two cores, the result seems to be random. Every time, count of
 joined record is different.

 Does this sound like a defect or I need to take care of something while
 using join ? I am using spark-0.9.1.

 Regards
 Ajay



Re: access hdfs file name in map()

2014-06-04 Thread Xu (Simon) Chen
N/M.. I wrote a HadoopRDD subclass and append one env field of the
HadoopPartition to the value in compute function. It worked pretty well.

Thanks!
On Jun 4, 2014 12:22 AM, Xu (Simon) Chen xche...@gmail.com wrote:

 I don't quite get it..

 mapPartitionWithIndex takes a function that maps an integer index and an
 iterator to another iterator. How does that help with retrieving the hdfs
 file name?

 I am obviously missing some context..

 Thanks.
  On May 30, 2014 1:28 AM, Aaron Davidson ilike...@gmail.com wrote:

 Currently there is not a way to do this using textFile(). However, you
 could pretty straightforwardly define your own subclass of HadoopRDD [1] in
 order to get access to this information (likely using
 mapPartitionsWithIndex to look up the InputSplit for a particular
 partition).

 Note that sc.textFile() is just a convenience function to construct a new
 HadoopRDD [2].

 [1] HadoopRDD:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93
 [2] sc.textFile():
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456


 On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:

 Hello,

 A quick question about using spark to parse text-format CSV files stored
 on hdfs.

 I have something very simple:
 sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p =
 (XXX, p[0], p[2]))

 Here, I want to replace XXX with a string, which is the current csv
 filename for the line. This is needed since some information may be encoded
 in the file name, like date.

 In hive, I am able to define an external table and use INPUT__FILE__NAME
 as a column in queries. I wonder if spark has something similar.

 Thanks!
 -Simon





Re: access hdfs file name in map()

2014-06-03 Thread Xu (Simon) Chen
I don't quite get it..

mapPartitionWithIndex takes a function that maps an integer index and an
iterator to another iterator. How does that help with retrieving the hdfs
file name?

I am obviously missing some context..

Thanks.
 On May 30, 2014 1:28 AM, Aaron Davidson ilike...@gmail.com wrote:

 Currently there is not a way to do this using textFile(). However, you
 could pretty straightforwardly define your own subclass of HadoopRDD [1] in
 order to get access to this information (likely using
 mapPartitionsWithIndex to look up the InputSplit for a particular
 partition).

 Note that sc.textFile() is just a convenience function to construct a new
 HadoopRDD [2].

 [1] HadoopRDD:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93
 [2] sc.textFile():
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456


 On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:

 Hello,

 A quick question about using spark to parse text-format CSV files stored
 on hdfs.

 I have something very simple:
 sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p =
 (XXX, p[0], p[2]))

 Here, I want to replace XXX with a string, which is the current csv
 filename for the line. This is needed since some information may be encoded
 in the file name, like date.

 In hive, I am able to define an external table and use INPUT__FILE__NAME
 as a column in queries. I wonder if spark has something similar.

 Thanks!
 -Simon





Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
OK, rebuilding the assembly jar file with cdh5 works now...
Thanks..

-Simon


On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 That helped a bit... Now I have a different failure: the start up process
 is stuck in an infinite loop outputting the following message:

 14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application
 report from ASM:
  appMasterRpcPort: -1
  appStartTime: 1401672868277
  yarnAppState: ACCEPTED

 I am using the hadoop 2 prebuild package. Probably it doesn't have the
 latest yarn client.

 -Simon




 On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 As a debugging step, does it work if you use a single resource manager
 with the key yarn.resourcemanager.address instead of using two named
 resource managers? I wonder if somehow the YARN client can't detect
 this multi-master set-up.

 On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
  Note that everything works fine in spark 0.9, which is packaged in
 CDH5: I
  can launch a spark-shell and interact with workers spawned on my yarn
  cluster.
 
  So in my /opt/hadoop/conf/yarn-site.xml, I have:
  ...
  property
  nameyarn.resourcemanager.address.rm1/name
  valuecontroller-1.mycomp.com:23140/value
  /property
  ...
  property
  nameyarn.resourcemanager.address.rm2/name
  valuecontroller-2.mycomp.com:23140/value
  /property
  ...
 
  And the other usual stuff.
 
  So spark 1.0 is launched like this:
  Spark Command: java -cp
 
 ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
  -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
  org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client
 --class
  org.apache.spark.repl.Main
 
  I do see /opt/hadoop/conf included, but not sure it's the right place.
 
  Thanks..
  -Simon
 
 
 
  On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  I would agree with your guess, it looks like the yarn library isn't
  correctly finding your yarn-site.xml file. If you look in
  yarn-site.xml do you definitely the resource manager
  address/addresses?
 
  Also, you can try running this command with
  SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
  set-up correctly.
 
  - Patrick
 
  On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com
  wrote:
   Hi all,
  
   I tried a couple ways, but couldn't get it to work..
  
   The following seems to be what the online document
   (http://spark.apache.org/docs/latest/running-on-yarn.html) is
   suggesting:
  
  
 SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
   YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
  
   Help info of spark-shell seems to be suggesting --master yarn
   --deploy-mode
   cluster.
  
   But either way, I am seeing the following messages:
   14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager
 at
   /0.0.0.0:8032
   14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
   0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
   RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
 SECONDS)
   14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
   0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
   RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
 SECONDS)
  
   My guess is that spark-shell is trying to talk to resource manager to
   setup
   spark master/worker nodes - I am not sure where 0.0.0.0:8032 came
 from
   though. I am running CDH5 with two resource managers in HA mode.
 Their
   IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
   HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
  
   Any ideas? Thanks.
   -Simon
 
 





pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Hi folks,

I have a weird problem when using pyspark with yarn. I started ipython as
follows:

IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors
4 --executor-memory 4G

When I create a notebook, I can see workers being created and indeed I see
spark UI running on my client machine on port 4040.

I have the following simple script:

import pyspark
data = sc.textFile(hdfs://test/tmp/data/*).cache()
oneday = data.map(lambda line: line.split(,)).\
  map(lambda f: (f[0], float(f[1]))).\
  filter(lambda t: t[0] = 2013-01-01 and t[0] 
2013-01-02).\
  map(lambda t: (parser.parse(t[0]), t[1]))
oneday.take(1)


By executing this, I see that it is my client machine (where ipython is
launched) is reading all the data from HDFS, and produce the result of
take(1), rather than my worker nodes...

When I do data.count(), things would blow up altogether. But I do see in
the error message something like this:


Error from python worker:
  /usr/bin/python: No module named pyspark




Am I supposed to install pyspark on every worker node?


Thanks.

-Simon


Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
1) yes, that sc.parallelize(range(10)).count() has the same error.

2) the files seem to be correct

3) I have trouble at this step, ImportError: No module named pyspark
but I seem to have files in the jar file:

$ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python
 import pyspark
Traceback (most recent call last):
  File stdin, line 1, in module
ImportError: No module named pyspark

$ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark
pyspark/
pyspark/rddsampler.py
pyspark/broadcast.py
pyspark/serializers.py
pyspark/java_gateway.py
pyspark/resultiterable.py
pyspark/accumulators.py
pyspark/sql.py
pyspark/__init__.py
pyspark/daemon.py
pyspark/context.py
pyspark/cloudpickle.py
pyspark/join.py
pyspark/tests.py
pyspark/files.py
pyspark/conf.py
pyspark/rdd.py
pyspark/storagelevel.py
pyspark/statcounter.py
pyspark/shell.py
pyspark/worker.py


4) All my nodes should be running java 7, so probably this is not related.
5) I'll do it in a bit.

Any ideas on 3)?

Thanks.
-Simon



On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote:

 Hi Simon,

 You shouldn't have to install pyspark on every worker node. In YARN mode,
 pyspark is packaged into your assembly jar and shipped to your executors
 automatically. This seems like a more general problem. There are a few
 things to try:

 1) Run a simple pyspark shell with yarn-client, and do
 sc.parallelize(range(10)).count() to see if you get the same error
 2) If so, check if your assembly jar is compiled correctly. Run

 $ jar -tf path/to/assembly/jar pyspark
 $ jar -tf path/to/assembly/jar py4j

 to see if the files are there. For Py4j, you need both the python files
 and the Java class files.

 3) If the files are there, try running a simple python shell (not pyspark
 shell) with the assembly jar on the PYTHONPATH:

 $ PYTHONPATH=/path/to/assembly/jar python
  import pyspark

 4) If that works, try it on every worker node. If it doesn't work, there
 is probably something wrong with your jar.

 There is a known issue for PySpark on YARN - jars built with Java 7 cannot
 be properly opened by Java 6. I would either verify that the JAVA_HOME set
 on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV),
 or simply build your jar with Java 6:

 $ cd /path/to/spark/home
 $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
 2.3.0-cdh5.0.0

 5) You can check out
 http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
 which has more detailed information about how to debug running an
 application on YARN in general. In my experience, the steps outlined there
 are quite useful.

 Let me know if you get it working (or not).

 Cheers,
 Andrew



 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com:

 Hi folks,

 I have a weird problem when using pyspark with yarn. I started ipython as
 follows:

 IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
 --num-executors 4 --executor-memory 4G

 When I create a notebook, I can see workers being created and indeed I
 see spark UI running on my client machine on port 4040.

 I have the following simple script:
 
 import pyspark
 data = sc.textFile(hdfs://test/tmp/data/*).cache()
 oneday = data.map(lambda line: line.split(,)).\
   map(lambda f: (f[0], float(f[1]))).\
   filter(lambda t: t[0] = 2013-01-01 and t[0] 
 2013-01-02).\
   map(lambda t: (parser.parse(t[0]), t[1]))
 oneday.take(1)
 

 By executing this, I see that it is my client machine (where ipython is
 launched) is reading all the data from HDFS, and produce the result of
 take(1), rather than my worker nodes...

 When I do data.count(), things would blow up altogether. But I do see
 in the error message something like this:
 

 Error from python worker:
   /usr/bin/python: No module named pyspark

 


 Am I supposed to install pyspark on every worker node?


 Thanks.

 -Simon





Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers,
it seems that the jar file is distributed and included in classpath
correctly.

I think the problem is likely at step 3..

I build my jar file with maven, like this:
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean
package

Anything that I might have missed?

Thanks.
-Simon


On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 1) yes, that sc.parallelize(range(10)).count() has the same error.

 2) the files seem to be correct

 3) I have trouble at this step, ImportError: No module named pyspark
 but I seem to have files in the jar file:
 
 $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python
  import pyspark
 Traceback (most recent call last):
   File stdin, line 1, in module
 ImportError: No module named pyspark

 $ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark
 pyspark/
 pyspark/rddsampler.py
 pyspark/broadcast.py
 pyspark/serializers.py
 pyspark/java_gateway.py
 pyspark/resultiterable.py
 pyspark/accumulators.py
 pyspark/sql.py
 pyspark/__init__.py
 pyspark/daemon.py
 pyspark/context.py
 pyspark/cloudpickle.py
 pyspark/join.py
 pyspark/tests.py
 pyspark/files.py
 pyspark/conf.py
 pyspark/rdd.py
 pyspark/storagelevel.py
 pyspark/statcounter.py
 pyspark/shell.py
 pyspark/worker.py
 

 4) All my nodes should be running java 7, so probably this is not related.
 5) I'll do it in a bit.

 Any ideas on 3)?

 Thanks.
 -Simon



 On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote:

 Hi Simon,

 You shouldn't have to install pyspark on every worker node. In YARN mode,
 pyspark is packaged into your assembly jar and shipped to your executors
 automatically. This seems like a more general problem. There are a few
 things to try:

 1) Run a simple pyspark shell with yarn-client, and do
 sc.parallelize(range(10)).count() to see if you get the same error
 2) If so, check if your assembly jar is compiled correctly. Run

 $ jar -tf path/to/assembly/jar pyspark
 $ jar -tf path/to/assembly/jar py4j

 to see if the files are there. For Py4j, you need both the python files
 and the Java class files.

 3) If the files are there, try running a simple python shell (not pyspark
 shell) with the assembly jar on the PYTHONPATH:

 $ PYTHONPATH=/path/to/assembly/jar python
  import pyspark

 4) If that works, try it on every worker node. If it doesn't work, there
 is probably something wrong with your jar.

 There is a known issue for PySpark on YARN - jars built with Java 7
 cannot be properly opened by Java 6. I would either verify that the
 JAVA_HOME set on all of your workers points to Java 7 (by setting
 SPARK_YARN_USER_ENV), or simply build your jar with Java 6:

 $ cd /path/to/spark/home
 $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
 2.3.0-cdh5.0.0

 5) You can check out
 http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
 which has more detailed information about how to debug running an
 application on YARN in general. In my experience, the steps outlined there
 are quite useful.

 Let me know if you get it working (or not).

 Cheers,
 Andrew



 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com:

 Hi folks,

 I have a weird problem when using pyspark with yarn. I started ipython
 as follows:

 IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
 --num-executors 4 --executor-memory 4G

 When I create a notebook, I can see workers being created and indeed I
 see spark UI running on my client machine on port 4040.

 I have the following simple script:
 
 import pyspark
 data = sc.textFile(hdfs://test/tmp/data/*).cache()
 oneday = data.map(lambda line: line.split(,)).\
   map(lambda f: (f[0], float(f[1]))).\
   filter(lambda t: t[0] = 2013-01-01 and t[0] 
 2013-01-02).\
   map(lambda t: (parser.parse(t[0]), t[1]))
 oneday.take(1)
 

 By executing this, I see that it is my client machine (where ipython is
 launched) is reading all the data from HDFS, and produce the result of
 take(1), rather than my worker nodes...

 When I do data.count(), things would blow up altogether. But I do see
 in the error message something like this:
 

 Error from python worker:
   /usr/bin/python: No module named pyspark

 


 Am I supposed to install pyspark on every worker node?


 Thanks.

 -Simon






Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
I asked several people, no one seems to believe that we can do this:
$ PYTHONPATH=/path/to/assembly/jar python
 import pyspark

This following pull request did mention something about generating a zip
file for all python related modules:
https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html

I've tested that zipped modules can as least be imported via zipimport.

Any ideas?

-Simon



On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote:

 Hi Simon,

 You shouldn't have to install pyspark on every worker node. In YARN mode,
 pyspark is packaged into your assembly jar and shipped to your executors
 automatically. This seems like a more general problem. There are a few
 things to try:

 1) Run a simple pyspark shell with yarn-client, and do
 sc.parallelize(range(10)).count() to see if you get the same error
 2) If so, check if your assembly jar is compiled correctly. Run

 $ jar -tf path/to/assembly/jar pyspark
 $ jar -tf path/to/assembly/jar py4j

 to see if the files are there. For Py4j, you need both the python files
 and the Java class files.

 3) If the files are there, try running a simple python shell (not pyspark
 shell) with the assembly jar on the PYTHONPATH:

 $ PYTHONPATH=/path/to/assembly/jar python
  import pyspark

 4) If that works, try it on every worker node. If it doesn't work, there
 is probably something wrong with your jar.

 There is a known issue for PySpark on YARN - jars built with Java 7 cannot
 be properly opened by Java 6. I would either verify that the JAVA_HOME set
 on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV),
 or simply build your jar with Java 6:

 $ cd /path/to/spark/home
 $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
 2.3.0-cdh5.0.0

 5) You can check out
 http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
 which has more detailed information about how to debug running an
 application on YARN in general. In my experience, the steps outlined there
 are quite useful.

 Let me know if you get it working (or not).

 Cheers,
 Andrew



 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com:

 Hi folks,

 I have a weird problem when using pyspark with yarn. I started ipython as
 follows:

 IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
 --num-executors 4 --executor-memory 4G

 When I create a notebook, I can see workers being created and indeed I
 see spark UI running on my client machine on port 4040.

 I have the following simple script:
 
 import pyspark
 data = sc.textFile(hdfs://test/tmp/data/*).cache()
 oneday = data.map(lambda line: line.split(,)).\
   map(lambda f: (f[0], float(f[1]))).\
   filter(lambda t: t[0] = 2013-01-01 and t[0] 
 2013-01-02).\
   map(lambda t: (parser.parse(t[0]), t[1]))
 oneday.take(1)
 

 By executing this, I see that it is my client machine (where ipython is
 launched) is reading all the data from HDFS, and produce the result of
 take(1), rather than my worker nodes...

 When I do data.count(), things would blow up altogether. But I do see
 in the error message something like this:
 

 Error from python worker:
   /usr/bin/python: No module named pyspark

 


 Am I supposed to install pyspark on every worker node?


 Thanks.

 -Simon





Re: spark 1.0.0 on yarn

2014-06-02 Thread Xu (Simon) Chen
I built my new package like this:
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean
package

Spark-shell is working now, but pyspark is still broken. I reported the
problem on a different thread. Please take a look if you can... Desperately
need ideas..

Thanks.
-Simon


On Mon, Jun 2, 2014 at 2:47 PM, Patrick Wendell pwend...@gmail.com wrote:

 Okay I'm guessing that our upstreaming Hadoop2 package isn't new
 enough to work with CDH5. We should probably clarify this in our
 downloads. Thanks for reporting this. What was the exact string you
 used when building? Also which CDH-5 version are you building against?

 On Mon, Jun 2, 2014 at 8:11 AM, Xu (Simon) Chen xche...@gmail.com wrote:
  OK, rebuilding the assembly jar file with cdh5 works now...
  Thanks..
 
  -Simon
 
 
  On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
 
  That helped a bit... Now I have a different failure: the start up
 process
  is stuck in an infinite loop outputting the following message:
 
  14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application
  report from ASM:
  appMasterRpcPort: -1
  appStartTime: 1401672868277
  yarnAppState: ACCEPTED
 
  I am using the hadoop 2 prebuild package. Probably it doesn't have the
  latest yarn client.
 
  -Simon
 
 
 
 
  On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  As a debugging step, does it work if you use a single resource manager
  with the key yarn.resourcemanager.address instead of using two named
  resource managers? I wonder if somehow the YARN client can't detect
  this multi-master set-up.
 
  On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com
  wrote:
   Note that everything works fine in spark 0.9, which is packaged in
   CDH5: I
   can launch a spark-shell and interact with workers spawned on my yarn
   cluster.
  
   So in my /opt/hadoop/conf/yarn-site.xml, I have:
   ...
   property
   nameyarn.resourcemanager.address.rm1/name
   valuecontroller-1.mycomp.com:23140/value
   /property
   ...
   property
   nameyarn.resourcemanager.address.rm2/name
   valuecontroller-2.mycomp.com:23140/value
   /property
   ...
  
   And the other usual stuff.
  
   So spark 1.0 is launched like this:
   Spark Command: java -cp
  
  
 ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
   -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
   org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client
   --class
   org.apache.spark.repl.Main
  
   I do see /opt/hadoop/conf included, but not sure it's the right
   place.
  
   Thanks..
   -Simon
  
  
  
   On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com
   wrote:
  
   I would agree with your guess, it looks like the yarn library isn't
   correctly finding your yarn-site.xml file. If you look in
   yarn-site.xml do you definitely the resource manager
   address/addresses?
  
   Also, you can try running this command with
   SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
   set-up correctly.
  
   - Patrick
  
   On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com
 
   wrote:
Hi all,
   
I tried a couple ways, but couldn't get it to work..
   
The following seems to be what the online document
(http://spark.apache.org/docs/latest/running-on-yarn.html) is
suggesting:
   
   
   
 SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
   
Help info of spark-shell seems to be suggesting --master yarn
--deploy-mode
cluster.
   
But either way, I am seeing the following messages:
14/06/01 00:33:20 INFO client.RMProxy: Connecting to
 ResourceManager
at
/0.0.0.0:8032
14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
   
My guess is that spark-shell is trying to talk to resource manager
to
setup
spark master/worker nodes - I am not sure where 0.0.0.0:8032 came
from
though. I am running CDH5 with two resource managers in HA mode.
Their
IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
   
Any ideas

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
OK, my colleague found this:
https://mail.python.org/pipermail/python-list/2014-May/671353.html

And my jar file has 70011 files. Fantastic..




On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 I asked several people, no one seems to believe that we can do this:
 $ PYTHONPATH=/path/to/assembly/jar python
  import pyspark

 This following pull request did mention something about generating a zip
 file for all python related modules:
 https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html

 I've tested that zipped modules can as least be imported via zipimport.

 Any ideas?

 -Simon



 On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote:

 Hi Simon,

 You shouldn't have to install pyspark on every worker node. In YARN mode,
 pyspark is packaged into your assembly jar and shipped to your executors
 automatically. This seems like a more general problem. There are a few
 things to try:

 1) Run a simple pyspark shell with yarn-client, and do
 sc.parallelize(range(10)).count() to see if you get the same error
 2) If so, check if your assembly jar is compiled correctly. Run

 $ jar -tf path/to/assembly/jar pyspark
 $ jar -tf path/to/assembly/jar py4j

 to see if the files are there. For Py4j, you need both the python files
 and the Java class files.

 3) If the files are there, try running a simple python shell (not pyspark
 shell) with the assembly jar on the PYTHONPATH:

 $ PYTHONPATH=/path/to/assembly/jar python
  import pyspark

 4) If that works, try it on every worker node. If it doesn't work, there
 is probably something wrong with your jar.

 There is a known issue for PySpark on YARN - jars built with Java 7
 cannot be properly opened by Java 6. I would either verify that the
 JAVA_HOME set on all of your workers points to Java 7 (by setting
 SPARK_YARN_USER_ENV), or simply build your jar with Java 6:

 $ cd /path/to/spark/home
 $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
 2.3.0-cdh5.0.0

 5) You can check out
 http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
 which has more detailed information about how to debug running an
 application on YARN in general. In my experience, the steps outlined there
 are quite useful.

 Let me know if you get it working (or not).

 Cheers,
 Andrew



 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com:

 Hi folks,

 I have a weird problem when using pyspark with yarn. I started ipython
 as follows:

 IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
 --num-executors 4 --executor-memory 4G

 When I create a notebook, I can see workers being created and indeed I
 see spark UI running on my client machine on port 4040.

 I have the following simple script:
 
 import pyspark
 data = sc.textFile(hdfs://test/tmp/data/*).cache()
 oneday = data.map(lambda line: line.split(,)).\
   map(lambda f: (f[0], float(f[1]))).\
   filter(lambda t: t[0] = 2013-01-01 and t[0] 
 2013-01-02).\
   map(lambda t: (parser.parse(t[0]), t[1]))
 oneday.take(1)
 

 By executing this, I see that it is my client machine (where ipython is
 launched) is reading all the data from HDFS, and produce the result of
 take(1), rather than my worker nodes...

 When I do data.count(), things would blow up altogether. But I do see
 in the error message something like this:
 

 Error from python worker:
   /usr/bin/python: No module named pyspark

 


 Am I supposed to install pyspark on every worker node?


 Thanks.

 -Simon






Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Xu (Simon) Chen
Nope... didn't try java 6. The standard installation guide didn't say
anything about java 7 and suggested to do -DskipTests for the build..
http://spark.apache.org/docs/latest/building-with-maven.html

So, I didn't see the warning message...


On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell pwend...@gmail.com wrote:

 Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
 Zip format and Java 7 uses Zip64. I think we've tried to add some
 build warnings if Java 7 is used, for this reason:

 https://github.com/apache/spark/blob/master/make-distribution.sh#L102

 Any luck if you use JDK 6 to compile?


 On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
  OK, my colleague found this:
  https://mail.python.org/pipermail/python-list/2014-May/671353.html
 
  And my jar file has 70011 files. Fantastic..
 
 
 
 
  On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
 
  I asked several people, no one seems to believe that we can do this:
  $ PYTHONPATH=/path/to/assembly/jar python
   import pyspark
 
  This following pull request did mention something about generating a zip
  file for all python related modules:
  https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
 
  I've tested that zipped modules can as least be imported via zipimport.
 
  Any ideas?
 
  -Simon
 
 
 
  On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com
 wrote:
 
  Hi Simon,
 
  You shouldn't have to install pyspark on every worker node. In YARN
 mode,
  pyspark is packaged into your assembly jar and shipped to your
 executors
  automatically. This seems like a more general problem. There are a few
  things to try:
 
  1) Run a simple pyspark shell with yarn-client, and do
  sc.parallelize(range(10)).count() to see if you get the same error
  2) If so, check if your assembly jar is compiled correctly. Run
 
  $ jar -tf path/to/assembly/jar pyspark
  $ jar -tf path/to/assembly/jar py4j
 
  to see if the files are there. For Py4j, you need both the python files
  and the Java class files.
 
  3) If the files are there, try running a simple python shell (not
 pyspark
  shell) with the assembly jar on the PYTHONPATH:
 
  $ PYTHONPATH=/path/to/assembly/jar python
   import pyspark
 
  4) If that works, try it on every worker node. If it doesn't work,
 there
  is probably something wrong with your jar.
 
  There is a known issue for PySpark on YARN - jars built with Java 7
  cannot be properly opened by Java 6. I would either verify that the
  JAVA_HOME set on all of your workers points to Java 7 (by setting
  SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
 
  $ cd /path/to/spark/home
  $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
  2.3.0-cdh5.0.0
 
  5) You can check out
 
 http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application
 ,
  which has more detailed information about how to debug running an
  application on YARN in general. In my experience, the steps outlined
 there
  are quite useful.
 
  Let me know if you get it working (or not).
 
  Cheers,
  Andrew
 
 
 
  2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com:
 
  Hi folks,
 
  I have a weird problem when using pyspark with yarn. I started ipython
  as follows:
 
  IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
  --num-executors 4 --executor-memory 4G
 
  When I create a notebook, I can see workers being created and indeed I
  see spark UI running on my client machine on port 4040.
 
  I have the following simple script:
  
  import pyspark
  data = sc.textFile(hdfs://test/tmp/data/*).cache()
  oneday = data.map(lambda line: line.split(,)).\
map(lambda f: (f[0], float(f[1]))).\
filter(lambda t: t[0] = 2013-01-01 and t[0] 
  2013-01-02).\
map(lambda t: (parser.parse(t[0]), t[1]))
  oneday.take(1)
  
 
  By executing this, I see that it is my client machine (where ipython
 is
  launched) is reading all the data from HDFS, and produce the result of
  take(1), rather than my worker nodes...
 
  When I do data.count(), things would blow up altogether. But I do
 see
  in the error message something like this:
  
 
  Error from python worker:
/usr/bin/python: No module named pyspark
 
  
 
 
  Am I supposed to install pyspark on every worker node?
 
 
  Thanks.
 
  -Simon
 
 
 
 



Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
Note that everything works fine in spark 0.9, which is packaged in CDH5: I
can launch a spark-shell and interact with workers spawned on my yarn
cluster.

So in my /opt/hadoop/conf/yarn-site.xml, I have:
...
property
nameyarn.resourcemanager.address.rm1/name
valuecontroller-1.mycomp.com:23140/value
/property
...
property
nameyarn.resourcemanager.address.rm2/name
valuecontroller-2.mycomp.com:23140/value
/property
...

And the other usual stuff.

So spark 1.0 is launched like this:
Spark Command: java -cp
::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
-XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client
--class org.apache.spark.repl.Main

I do see /opt/hadoop/conf included, but not sure it's the right place.

Thanks..
-Simon



On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 I would agree with your guess, it looks like the yarn library isn't
 correctly finding your yarn-site.xml file. If you look in
 yarn-site.xml do you definitely the resource manager
 address/addresses?

 Also, you can try running this command with
 SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
 set-up correctly.

 - Patrick

 On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
  Hi all,
 
  I tried a couple ways, but couldn't get it to work..
 
  The following seems to be what the online document
  (http://spark.apache.org/docs/latest/running-on-yarn.html) is
 suggesting:
 
 SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
  YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
 
  Help info of spark-shell seems to be suggesting --master yarn
 --deploy-mode
  cluster.
 
  But either way, I am seeing the following messages:
  14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at
  /0.0.0.0:8032
  14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
  0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
  RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
  14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
  0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
  RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
 
  My guess is that spark-shell is trying to talk to resource manager to
 setup
  spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from
  though. I am running CDH5 with two resource managers in HA mode. Their
  IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
  HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
 
  Any ideas? Thanks.
  -Simon



Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
That helped a bit... Now I have a different failure: the start up process
is stuck in an infinite loop outputting the following message:

14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:
 appMasterRpcPort: -1
 appStartTime: 1401672868277
 yarnAppState: ACCEPTED

I am using the hadoop 2 prebuild package. Probably it doesn't have the
latest yarn client.

-Simon




On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com wrote:

 As a debugging step, does it work if you use a single resource manager
 with the key yarn.resourcemanager.address instead of using two named
 resource managers? I wonder if somehow the YARN client can't detect
 this multi-master set-up.

 On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen xche...@gmail.com
 wrote:
  Note that everything works fine in spark 0.9, which is packaged in CDH5:
 I
  can launch a spark-shell and interact with workers spawned on my yarn
  cluster.
 
  So in my /opt/hadoop/conf/yarn-site.xml, I have:
  ...
  property
  nameyarn.resourcemanager.address.rm1/name
  valuecontroller-1.mycomp.com:23140/value
  /property
  ...
  property
  nameyarn.resourcemanager.address.rm2/name
  valuecontroller-2.mycomp.com:23140/value
  /property
  ...
 
  And the other usual stuff.
 
  So spark 1.0 is launched like this:
  Spark Command: java -cp
 
 ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
  -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
  org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client
 --class
  org.apache.spark.repl.Main
 
  I do see /opt/hadoop/conf included, but not sure it's the right place.
 
  Thanks..
  -Simon
 
 
 
  On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  I would agree with your guess, it looks like the yarn library isn't
  correctly finding your yarn-site.xml file. If you look in
  yarn-site.xml do you definitely the resource manager
  address/addresses?
 
  Also, you can try running this command with
  SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
  set-up correctly.
 
  - Patrick
 
  On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen xche...@gmail.com
  wrote:
   Hi all,
  
   I tried a couple ways, but couldn't get it to work..
  
   The following seems to be what the online document
   (http://spark.apache.org/docs/latest/running-on-yarn.html) is
   suggesting:
  
  
 SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
   YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
  
   Help info of spark-shell seems to be suggesting --master yarn
   --deploy-mode
   cluster.
  
   But either way, I am seeing the following messages:
   14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager
 at
   /0.0.0.0:8032
   14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
   0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
   RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
 SECONDS)
   14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
   0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
   RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
 SECONDS)
  
   My guess is that spark-shell is trying to talk to resource manager to
   setup
   spark master/worker nodes - I am not sure where 0.0.0.0:8032 came
 from
   though. I am running CDH5 with two resource managers in HA mode. Their
   IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
   HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
  
   Any ideas? Thanks.
   -Simon
 
 



spark 1.0.0 on yarn

2014-05-31 Thread Xu (Simon) Chen
Hi all,

I tried a couple ways, but couldn't get it to work..

The following seems to be what the online document (
http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting:
SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client

Help info of spark-shell seems to be suggesting --master yarn
--deploy-mode cluster.

But either way, I am seeing the following messages:
14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at /
0.0.0.0:8032
14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

My guess is that spark-shell is trying to talk to resource manager to setup
spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from
though. I am running CDH5 with two resource managers in HA mode. Their
IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.

Any ideas? Thanks.
-Simon


access hdfs file name in map()

2014-05-29 Thread Xu (Simon) Chen
Hello,

A quick question about using spark to parse text-format CSV files stored on
hdfs.

I have something very simple:
sc.textFile(hdfs://test/path/*).map(line = line.split(,)).map(p =
(XXX, p[0], p[2]))

Here, I want to replace XXX with a string, which is the current csv
filename for the line. This is needed since some information may be encoded
in the file name, like date.

In hive, I am able to define an external table and use INPUT__FILE__NAME as
a column in queries. I wonder if spark has something similar.

Thanks!
-Simon