Re: error with pyspark

Davies Liu Sun, 10 Aug 2014 22:20:28 -0700

On Fri, Aug 8, 2014 at 9:12 AM, Baoqiang Cao <bqcaom...@gmail.com> wrote:
> Hi There
>
> I ran into a problem and can’t find a solution.
>
> I was running bin/pyspark < ../python/wordcount.py


you could use bin/spark-submit  ../python/wordcount.py

> The wordcount.py is here:
>
> ========================================
> import sys
> from operator import add
>
> from pyspark import SparkContext
>
> datafile = '/mnt/data/m1.txt'
>
> sc = SparkContext()
> outfile = datafile + '.freq'
> lines = sc.textFile(datafile, 1)
> counts = lines.flatMap(lambda x: x.split(' ')) \
>                     .map(lambda x: (x, 1)) \
>                     .reduceByKey(add)
> output = counts.collect()
>
> outf = open(outfile, 'w')
>
> for (word, count) in output:
>    outf.write(word.encode('utf-8') + '\t' + str(count) + '\n')
> outf.close()
> ========================================
>
>
> The error message is here:
>
> 14/08/08 16:01:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 0)
> java.io.FileNotFoundException:
> /tmp/spark-local-20140808160150-d36b/12/shuffle_0_0_468 (Too many open
> files)

This message means that the Spark (JVM) had reach  the max number of open files,
there are fd leak some where, unfortunately I can not reproduce this
problem.  What
is the version of Spark?

>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
>         at
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107)
>         at
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175)
>         at
> org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
>         at
> org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         at
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
>
>
> The m1.txt is about 4G, and I have >120GB Ram and used -Xmx120GB. It is on
> Ubuntu. Any help please?
>
> Best
> Baoqiang Cao
> Blog: http://baoqiang.org
> Email: bqcaom...@gmail.com
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: error with pyspark

Reply via email to