Re: rdd.saveAsTextFile blows up
I ported the same code to scala. No problems. But in pyspark, this fails consistently: ctx = SQLContext(sc) pf = ctx.parquetFile("...") rdd = pf.map(lambda x: x) crdd = ctx.inferSchema(rdd) crdd.saveAsParquetFile("...") If I do rdd = sc.parallelize(["hello", "world"]) rdd.saveAsTextFile(...) It works. Ideas? > On Jul 24, 2014, at 11:05 PM, Akhil Das wrote: > > Most likely you are closing the connection with HDFS. Can you paste the piece > of code that you are executing? > > We were having similar problem when we closed the FileSystem object in our > code. > > Thanks > Best Regards > > >> On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman >> wrote: >> I'm trying to run a simple pipeline using PySpark, version 1.0.1 >> >> I've created an RDD over a parquetFile and am mapping the contents with a >> transformer function and now wish to write the data out to HDFS. >> >> All of the executors fail with the same stack trace (below) >> >> I do get a directory on HDFS, but it's empty except for a file named >> _temporary. >> >> Any ideas? >> >> java.io.IOException (java.io.IOException: Filesystem closed} >> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) >> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735) >> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793) >> java.io.DataInputStream.readFully(DataInputStream.java:195) >> java.io.DataInputStream.readFully(DataInputStream.java:169) >> parquet.hadoop.ParquetFileReader$Chunk.(ParquetFileReader.java:369) >> parquet.hadoop.ParquetFileReader$Chunk.(ParquetFileReader.java:362) >> parquet.hadoop.ParquetFileReader.readColumnChunkPages(ParquetFileReader.java:411) >> parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349) >> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100) >> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) >> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) >> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122) >> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) >> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) >> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966) >> scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) >> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:293) >> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) >> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) >> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) >> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) >> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) >
Re: rdd.saveAsTextFile blows up
Most likely you are closing the connection with HDFS. Can you paste the piece of code that you are executing? We were having similar problem when we closed the FileSystem object in our code. Thanks Best Regards On Thu, Jul 24, 2014 at 11:00 PM, Eric Friedman wrote: > I'm trying to run a simple pipeline using PySpark, version 1.0.1 > > I've created an RDD over a parquetFile and am mapping the contents with a > transformer function and now wish to write the data out to HDFS. > > All of the executors fail with the same stack trace (below) > > I do get a directory on HDFS, but it's empty except for a file named > _temporary. > > Any ideas? > > java.io.IOException (java.io.IOException: Filesystem closed} > org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readFully(DataInputStream.java:169) > parquet.hadoop.ParquetFileReader$Chunk.(ParquetFileReader.java:369) > parquet.hadoop.ParquetFileReader$Chunk.(ParquetFileReader.java:362) > parquet.hadoop.ParquetFileReader.readColumnChunkPages(ParquetFileReader.java:411) > parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349) > parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100) > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122) > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966) > scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:293) > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) > >
rdd.saveAsTextFile blows up
I'm trying to run a simple pipeline using PySpark, version 1.0.1 I've created an RDD over a parquetFile and am mapping the contents with a transformer function and now wish to write the data out to HDFS. All of the executors fail with the same stack trace (below) I do get a directory on HDFS, but it's empty except for a file named _temporary. Any ideas? java.io.IOException (java.io.IOException: Filesystem closed} org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:735) org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793) java.io.DataInputStream.readFully(DataInputStream.java:195) java.io.DataInputStream.readFully(DataInputStream.java:169) parquet.hadoop.ParquetFileReader$Chunk.(ParquetFileReader.java:369) parquet.hadoop.ParquetFileReader$Chunk.(ParquetFileReader.java:362) parquet.hadoop.ParquetFileReader.readColumnChunkPages(ParquetFileReader.java:411) parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349) parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100) parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966) scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:293) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)