One thing to note is that, it would be good to add explicit file system scheme to the output path (i.e. "file:///var/..." instead of "/var/..."), esp. when you do have HDFS running. Because in this case the data might be written to HDFS rather than your local file system if Spark found Hadoop configuration files when starting the application.

Cheng

On 8/11/15 11:12 PM, saif.a.ell...@wellsfargo.com wrote:

I confirm that it works,

I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450

Saif

*From:*Ellafi, Saif A.
*Sent:* Tuesday, August 11, 2015 12:01 PM
*To:* Ellafi, Saif A.; deanwamp...@gmail.com
*Cc:* user@spark.apache.org
*Subject:* RE: Parquet without hadoop: Possible?

Sorry, I provided bad information. This example worked fine with reduced parallelism.

It seems my problem have to do with something specific with the real data frame at reading point.

Saif

*From:*saif.a.ell...@wellsfargo.com <mailto:saif.a.ell...@wellsfargo.com> [mailto:saif.a.ell...@wellsfargo.com]
*Sent:* Tuesday, August 11, 2015 11:49 AM
*To:* deanwamp...@gmail.com <mailto:deanwamp...@gmail.com>
*Cc:* user@spark.apache.org <mailto:user@spark.apache.org>
*Subject:* RE: Parquet without hadoop: Possible?

I am launching my spark-shell

spark-1.4.1-bin-hadoop2.6/bin/spark-shell

15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support)..

SQL context available as sqlContext.

scala> val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF

scala> data.write.parquet("/var/ data/Saif/pq")

Then I get a million errors:

15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz]

15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz]

15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz]

15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task.

java.lang.OutOfMemoryError: Java heap space

15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task.

java.lang.OutOfMemoryError: Java heap space

15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.

java.lang.OutOfMemoryError: Java heap space

15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.

java.lang.OutOfMemoryError: Java heap space

15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task.

java.lang.OutOfMemoryError: Java heap space

15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task.

java.lang.OutOfMemoryError: Java heap space

15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.

java.lang.OutOfMemoryError: Java heap space

15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task.

java.lang.OutOfMemoryError: Java heap space

15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task.

java.lang.OutOfMemoryError: Java heap space

at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)

at parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)

at parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>(ColumnChunkPageWriteStore.java:68)

at parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>(ColumnChunkPageWriteStore.java:48)

at parquet.hadoop.ColumnChunkPageWriteStore.getPageWriter(ColumnChunkPageWriteStore.java:215)

at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:67)

at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)

at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)

at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)

at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)

at parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)

at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)

at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)

at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)

at org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)

at org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)

at org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)

at org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)

at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)

at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)

at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)

at org.apache.spark.scheduler.Task.run(Task.scala:70)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.

...

...

.

15/08/11 09:46:10 ERROR DefaultWriterContainer: Task attempt attempt_201508110946_0000_m_000011_0 aborted.

15/08/11 09:46:10 ERROR Executor: Exception in task 31.0 in stage 0.0 (TID 31)

org.apache.spark.SparkException: Task failed while writing rows.

at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:191)

at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)

at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)

at org.apache.spark.scheduler.Task.run(Task.scala:70)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: Java heap space

...

*From:*Dean Wampler [mailto:deanwamp...@gmail.com]
*Sent:* Tuesday, August 11, 2015 11:39 AM
*To:* Ellafi, Saif A.
*Cc:* user@spark.apache.org <mailto:user@spark.apache.org>
*Subject:* Re: Parquet without hadoop: Possible?

It should work fine. I have an example script here: https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala (Spark 1.4.X)

What does "I am failing to do so" mean?


Dean Wampler, Ph.D.

Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)

Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>

http://polyglotprogramming.com

On Tue, Aug 11, 2015 at 9:28 AM, <saif.a.ell...@wellsfargo.com <mailto:saif.a.ell...@wellsfargo.com>> wrote:

Hi all,

I don’t have any hadoop fs installed on my environment, but I would like to store dataframes in parquet files. I am failing to do so, if possible, anyone have any pointers?

Thank you,

Saif


Reply via email to