One thing to note is that, it would be good to add explicit file system
scheme to the output path (i.e. "file:///var/..." instead of
"/var/..."), esp. when you do have HDFS running. Because in this case
the data might be written to HDFS rather than your local file system if
Spark found Hadoop configuration files when starting the application.
Cheng
On 8/11/15 11:12 PM, saif.a.ell...@wellsfargo.com wrote:
I confirm that it works,
I was just having this issue:
https://issues.apache.org/jira/browse/SPARK-8450
Saif
*From:*Ellafi, Saif A.
*Sent:* Tuesday, August 11, 2015 12:01 PM
*To:* Ellafi, Saif A.; deanwamp...@gmail.com
*Cc:* user@spark.apache.org
*Subject:* RE: Parquet without hadoop: Possible?
Sorry, I provided bad information. This example worked fine with
reduced parallelism.
It seems my problem have to do with something specific with the real
data frame at reading point.
Saif
*From:*saif.a.ell...@wellsfargo.com
<mailto:saif.a.ell...@wellsfargo.com>
[mailto:saif.a.ell...@wellsfargo.com]
*Sent:* Tuesday, August 11, 2015 11:49 AM
*To:* deanwamp...@gmail.com <mailto:deanwamp...@gmail.com>
*Cc:* user@spark.apache.org <mailto:user@spark.apache.org>
*Subject:* RE: Parquet without hadoop: Possible?
I am launching my spark-shell
spark-1.4.1-bin-hadoop2.6/bin/spark-shell
15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive
support)..
SQL context available as sqlContext.
scala> val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF
scala> data.write.parquet("/var/ data/Saif/pq")
Then I get a million errors:
15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz]
15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz]
15/08/11 09:46:01 INFO CodecPool: Got brand-new compressor [.gz]
15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:09 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
15/08/11 09:46:07 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.lang.OutOfMemoryError: Java heap space
at
parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
at
parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
at
parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>(ColumnChunkPageWriteStore.java:68)
at
parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>(ColumnChunkPageWriteStore.java:48)
at
parquet.hadoop.ColumnChunkPageWriteStore.getPageWriter(ColumnChunkPageWriteStore.java:215)
at
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:67)
at
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
at
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
at
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
at
parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
at
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
at
org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
at
org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
at
org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/08/11 09:46:08 ERROR InsertIntoHadoopFsRelation: Aborting task.
...
...
.
15/08/11 09:46:10 ERROR DefaultWriterContainer: Task attempt
attempt_201508110946_0000_m_000011_0 aborted.
15/08/11 09:46:10 ERROR Executor: Exception in task 31.0 in stage 0.0
(TID 31)
org.apache.spark.SparkException: Task failed while writing rows.
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:191)
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
at
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
...
*From:*Dean Wampler [mailto:deanwamp...@gmail.com]
*Sent:* Tuesday, August 11, 2015 11:39 AM
*To:* Ellafi, Saif A.
*Cc:* user@spark.apache.org <mailto:user@spark.apache.org>
*Subject:* Re: Parquet without hadoop: Possible?
It should work fine. I have an example script here:
https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala
(Spark 1.4.X)
What does "I am failing to do so" mean?
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
On Tue, Aug 11, 2015 at 9:28 AM, <saif.a.ell...@wellsfargo.com
<mailto:saif.a.ell...@wellsfargo.com>> wrote:
Hi all,
I don’t have any hadoop fs installed on my environment, but I would
like to store dataframes in parquet files. I am failing to do so, if
possible, anyone have any pointers?
Thank you,
Saif