Re: Spark parquet file read problem !

serkan taş Sun, 30 Jul 2017 22:55:13 -0700

I checked and realised that the schema of the files different with some missing 
fields and some fields with same  name but different type.


How may i overcome the issue?

Android için Outlook<https://aka.ms/ghei36> uygulamasini edinin

________________________________
From: pandees waran <pande...@gmail.com>
Sent: Sunday, July 30, 2017 7:12:55 PM
To: serkan taş
Cc: user@spark.apache.org
Subject: Re: Spark parquet file read problem !

I have encountered the similar error when the schema / datatypes are 
conflicting in those 2 parquet files. Are you sure that the 2 individual files 
are in the same structure with similar datatypes. If not you have to fix this 
by enforcing the default values for the missing values to make the structure 
and data types identical.

Sent from my iPhone

On Jul 30, 2017, at 8:11 AM, serkan taş 
<serkan_...@hotmail.com<mailto:serkan_...@hotmail.com>> wrote:

Hi,

I have a problem while reading parquet files located in hdfs.


If i read the files individually nothing wrong and i can get the file content.

parquetFile = 
spark.read.parquet(“hdfs://xxx/20170719/part-00000-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet")

and

parquetFile = 
spark.read.parquet(“hdfs://xxx/20170719/part-00000-17262890-a56c-42e2-b299-2b10354da184.snappy.parquet”)


But when i try to read the folder,

This is how i read the folder :

parquetFile = spark.read.parquet(“hdfs://xxx/20170719/“)


than i get the below exception :

Note : Only these two files are in the folder. Please find the parquet files 
attached.


Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 
5, localhost, executor driver): org.apache.parquet.io.ParquetDecodingException: 
Can not read value at 1 in block 0 in file 
hdfs://xxx/20170719/part-00000-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet
       at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
       at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
       at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
       at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
       at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
       at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
       at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
       at scala.collection.Iterator$class.foreach(Iterator.scala:893)
       at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
       at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
       at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
       at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
       at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
       at 
org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setLong(SpecificInternalRow.scala:295)
       at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setLong(ParquetRowConverter.scala:164)
       at 
org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addLong(ParquetRowConverter.scala:86)
       at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:274)
       at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:371)
       at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
       at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
       ... 16 more

Driver stacktrace:
       at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
       at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
       at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
       at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
       at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
       at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
       at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
       at scala.Option.foreach(Option.scala:257)
       at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
       at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
       at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
       at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
       at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
       at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
       at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
       at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
       at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
       at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:441)
       at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
       at py4j.Gateway.invoke(Gateway.java:280)
       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
       at py4j.commands.CallCommand.execute(CallCommand.java:79)
       at py4j.GatewayConnection.run(GatewayConnection.java:214)
       at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 1 in block 0 in file 
hdfs://xxx/20170719/part-00000-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet
       at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
       at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
       at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
       at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
       at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
       at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
       at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
       at scala.collection.Iterator$class.foreach(Iterator.scala:893)
       at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
       at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
       at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
       at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
       at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
       at 
org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setLong(SpecificInternalRow.scala:295)
       at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setLong(ParquetRowConverter.scala:164)
       at 
org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addLong(ParquetRowConverter.scala:86)
       at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.writeValue(ColumnReaderImpl.java:274)
       at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:371)
       at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
       at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
       ... 16 more



<part-00000-17262890-a56c-42e2-b299-2b10354da184.snappy.parquet>
<part-00000-3a9c226f-4fef-44b8-996b-115a2408c746.snappy.parquet>

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

Re: Spark parquet file read problem !

Reply via email to