[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2019-05-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9442:

Labels: bulk-closed  (was: )

> java.lang.ArithmeticException: / by zero when reading Parquet
> -
>
> Key: SPARK-9442
> URL: https://issues.apache.org/jira/browse/SPARK-9442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: DB Tsai
>Priority: Major
>  Labels: bulk-closed
>
> I am counting how many records in my nested parquet file with this schema,
> {code}
> scala> u1aTesting.printSchema
> root
>  |-- profileId: long (nullable = true)
>  |-- country: string (nullable = true)
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- videoId: long (nullable = true)
>  |||-- date: long (nullable = true)
>  |||-- label: double (nullable = true)
>  |||-- weight: double (nullable = true)
>  |||-- features: vector (nullable = true)
> {code}
> and the number of the records in the nested data array is around 10k, and 
> each of the parquet file is around 600MB. The total size is around 120GB. 
> I am doing a simple count
> {code}
> scala> u1aTesting.count
> parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
> file 
> hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: / by zero
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>   ... 21 more
> {code}
> BTW, no all the tasks fail, and some of them are successful. 
> Another note: By explicitly looping through the data to count, it will works.
> {code}
> sqlContext.read.load(hdfsPath + s"/testing/u1snappy/${date}/").map(x => 
> 1L).reduce((x, y) => x + y) 
> {code}
> I think maybe some metadata in parquet files are corrupted. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9442:
---
Description: 
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 21 more
{code}

BTW, no all the tasks fail, and some of them are successful. 

Another note: By explicitly looping through the data to count, it will works.
[code]
sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
1L).reduce((x, y) = x + y) 
[code]

I think maybe some metadata in parquet files are corrupted. 

  was:
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 

[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9442:
---
Description: 
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 21 more
{code}

BTW, no all the tasks fail, and some of them are successful. 

Another note: By explicitly looping through the data to count, it will works.
{code}
sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
1L).reduce((x, y) = x + y) 
{code}

I think maybe some metadata in parquet files are corrupted. 

  was:
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 

[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9442:
---
Description: 
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 21 more
{code}


 java.lang.ArithmeticException: / by zero when reading Parquet
 -

 Key: SPARK-9442
 URL: https://issues.apache.org/jira/browse/SPARK-9442
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: DB Tsai

 I am counting how many records in my nested parquet file with this schema,
 {code}
 scala u1aTesting.printSchema
 root
  |-- profileId: long (nullable = true)
  |-- country: string (nullable = true)
  |-- data: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- videoId: long (nullable = true)
  |||-- date: long (nullable = true)
  |||-- label: double (nullable = true)
  |||-- weight: double (nullable = true)
  |||-- features: vector (nullable = true)
 {code}
 and the number of the records in the nested data array is around 10k, and 
 each of the parquet file is around 600MB. The total size is around 120GB. 
 I am doing a simple count
 {code}
 scala u1aTesting.count
 parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
 file 
 hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
   at 
 

[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9442:
---
Description: 
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 21 more
{code}

BTW, no all the tasks fail, and some of them are successful. 

  was:
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at