[jira] [Comment Edited] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-18 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547444#comment-16547444
 ] 

Romeo Kienzer edited comment on SPARK-24828 at 7/18/18 6:25 AM:


Dear [~q79969786] - thanks a lot - now it works - complete code below

from pyspark.sql.types import LongType

df = 
spark.read.parquet('a2_m2.parquet/part-0-1ff92a81-68c8-446b-a54e-a042a8fd7f1e.snappy.parquet')

df_hat = df \
    .withColumn("label_tmp", df["label"].cast(LongType())) \
    .drop('label') \
    .withColumnRenamed('label_tmp', 'label')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df_hat)

 

 

Marking as resolved


was (Author: romeokienzler):
Dear [~q79969786] - thanks a lot - now it works - complete code below

{{from pyspark.sql.types import LongType}}

{{df = 
spark.read.parquet('a2_m2.parquet/part-0-1ff92a81-68c8-446b-a54e-a042a8fd7f1e.snappy.parquet')}}

df_hat = df 
     .withColumn("label_tmp", df["label"].cast(LongType())) 
     .drop('label') 
 {{  .withColumnRenamed('label_tmp', 'label')}}

{{from pyspark.ml.evaluation import MulticlassClassificationEvaluator}}

{{binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")}}

{{accuracy = binEval.evaluate(df_hat)}}

 

Marking as resolved

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip, image-2018-07-18-13-57-21-148.png
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> 

[jira] [Comment Edited] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-18 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547444#comment-16547444
 ] 

Romeo Kienzer edited comment on SPARK-24828 at 7/18/18 6:24 AM:


Dear [~q79969786] - thanks a lot - now it works - complete code below

{{from pyspark.sql.types import LongType}}

{{df = 
spark.read.parquet('a2_m2.parquet/part-0-1ff92a81-68c8-446b-a54e-a042a8fd7f1e.snappy.parquet')}}

df_hat = df 
     .withColumn("label_tmp", df["label"].cast(LongType())) 
     .drop('label') 
 {{  .withColumnRenamed('label_tmp', 'label')}}

{{from pyspark.ml.evaluation import MulticlassClassificationEvaluator}}

{{binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")}}

{{accuracy = binEval.evaluate(df_hat)}}

 

Marking as resolved


was (Author: romeokienzler):
Dear [~q79969786] - thanks a lot - now it works - complete code below

{{from pyspark.sql.types import LongType}}

{{df = 
spark.read.parquet('a2_m2.parquet/part-0-1ff92a81-68c8-446b-a54e-a042a8fd7f1e.snappy.parquet')}}

{{df_hat = df \}}
{{    .withColumn("label_tmp", df["label"].cast(LongType())) \}}
{{    .drop('label') \}}
{{    .withColumnRenamed('label_tmp', 'label')}}

{{from pyspark.ml.evaluation import MulticlassClassificationEvaluator}}

{{binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")}}

{{accuracy = binEval.evaluate(df_hat)}}

 

Marking as resolved

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip, image-2018-07-18-13-57-21-148.png
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> 

[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-18 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547444#comment-16547444
 ] 

Romeo Kienzer commented on SPARK-24828:
---

Dear [~q79969786] - thanks a lot - now it works - complete code below

{{from pyspark.sql.types import LongType}}

{{df = 
spark.read.parquet('a2_m2.parquet/part-0-1ff92a81-68c8-446b-a54e-a042a8fd7f1e.snappy.parquet')}}

{{df_hat = df \}}
{{    .withColumn("label_tmp", df["label"].cast(LongType())) \}}
{{    .drop('label') \}}
{{    .withColumnRenamed('label_tmp', 'label')}}

{{from pyspark.ml.evaluation import MulticlassClassificationEvaluator}}

{{binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")}}

{{accuracy = binEval.evaluate(df_hat)}}

 

Marking as resolved

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip, image-2018-07-18-13-57-21-148.png
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> 

[jira] [Resolved] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-18 Thread Romeo Kienzer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romeo Kienzer resolved SPARK-24828.
---
Resolution: Won't Fix

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip, image-2018-07-18-13-57-21-148.png
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  
> The file is attached [^a2_m2.parquet.zip]
>  
> The following code reproduces the error:
> df = spark.read.parquet('a2_m2.parquet')
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> 

[jira] [Comment Edited] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547418#comment-16547418
 ] 

Romeo Kienzer edited comment on SPARK-24828 at 7/18/18 5:53 AM:


Dear [~q79969786] - thanks for pointing this out.

[~hyukjin.kwon] - I've created the following code casting Integer to Long on 
the label field - but it results in the same error

 

from pyspark.sql.types import LongType

df = spark.read.parquet('a2_m2.parquet')
df_hat = df \
    .withColumn("label_tmp", df["label"].cast(LongType())) \
    .drop('label') \
    .withColumnRenamed('label_tmp', 'label')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df_hat)


was (Author: romeokienzler):
Dear [~q79969786] - thanks for pointing this out.

[~hyukjin.kwon] - I've created the following code casting Integer to Long on 
the label field - but it results in the same error

 

from pyspark.sql.types import LongType

df = spark.read.parquet('a2_m2.parquet')
df_cast = df.withColumn("label_tmp", df["label"].cast(LongType()))
df_hat = df_cast.drop('label').withColumnRenamed('label_tmp', 'label')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df_hat)

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> 

[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547418#comment-16547418
 ] 

Romeo Kienzer commented on SPARK-24828:
---

Dear [~q79969786] - thanks for pointing this out.

[~hyukjin.kwon] - I've created the following code casting Integer to Long on 
the label field - but it results in the same error

 

from pyspark.sql.types import LongType

df = spark.read.parquet('a2_m2.parquet')
df_cast = df.withColumn("label_tmp", df["label"].cast(LongType()))
df_hat = df_cast.drop('label').withColumnRenamed('label_tmp', 'label')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df_hat)

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> 

[jira] [Commented] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546602#comment-16546602
 ] 

Romeo Kienzer commented on SPARK-24828:
---

[~q79969786]  [x] done

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  
> The file is attached [^a2_m2.parquet.zip]
>  
> The following code reproduces the error:
> df = spark.read.parquet('a2_m2.parquet')
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> 

[jira] [Updated] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-17 Thread Romeo Kienzer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romeo Kienzer updated SPARK-24828:
--
Attachment: a2_m2.parquet.zip

> Incompatible parquet formats - java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> -
>
> Key: SPARK-24828
> URL: https://issues.apache.org/jira/browse/SPARK-24828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Environment for creating the parquet file:
> IBM Watson Studio Apache Spark Service, V2.1.2
> Environment for reading the parquet file:
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> MacOSX 10.13.3 (17D47)
> Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
> http://spark.apache.org/downloads.html
>Reporter: Romeo Kienzer
>Priority: Minor
> Attachments: a2_m2.parquet.zip
>
>
> As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
> here
>  
> Using the attached parquet file from one Spark installation, reading it using 
> an installation directly obtained from 
> [http://spark.apache.org/downloads.html] yields to the following exception:
>  
> 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
>  scala.MatchError: [1.0,null] (of class 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
>  java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
>      at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>      at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>      at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>      at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>      at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>      at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
>      at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>      at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>      at org.apache.spark.scheduler.Task.run(Task.scala:99)
>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>  
> The file is attached [^a2_m2.parquet.zip]
>  
> The following code reproduces the error:
> df = spark.read.parquet('a2_m2.parquet')
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> binEval = 

[jira] [Updated] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romeo Kienzer updated SPARK-24828:
--
Description: 
As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
here

 

Using the attached parquet file from one Spark installation, reading it using 
an installation directly obtained from [http://spark.apache.org/downloads.html] 
yields to the following exception:

 

18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
 scala.MatchError: [1.0,null] (of class 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
     at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
     at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
     at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
     at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
     at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
     at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
     at org.apache.spark.scheduler.Task.run(Task.scala:99)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
 java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
     at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
     at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
     at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
     at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
     at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
     at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
     at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
     at org.apache.spark.scheduler.Task.run(Task.scala:99)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)

 

The file is attached [^a2_m2.parquet.zip]

 

The following code reproduces the error:

df = spark.read.parquet('a2_m2.parquet')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df)

  was:
As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
here #3

 

Using the attached parquet file from one Spark installation, reading it using 
an installation directly obtained from [http://spark.apache.org/downloads.html] 
yields to the following exception:

 

18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
scala.MatchError: [1.0,null] (of class 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
    at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
    at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
    at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
    at 

[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546059#comment-16546059
 ] 

Romeo Kienzer commented on SPARK-17557:
---

Dear [~hyukjin.kwon] - I've done so - new issue is SPARK-24828

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Major
> Attachments: a2_m2.parquet.zip
>
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)
Romeo Kienzer created SPARK-24828:
-

 Summary: Incompatible parquet formats - 
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
 Key: SPARK-24828
 URL: https://issues.apache.org/jira/browse/SPARK-24828
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: Environment for creating the parquet file:

IBM Watson Studio Apache Spark Service, V2.1.2

Environment for reading the parquet file:

java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

MacOSX 10.13.3 (17D47)

Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
http://spark.apache.org/downloads.html
Reporter: Romeo Kienzer


As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
here #3

 

Using the attached parquet file from one Spark installation, reading it using 
an installation directly obtained from [http://spark.apache.org/downloads.html] 
yields to the following exception:

 

18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
scala.MatchError: [1.0,null] (of class 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
    at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
    at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
    at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
    at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
    at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
    at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

 

The file is attached [^a2_m2.parquet.zip]

 

The following code reproduces the error:

df = spark.read.parquet('a2_m2.parquet')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545557#comment-16545557
 ] 

Romeo Kienzer commented on SPARK-17557:
---

[~jayadevan.m] [~hyukjin.kwon] can you please re-open? You can easily reproduce 
the error with the following parquet file

 

[^a2_m2.parquet.zip]

 

and the following code in pyspark 2.1.2, 2.1.3, 2.3.0

df = spark.read.parquet('a2_m2.parquet')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df)

 

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Major
> Attachments: a2_m2.parquet.zip
>
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romeo Kienzer updated SPARK-17557:
--
Attachment: a2_m2.parquet.zip

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Major
> Attachments: a2_m2.parquet.zip
>
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org