[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2019-05-28 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850003#comment-16850003
 ] 

Michael Heuer commented on SPARK-25588:
---

Note a fix for this issue has been merged upstream in Parquet

[https://github.com/apache/parquet-mr/pull/560]

I don't know when a Parquet release containing this fix will be made available, 
nor how soon such a Parquet release could be merged into Spark.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2019-02-06 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761734#comment-16761734
 ] 

Nandor Kollar commented on SPARK-25588:
---

[~rdakshin] the stacktrace you get is unrelated to this Jira, it seems to be a 
completely different issue. It seems that you have a wrong version from 
parquet-format on your classpath. Spark 2.4 depends on 1.10.0 Parquet, which 
requires 2.4.0 parquet-format (pulls as transitive dependency), could you make 
sure that you have the correct version from parquet-format on your classpath?

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2019-01-19 Thread Dakshin Rajavel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16747343#comment-16747343
 ] 

Dakshin Rajavel commented on SPARK-25588:
-

[~heuermh] - I upgraded spark to 2.4 since I wanted to use aws EKS. Reading a 
parquet file causing this issue. Do you know when this issue will be fixed?

Here is the error stack trace:

Exception in thread "main" java.lang.NoSuchFieldError: BROTLI
 at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOptions$.(ParquetOptions.scala:80)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOptions$.(ParquetOptions.scala)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.(ParquetOptions.scala:55)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.(ParquetOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:164)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
 at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:643)
 at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:627)

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-11-04 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674689#comment-16674689
 ] 

Wenchen Fan commented on SPARK-25588:
-

This looks like a parquet bug when it interacts with avro. I don't know what we 
can do at Spark side, but we can upgrade parquet once this bug is fixed 
upstream.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-11-03 Thread antonkulaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674066#comment-16674066
 ] 

antonkulaga commented on SPARK-25588:
-

Any updates on this? This bug blocks ADAM library and hence blocks most of 
bioinformaticians using Spark.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);
>   }
> }

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-22 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659990#comment-16659990
 ] 

Michael Heuer commented on SPARK-25588:
---

> Can you try to downgrade parquet and see if the problem goes away?

Downgrading by dependency exclusion and override in ADAM does not appear to 
work.  Do you mean downgrade the Parquet 1.10.0 dependency in the Spark build?

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-17 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654493#comment-16654493
 ] 

Wenchen Fan commented on SPARK-25588:
-

sounds like the problem is caused by the parquet upgrade in 2.4. Can you try to 
downgrade parquet and see if the problem goes away?

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-17 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654400#comment-16654400
 ] 

Michael Heuer commented on SPARK-25588:
---

[~Gengliang.Wang] The unit test provided is only an attempt to reproduce the 
actual error, which happens downstream in ADAM.  In ADAM, we have been 
struggling with Spark's conflicting Parquet and Avro dependencies for many 
versions.  Our most recent workaround is to pin parquet-avro to version 1.8.1 
and exclude all its transitive dependencies.  This workaround worked for 2.3.2, 
thus I gave the last RC a non-binding +1.

[https://github.com/bigdatagenomics/adam/blob/master/pom.xml#L520]


That workaround does not work for 2.4.0, as this pinned version 1.8.1 conflicts 
at runtime with version 1.10.0 brought in by Spark.
{noformat}
$ mvn test
...
*** RUN ABORTED ***
  java.lang.NoSuchFieldError: BROTLI
  at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31)
  at 
org.bdgenomics.adam.rdd.JavaSaveArgs$.$lessinit$greater$default$4(GenomicRDD.scala:78){noformat}

Removing the pinned version and dependency exclusions, bringing the build 
dependency version to 1.10.0, results in the error reported here in our unit 
tests under Spark version 2.4.0.  Doing the same thing also results in the 
error reported here in our unit tests under Spark version 2.3.2.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-16 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651207#comment-16651207
 ] 

Gengliang Wang commented on SPARK-25588:


[~heuermh] I checkout Spark code with tag v2.3.1, and run your case by 
./build/sbt "; clean; project sql; testOnly *Spark25588Suite"

I can still reproduce the error. Can you confirm the case is working for 2.3.1?



> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-16 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651180#comment-16651180
 ] 

Gengliang Wang commented on SPARK-25588:


Saw similar issue in Parquet 1.10: 
https://jira.apache.org/jira/browse/PARQUET-1409

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);
>   }
> }
>   }
> }
> 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-15 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651152#comment-16651152
 ] 

Gengliang Wang commented on SPARK-25588:


[~srowen] I tried the test case in branch-2.3, which uses avro 1.7.7.  It can 
be reproduced. It seems not related to the upgrade of Avro 1.7.7 to 1.8.2.


> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
>   

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651081#comment-16651081
 ] 

Apache Spark commented on SPARK-25588:
--

User 'heuermh' has created a pull request for this issue:
https://github.com/apache/spark/pull/22742

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);
>   }
> }
>   

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-15 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651080#comment-16651080
 ] 

Michael Heuer commented on SPARK-25588:
---

Created pull request [https://github.com/apache/spark/pull/22742] with failing 
unit test that demonstrates this issue.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-11 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646027#comment-16646027
 ] 

Michael Heuer commented on SPARK-25588:
---

Here is a generic unit test that demonstrates the issue.
{code}
import com.google.common.collect.Lists
import org.apache.avro.Schema
import org.apache.hadoop.mapreduce.Job
import org.apache.parquet.avro.{ AvroParquetInputFormat, AvroReadSupport }
import org.apache.parquet.hadoop.ParquetInputFormat
import org.apache.parquet.hadoop.util.ContextUtil
import org.apache.spark.sql.SQLContext
import org.bdgenomics.adam.util.ADAMFunSuite

case class Inner(
  names: Seq[String] = Seq())

case class Middle(
  inners: Seq[Inner] = Seq())

case class Outer(
  middle: Option[Middle] = None)

class Spark25588Suite extends ADAMFunSuite {

  sparkTest("write dataset out as parquet read in as rdd fails") {
val spark = SQLContext.getOrCreate(sc)
import spark.implicits._

val inner = Inner(Seq("name0", "name1"))
val middle = Middle(Seq(inner))
val outer = Outer(Some(middle))
val dataset = sc.parallelize(Seq(outer)).toDS()

// write out from dataset to parquet
val tempPath = tmpLocation(".parquet")
dataset.toDF().write.format("parquet").save(tempPath)

// read parquet in through SQL works ok
val roundtrip = spark.read.parquet(tempPath).as[Outer]
assert(roundtrip.first != null)

// read parquet in as RDD fails
val job = Job.getInstance(sc.hadoopConfiguration)
val conf = ContextUtil.getConfiguration(job)
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Outer]])

val innerSchema = Schema.createRecord("Inner", null, null, false)
innerSchema.setFields(Lists.newArrayList(new Schema.Field("names", 
Schema.createArray(Schema.create(Schema.Type.STRING)), null, null)))

val middleSchema = Schema.createRecord("Middle", null, null, false)
middleSchema.setFields(Lists.newArrayList(new Schema.Field("inners", 
Schema.createArray(innerSchema), null, null)))

val outerSchema = Schema.createRecord("Outer", null, null, false)
outerSchema.setFields(Lists.newArrayList(new Schema.Field("middle", 
Schema.createUnion(Lists.newArrayList(Schema.create(Schema.Type.NULL), 
middleSchema)), null, null)))

AvroParquetInputFormat.setAvroReadSchema(job, outerSchema)

val records = sc.newAPIHadoopFile(tempPath,
  classOf[ParquetInputFormat[Outer]],
  classOf[Void],
  classOf[Outer],
  conf
)

assert(records.first != null)
  }
}
{code}
I'll see if I can migrate it over to the Spark test framework tomorrow morning.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645919#comment-16645919
 ] 

Sean Owen commented on SPARK-25588:
---

Is this likely due to the update from Avro 1.7.x to 1.8.x? I would say so, 
except it seems that Avro 1.8.1 + Parquet 1.8 works? Spark is on Parquet 1.10 
and Avro 1.8.2 and that triggers the issue. It sounds like we don't know of an 
Avro 1.8 and Parquet 1.10 version that work together in this respect.

I know we want to update to Avro 1.8 and Parquet 1.10 for other reasons. 
SPARK-24771 suggests that some incompatibility is inevitable with the Avro 
upgrade. But it was an update that matches what ADAM uses, 1.8.x. So I'm kind 
of confused what that caused a problem.

What changes would resolve this? Version changes in Avro or Parquet?

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-10 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645867#comment-16645867
 ] 

Wenchen Fan commented on SPARK-25588:
-

[~heuermh] is there a way to demonstrate the problem without ADAM context? I 
took a look at your test case and still have no idea what's going on.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-10 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645295#comment-16645295
 ] 

Michael Heuer commented on SPARK-25588:
---

I've reported an issue against Parquet with additional investigation
https://issues.apache.org/jira/browse/PARQUET-1441

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required group attributes (MAP) {
>   repeated group map (MAP_KEY_VALUE) {
> required binary key (UTF8);
> required binary value (UTF8);

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-10 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645294#comment-16645294
 ] 

Michael Heuer commented on SPARK-25588:
---

[~cloud_fan] I'm curious to how this issue should be triaged with regards to 
the 2.4.0 release(s).  As far as I'm concerned, this is a regression, in that 
Spark did not adequately consider what might happen when it updated the Avro 
and Parquet dependency versions.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>   }
> }
> required 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-08 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642200#comment-16642200
 ] 

Michael Heuer commented on SPARK-25588:
---

I moved the version back to 2.3.2, see
[https://github.com/bigdatagenomics/adam/pull/2055]

and created this more succinct failing unit test
[https://github.com/bigdatagenomics/adam/blob/2551654a284a4efba70aff3a2efa8f5e29bb8ea3/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/Issue2058Suite.scala]

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Spark version 2.3.2
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
> optional int32 cdsLength;
> optional int32 proteinPosition;
> optional int32 proteinLength;
> optional int32 distance;
> required group messages (LIST) {
>   repeated binary array (ENUM);
> }
>

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-04 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638226#comment-16638226
 ] 

Michael Heuer commented on SPARK-25588:
---

> Looking at the stack trace, it seems like we are using Spark RDD API to read 
> something via the parquet lib with avro read support.

The unit test shows two code paths, one uses Spark RDD API, that works fine, 
one uses the Spark SQL API, that one worked fine with 2.3.1 and now breaks with 
2.4.0.


> Is it possible to provide some code that other people can also reproduce the 
> bug locally?

Agreed, I'm still working on this, 
https://issues.apache.org/jira/browse/SPARK-25587 was an attempt at reproducing 
this issue that uncovered a different issue.


> BTW is it possible that ADAM has some problem with avro 1.8.x?

ADAM has had a dependency on Avro 1.8.x for a long time, rather there was a 1.8 
vs 1.7 internal conflict present in Spark at runtime that caused trouble.

With Avro 1.8.1 and Parquet 1.8.x dependencies in ADAM, building against Spark 
2.4.0 results in runtime error
{noformat}
*** RUN ABORTED ***
  java.lang.NoSuchFieldError: BROTLI
  at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31){noformat}
With Avro 1.8.2 and Parquet 1.10.0 dependencies in ADAM, building against Spark 
2.4.0, we run into this issue.

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0 (RC2).
> {noformat}
> $ spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.0
>   /_/
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181
> Branch
> Compiled by user  on 2018-09-27T14:50:10Z
> Revision
> Url
> Type --help for more information.
> {noformat}
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-04 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638151#comment-16638151
 ] 

Wenchen Fan commented on SPARK-25588:
-

The code snippet is a little hard to understand without context(e.g. what's 
`loadAlignments`? what's `transmute`?). Looking at the stack trace, it seems 
like we are using Spark RDD API to read something via the parquet lib with avro 
read support.

Is it possible to provide some code that other people can also reproduce the 
bug locally?

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0 (RC2).
> {noformat}
> $ spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.0
>   /_/
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181
> Branch
> Compiled by user  on 2018-09-27T14:50:10Z
> Revision
> Url
> Type --help for more information.
> {noformat}
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary 

[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-04 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638155#comment-16638155
 ] 

Wenchen Fan commented on SPARK-25588:
-

BTW is it possible that ADAM has some problem with avro 1.8.x? We upgrade avro 
to 1.8 in the 2.4 release: https://issues.apache.org/jira/browse/SPARK-24771

> SchemaParseException: Can't redefine: list when reading from Parquet
> 
>
> Key: SPARK-25588
> URL: https://issues.apache.org/jira/browse/SPARK-25588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0 (RC2).
> {noformat}
> $ spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.0
>   /_/
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181
> Branch
> Compiled by user  on 2018-09-27T14:50:10Z
> Revision
> Url
> Type --help for more information.
> {noformat}
>Reporter: Michael Heuer
>Priority: Major
>
> In ADAM, a library downstream of Spark, we use Avro to define a schema, 
> generate Java classes from the Avro schema using the avro-maven-plugin, and 
> generate Scala Products from the Avro schema using our own code generation 
> library.
> In the code path demonstrated by the following unit test, we write out to 
> Parquet and read back in using an RDD of Avro-generated Java classes and then 
> write out to Parquet and read back in using a Dataset of Avro-generated Scala 
> Products.
> {code:scala}
>   sparkTest("transform reads to variant rdd") {
> val reads = sc.loadAlignments(testFile("small.sam"))
> def checkSave(variants: VariantRDD) {
>   val tempPath = tmpLocation(".adam")
>   variants.saveAsParquet(tempPath)
>   assert(sc.loadVariants(tempPath).rdd.count === 20)
> }
> val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
> VariantRDD](
>   (rdd: RDD[AlignmentRecord]) => {
> rdd.map(AlignmentRecordRDDSuite.varFn)
>   })
> checkSave(variants)
> val sqlContext = SQLContext.getOrCreate(sc)
> import sqlContext.implicits._
> val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
> VariantProduct, VariantRDD](
>   (ds: Dataset[AlignmentRecordProduct]) => {
> ds.map(r => {
>   VariantProduct.fromAvro(
> AlignmentRecordRDDSuite.varFn(r.toAvro))
> })
>   })
> checkSave(variantsDs)
> }
> {code}
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540
> Note the schema in Parquet are different:
> RDD code path
> {noformat}
> $ parquet-tools schema 
> /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
> message org.bdgenomics.formats.avro.Variant {
>   optional binary contigName (UTF8);
>   optional int64 start;
>   optional int64 end;
>   required group names (LIST) {
> repeated binary array (UTF8);
>   }
>   optional boolean splitFromMultiAllelic;
>   optional binary referenceAllele (UTF8);
>   optional binary alternateAllele (UTF8);
>   optional double quality;
>   optional boolean filtersApplied;
>   optional boolean filtersPassed;
>   required group filtersFailed (LIST) {
> repeated binary array (UTF8);
>   }
>   optional group annotation {
> optional binary ancestralAllele (UTF8);
> optional int32 alleleCount;
> optional int32 readDepth;
> optional int32 forwardReadDepth;
> optional int32 reverseReadDepth;
> optional int32 referenceReadDepth;
> optional int32 referenceForwardReadDepth;
> optional int32 referenceReverseReadDepth;
> optional float alleleFrequency;
> optional binary cigar (UTF8);
> optional boolean dbSnp;
> optional boolean hapMap2;
> optional boolean hapMap3;
> optional boolean validated;
> optional boolean thousandGenomes;
> optional boolean somatic;
> required group transcriptEffects (LIST) {
>   repeated group array {
> optional binary alternateAllele (UTF8);
> required group effects (LIST) {
>   repeated binary array (UTF8);
> }
> optional binary geneName (UTF8);
> optional binary geneId (UTF8);
> optional binary featureType (UTF8);
> optional binary featureId (UTF8);
> optional binary biotype (UTF8);
> optional int32 rank;
> optional int32 total;
> optional binary genomicHgvs (UTF8);
> optional binary transcriptHgvs (UTF8);
> optional binary proteinHgvs (UTF8);
> optional int32 cdnaPosition;
> optional int32 cdnaLength;
> optional int32 cdsPosition;
>