[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850003#comment-16850003 ] Michael Heuer commented on SPARK-25588: --- Note a fix for this issue has been merged upstream in Parquet [https://github.com/apache/parquet-mr/pull/560] I don't know when a Parquet release containing this fix will be made available, nor how soon such a Parquet release could be merged into Spark. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761734#comment-16761734 ] Nandor Kollar commented on SPARK-25588: --- [~rdakshin] the stacktrace you get is unrelated to this Jira, it seems to be a completely different issue. It seems that you have a wrong version from parquet-format on your classpath. Spark 2.4 depends on 1.10.0 Parquet, which requires 2.4.0 parquet-format (pulls as transitive dependency), could you make sure that you have the correct version from parquet-format on your classpath? > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; >
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16747343#comment-16747343 ] Dakshin Rajavel commented on SPARK-25588: - [~heuermh] - I upgraded spark to 2.4 since I wanted to use aws EKS. Reading a parquet file causing this issue. Do you know when this issue will be fixed? Here is the error stack trace: Exception in thread "main" java.lang.NoSuchFieldError: BROTLI at org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31) at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions$.(ParquetOptions.scala:80) at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions$.(ParquetOptions.scala) at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.(ParquetOptions.scala:55) at org.apache.spark.sql.execution.datasources.parquet.ParquetOptions.(ParquetOptions.scala:39) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.inferSchema(ParquetFileFormat.scala:164) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:643) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:627) > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674689#comment-16674689 ] Wenchen Fan commented on SPARK-25588: - This looks like a parquet bug when it interacts with avro. I don't know what we can do at Spark side, but we can upgrade parquet once this bug is fixed upstream. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8);
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674066#comment-16674066 ] antonkulaga commented on SPARK-25588: - Any updates on this? This bug blocks ADAM library and hence blocks most of bioinformaticians using Spark. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8); > } > }
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659990#comment-16659990 ] Michael Heuer commented on SPARK-25588: --- > Can you try to downgrade parquet and see if the problem goes away? Downgrading by dependency exclusion and override in ADAM does not appear to work. Do you mean downgrade the Parquet 1.10.0 dependency in the Spark build? > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654493#comment-16654493 ] Wenchen Fan commented on SPARK-25588: - sounds like the problem is caused by the parquet upgrade in 2.4. Can you try to downgrade parquet and see if the problem goes away? > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654400#comment-16654400 ] Michael Heuer commented on SPARK-25588: --- [~Gengliang.Wang] The unit test provided is only an attempt to reproduce the actual error, which happens downstream in ADAM. In ADAM, we have been struggling with Spark's conflicting Parquet and Avro dependencies for many versions. Our most recent workaround is to pin parquet-avro to version 1.8.1 and exclude all its transitive dependencies. This workaround worked for 2.3.2, thus I gave the last RC a non-binding +1. [https://github.com/bigdatagenomics/adam/blob/master/pom.xml#L520] That workaround does not work for 2.4.0, as this pinned version 1.8.1 conflicts at runtime with version 1.10.0 brought in by Spark. {noformat} $ mvn test ... *** RUN ABORTED *** java.lang.NoSuchFieldError: BROTLI at org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31) at org.bdgenomics.adam.rdd.JavaSaveArgs$.$lessinit$greater$default$4(GenomicRDD.scala:78){noformat} Removing the pinned version and dependency exclusions, bringing the build dependency version to 1.10.0, results in the error reported here in our unit tests under Spark version 2.4.0. Doing the same thing also results in the error reported here in our unit tests under Spark version 2.3.2. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651207#comment-16651207 ] Gengliang Wang commented on SPARK-25588: [~heuermh] I checkout Spark code with tag v2.3.1, and run your case by ./build/sbt "; clean; project sql; testOnly *Spark25588Suite" I can still reproduce the error. Can you confirm the case is working for 2.3.1? > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651180#comment-16651180 ] Gengliang Wang commented on SPARK-25588: Saw similar issue in Parquet 1.10: https://jira.apache.org/jira/browse/PARQUET-1409 > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8); > } > } > } > } >
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651152#comment-16651152 ] Gengliang Wang commented on SPARK-25588: [~srowen] I tried the test case in branch-2.3, which uses avro 1.7.7. It can be reproduced. It seems not related to the upgrade of Avro 1.7.7 to 1.8.2. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); >
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651081#comment-16651081 ] Apache Spark commented on SPARK-25588: -- User 'heuermh' has created a pull request for this issue: https://github.com/apache/spark/pull/22742 > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8); > } > } >
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651080#comment-16651080 ] Michael Heuer commented on SPARK-25588: --- Created pull request [https://github.com/apache/spark/pull/22742] with failing unit test that demonstrates this issue. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8);
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646027#comment-16646027 ] Michael Heuer commented on SPARK-25588: --- Here is a generic unit test that demonstrates the issue. {code} import com.google.common.collect.Lists import org.apache.avro.Schema import org.apache.hadoop.mapreduce.Job import org.apache.parquet.avro.{ AvroParquetInputFormat, AvroReadSupport } import org.apache.parquet.hadoop.ParquetInputFormat import org.apache.parquet.hadoop.util.ContextUtil import org.apache.spark.sql.SQLContext import org.bdgenomics.adam.util.ADAMFunSuite case class Inner( names: Seq[String] = Seq()) case class Middle( inners: Seq[Inner] = Seq()) case class Outer( middle: Option[Middle] = None) class Spark25588Suite extends ADAMFunSuite { sparkTest("write dataset out as parquet read in as rdd fails") { val spark = SQLContext.getOrCreate(sc) import spark.implicits._ val inner = Inner(Seq("name0", "name1")) val middle = Middle(Seq(inner)) val outer = Outer(Some(middle)) val dataset = sc.parallelize(Seq(outer)).toDS() // write out from dataset to parquet val tempPath = tmpLocation(".parquet") dataset.toDF().write.format("parquet").save(tempPath) // read parquet in through SQL works ok val roundtrip = spark.read.parquet(tempPath).as[Outer] assert(roundtrip.first != null) // read parquet in as RDD fails val job = Job.getInstance(sc.hadoopConfiguration) val conf = ContextUtil.getConfiguration(job) ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Outer]]) val innerSchema = Schema.createRecord("Inner", null, null, false) innerSchema.setFields(Lists.newArrayList(new Schema.Field("names", Schema.createArray(Schema.create(Schema.Type.STRING)), null, null))) val middleSchema = Schema.createRecord("Middle", null, null, false) middleSchema.setFields(Lists.newArrayList(new Schema.Field("inners", Schema.createArray(innerSchema), null, null))) val outerSchema = Schema.createRecord("Outer", null, null, false) outerSchema.setFields(Lists.newArrayList(new Schema.Field("middle", Schema.createUnion(Lists.newArrayList(Schema.create(Schema.Type.NULL), middleSchema)), null, null))) AvroParquetInputFormat.setAvroReadSchema(job, outerSchema) val records = sc.newAPIHadoopFile(tempPath, classOf[ParquetInputFormat[Outer]], classOf[Void], classOf[Outer], conf ) assert(records.first != null) } } {code} I'll see if I can migrate it over to the Spark test framework tomorrow morning. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645919#comment-16645919 ] Sean Owen commented on SPARK-25588: --- Is this likely due to the update from Avro 1.7.x to 1.8.x? I would say so, except it seems that Avro 1.8.1 + Parquet 1.8 works? Spark is on Parquet 1.10 and Avro 1.8.2 and that triggers the issue. It sounds like we don't know of an Avro 1.8 and Parquet 1.10 version that work together in this respect. I know we want to update to Avro 1.8 and Parquet 1.10 for other reasons. SPARK-24771 suggests that some incompatibility is inevitable with the Avro upgrade. But it was an update that matches what ADAM uses, 1.8.x. So I'm kind of confused what that caused a problem. What changes would resolve this? Version changes in Avro or Parquet? > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645867#comment-16645867 ] Wenchen Fan commented on SPARK-25588: - [~heuermh] is there a way to demonstrate the problem without ADAM context? I took a look at your test case and still have no idea what's going on. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); >
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645295#comment-16645295 ] Michael Heuer commented on SPARK-25588: --- I've reported an issue against Parquet with additional investigation https://issues.apache.org/jira/browse/PARQUET-1441 > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8);
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645294#comment-16645294 ] Michael Heuer commented on SPARK-25588: --- [~cloud_fan] I'm curious to how this issue should be triaged with regards to the 2.4.0 release(s). As far as I'm concerned, this is a regression, in that Spark did not adequately consider what might happen when it updated the Avro and Parquet dependency versions. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642200#comment-16642200 ] Michael Heuer commented on SPARK-25588: --- I moved the version back to 2.3.2, see [https://github.com/bigdatagenomics/adam/pull/2055] and created this more succinct failing unit test [https://github.com/bigdatagenomics/adam/blob/2551654a284a4efba70aff3a2efa8f5e29bb8ea3/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/Issue2058Suite.scala] > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } >
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638226#comment-16638226 ] Michael Heuer commented on SPARK-25588: --- > Looking at the stack trace, it seems like we are using Spark RDD API to read > something via the parquet lib with avro read support. The unit test shows two code paths, one uses Spark RDD API, that works fine, one uses the Spark SQL API, that one worked fine with 2.3.1 and now breaks with 2.4.0. > Is it possible to provide some code that other people can also reproduce the > bug locally? Agreed, I'm still working on this, https://issues.apache.org/jira/browse/SPARK-25587 was an attempt at reproducing this issue that uncovered a different issue. > BTW is it possible that ADAM has some problem with avro 1.8.x? ADAM has had a dependency on Avro 1.8.x for a long time, rather there was a 1.8 vs 1.7 internal conflict present in Spark at runtime that caused trouble. With Avro 1.8.1 and Parquet 1.8.x dependencies in ADAM, building against Spark 2.4.0 results in runtime error {noformat} *** RUN ABORTED *** java.lang.NoSuchFieldError: BROTLI at org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31){noformat} With Avro 1.8.2 and Parquet 1.10.0 dependencies in ADAM, building against Spark 2.4.0, we run into this issue. > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 (RC2). > {noformat} > $ spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.0 > /_/ > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181 > Branch > Compiled by user on 2018-09-27T14:50:10Z > Revision > Url > Type --help for more information. > {noformat} >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; >
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638151#comment-16638151 ] Wenchen Fan commented on SPARK-25588: - The code snippet is a little hard to understand without context(e.g. what's `loadAlignments`? what's `transmute`?). Looking at the stack trace, it seems like we are using Spark RDD API to read something via the parquet lib with avro read support. Is it possible to provide some code that other people can also reproduce the bug locally? > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 (RC2). > {noformat} > $ spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.0 > /_/ > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181 > Branch > Compiled by user on 2018-09-27T14:50:10Z > Revision > Url > Type --help for more information. > {noformat} >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary
[jira] [Commented] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638155#comment-16638155 ] Wenchen Fan commented on SPARK-25588: - BTW is it possible that ADAM has some problem with avro 1.8.x? We upgrade avro to 1.8 in the 2.4 release: https://issues.apache.org/jira/browse/SPARK-24771 > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 (RC2). > {noformat} > $ spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.0 > /_/ > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181 > Branch > Compiled by user on 2018-09-27T14:50:10Z > Revision > Url > Type --help for more information. > {noformat} >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; >