[jira] [Assigned] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25588: Assignee: (was: Apache Spark) > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8); > } > } > } > } > {noformat} > Dataset code path: > {noformat} > $ parquet-tools schema >
[jira] [Assigned] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25588: Assignee: Apache Spark > SchemaParseException: Can't redefine: list when reading from Parquet > > > Key: SPARK-25588 > URL: https://issues.apache.org/jira/browse/SPARK-25588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 > Environment: Spark version 2.3.2 >Reporter: Michael Heuer >Assignee: Apache Spark >Priority: Major > > In ADAM, a library downstream of Spark, we use Avro to define a schema, > generate Java classes from the Avro schema using the avro-maven-plugin, and > generate Scala Products from the Avro schema using our own code generation > library. > In the code path demonstrated by the following unit test, we write out to > Parquet and read back in using an RDD of Avro-generated Java classes and then > write out to Parquet and read back in using a Dataset of Avro-generated Scala > Products. > {code:scala} > sparkTest("transform reads to variant rdd") { > val reads = sc.loadAlignments(testFile("small.sam")) > def checkSave(variants: VariantRDD) { > val tempPath = tmpLocation(".adam") > variants.saveAsParquet(tempPath) > assert(sc.loadVariants(tempPath).rdd.count === 20) > } > val variants: VariantRDD = reads.transmute[Variant, VariantProduct, > VariantRDD]( > (rdd: RDD[AlignmentRecord]) => { > rdd.map(AlignmentRecordRDDSuite.varFn) > }) > checkSave(variants) > val sqlContext = SQLContext.getOrCreate(sc) > import sqlContext.implicits._ > val variantsDs: VariantRDD = reads.transmuteDataset[Variant, > VariantProduct, VariantRDD]( > (ds: Dataset[AlignmentRecordProduct]) => { > ds.map(r => { > VariantProduct.fromAvro( > AlignmentRecordRDDSuite.varFn(r.toAvro)) > }) > }) > checkSave(variantsDs) > } > {code} > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540 > Note the schema in Parquet are different: > RDD code path > {noformat} > $ parquet-tools schema > /var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet > message org.bdgenomics.formats.avro.Variant { > optional binary contigName (UTF8); > optional int64 start; > optional int64 end; > required group names (LIST) { > repeated binary array (UTF8); > } > optional boolean splitFromMultiAllelic; > optional binary referenceAllele (UTF8); > optional binary alternateAllele (UTF8); > optional double quality; > optional boolean filtersApplied; > optional boolean filtersPassed; > required group filtersFailed (LIST) { > repeated binary array (UTF8); > } > optional group annotation { > optional binary ancestralAllele (UTF8); > optional int32 alleleCount; > optional int32 readDepth; > optional int32 forwardReadDepth; > optional int32 reverseReadDepth; > optional int32 referenceReadDepth; > optional int32 referenceForwardReadDepth; > optional int32 referenceReverseReadDepth; > optional float alleleFrequency; > optional binary cigar (UTF8); > optional boolean dbSnp; > optional boolean hapMap2; > optional boolean hapMap3; > optional boolean validated; > optional boolean thousandGenomes; > optional boolean somatic; > required group transcriptEffects (LIST) { > repeated group array { > optional binary alternateAllele (UTF8); > required group effects (LIST) { > repeated binary array (UTF8); > } > optional binary geneName (UTF8); > optional binary geneId (UTF8); > optional binary featureType (UTF8); > optional binary featureId (UTF8); > optional binary biotype (UTF8); > optional int32 rank; > optional int32 total; > optional binary genomicHgvs (UTF8); > optional binary transcriptHgvs (UTF8); > optional binary proteinHgvs (UTF8); > optional int32 cdnaPosition; > optional int32 cdnaLength; > optional int32 cdsPosition; > optional int32 cdsLength; > optional int32 proteinPosition; > optional int32 proteinLength; > optional int32 distance; > required group messages (LIST) { > repeated binary array (ENUM); > } > } > } > required group attributes (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (UTF8); > required binary value (UTF8); > } > } > } > } > {noformat} > Dataset code path: > {noformat} > $ parquet-tools