[ https://issues.apache.org/jira/browse/SPARK-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust updated SPARK-2721: ------------------------------------ Target Version/s: 1.2.0 > Fix MapType compatibility issues with reading Parquet datasets > -------------------------------------------------------------- > > Key: SPARK-2721 > URL: https://issues.apache.org/jira/browse/SPARK-2721 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.0.1 > Reporter: Robbie Russo > > Parquet-thrift (along with most likely other implementations of parquet) > supports null values in a map and this makes any thrift generated parquet > files that contain a map unreadable by spark sql due to the following code in > parquet-thrift for generating the schema for maps: > {code:title=parquet.thrift.ThriftSchemaConverter.java|borderStyle=solid} > @Override > public void visit(ThriftType.MapType mapType) { > final ThriftField mapKeyField = mapType.getKey(); > final ThriftField mapValueField = mapType.getValue(); > //save env for map > String mapName = currentName; > Type.Repetition mapRepetition = currentRepetition; > //=========handle key > currentFieldPath.push(mapKeyField); > currentName = "key"; > currentRepetition = REQUIRED; > mapKeyField.getType().accept(this); > Type keyType = currentType;//currentType is the already converted type > currentFieldPath.pop(); > //=========handle value > currentFieldPath.push(mapValueField); > currentName = "value"; > currentRepetition = OPTIONAL; > mapValueField.getType().accept(this); > Type valueType = currentType; > currentFieldPath.pop(); > if (keyType == null && valueType == null) { > currentType = null; > return; > } > if (keyType == null && valueType != null) > throw new ThriftProjectionException("key of map is not specified in > projection: " + currentFieldPath); > //restore Env > currentName = mapName; > currentRepetition = mapRepetition; > currentType = ConversionPatterns.mapType(currentRepetition, currentName, > keyType, > valueType); > } > {code} > Which causes an error on the spark side when we reach this step in the > toDataType function that asserts that both the key and value are of > repetition level REQUIRED: > {code:title=org.apache.spark.sql.parquet.ParquetTypes.scala|borderStyle=solid} > case ParquetOriginalType.MAP => { > assert( > !groupType.getFields.apply(0).isPrimitive, > "Parquet Map type malformatted: expected nested group for map!") > val keyValueGroup = groupType.getFields.apply(0).asGroupType() > assert( > keyValueGroup.getFieldCount == 2, > "Parquet Map type malformatted: nested group should have 2 (key, > value) fields!") > val keyType = toDataType(keyValueGroup.getFields.apply(0)) > println("here") > assert(keyValueGroup.getFields.apply(0).getRepetition == > Repetition.REQUIRED) > val valueType = toDataType(keyValueGroup.getFields.apply(1)) > assert(keyValueGroup.getFields.apply(1).getRepetition == > Repetition.REQUIRED) > new MapType(keyType, valueType) > } > {code} > Currently I have modified parquet-thrift to use repetition REQUIRED just to > make spark sql able to work on the parquet files since we don't actually use > null values in our maps. However it would be preferred to use parquet-thrift > and spark sql out of the box and have them work nicely together with our > existing thrift data types without having to modify dependencies. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org