[ 
https://issues.apache.org/jira/browse/SPARK-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2721:
------------------------------------

    Target Version/s: 1.2.0

> Fix MapType compatibility issues with reading Parquet datasets
> --------------------------------------------------------------
>
>                 Key: SPARK-2721
>                 URL: https://issues.apache.org/jira/browse/SPARK-2721
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.0.1
>            Reporter: Robbie Russo
>
> Parquet-thrift (along with most likely other implementations of parquet) 
> supports null values in a map and this makes any thrift generated parquet 
> files that contain a map unreadable by spark sql due to the following code in 
> parquet-thrift for generating the schema for maps:
> {code:title=parquet.thrift.ThriftSchemaConverter.java|borderStyle=solid}
>   @Override
>   public void visit(ThriftType.MapType mapType) {
>     final ThriftField mapKeyField = mapType.getKey();
>     final ThriftField mapValueField = mapType.getValue();
>     //save env for map
>     String mapName = currentName;
>     Type.Repetition mapRepetition = currentRepetition;
>     //=========handle key
>     currentFieldPath.push(mapKeyField);
>     currentName = "key";
>     currentRepetition = REQUIRED;
>     mapKeyField.getType().accept(this);
>     Type keyType = currentType;//currentType is the already converted type
>     currentFieldPath.pop();
>     //=========handle value
>     currentFieldPath.push(mapValueField);
>     currentName = "value";
>     currentRepetition = OPTIONAL;
>     mapValueField.getType().accept(this);
>     Type valueType = currentType;
>     currentFieldPath.pop();
>     if (keyType == null && valueType == null) {
>       currentType = null;
>       return;
>     }
>     if (keyType == null && valueType != null)
>       throw new ThriftProjectionException("key of map is not specified in 
> projection: " + currentFieldPath);
>     //restore Env
>     currentName = mapName;
>     currentRepetition = mapRepetition;
>     currentType = ConversionPatterns.mapType(currentRepetition, currentName,
>             keyType,
>             valueType);
>   }
> {code}
> Which causes an error on the spark side when we reach this step in the 
> toDataType function that asserts that both the key and value are of 
> repetition level REQUIRED:
> {code:title=org.apache.spark.sql.parquet.ParquetTypes.scala|borderStyle=solid}
>         case ParquetOriginalType.MAP => {
>           assert(
>             !groupType.getFields.apply(0).isPrimitive,
>             "Parquet Map type malformatted: expected nested group for map!")
>           val keyValueGroup = groupType.getFields.apply(0).asGroupType()
>           assert(
>             keyValueGroup.getFieldCount == 2,
>             "Parquet Map type malformatted: nested group should have 2 (key, 
> value) fields!")
>           val keyType = toDataType(keyValueGroup.getFields.apply(0))
>           println("here")
>           assert(keyValueGroup.getFields.apply(0).getRepetition == 
> Repetition.REQUIRED)
>           val valueType = toDataType(keyValueGroup.getFields.apply(1))
>           assert(keyValueGroup.getFields.apply(1).getRepetition == 
> Repetition.REQUIRED)
>           new MapType(keyType, valueType)
>         }
> {code}
> Currently I have modified parquet-thrift to use repetition REQUIRED just to 
> make spark sql able to work on the parquet files since we don't actually use 
> null values in our maps. However it would be preferred to use parquet-thrift 
> and spark sql out of the box and have them work nicely together with our 
> existing thrift data types without having to modify dependencies.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to