[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425919#comment-15425919 ] Lefty Leverenz commented on ORC-54: --- At least the new ORC configuration parameter (*orc.tolerate.missing.schema*) needs to be documented. But what about the field name functionality? > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > Fix For: 1.2.0 > > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425915#comment-15425915 ] Lefty Leverenz commented on ORC-54: --- Should this be documented in the wiki? > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > Fix For: 1.2.0 > > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15424754#comment-15424754 ] ASF GitHub Bot commented on ORC-54: --- Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/55 > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15423575#comment-15423575 ] ASF GitHub Bot commented on ORC-54: --- Github user prasanthj commented on a diff in the pull request: https://github.com/apache/orc/pull/55#discussion_r75038675 --- Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java --- @@ -20,59 +20,132 @@ import java.util.ArrayList; import java.util.Arrays; +import java.util.HashMap; import java.util.List; +import java.util.Map; +import java.util.regex.Pattern; +import org.apache.orc.Reader; import org.apache.orc.TypeDescription; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** - * Take the file types and the (optional) configuration column names/types and - * see if there has been schema evolution. + * Infer and track the evolution between the schema as stored in the file and + * the schema that has been requested by the reader. */ public class SchemaEvolution { // indexed by reader column id private final TypeDescription[] readerFileTypes; // indexed by reader column id - private final boolean[] included; + private final boolean[] readerIncluded; + // indexed by file column id + private final boolean[] fileIncluded; private final TypeDescription fileSchema; private final TypeDescription readerSchema; private boolean hasConversion = false; + private final boolean isAcid; + // indexed by reader column id private final boolean[] ppdSafeConversion; - public SchemaEvolution(TypeDescription fileSchema, boolean[] includedCols) { -this(fileSchema, null, includedCols); + private static final Logger LOG = +LoggerFactory.getLogger(SchemaEvolution.class); + private static final Pattern missingMetadataPattern = +Pattern.compile("_col\\d+"); + + public static class IllegalEvolutionException extends RuntimeException { +public IllegalEvolutionException(String msg) { + super(msg); +} + } + + public SchemaEvolution(TypeDescription fileSchema, + Reader.Options options) { +this(fileSchema, null, options); } public SchemaEvolution(TypeDescription fileSchema, TypeDescription readerSchema, - boolean[] includedCols) { -this.included = includedCols == null ? null : + Reader.Options options) { +boolean allowMissingMetadata = options.getTolerateMissingSchema(); +boolean[] includedCols = options.getInclude(); +this.readerIncluded = includedCols == null ? null : Arrays.copyOf(includedCols, includedCols.length); +this.fileIncluded = new boolean[fileSchema.getMaximumId() + 1]; this.hasConversion = false; this.fileSchema = fileSchema; +isAcid = checkAcidSchema(fileSchema); if (readerSchema != null) { - if (checkAcidSchema(fileSchema)) { + if (isAcid) { this.readerSchema = createEventSchema(readerSchema); } else { this.readerSchema = readerSchema; } - this.readerFileTypes = new TypeDescription[this.readerSchema.getMaximumId() + 1]; - buildConversionFileTypesArray(fileSchema, this.readerSchema); + this.readerFileTypes = +new TypeDescription[this.readerSchema.getMaximumId() + 1]; + int positionalLevels = 0; + if (!hasColumnNames(isAcid? getBaseRow(fileSchema) : fileSchema)){ +if (!this.fileSchema.equals(this.readerSchema)) { + if (!allowMissingMetadata) { +throw new RuntimeException("Found that schema metadata is missing" ++ " from file. This is likely caused by" ++ " a writer earlier than HIVE-4243. Will" ++ " not try to reconcile schemas"); + } else { +LOG.warn("Column names are missing from this file. This is" ++ " caused by a writer earlier than HIVE-4243. The reader will" ++ " reconcile schemas based on index. File type: " + +this.fileSchema + ", reader type: " + this.readerSchema); +positionalLevels = isAcid ? 2 : 1; --- End diff -- What does positional level mean? Is it real row level? Does acid file schema look like this struct,struct[real_cols]>>? If so can you leave a comment about it? > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 >
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15423517#comment-15423517 ] ASF GitHub Bot commented on ORC-54: --- GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/55 ORC-54: Evolve schemas based on field name rather than index This is an updated version of Mark's patch that fixes evolution of ACID files and rebases it to the current trunk. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-54 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/55.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #55 commit e37bc797562f4c219bb733390f8e17a79d7ecec8 Author: Mark WagnerDate: 2016-08-16T22:30:51Z ORC-54: Evolve schemas based on field name rather than index > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364827#comment-15364827 ] ASF GitHub Bot commented on ORC-54: --- Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69785443 --- Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java --- @@ -85,55 +142,78 @@ void buildMapping(TypeDescription fileType, // check the easy case first if (fileType.getCategory() == readerType.getCategory()) { switch (readerType.getCategory()) { -case BOOLEAN: -case BYTE: -case SHORT: -case INT: -case LONG: -case DOUBLE: -case FLOAT: -case STRING: -case TIMESTAMP: -case BINARY: -case DATE: - // these are always a match - break; -case CHAR: -case VARCHAR: - // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) - isOk = fileType.getMaxLength() == readerType.getMaxLength(); - break; -case DECIMAL: - // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) - // TODO we don't enforce scale and precision checks, but probably should - break; -case UNION: -case MAP: -case LIST: { - // these must be an exact match - List fileChildren = fileType.getChildren(); - List readerChildren = readerType.getChildren(); - if (fileChildren.size() == readerChildren.size()) { -for(int i=0; i < fileChildren.size(); ++i) { - buildMapping(fileChildren.get(i), readerChildren.get(i)); -} - } else { -isOk = false; + case BOOLEAN: + case BYTE: + case SHORT: + case INT: + case LONG: + case DOUBLE: + case FLOAT: + case STRING: + case TIMESTAMP: + case BINARY: + case DATE: +// these are always a match +break; + case CHAR: + case VARCHAR: +// HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) +isOk = fileType.getMaxLength() == readerType.getMaxLength(); +break; + case DECIMAL: +// HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) +// TODO we don't enforce scale and precision checks, but probably should +break; + case UNION: + case MAP: + case LIST: { +// these must be an exact match +List fileChildren = fileType.getChildren(); +List readerChildren = readerType.getChildren(); +if (fileChildren.size() == readerChildren.size()) { + for (int i = 0; i < fileChildren.size(); ++i) { +buildMapping(fileChildren.get(i), readerChildren.get(i), useFieldNames); --- End diff -- when you are recursing, useFieldNames should always be true. Prior to HIVE-4243 only the top level column names were lost. In recursive types the field names were correct. > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364809#comment-15364809 ] ASF GitHub Bot commented on ORC-54: --- Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69784363 --- Diff: java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java --- @@ -201,6 +201,10 @@ public void nextVector(ColumnVector previous, public BitFieldReader getPresent() { return present; } + --- End diff -- What do you think about moving the building of writerIncluded into the SchemaEvolution class? I think it would be pretty natural there and building it would come naturally out of buildMapping. > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364751#comment-15364751 ] ASF GitHub Bot commented on ORC-54: --- Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69779322 --- Diff: java/core/src/java/org/apache/orc/OrcConf.java --- @@ -82,6 +82,12 @@ "If ORC reader encounters corrupt data, this value will be used to\n" + "determine whether to skip the corrupt data or throw exception.\n" + "The default behavior is to throw exception."), + TOLERATE_MISSING_SCHEMA("orc.tolerate.missing.schema", + "hive.exec.orc.tolerate.missing.schema", + true, + "Writers earlier than HIVE-4243 may have inaccurate schema metadata.\n" + + "This setting will enable best effort schema evolution rather\n" --- End diff -- I'd suggest that the comment include something about using position in the top level structure as the fallback and that it support adding new columns to the end and changing types, but not deleting or reordering the top level columns. > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364723#comment-15364723 ] ASF GitHub Bot commented on ORC-54: --- Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69777633 --- Diff: java/core/src/java/org/apache/orc/impl/ReaderImpl.java --- @@ -572,7 +572,14 @@ public RecordReader rows(Options options) throws IOException { boolean[] include = options.getInclude(); // if included columns is null, then include all columns if (include == null) { - include = new boolean[types.size()]; + int size; + TypeDescription readSchema = options.getSchema(); + if (readSchema != null){ +size = readSchema.getMaximumId() + 1; --- End diff -- I'd suggest doing ``` if (readSchema == null) { readSchema = schema; } ``` > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334675#comment-15334675 ] Mark Wagner commented on ORC-54: I'm reading through your patch on HIVE-13974. Wasn't aware of that work. Still need to read through it some more to grok, but I agree that there's overlap in that both of the patches are making the distinction around the included array explicit. I don't see a reason Hive shouldn't be able to read these datasets as long as the OrcSerde declares the reader schema in the Options. We use Hive for reading Orc, so that's definitely something I'm paying attention to. > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333415#comment-15333415 ] Matt McCline commented on ORC-54: - Thanks for the heads up. Big change -- seem like it is on a direct collision course with HIVE-13974 where the reader schema needs to be used for determining the included array. I need to understand how this relates to Hive. Will Hive be able to read these tables? [~hagleitn] [~omalley] > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333251#comment-15333251 ] Mark Wagner commented on ORC-54: [~mmccline], I've just posted a PR. Let me know your thoughts. Thanks! > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333242#comment-15333242 ] ASF GitHub Bot commented on ORC-54: --- GitHub user wagnermarkd opened a pull request: https://github.com/apache/orc/pull/40 ORC-54: Evolve schemas based on field name rather than index Change evolution to compare column names when building mapping. Misc notes: - I've added a setting orc.tolerate.missing.schema to allow working with data before HIVE-4243 was fixed. This is important for anyone who has older data, though it limits the evolutions that can be supported - A lot of changes in RecordReaderImpl to make the reader column vs writer column difference explicit - SARG tests are passing, but there's no test coverage for SARG + evolution. Will work on that. - Removed some unneeded checked exceptions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wagnermarkd/orc ORC-54-public Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/40.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #40 commit c24200900e55437a242034eda6a5cd5122cc3acc Author: Mark WagnerDate: 2016-05-18T21:24:16Z ORC-54: Evolve schemas based on field name rather than index > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index
[ https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319680#comment-15319680 ] Matt McCline commented on ORC-54: - [~mwagner] [~owen.omalley] [~hagleitn] I need to be involved in this. > Evolve schemas based on field name rather than index > > > Key: ORC-54 > URL: https://issues.apache.org/jira/browse/ORC-54 > Project: Orc > Issue Type: Improvement >Reporter: Mark Wagner >Assignee: Mark Wagner > > Schema evolution as it stands today allows adding fields to the end of > schemas or removing them from the end. However, because it is based on the > index of the column, you can only ever add or remove -- not both. > ORC files have the full schema information of their contents, so there's > actually enough metadata to support changing columns anywhere in the schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)