[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-09-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179620#comment-16179620
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vvysotskyi commented on the issue:

https://github.com/apache/drill/pull/805
  
This my bad habit to regenerate metadata left from the times before fix 
DRILL-5660. These functional test failures are the result of the mistake that I 
have made when resolving merge conflicts with DRILL-4264. After I made rebase, 
older 3.2 version still used in `ParquetReaderUtility` instead of 3.3. 
I fixed this bug and functional tests are pass without regenerating parquet 
metadata cache. Could you please take a look at this change in the last commit?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142170#comment-16142170
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/805
  
Thanks for addressing my questions. While waiting for DRILL-5660, I wonder 
if we can ask Salim to take a look?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 
> one strucure by using special converting)
> For the second case only annotation for the last metadata structure can be 
> updated.
> *Summary*
> 1. Drill will read and use metadata files if files are valid, all present and 
> supported. Under 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127577#comment-16127577
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/877


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 
> one strucure by using special converting)
> For the second case only annotation for the last metadata structure can be 
> updated.
> *Summary*
> 1. Drill will read and use metadata files if files are valid, all present and 
> supported. Under supported we mean that files were created before and under 
> current Drill version.
> 2. Drill will ignore reading metadata 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127417#comment-16127417
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133228018
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatConfig.java
 ---
@@ -24,9 +24,16 @@
 import com.fasterxml.jackson.annotation.JsonTypeName;
 
 @JsonTypeName("parquet") @JsonInclude(JsonInclude.Include.NON_DEFAULT)
-public class ParquetFormatConfig implements FormatPluginConfig{
+public class ParquetFormatConfig implements FormatPluginConfig {
 
-  public boolean autoCorrectCorruptDates = true;
+  private boolean autoCorrectCorruptDates = true;
--- End diff --

4 unit test have failed in `TestCorruptParquetDateCorrection` class.
For Json generation let's leave `autoCorrectCorruptDates` as public and 
`areCorruptDatesAutoCorrected()` method mark as `@JsonIgnore`.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127285#comment-16127285
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133198372
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/MetadataContext.java
 ---
@@ -41,6 +42,8 @@
 
   private PruneStatus pruneStatus = PruneStatus.NOT_STARTED;
 
+  private boolean isMetadataCacheCorrupted;
+
   public MetadataContext() {
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 
> one strucure by using 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127286#comment-16127286
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133198419
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -543,78 +584,72 @@ private void readBlockMeta(String path,
 mapper.registerModule(serialModule);
 mapper.registerModule(module);
 mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, 
false);
-FSDataInputStream is = fs.open(p);
-
-boolean alreadyCheckedModification = false;
-boolean newMetadata = false;
-
-if (metaContext != null) {
-  alreadyCheckedModification = metaContext.getStatus(parentDirString);
-}
-
-if (dirsOnly) {
-  parquetTableMetadataDirs = mapper.readValue(is, 
ParquetTableMetadataDirs.class);
-  logger.info("Took {} ms to read directories from directory cache 
file", timer.elapsed(TimeUnit.MILLISECONDS));
-  timer.stop();
-  parquetTableMetadataDirs.updateRelativePaths(parentDirString);
-  if (!alreadyCheckedModification && 
tableModified(parquetTableMetadataDirs.getDirectories(), p, parentDir, 
metaContext)) {
-parquetTableMetadataDirs =
-
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(p.getParent()).toString())).getRight();
-newMetadata = true;
-  }
-} else {
-  parquetTableMetadata = mapper.readValue(is, 
ParquetTableMetadataBase.class);
-  logger.info("Took {} ms to read metadata from cache file", 
timer.elapsed(TimeUnit.MILLISECONDS));
-  timer.stop();
-  if (parquetTableMetadata instanceof ParquetTableMetadata_v3) {
-((ParquetTableMetadata_v3) 
parquetTableMetadata).updateRelativePaths(parentDirString);
-  }
-  if (!alreadyCheckedModification && 
tableModified(parquetTableMetadata.getDirectories(), p, parentDir, 
metaContext)) {
-parquetTableMetadata =
-
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(p.getParent()).toString())).getLeft();
-newMetadata = true;
-  }
+try (FSDataInputStream is = fs.open(path)) {
+  boolean alreadyCheckedModification = false;
+  boolean newMetadata = false;
+alreadyCheckedModification = 
metaContext.getStatus(metadataParentDirPath);
+
+  if (dirsOnly) {
+parquetTableMetadataDirs = mapper.readValue(is, 
ParquetTableMetadataDirs.class);
+logger.info("Took {} ms to read directories from directory cache 
file", timer.elapsed(TimeUnit.MILLISECONDS));
+timer.stop();
+
parquetTableMetadataDirs.updateRelativePaths(metadataParentDirPath);
+if (!alreadyCheckedModification && 
tableModified(parquetTableMetadataDirs.getDirectories(), path, 
metadataParentDir, metaContext)) {
+  parquetTableMetadataDirs =
+  
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(path.getParent()).toString())).getRight();
+  newMetadata = true;
+}
+  } else {
+parquetTableMetadata = mapper.readValue(is, 
ParquetTableMetadataBase.class);
+logger.info("Took {} ms to read metadata from cache file", 
timer.elapsed(TimeUnit.MILLISECONDS));
+timer.stop();
+if (new 
MetadataVersion(parquetTableMetadata.getMetadataVersion()).compareTo(new 
MetadataVersion(3, 0)) >= 0) {
+  ((ParquetTableMetadata_v3) 
parquetTableMetadata).updateRelativePaths(metadataParentDirPath);
+}
+if (!alreadyCheckedModification && 
tableModified(parquetTableMetadata.getDirectories(), path, metadataParentDir, 
metaContext)) {
+  parquetTableMetadata =
+  
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(path.getParent()).toString())).getLeft();
+  newMetadata = true;
+}
 
-  // DRILL-5009: Remove the RowGroup if it is empty
-  List files = 
parquetTableMetadata.getFiles();
-  for (ParquetFileMetadata file : files) {
-List rowGroups = file.getRowGroups();
-for (Iterator iter = 
rowGroups.iterator(); iter.hasNext(); ) {
-  RowGroupMetadata r = iter.next();
-  if (r.getRowCount() == 0) {
-iter.remove();
+// DRILL-5009: Remove the RowGroup if it is empty
+List files = 
parquetTableMetadata.getFiles();
+for (ParquetFileMetadata file : files) {
+  List rowGroups = file.getRowGroups();
+  for (Iterator iter = 
rowGroups.iterator(); 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127287#comment-16127287
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133198332
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/MetadataContext.java
 ---
@@ -41,6 +42,8 @@
 
   private PruneStatus pruneStatus = PruneStatus.NOT_STARTED;
 
+  private boolean isMetadataCacheCorrupted;
--- End diff --

Looks better. Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 
> one strucure by using special converting)
> For 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127284#comment-16127284
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/877
  
+1, LGTM.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 
> one strucure by using special converting)
> For the second case only annotation for the last metadata structure can be 
> updated.
> *Summary*
> 1. Drill will read and use metadata files if files are valid, all present and 
> supported. Under supported we mean that files were created before and under 
> current Drill version.
> 2. Drill will ignore reading 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127166#comment-16127166
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133178381
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/MetadataContext.java
 ---
@@ -41,6 +42,8 @@
 
   private PruneStatus pruneStatus = PruneStatus.NOT_STARTED;
 
+  private boolean isMetadataCacheCorrupted;
+
   public MetadataContext() {
--- End diff --

1. We can remove this constructor, it exists by default or init 
`pruneStatus` and `dirModifCheckMap` in it.
2. Please move enum `PruneStatus` in the end of the class


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127167#comment-16127167
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133177759
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -543,78 +584,72 @@ private void readBlockMeta(String path,
 mapper.registerModule(serialModule);
 mapper.registerModule(module);
 mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, 
false);
-FSDataInputStream is = fs.open(p);
-
-boolean alreadyCheckedModification = false;
-boolean newMetadata = false;
-
-if (metaContext != null) {
-  alreadyCheckedModification = metaContext.getStatus(parentDirString);
-}
-
-if (dirsOnly) {
-  parquetTableMetadataDirs = mapper.readValue(is, 
ParquetTableMetadataDirs.class);
-  logger.info("Took {} ms to read directories from directory cache 
file", timer.elapsed(TimeUnit.MILLISECONDS));
-  timer.stop();
-  parquetTableMetadataDirs.updateRelativePaths(parentDirString);
-  if (!alreadyCheckedModification && 
tableModified(parquetTableMetadataDirs.getDirectories(), p, parentDir, 
metaContext)) {
-parquetTableMetadataDirs =
-
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(p.getParent()).toString())).getRight();
-newMetadata = true;
-  }
-} else {
-  parquetTableMetadata = mapper.readValue(is, 
ParquetTableMetadataBase.class);
-  logger.info("Took {} ms to read metadata from cache file", 
timer.elapsed(TimeUnit.MILLISECONDS));
-  timer.stop();
-  if (parquetTableMetadata instanceof ParquetTableMetadata_v3) {
-((ParquetTableMetadata_v3) 
parquetTableMetadata).updateRelativePaths(parentDirString);
-  }
-  if (!alreadyCheckedModification && 
tableModified(parquetTableMetadata.getDirectories(), p, parentDir, 
metaContext)) {
-parquetTableMetadata =
-
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(p.getParent()).toString())).getLeft();
-newMetadata = true;
-  }
+try (FSDataInputStream is = fs.open(path)) {
+  boolean alreadyCheckedModification = false;
+  boolean newMetadata = false;
+alreadyCheckedModification = 
metaContext.getStatus(metadataParentDirPath);
+
+  if (dirsOnly) {
+parquetTableMetadataDirs = mapper.readValue(is, 
ParquetTableMetadataDirs.class);
+logger.info("Took {} ms to read directories from directory cache 
file", timer.elapsed(TimeUnit.MILLISECONDS));
+timer.stop();
+
parquetTableMetadataDirs.updateRelativePaths(metadataParentDirPath);
+if (!alreadyCheckedModification && 
tableModified(parquetTableMetadataDirs.getDirectories(), path, 
metadataParentDir, metaContext)) {
+  parquetTableMetadataDirs =
+  
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(path.getParent()).toString())).getRight();
+  newMetadata = true;
+}
+  } else {
+parquetTableMetadata = mapper.readValue(is, 
ParquetTableMetadataBase.class);
+logger.info("Took {} ms to read metadata from cache file", 
timer.elapsed(TimeUnit.MILLISECONDS));
+timer.stop();
+if (new 
MetadataVersion(parquetTableMetadata.getMetadataVersion()).compareTo(new 
MetadataVersion(3, 0)) >= 0) {
+  ((ParquetTableMetadata_v3) 
parquetTableMetadata).updateRelativePaths(metadataParentDirPath);
+}
+if (!alreadyCheckedModification && 
tableModified(parquetTableMetadata.getDirectories(), path, metadataParentDir, 
metaContext)) {
+  parquetTableMetadata =
+  
(createMetaFilesRecursively(Path.getPathWithoutSchemeAndAuthority(path.getParent()).toString())).getLeft();
+  newMetadata = true;
+}
 
-  // DRILL-5009: Remove the RowGroup if it is empty
-  List files = 
parquetTableMetadata.getFiles();
-  for (ParquetFileMetadata file : files) {
-List rowGroups = file.getRowGroups();
-for (Iterator iter = 
rowGroups.iterator(); iter.hasNext(); ) {
-  RowGroupMetadata r = iter.next();
-  if (r.getRowCount() == 0) {
-iter.remove();
+// DRILL-5009: Remove the RowGroup if it is empty
+List files = 
parquetTableMetadata.getFiles();
+for (ParquetFileMetadata file : files) {
+  List rowGroups = file.getRowGroups();
+  for (Iterator iter = 
rowGroups.iterator(); 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127165#comment-16127165
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133179047
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/MetadataContext.java
 ---
@@ -41,6 +42,8 @@
 
   private PruneStatus pruneStatus = PruneStatus.NOT_STARTED;
 
+  private boolean isMetadataCacheCorrupted;
--- End diff --

`metadataCacheCorrupted`?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 
> one strucure by using special 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127138#comment-16127138
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132533379
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -519,17 +564,16 @@ private void writeFile(ParquetTableMetadataDirs 
parquetTableMetadataDirs, Path p
   /**
* Read the parquet metadata from a file
*
-   * @param path
-   * @return
-   * @throws IOException
+   * @param path to metadata file
+   * @param dirsOnly true for {@link 
Metadata#METADATA_DIRECTORIES_FILENAME}
+   * or false for {@link Metadata#METADATA_FILENAME} files 
reading
+   * @param metaContext current metadata context
+   * @throws IOException if metadata file can't be read or updated
*/
-  private void readBlockMeta(String path,
-  boolean dirsOnly,
-  MetadataContext metaContext) throws IOException {
+  private void readBlockMeta(Path path, boolean dirsOnly, MetadataContext 
metaContext) {
 Stopwatch timer = Stopwatch.createStarted();
-Path p = new Path(path);
-Path parentDir = Path.getPathWithoutSchemeAndAuthority(p.getParent()); 
// parent directory of the metadata file
-String parentDirString = parentDir.toUri().toString(); // string 
representation for parent directory of the metadata file
+Path parentDir = 
Path.getPathWithoutSchemeAndAuthority(path.getParent()); // parent directory of 
the metadata file
+String parentDirString = parentDir.toUri().getPath(); // string 
representation for parent directory of the metadata file
--- End diff --

Possibly it can be confused with above `parentDir`. 
What about `Path metadataParentDirPath` and `String 
metadataParentDirStringPath`?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127141#comment-16127141
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132536020
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -268,11 +276,15 @@ private FileSelection expandIfNecessary(FileSelection 
selection) throws IOExcept
 // use the cacheFileRoot if provided (e.g after partition pruning)
 Path metaFilePath = new Path(cacheFileRoot != null ? cacheFileRoot : 
selectionRoot, Metadata.METADATA_FILENAME);
 if (!fs.exists(metaFilePath)) { // no metadata cache
+  MetadataContext metaContext = selection.getMetaContext();
+  if (metaContext != null) {
+// some metadata files are absent, but some are present (since 
metadata context was created)
+metaContext.isMetadataFilesMissingOrCorrupted = true;
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127149#comment-16127149
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133142647
  
--- Diff: exec/java-exec/src/test/java/org/apache/drill/BaseTestQuery.java 
---
@@ -639,14 +644,18 @@ protected static void 
copyDirectoryIntoTempSpace(String resourcesDir, String des
* @param srcFileOnClassPath the source path of metadata cache file, 
which should be replaced
* @param destFolderInTmp  the parent folder name of the metadata cache 
file
* @param metaFileName the name of metadata cache file depending on the 
type of the metadata
+   * @param customStringReplacement custom string to replace the 
"CUSTOM_REPLACED" target string in metadata file
* @throws IOException if a create or write errors occur
*/
-  protected static void 
copyMetaDataCacheToTempReplacingInternalPaths(String srcFileOnClassPath, String 
destFolderInTmp,
-  String metaFileName) throws IOException {
+  protected static void copyMetaDataCacheToTempWithReplacements(String 
srcFileOnClassPath,
+  String destFolderInTmp, String metaFileName, String 
customStringReplacement) throws IOException {
 String metadataFileContents = getFile(srcFileOnClassPath);
 Path rootMeta = new Path(dfsTestTmpSchemaLocation, destFolderInTmp);
 Path newMetaCache = new Path(rootMeta, metaFileName);
 FSDataOutputStream outSteam = fs.create(newMetaCache);
+if (customStringReplacement != null) {
+  metadataFileContents = 
metadataFileContents.replace("CUSTOM_STRING_REPLACEMENT", 
customStringReplacement);
+}
 outSteam.writeBytes(metadataFileContents.replace("REPLACED_IN_TEST", 
dfsTestTmpSchemaLocation));
 outSteam.close();
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127136#comment-16127136
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132534589
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -222,15 +222,17 @@ public DrillTable isReadable(DrillFileSystem fs, 
FileSelection selection,
   // create a metadata context that will be used for the duration 
of the query for this table
   MetadataContext metaContext = new MetadataContext();
 
-  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath.toString(), metaContext, formatConfig);
-  if (mDirs.getDirectories().size() > 0) {
-FileSelection dirSelection = 
FileSelection.createFromDirectories(mDirs.getDirectories(), selection,
-selection.getSelectionRoot() /* cacheFileRoot initially 
points to selectionRoot */);
-dirSelection.setExpandedPartial();
-dirSelection.setMetaContext(metaContext);
-
-return new DynamicDrillTable(fsPlugin, storageEngineName, 
userName,
-new FormatSelection(plugin.getConfig(), dirSelection));
+  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath, metaContext, formatConfig);
+  if (mDirs != null) {
+if (mDirs.getDirectories().size() > 0) {
--- End diff --

Missed that. Thanks.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127144#comment-16127144
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133139901
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -974,8 +1005,17 @@ private int updateRowGroupInfo(long maxRecords) {
   public ParquetGroupScan clone(FileSelection selection) throws 
IOException {
 ParquetGroupScan newScan = new ParquetGroupScan(this);
 newScan.modifyFileSelection(selection);
-newScan.setCacheFileRoot(selection.cacheFileRoot);
-newScan.init(selection.getMetaContext());
+MetadataContext metaContext = selection.getMetaContext();
+if (metaContext == null) {
+  metaContext = new MetadataContext();
+}
+if (newScan.usedMetadataCache) {
--- End diff --

`usedMetadataCache` isn't used anymore there.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127142#comment-16127142
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132652665
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -209,9 +211,10 @@ public ParquetGroupScan( //
 this.entries = Lists.newArrayList();
 if (fileSelection.getMetaContext() != null &&
 (fileSelection.getMetaContext().getPruneStatus() == 
PruneStatus.NOT_STARTED ||
-  fileSelection.getMetaContext().getPruneStatus() == 
PruneStatus.NOT_PRUNED)) {
-  // if pruning was not applicable or was attempted and nothing was 
pruned, initialize the
-  // entries with just the selection root instead of the fully 
expanded list to reduce overhead.
+  fileSelection.getMetaContext().getPruneStatus() == 
PruneStatus.NOT_PRUNED ||
+
fileSelection.getMetaContext().isMetadataFilesMissingOrCorrupted)) {
--- End diff --

Makes sense. I've factored out checks into the new 
'checkIfPruningWasApplicable()' method.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127147#comment-16127147
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133141382
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -1123,4 +1163,22 @@ public GroupScan applyFilter(LogicalExpression 
filterExpr, UdfUtilities udfUtili
 }
   }
 
+  /**
+   * Ignore expanding selection, if metadata is corrupted
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127148#comment-16127148
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133157793
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -633,13 +645,22 @@ public long getRowCount() {
* @throws UserException when the updated selection is empty, this 
happens if the user selects an empty folder.
*/
   private FileSelection
-  initFromMetadataCache(FileSelection selection, Path metaFilePath) throws 
IOException {
+  expandSelectionFromMetadataCache(FileSelection selection, Path 
metaFilePath) throws IOException {
 // get the metadata for the root directory by reading the metadata file
 // parquetTableMetadata contains the metadata for all files in the 
selection root folder, but we need to make sure
 // we only select the files that are part of selection (by setting 
fileSet appropriately)
 
+MetadataContext metaContext = selection.getMetaContext();
+// make sure that a metadata context is created since we are going to 
use metadata caching
+if (metaContext == null) {
+  metaContext = new MetadataContext();
+  selection.setMetaContext(metaContext);
+}
 // get (and set internal field) the metadata for the directory by 
reading the metadata file
-this.parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaFilePath.toString(), selection.getMetaContext(), formatConfig);
+parquetTableMetadata = Metadata.readBlockMeta(fs, metaFilePath, 
metaContext, formatConfig);
+if (ignoreExpandingSelection(parquetTableMetadata, selection)) {
+  return selection;
+}
 if (formatConfig.autoCorrectCorruptDates) {
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127145#comment-16127145
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132533745
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.ImmutableSortedSet;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.SortedSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+public class MetadataVersion implements Comparable {
+
+  private static final String FORMAT = "v?((?!0)\\d+)(\\.(\\d+))?";
+  private static final Pattern PATTERN = Pattern.compile(FORMAT);
+
+  private final int major;
+  private final int minor;
+
+  public MetadataVersion(int major, int minor) {
+this.major = major;
+this.minor = minor;
+  }
+
+  public MetadataVersion(String metadataVersion) {
+Matcher matcher = PATTERN.matcher(metadataVersion);
+if (!matcher.matches()) {
+  DrillRuntimeException.format("Could not parse metadata version '%s' 
using format '%s'", metadataVersion, FORMAT);
+}
+this.major = Integer.parseInt(matcher.group(1));
+this.minor = matcher.group(3) != null ? 
Integer.parseInt(matcher.group(3)) : 0;
+  }
+
+  public int getMajor() {
+return major;
+  }
+
+  public int getMinor() {
+return minor;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) {
+  return true;
+}
+if (!(o instanceof MetadataVersion)) {
+  return false;
+}
+MetadataVersion that = (MetadataVersion) o;
+return this.major == that.major
+&& this.minor == that.minor;
+  }
+
+  @Override
+  public int hashCode() {
+int result = major;
+result = 31 * result + minor;
+return result;
+  }
+
+  /**
+   * @return string representation of the metadata file version, for 
example: "v1", "v10", "v4.13"
+   * 
+   * String metadata version consists of the following characters:
+   * optional "v" letter,
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127143#comment-16127143
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132432541
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -188,6 +188,10 @@ public void setExpandedFully() {
 this.dirStatus = StatusType.EXPANDED_FULLY;
   }
 
+  public void revertDirStatuses() {
--- End diff --

The method is not used any more


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 
> one strucure by 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127140#comment-16127140
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132517116
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,25 +140,62 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (ignoreReadingMetadata(metaContext, path)) {
+  return null;
+}
 Metadata metadata = new Metadata(fs, formatConfig);
 metadata.readBlockMeta(path, false, metaContext);
 return metadata.parquetTableMetadata;
   }
 
-  public static ParquetTableMetadataDirs readMetadataDirs(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
+  /**
+   * Get the parquet metadata for all subdirectories by reading the 
metadata file
+   *
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127146#comment-16127146
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133143191
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -452,33 +458,202 @@ public void testMoveCache() throws Exception {
 
   @Test
   public void testMetadataCacheAbsolutePaths() throws Exception {
+final String absolutePathsMetadata = "absolute_paths_metadata";
 try {
   test("use dfs_test.tmp");
-  final String relative_path_metadata_t1 = RELATIVE_PATHS_METADATA + 
"/t1";
-  final String relative_path_metadata_t2 = RELATIVE_PATHS_METADATA + 
"/t2";
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t1);
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t2);
+  // creating two inner directories to leverage 
METADATA_DIRECTORIES_FILENAME metadata file as well
+  final String absolutePathsMetadataT1 = absolutePathsMetadata + "/t1";
+  final String absolutePathsMetadataT2 = absolutePathsMetadata + "/t2";
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT1);
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT2);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_directories_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_DIRECTORIES_FILENAME);
+  "metadata_directories_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_DIRECTORIES_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
relative_path_metadata_t1, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
absolutePathsMetadataT1, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
relative_path_metadata_t2, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
absolutePathsMetadataT2, Metadata.METADATA_FILENAME);
+  String query = String.format("select * from %s", 
absolutePathsMetadata);
+  int expectedRowCount = 50;
+  int expectedNumFiles = 1; // point to selectionRoot since no pruning 
is done in this query
+  int actualRowCount = testSql(query);
+  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files",
+  expectedRowCount, actualRowCount);
+  String numFilesPattern = "numFiles=" + expectedNumFiles;
+  String usedMetaPattern = "usedMetadataFile=true";
+  String cacheFileRootPattern = String.format("cacheFileRoot=%s/%s", 
getDfsTestTmpSchemaLocation(), absolutePathsMetadata);
+  PlanTestBase.testPlanMatchingPatterns(query, new 
String[]{numFilesPattern, usedMetaPattern, cacheFileRootPattern},
+  new String[] {"Filter"});
+} finally {
+  test("drop table if exists %s", absolutePathsMetadata);
+}
+  }
 
-  int rowCount = testSql(String.format("select * from %s", 
RELATIVE_PATHS_METADATA));
-  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files", 50, rowCount);
+  @Test
+  public void testSpacesInMetadataCachePath() throws Exception {
+final String pathWithSpaces = "path with spaces";
+try {
+  test("use dfs_test.tmp");
+  // creating multilevel table to store path with spaces in both 
metadata files (METADATA and METADATA_DIRECTORIES)
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
pathWithSpaces);
+  test("create table `%1$s/%1$s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("refresh table metadata `%s`", pathWithSpaces);
+  checkForMetadataFile(pathWithSpaces);
+  String query = 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127150#comment-16127150
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133143486
  
--- Diff: 
exec/java-exec/src/test/resources/parquet/unsupported_metadata/corrupted_metadata.requires_replace.txt
 ---
@@ -0,0 +1,41 @@
+{
+  "metadata_version" : "v3",
+  "columnTypeInfo"
+
+  CORRUPTED DATA THAT LEADS TO JsonParseException
+
+"path" : "0_0_0.parquet",
+"length" : 2424,
+"rowGroups" : [ {
+  "start" : 4,
+  "length" : 1802,
+  "rowCount" : 25,
+  "hostAffinity" : {
+"localhost" : 1.0
+  },
+  "columns" : [ {
+"name" : [ "n_nationkey" ],
+"minValue" : 0,
+"maxValue" : 24,
+"nulls" : 0
+  }, {
+"name" : [ "n_name" ],
+"minValue" : "ALGERIA",
+"maxValue" : "VIETNAM",
+"nulls" : 0
+  }, {
+"name" : [ "n_regionkey" ],
+"minValue" : 0,
+"maxValue" : 4,
+"nulls" : 0
+  }, {
+"name" : [ "n_comment" ],
+"minValue" : " haggle. carefully final deposits detect slyly agai",
+"maxValue" : "y final packages. slow foxes cajole quickly. quickly 
silent platelets breach ironic accounts. unusual pinto be",
+"nulls" : 0
+  } ]
+} ]
+  } ],
+  "directories" : [ ],
+  "drillVersion" : "1.11.0-SNAPSHOT"
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127151#comment-16127151
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r133159772
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -268,11 +276,15 @@ private FileSelection expandIfNecessary(FileSelection 
selection) throws IOExcept
 // use the cacheFileRoot if provided (e.g after partition pruning)
 Path metaFilePath = new Path(cacheFileRoot != null ? cacheFileRoot : 
selectionRoot, Metadata.METADATA_FILENAME);
 if (!fs.exists(metaFilePath)) { // no metadata cache
+  MetadataContext metaContext = selection.getMetaContext();
--- End diff --

Started using metadata context as a ParquetGroupScan field.
The complexity was with initializing entries with selection root. There was 
a check when metadata context is  null. The fact that it was not null only in 
case, when metadata directories file is present and using (in that case 
metadata context was created).
Added method with javadoc explaining in what cases entries are set as 
selection root.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127139#comment-16127139
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132515388
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/MetadataContext.java
 ---
@@ -41,6 +42,9 @@
 
   private PruneStatus pruneStatus = PruneStatus.NOT_STARTED;
 
+  // Setting this value as true allows to avoid double reading of 
corrupted, unsupported or missing metadata files
+  public boolean isMetadataFilesMissingOrCorrupted;
--- End diff --

Done.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127137#comment-16127137
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r13314
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -974,8 +1005,17 @@ private int updateRowGroupInfo(long maxRecords) {
   public ParquetGroupScan clone(FileSelection selection) throws 
IOException {
 ParquetGroupScan newScan = new ParquetGroupScan(this);
 newScan.modifyFileSelection(selection);
-newScan.setCacheFileRoot(selection.cacheFileRoot);
-newScan.init(selection.getMetaContext());
+MetadataContext metaContext = selection.getMetaContext();
+if (metaContext == null) {
+  metaContext = new MetadataContext();
+}
+if (newScan.usedMetadataCache) {
+  newScan.setCacheFileRoot(selection.cacheFileRoot);
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121421#comment-16121421
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132405168
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.ImmutableSortedSet;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.SortedSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+public class MetadataVersion implements Comparable {
+
+  private static final String FORMAT = "v?((?!0)\\d+)(\\.(\\d+))?";
+  private static final Pattern PATTERN = Pattern.compile(FORMAT);
+
+  private final int major;
+  private final int minor;
+
+  public MetadataVersion(int major, int minor) {
+this.major = major;
+this.minor = minor;
+  }
+
+  public MetadataVersion(String metadataVersion) {
+Matcher matcher = PATTERN.matcher(metadataVersion);
+if (!matcher.matches()) {
+  DrillRuntimeException.format("Could not parse metadata version '%s' 
using format '%s'", metadataVersion, FORMAT);
+}
+this.major = Integer.parseInt(matcher.group(1));
+this.minor = matcher.group(3) != null ? 
Integer.parseInt(matcher.group(3)) : 0;
+  }
+
+  public int getMajor() {
+return major;
+  }
+
+  public int getMinor() {
+return minor;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) {
+  return true;
+}
+if (!(o instanceof MetadataVersion)) {
+  return false;
+}
+MetadataVersion that = (MetadataVersion) o;
+return this.major == that.major
+&& this.minor == that.minor;
+  }
+
+  @Override
+  public int hashCode() {
+int result = major;
+result = 31 * result + minor;
+return result;
+  }
+
+  /**
+   * @return string representation of the metadata file version, for 
example: "v1", "v10", "v4.13"
+   * 
+   * String metadata version consists of the following characters:
+   * optional "v" letter,
--- End diff --

In our case `toString()` method won't return result with "v", so I guess we 
can remove this sentence.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121408#comment-16121408
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132413275
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -268,11 +276,15 @@ private FileSelection expandIfNecessary(FileSelection 
selection) throws IOExcept
 // use the cacheFileRoot if provided (e.g after partition pruning)
 Path metaFilePath = new Path(cacheFileRoot != null ? cacheFileRoot : 
selectionRoot, Metadata.METADATA_FILENAME);
 if (!fs.exists(metaFilePath)) { // no metadata cache
+  MetadataContext metaContext = selection.getMetaContext();
--- End diff --

It seems you set metadata context, when there can be the cases when it is 
still null? Also can you please explain why there are so many checks in the 
code if meta context is null, can we make sure it always exists?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121416#comment-16121416
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132414700
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -519,17 +564,16 @@ private void writeFile(ParquetTableMetadataDirs 
parquetTableMetadataDirs, Path p
   /**
* Read the parquet metadata from a file
*
-   * @param path
-   * @return
-   * @throws IOException
+   * @param path to metadata file
+   * @param dirsOnly true for {@link 
Metadata#METADATA_DIRECTORIES_FILENAME}
+   * or false for {@link Metadata#METADATA_FILENAME} files 
reading
+   * @param metaContext current metadata context
+   * @throws IOException if metadata file can't be read or updated
*/
-  private void readBlockMeta(String path,
-  boolean dirsOnly,
-  MetadataContext metaContext) throws IOException {
+  private void readBlockMeta(Path path, boolean dirsOnly, MetadataContext 
metaContext) {
 Stopwatch timer = Stopwatch.createStarted();
-Path p = new Path(path);
-Path parentDir = Path.getPathWithoutSchemeAndAuthority(p.getParent()); 
// parent directory of the metadata file
-String parentDirString = parentDir.toUri().toString(); // string 
representation for parent directory of the metadata file
+Path parentDir = 
Path.getPathWithoutSchemeAndAuthority(path.getParent()); // parent directory of 
the metadata file
+String parentDirString = parentDir.toUri().getPath(); // string 
representation for parent directory of the metadata file
--- End diff --

`parentDirPath`


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121413#comment-16121413
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132410703
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -974,8 +1005,17 @@ private int updateRowGroupInfo(long maxRecords) {
   public ParquetGroupScan clone(FileSelection selection) throws 
IOException {
 ParquetGroupScan newScan = new ParquetGroupScan(this);
 newScan.modifyFileSelection(selection);
-newScan.setCacheFileRoot(selection.cacheFileRoot);
-newScan.init(selection.getMetaContext());
+MetadataContext metaContext = selection.getMetaContext();
+if (metaContext == null) {
+  metaContext = new MetadataContext();
+}
+if (newScan.usedMetadataCache) {
--- End diff --

Please add / use getter.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121412#comment-16121412
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132407801
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -222,15 +222,17 @@ public DrillTable isReadable(DrillFileSystem fs, 
FileSelection selection,
   // create a metadata context that will be used for the duration 
of the query for this table
   MetadataContext metaContext = new MetadataContext();
 
-  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath.toString(), metaContext, formatConfig);
-  if (mDirs.getDirectories().size() > 0) {
-FileSelection dirSelection = 
FileSelection.createFromDirectories(mDirs.getDirectories(), selection,
-selection.getSelectionRoot() /* cacheFileRoot initially 
points to selectionRoot */);
-dirSelection.setExpandedPartial();
-dirSelection.setMetaContext(metaContext);
-
-return new DynamicDrillTable(fsPlugin, storageEngineName, 
userName,
-new FormatSelection(plugin.getConfig(), dirSelection));
+  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath, metaContext, formatConfig);
+  if (mDirs != null) {
+if (mDirs.getDirectories().size() > 0) {
--- End diff --

Can we combine two `if` statements into one?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121420#comment-16121420
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132405399
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -268,11 +276,15 @@ private FileSelection expandIfNecessary(FileSelection 
selection) throws IOExcept
 // use the cacheFileRoot if provided (e.g after partition pruning)
 Path metaFilePath = new Path(cacheFileRoot != null ? cacheFileRoot : 
selectionRoot, Metadata.METADATA_FILENAME);
 if (!fs.exists(metaFilePath)) { // no metadata cache
+  MetadataContext metaContext = selection.getMetaContext();
+  if (metaContext != null) {
+// some metadata files are absent, but some are present (since 
metadata context was created)
+metaContext.isMetadataFilesMissingOrCorrupted = true;
--- End diff --

Please use setter & getter instead of direct assignment / access.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121411#comment-16121411
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132409360
  
--- Diff: 
exec/java-exec/src/test/resources/parquet/unsupported_metadata/corrupted_metadata.requires_replace.txt
 ---
@@ -0,0 +1,41 @@
+{
+  "metadata_version" : "v3",
+  "columnTypeInfo"
+
+  CORRUPTED DATA THAT LEADS TO JsonParseException
+
+"path" : "0_0_0.parquet",
+"length" : 2424,
+"rowGroups" : [ {
+  "start" : 4,
+  "length" : 1802,
+  "rowCount" : 25,
+  "hostAffinity" : {
+"localhost" : 1.0
+  },
+  "columns" : [ {
+"name" : [ "n_nationkey" ],
+"minValue" : 0,
+"maxValue" : 24,
+"nulls" : 0
+  }, {
+"name" : [ "n_name" ],
+"minValue" : "ALGERIA",
+"maxValue" : "VIETNAM",
+"nulls" : 0
+  }, {
+"name" : [ "n_regionkey" ],
+"minValue" : 0,
+"maxValue" : 4,
+"nulls" : 0
+  }, {
+"name" : [ "n_comment" ],
+"minValue" : " haggle. carefully final deposits detect slyly agai",
+"maxValue" : "y final packages. slow foxes cajole quickly. quickly 
silent platelets breach ironic accounts. unusual pinto be",
+"nulls" : 0
+  } ]
+} ]
+  } ],
+  "directories" : [ ],
+  "drillVersion" : "1.11.0-SNAPSHOT"
--- End diff --

May be to Drill version without snapshot here and below?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121417#comment-16121417
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132408041
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/MetadataContext.java
 ---
@@ -41,6 +42,9 @@
 
   private PruneStatus pruneStatus = PruneStatus.NOT_STARTED;
 
+  // Setting this value as true allows to avoid double reading of 
corrupted, unsupported or missing metadata files
+  public boolean isMetadataFilesMissingOrCorrupted;
--- End diff --

Please make private and add setter and getter.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121415#comment-16121415
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132408506
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,25 +140,62 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (ignoreReadingMetadata(metaContext, path)) {
+  return null;
+}
 Metadata metadata = new Metadata(fs, formatConfig);
 metadata.readBlockMeta(path, false, metaContext);
 return metadata.parquetTableMetadata;
   }
 
-  public static ParquetTableMetadataDirs readMetadataDirs(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
+  /**
+   * Get the parquet metadata for all subdirectories by reading the 
metadata file
+   *
--- End diff --

Please add note explaining the case when we would return null value.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121410#comment-16121410
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132415968
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -1123,4 +1163,22 @@ public GroupScan applyFilter(LogicalExpression 
filterExpr, UdfUtilities udfUtili
 }
   }
 
+  /**
+   * Ignore expanding selection, if metadata is corrupted
--- End diff --

Please add comment that you not only check should we ignore expanding 
selection but also setting `parquetTableMetadata` and `fileSet` and 
`selection.revertDirStatuses()`.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121414#comment-16121414
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132408285
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -188,6 +188,10 @@ public void setExpandedFully() {
 this.dirStatus = StatusType.EXPANDED_FULLY;
   }
 
+  public void revertDirStatuses() {
--- End diff --

Can you please add javadoc explaining cases when we would revert directory 
statuses.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121419#comment-16121419
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132411564
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -633,13 +645,22 @@ public long getRowCount() {
* @throws UserException when the updated selection is empty, this 
happens if the user selects an empty folder.
*/
   private FileSelection
-  initFromMetadataCache(FileSelection selection, Path metaFilePath) throws 
IOException {
+  expandSelectionFromMetadataCache(FileSelection selection, Path 
metaFilePath) throws IOException {
 // get the metadata for the root directory by reading the metadata file
 // parquetTableMetadata contains the metadata for all files in the 
selection root folder, but we need to make sure
 // we only select the files that are part of selection (by setting 
fileSet appropriately)
 
+MetadataContext metaContext = selection.getMetaContext();
+// make sure that a metadata context is created since we are going to 
use metadata caching
+if (metaContext == null) {
+  metaContext = new MetadataContext();
+  selection.setMetaContext(metaContext);
+}
 // get (and set internal field) the metadata for the directory by 
reading the metadata file
-this.parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaFilePath.toString(), selection.getMetaContext(), formatConfig);
+parquetTableMetadata = Metadata.readBlockMeta(fs, metaFilePath, 
metaContext, formatConfig);
+if (ignoreExpandingSelection(parquetTableMetadata, selection)) {
+  return selection;
+}
 if (formatConfig.autoCorrectCorruptDates) {
--- End diff --

Please add / use getter.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121422#comment-16121422
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132410745
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -974,8 +1005,17 @@ private int updateRowGroupInfo(long maxRecords) {
   public ParquetGroupScan clone(FileSelection selection) throws 
IOException {
 ParquetGroupScan newScan = new ParquetGroupScan(this);
 newScan.modifyFileSelection(selection);
-newScan.setCacheFileRoot(selection.cacheFileRoot);
-newScan.init(selection.getMetaContext());
+MetadataContext metaContext = selection.getMetaContext();
+if (metaContext == null) {
+  metaContext = new MetadataContext();
+}
+if (newScan.usedMetadataCache) {
+  newScan.setCacheFileRoot(selection.cacheFileRoot);
--- End diff --

Please use getter.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121418#comment-16121418
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132406973
  
--- Diff: exec/java-exec/src/test/java/org/apache/drill/BaseTestQuery.java 
---
@@ -639,14 +644,18 @@ protected static void 
copyDirectoryIntoTempSpace(String resourcesDir, String des
* @param srcFileOnClassPath the source path of metadata cache file, 
which should be replaced
* @param destFolderInTmp  the parent folder name of the metadata cache 
file
* @param metaFileName the name of metadata cache file depending on the 
type of the metadata
+   * @param customStringReplacement custom string to replace the 
"CUSTOM_REPLACED" target string in metadata file
* @throws IOException if a create or write errors occur
*/
-  protected static void 
copyMetaDataCacheToTempReplacingInternalPaths(String srcFileOnClassPath, String 
destFolderInTmp,
-  String metaFileName) throws IOException {
+  protected static void copyMetaDataCacheToTempWithReplacements(String 
srcFileOnClassPath,
+  String destFolderInTmp, String metaFileName, String 
customStringReplacement) throws IOException {
 String metadataFileContents = getFile(srcFileOnClassPath);
 Path rootMeta = new Path(dfsTestTmpSchemaLocation, destFolderInTmp);
 Path newMetaCache = new Path(rootMeta, metaFileName);
 FSDataOutputStream outSteam = fs.create(newMetaCache);
+if (customStringReplacement != null) {
+  metadataFileContents = 
metadataFileContents.replace("CUSTOM_STRING_REPLACEMENT", 
customStringReplacement);
+}
 outSteam.writeBytes(metadataFileContents.replace("REPLACED_IN_TEST", 
dfsTestTmpSchemaLocation));
 outSteam.close();
--- End diff --

Please use try-with-resources to ensure that stream will be closed in any 
case.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121407#comment-16121407
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132406353
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -452,33 +458,202 @@ public void testMoveCache() throws Exception {
 
   @Test
   public void testMetadataCacheAbsolutePaths() throws Exception {
+final String absolutePathsMetadata = "absolute_paths_metadata";
 try {
   test("use dfs_test.tmp");
-  final String relative_path_metadata_t1 = RELATIVE_PATHS_METADATA + 
"/t1";
-  final String relative_path_metadata_t2 = RELATIVE_PATHS_METADATA + 
"/t2";
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t1);
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t2);
+  // creating two inner directories to leverage 
METADATA_DIRECTORIES_FILENAME metadata file as well
+  final String absolutePathsMetadataT1 = absolutePathsMetadata + "/t1";
+  final String absolutePathsMetadataT2 = absolutePathsMetadata + "/t2";
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT1);
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT2);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_directories_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_DIRECTORIES_FILENAME);
+  "metadata_directories_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_DIRECTORIES_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
relative_path_metadata_t1, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
absolutePathsMetadataT1, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
relative_path_metadata_t2, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
absolutePathsMetadataT2, Metadata.METADATA_FILENAME);
+  String query = String.format("select * from %s", 
absolutePathsMetadata);
+  int expectedRowCount = 50;
+  int expectedNumFiles = 1; // point to selectionRoot since no pruning 
is done in this query
+  int actualRowCount = testSql(query);
+  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files",
+  expectedRowCount, actualRowCount);
+  String numFilesPattern = "numFiles=" + expectedNumFiles;
+  String usedMetaPattern = "usedMetadataFile=true";
+  String cacheFileRootPattern = String.format("cacheFileRoot=%s/%s", 
getDfsTestTmpSchemaLocation(), absolutePathsMetadata);
+  PlanTestBase.testPlanMatchingPatterns(query, new 
String[]{numFilesPattern, usedMetaPattern, cacheFileRootPattern},
+  new String[] {"Filter"});
+} finally {
+  test("drop table if exists %s", absolutePathsMetadata);
+}
+  }
 
-  int rowCount = testSql(String.format("select * from %s", 
RELATIVE_PATHS_METADATA));
-  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files", 50, rowCount);
+  @Test
+  public void testSpacesInMetadataCachePath() throws Exception {
+final String pathWithSpaces = "path with spaces";
+try {
+  test("use dfs_test.tmp");
+  // creating multilevel table to store path with spaces in both 
metadata files (METADATA and METADATA_DIRECTORIES)
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
pathWithSpaces);
+  test("create table `%1$s/%1$s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("refresh table metadata `%s`", pathWithSpaces);
+  checkForMetadataFile(pathWithSpaces);
+  String query = 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121409#comment-16121409
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132414323
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -209,9 +211,10 @@ public ParquetGroupScan( //
 this.entries = Lists.newArrayList();
 if (fileSelection.getMetaContext() != null &&
 (fileSelection.getMetaContext().getPruneStatus() == 
PruneStatus.NOT_STARTED ||
-  fileSelection.getMetaContext().getPruneStatus() == 
PruneStatus.NOT_PRUNED)) {
-  // if pruning was not applicable or was attempted and nothing was 
pruned, initialize the
-  // entries with just the selection root instead of the fully 
expanded list to reduce overhead.
+  fileSelection.getMetaContext().getPruneStatus() == 
PruneStatus.NOT_PRUNED ||
+
fileSelection.getMetaContext().isMetadataFilesMissingOrCorrupted)) {
--- End diff --

There are 4 checks in `if` statement, may be we can factor out them in 
separate method?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118183#comment-16118183
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on the issue:

https://github.com/apache/drill/pull/877
  
@arina-ielchiieva Small fix was made to resolve some regression tests 
failings. The branch is rebased to the master version.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 
> one strucure by using special converting)
> For the second case only annotation for the last metadata structure can be 
> updated.
> *Summary*
> 1. Drill will read and use metadata files if files are valid, all present and 
> supported. Under supported we 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116940#comment-16116940
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vvysotskyi commented on the issue:

https://github.com/apache/drill/pull/805
  
Currently, I am waiting when DRILL-5660 is merged. I will add a comment 
when it is ready for another review. 


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of deserializing metadata files of any version into 
> one strucure by using special converting)
> For the second case only annotation for the last metadata structure can be 
> updated.
> *Summary*
> 1. Drill will read and use metadata files if files are valid, all present and 
> supported. Under supported we mean that 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113341#comment-16113341
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r131164453
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersions.java
 ---
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.ImmutableSortedSet;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.NavigableSet;
+
+/**
+ * Supported metadata versions.
+ * 
+ * String metadata version consists of the following characters:
+ * optional "v" letter,
+ * major metadata version (any number of digits),
+ * optional "." delimiter (used if minor metadata version is 
specified),
+ * minor metadata version (one digit number)
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+public class MetadataVersions {
+  /**
+   * Version 1: Introduces parquet file metadata caching.
+   * See DRILL-2743
+   */
+  public static final String V1 = "v1";
+  /**
+   * Version 2: Metadata cache file size is reduced.
+   * See DRILL-4053
+   */
+  public static final String V2 = "v2";
+  /**
+   * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+   * Filter pushdown for Parquet is implemented. 
+   * See DRILL-1950
+   */
+  public static final String V3 = "v3";
+  /**
+   * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones. Metadata version value
+   * doesn't contain `v` letter
+   * See DRILL-3867, DRILL-5660
+   */
+  public static final String V3_1 = "3.1";
+
+  /**
+   * Helper method to parse string metadata version into float.
+   *
+   * @param stringVersion text metadata version
+   * @return parsed Float metadata version
+   */
+  public static Float parseStringMetadataVersion(String stringVersion) {
+try {
+  if (stringVersion.contains(".") && 
stringVersion.split("\\.")[1].length() != 1) {
+throw new DrillRuntimeException("Minor metadata version shouldn't 
be greater than 9 or contain more than one digit");
+  }
+  return stringVersion.charAt(0) == 'v' ? 
Float.valueOf(stringVersion.substring(1)) : Float.valueOf(stringVersion);
+} catch (Exception e) {
+  throw new DrillRuntimeException(String.format("Could not parse 
metadata version '%s'", stringVersion), e);
+}
+  }
+
+  /**
+   * All historical versions of the Drill metadata cache files
+   */
+  public static final NavigableSet SUPPORTED_VERSIONS = 
ImmutableSortedSet.of(
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112542#comment-16112542
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/877
  
@vdiravka 
limiting minor version to 9 is breaking the whole concept of major and 
minor versions usage. Since when we reach minor max we'll need to bump up the 
major though structure changes did not occur.
If using float is not working for major and minor version (I believe you 
having trouble comparing 1.1 and 1.10) then let's leave our `MetadataVersion` 
class, just get rid of `v` in version name.



> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvement of 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112540#comment-16112540
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r131101557
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersions.java
 ---
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.ImmutableSortedSet;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.NavigableSet;
+
+/**
+ * Supported metadata versions.
+ * 
+ * String metadata version consists of the following characters:
+ * optional "v" letter,
+ * major metadata version (any number of digits),
+ * optional "." delimiter (used if minor metadata version is 
specified),
+ * minor metadata version (one digit number)
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+public class MetadataVersions {
+  /**
+   * Version 1: Introduces parquet file metadata caching.
+   * See DRILL-2743
+   */
+  public static final String V1 = "v1";
+  /**
+   * Version 2: Metadata cache file size is reduced.
+   * See DRILL-4053
+   */
+  public static final String V2 = "v2";
+  /**
+   * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+   * Filter pushdown for Parquet is implemented. 
+   * See DRILL-1950
+   */
+  public static final String V3 = "v3";
+  /**
+   * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones. Metadata version value
+   * doesn't contain `v` letter
+   * See DRILL-3867, DRILL-5660
+   */
+  public static final String V3_1 = "3.1";
+
+  /**
+   * Helper method to parse string metadata version into float.
+   *
+   * @param stringVersion text metadata version
+   * @return parsed Float metadata version
+   */
+  public static Float parseStringMetadataVersion(String stringVersion) {
+try {
+  if (stringVersion.contains(".") && 
stringVersion.split("\\.")[1].length() != 1) {
+throw new DrillRuntimeException("Minor metadata version shouldn't 
be greater than 9 or contain more than one digit");
+  }
+  return stringVersion.charAt(0) == 'v' ? 
Float.valueOf(stringVersion.substring(1)) : Float.valueOf(stringVersion);
+} catch (Exception e) {
+  throw new DrillRuntimeException(String.format("Could not parse 
metadata version '%s'", stringVersion), e);
+}
+  }
+
+  /**
+   * All historical versions of the Drill metadata cache files
+   */
+  public static final NavigableSet SUPPORTED_VERSIONS = 
ImmutableSortedSet.of(
--- End diff --

You can use `SortedSet`, more abstract implementation.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111712#comment-16111712
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r130992403
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -424,32 +432,121 @@ public void testMoveCache() throws Exception {
 
   @Test
   public void testMetadataCacheAbsolutePaths() throws Exception {
+final String absolutePathsMetadata = "absolute_paths_metadata";
 try {
   test("use dfs_test.tmp");
-  final String relative_path_metadata_t1 = RELATIVE_PATHS_METADATA + 
"/t1";
-  final String relative_path_metadata_t2 = RELATIVE_PATHS_METADATA + 
"/t2";
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t1);
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t2);
+  // creating two inner directories to leverage 
METADATA_DIRECTORIES_FILENAME metadata file as well
+  final String absolutePathsMetadataT1 = absolutePathsMetadata + "/t1";
+  final String absolutePathsMetadataT2 = absolutePathsMetadata + "/t2";
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT1);
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT2);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_directories_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_DIRECTORIES_FILENAME);
+  "metadata_directories_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_DIRECTORIES_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
relative_path_metadata_t1, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
absolutePathsMetadataT1, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
relative_path_metadata_t2, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
absolutePathsMetadataT2, Metadata.METADATA_FILENAME);
+  String query = String.format("select * from %s", 
absolutePathsMetadata);
+  int expectedRowCount = 50;
+  int expectedNumFiles = 1; // point to selectionRoot since no pruning 
is done in this query
+  int actualRowCount = testSql(query);
+  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files",
+  expectedRowCount, actualRowCount);
+  String numFilesPattern = "numFiles=" + expectedNumFiles;
+  String usedMetaPattern = "usedMetadataFile=true";
+  String cacheFileRootPattern = String.format("cacheFileRoot=%s/%s", 
getDfsTestTmpSchemaLocation(), absolutePathsMetadata);
+  PlanTestBase.testPlanMatchingPatterns(query, new 
String[]{numFilesPattern, usedMetaPattern, cacheFileRootPattern},
+  new String[] {"Filter"});
+} finally {
+  test("drop table if exists %s", absolutePathsMetadata);
+}
+  }
 
-  int rowCount = testSql(String.format("select * from %s", 
RELATIVE_PATHS_METADATA));
-  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files", 50, rowCount);
+  @Test
+  public void testSpacesInMetadataCachePath() throws Exception {
+final String pathWithSpaces = "path with spaces";
+try {
+  // creating multilevel table to store path with spaces in both 
metadata files (METADATA and METADATA_DIRECTORIES)
+  test("create table dfs_test.tmp.`%s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("create table dfs_test.tmp.`%1$s/%1$s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("refresh table metadata dfs_test.tmp.`%s`", pathWithSpaces);
+  checkForMetadataFile(pathWithSpaces);
+  String query = 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111713#comment-16111713
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r130990507
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -495,14 +495,14 @@ public void testFutureUnsupportedMetadataVersion() 
throws Exception {
 try {
   test("use dfs_test.tmp");
   test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
unsupportedMetadataVersion);
-  String lastVersion = 
Iterables.getLast(MetadataVersion.Constants.SUPPORTED_VERSIONS);
-  for (String supportedVersion : 
MetadataVersion.Constants.SUPPORTED_VERSIONS) {
-if (new MetadataVersion(lastVersion).compareTo(new 
MetadataVersion(supportedVersion)) < 0) {
+  MetadataVersion lastVersion = 
Iterables.getLast(MetadataVersion.Constants.SUPPORTED_VERSIONS);
--- End diff --

Added `ImmutableSortedSet` in the last commit. It allows do not worry about 
order of elements.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111711#comment-16111711
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r130973130
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.ImmutableList;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+public class MetadataVersion implements Comparable {
+
+  private static final String FORMAT = "v((?!0)\\d+)\\.?((?!0)\\d+)?";
--- End diff --

Sounds good.  Implemented. 
Just one restriction - the last minor version is 9, after that increase 
major version. 


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111710#comment-16111710
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r130991076
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -495,14 +495,14 @@ public void testFutureUnsupportedMetadataVersion() 
throws Exception {
 try {
   test("use dfs_test.tmp");
   test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
unsupportedMetadataVersion);
-  String lastVersion = 
Iterables.getLast(MetadataVersion.Constants.SUPPORTED_VERSIONS);
-  for (String supportedVersion : 
MetadataVersion.Constants.SUPPORTED_VERSIONS) {
-if (new MetadataVersion(lastVersion).compareTo(new 
MetadataVersion(supportedVersion)) < 0) {
+  MetadataVersion lastVersion = 
Iterables.getLast(MetadataVersion.Constants.SUPPORTED_VERSIONS);
+  for (MetadataVersion supportedVersion : 
MetadataVersion.Constants.SUPPORTED_VERSIONS) {
+if (lastVersion.compareTo(supportedVersion) < 0) {
   lastVersion = supportedVersion;
 }
   }
   // Get the future version, which is absent in 
MetadataVersion.SUPPORTED_VERSIONS list
-  String futureVersion = "v" + 
(Integer.parseInt(String.valueOf(lastVersion.charAt(1))) + 1);
+  String futureVersion = "v" + 
(Integer.parseInt(String.valueOf(lastVersion.toString().charAt(1))) + 1);
--- End diff --

This code was replaced since float metadata versions are introduced.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105624#comment-16105624
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r130179724
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.ImmutableList;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+public class MetadataVersion implements Comparable {
+
+  private static final String FORMAT = "v((?!0)\\d+)\\.?((?!0)\\d+)?";
--- End diff --

To avoid having to parse the "v", might as well just say that the version 
is a number. Define that we increment the major number (before the decimal 
point) if the JSON structure changes, we increment the minor number (after the 
decimal point) of the format is unchanged, but the semantics change (as in the 
absolute/relative path change.) In general, convert the version to a float and 
just do normal numeric comparison to determine if the file version is earlier, 
same as, or later than the code version.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104819#comment-16104819
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r130065442
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -495,14 +495,14 @@ public void testFutureUnsupportedMetadataVersion() 
throws Exception {
 try {
   test("use dfs_test.tmp");
   test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
unsupportedMetadataVersion);
-  String lastVersion = 
Iterables.getLast(MetadataVersion.Constants.SUPPORTED_VERSIONS);
-  for (String supportedVersion : 
MetadataVersion.Constants.SUPPORTED_VERSIONS) {
-if (new MetadataVersion(lastVersion).compareTo(new 
MetadataVersion(supportedVersion)) < 0) {
+  MetadataVersion lastVersion = 
Iterables.getLast(MetadataVersion.Constants.SUPPORTED_VERSIONS);
+  for (MetadataVersion supportedVersion : 
MetadataVersion.Constants.SUPPORTED_VERSIONS) {
+if (lastVersion.compareTo(supportedVersion) < 0) {
   lastVersion = supportedVersion;
 }
   }
   // Get the future version, which is absent in 
MetadataVersion.SUPPORTED_VERSIONS list
-  String futureVersion = "v" + 
(Integer.parseInt(String.valueOf(lastVersion.charAt(1))) + 1);
+  String futureVersion = "v" + 
(Integer.parseInt(String.valueOf(lastVersion.toString().charAt(1))) + 1);
--- End diff --

`String futureVersion = new MedataVersion(lastVersion.getMajor() + 1, 
0).toString();`
Just add getters for major and minor versions to `MedataVersion` class 
first.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104821#comment-16104821
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r130068040
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -424,32 +432,121 @@ public void testMoveCache() throws Exception {
 
   @Test
   public void testMetadataCacheAbsolutePaths() throws Exception {
+final String absolutePathsMetadata = "absolute_paths_metadata";
 try {
   test("use dfs_test.tmp");
-  final String relative_path_metadata_t1 = RELATIVE_PATHS_METADATA + 
"/t1";
-  final String relative_path_metadata_t2 = RELATIVE_PATHS_METADATA + 
"/t2";
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t1);
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t2);
+  // creating two inner directories to leverage 
METADATA_DIRECTORIES_FILENAME metadata file as well
+  final String absolutePathsMetadataT1 = absolutePathsMetadata + "/t1";
+  final String absolutePathsMetadataT2 = absolutePathsMetadata + "/t2";
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT1);
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT2);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_directories_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_DIRECTORIES_FILENAME);
+  "metadata_directories_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_DIRECTORIES_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
relative_path_metadata_t1, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
absolutePathsMetadataT1, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
relative_path_metadata_t2, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
absolutePathsMetadataT2, Metadata.METADATA_FILENAME);
+  String query = String.format("select * from %s", 
absolutePathsMetadata);
+  int expectedRowCount = 50;
+  int expectedNumFiles = 1; // point to selectionRoot since no pruning 
is done in this query
+  int actualRowCount = testSql(query);
+  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files",
+  expectedRowCount, actualRowCount);
+  String numFilesPattern = "numFiles=" + expectedNumFiles;
+  String usedMetaPattern = "usedMetadataFile=true";
+  String cacheFileRootPattern = String.format("cacheFileRoot=%s/%s", 
getDfsTestTmpSchemaLocation(), absolutePathsMetadata);
+  PlanTestBase.testPlanMatchingPatterns(query, new 
String[]{numFilesPattern, usedMetaPattern, cacheFileRootPattern},
+  new String[] {"Filter"});
+} finally {
+  test("drop table if exists %s", absolutePathsMetadata);
+}
+  }
 
-  int rowCount = testSql(String.format("select * from %s", 
RELATIVE_PATHS_METADATA));
-  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files", 50, rowCount);
+  @Test
+  public void testSpacesInMetadataCachePath() throws Exception {
+final String pathWithSpaces = "path with spaces";
+try {
+  // creating multilevel table to store path with spaces in both 
metadata files (METADATA and METADATA_DIRECTORIES)
+  test("create table dfs_test.tmp.`%s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("create table dfs_test.tmp.`%1$s/%1$s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("refresh table metadata dfs_test.tmp.`%s`", pathWithSpaces);
+  checkForMetadataFile(pathWithSpaces);
+  String query = 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104820#comment-16104820
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r130067083
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -495,14 +495,14 @@ public void testFutureUnsupportedMetadataVersion() 
throws Exception {
 try {
   test("use dfs_test.tmp");
   test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
unsupportedMetadataVersion);
-  String lastVersion = 
Iterables.getLast(MetadataVersion.Constants.SUPPORTED_VERSIONS);
-  for (String supportedVersion : 
MetadataVersion.Constants.SUPPORTED_VERSIONS) {
-if (new MetadataVersion(lastVersion).compareTo(new 
MetadataVersion(supportedVersion)) < 0) {
+  MetadataVersion lastVersion = 
Iterables.getLast(MetadataVersion.Constants.SUPPORTED_VERSIONS);
--- End diff --

No need to this manually:
```
List supportedVersions = new 
ArrayList(MetadataVersion.Constants.SUPPORTED_VERSIONS);
Collections.sort(supportedVersions);
MetadataVersion lastVersion = 
supportedVersions.get(supportedVersions.size() - 1);
```


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103595#comment-16103595
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129834750
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.Lists;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+class MetadataVersion implements Comparable {
+  /**
+   * String version starts from 'v' letter
+   * First group is major metadata version (any number of digits, except a 
single zero digit)
+   * Next character is optional '.' (if minor version is specified)
+   * Next group is optional, minor metadata version (any number of digits, 
except a single zero digit)
+   * Examples of correct metadata versions: v1, v10, v4.13
+   */
+  private static final String FORMAT = "v((?!0)\\d+)\\.?((?!0)\\d+)?";
+  private static final Pattern PATTERN = Pattern.compile(FORMAT);
+
+  private final int major;
+  private final int minor;
+
+  public MetadataVersion(int major, int minor) {
+this.major = major;
+this.minor = minor;
+  }
+
+  public MetadataVersion(String metadataVersion) {
+Matcher matcher = PATTERN.matcher(metadataVersion);
+if (!matcher.matches()) {
+  DrillRuntimeException.format("Could not parse metadata version '%s' 
using format '%s'", metadataVersion, FORMAT);
+}
+this.major = Integer.parseInt(matcher.group(1));
+this.minor = matcher.group(2) != null ? 
Integer.parseInt(matcher.group(2)) : 0;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) return true;
+if (!(o instanceof MetadataVersion)) {
+  return false;
+}
+MetadataVersion that = (MetadataVersion) o;
+return this.major == that.major
+&& this.minor == that.minor;
+  }
+
+  @Override
+  public int hashCode() {
+int result = major;
+result = 31 * result + minor;
+return result;
+  }
+
+  @Override
+  public String toString() {
+return minor == 0 ? String.format("v%s1", major) : 
String.format("v%s1.%s2", major, minor);
+  }
+
+  @Override
+  public int compareTo(MetadataVersion o) {
+Preconditions.checkNotNull(o);
+return ComparisonChain.start()
+.compare(this.major, o.major)
+.compare(this.minor, o.minor)
+.result();
+  }
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+  public static class Constants {
+/**
+ * Version 1: Introduces parquet file metadata caching.
+ * See DRILL-2743
+ */
+public static final String V1 = "v1";
--- End diff --

Makes sense. Also started using an `ImmutableList` here.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103598#comment-16103598
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129834322
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.Lists;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+class MetadataVersion implements Comparable {
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103597#comment-16103597
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129834482
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.Lists;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+class MetadataVersion implements Comparable {
+  /**
+   * String version starts from 'v' letter
+   * First group is major metadata version (any number of digits, except a 
single zero digit)
+   * Next character is optional '.' (if minor version is specified)
+   * Next group is optional, minor metadata version (any number of digits, 
except a single zero digit)
+   * Examples of correct metadata versions: v1, v10, v4.13
+   */
+  private static final String FORMAT = "v((?!0)\\d+)\\.?((?!0)\\d+)?";
+  private static final Pattern PATTERN = Pattern.compile(FORMAT);
+
+  private final int major;
+  private final int minor;
+
+  public MetadataVersion(int major, int minor) {
+this.major = major;
+this.minor = minor;
+  }
+
+  public MetadataVersion(String metadataVersion) {
+Matcher matcher = PATTERN.matcher(metadataVersion);
+if (!matcher.matches()) {
+  DrillRuntimeException.format("Could not parse metadata version '%s' 
using format '%s'", metadataVersion, FORMAT);
+}
+this.major = Integer.parseInt(matcher.group(1));
+this.minor = matcher.group(2) != null ? 
Integer.parseInt(matcher.group(2)) : 0;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) return true;
+if (!(o instanceof MetadataVersion)) {
+  return false;
+}
+MetadataVersion that = (MetadataVersion) o;
+return this.major == that.major
+&& this.minor == that.minor;
+  }
+
+  @Override
+  public int hashCode() {
+int result = major;
+result = 31 * result + minor;
+return result;
+  }
+
+  @Override
--- End diff --

Looks better. Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103596#comment-16103596
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129834889
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataVersion.java
 ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.junit.Test;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+
+public class TestParquetMetadataVersion {
+
+  @Test
+  public void testCorrectOnlyMajorVersion() throws Exception {
+MetadataVersion parsedMetadataVersion = new 
MetadataVersion(MetadataVersion.Constants.V1);
+MetadataVersion expectedMetadataVersion = new MetadataVersion(1, 0);
+assertEquals("Parquet metadata version is parsed incorrectly", 
expectedMetadataVersion, parsedMetadataVersion);
+  }
+
+  @Test
+  public void testCorrectMajorMinorVersions() throws Exception {
+MetadataVersion parsedMetadataVersion = new 
MetadataVersion(MetadataVersion.Constants.V3_1);
+MetadataVersion expectedMetadataVersion = new MetadataVersion(3, 1);
+assertEquals("Parquet metadata version is parsed incorrectly", 
expectedMetadataVersion, parsedMetadataVersion);
+  }
+
+  @Test
+  public void testTwoDigitNumberMajorVersion() throws Exception {
+String twoDigitNumberMajorVersion = "v10.2";
+MetadataVersion parsedMetadataVersion = new 
MetadataVersion(twoDigitNumberMajorVersion);
+MetadataVersion expectedMetadataVersion = new MetadataVersion(10, 2);
+assertEquals("Parquet metadata version is parsed incorrectly", 
expectedMetadataVersion, parsedMetadataVersion);
+  }
+
+  @Test
+  public void testCorrectTwoDigitNumberMinorVersion() throws Exception {
+String twoDigitNumberMinorVersion = "v3.10";
+MetadataVersion parsedMetadataVersion = new 
MetadataVersion(twoDigitNumberMinorVersion);
+MetadataVersion expectedMetadataVersion = new MetadataVersion(3, 10);
+assertEquals("Parquet metadata version is parsed incorrectly", 
expectedMetadataVersion, parsedMetadataVersion);
+  }
+
+  @Test(expected = DrillRuntimeException.class)
+  public void testVersionWithoutFirstLetter() throws Exception {
+String versionWithoutFirstLetter = "3.1";
+try {
+  new MetadataVersion(versionWithoutFirstLetter);
+} catch (DrillRuntimeException e) {
+  assertTrue("Not expected exception is obtained while parsing parquet 
metadata version",
+  e.getMessage().contains("Could not parse metadata"));
+  throw new DrillRuntimeException(e);
--- End diff --

Missed that. Thanks.
Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103009#comment-16103009
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129796074
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.Lists;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+class MetadataVersion implements Comparable {
+  /**
+   * String version starts from 'v' letter
+   * First group is major metadata version (any number of digits, except a 
single zero digit)
+   * Next character is optional '.' (if minor version is specified)
+   * Next group is optional, minor metadata version (any number of digits, 
except a single zero digit)
+   * Examples of correct metadata versions: v1, v10, v4.13
+   */
+  private static final String FORMAT = "v((?!0)\\d+)\\.?((?!0)\\d+)?";
+  private static final Pattern PATTERN = Pattern.compile(FORMAT);
+
+  private final int major;
+  private final int minor;
+
+  public MetadataVersion(int major, int minor) {
+this.major = major;
+this.minor = minor;
+  }
+
+  public MetadataVersion(String metadataVersion) {
+Matcher matcher = PATTERN.matcher(metadataVersion);
+if (!matcher.matches()) {
+  DrillRuntimeException.format("Could not parse metadata version '%s' 
using format '%s'", metadataVersion, FORMAT);
+}
+this.major = Integer.parseInt(matcher.group(1));
+this.minor = matcher.group(2) != null ? 
Integer.parseInt(matcher.group(2)) : 0;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) return true;
+if (!(o instanceof MetadataVersion)) {
+  return false;
+}
+MetadataVersion that = (MetadataVersion) o;
+return this.major == that.major
+&& this.minor == that.minor;
+  }
+
+  @Override
+  public int hashCode() {
+int result = major;
+result = 31 * result + minor;
+return result;
+  }
+
+  @Override
+  public String toString() {
+return minor == 0 ? String.format("v%s1", major) : 
String.format("v%s1.%s2", major, minor);
+  }
+
+  @Override
+  public int compareTo(MetadataVersion o) {
+Preconditions.checkNotNull(o);
+return ComparisonChain.start()
+.compare(this.major, o.major)
+.compare(this.minor, o.minor)
+.result();
+  }
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+  public static class Constants {
+/**
+ * Version 1: Introduces parquet file metadata caching.
+ * See DRILL-2743
+ */
+public static final String V1 = "v1";
--- End diff --

As turned out we these constants will be used as annotation attributes thus 
they MUST have constant expression.
In this case we need to make sure that developer has added correct version 
before Drill start up. Let's keep list of `MetadataVersion` instead of `String` 
thus we'll be able to validate version constant.
```
public static final List SUPPORTED_VERSIONS = new ArrayList();

static {
  SUPPORTED_VERSIONS.add(new MetadataVersion(V1));
  ...

}
```


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102949#comment-16102949
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129782338
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataVersion.java
 ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.junit.Test;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+
+public class TestParquetMetadataVersion {
+
+  @Test
+  public void testCorrectOnlyMajorVersion() throws Exception {
+MetadataVersion parsedMetadataVersion = new 
MetadataVersion(MetadataVersion.Constants.V1);
+MetadataVersion expectedMetadataVersion = new MetadataVersion(1, 0);
+assertEquals("Parquet metadata version is parsed incorrectly", 
expectedMetadataVersion, parsedMetadataVersion);
+  }
+
+  @Test
+  public void testCorrectMajorMinorVersions() throws Exception {
+MetadataVersion parsedMetadataVersion = new 
MetadataVersion(MetadataVersion.Constants.V3_1);
+MetadataVersion expectedMetadataVersion = new MetadataVersion(3, 1);
+assertEquals("Parquet metadata version is parsed incorrectly", 
expectedMetadataVersion, parsedMetadataVersion);
+  }
+
+  @Test
+  public void testTwoDigitNumberMajorVersion() throws Exception {
+String twoDigitNumberMajorVersion = "v10.2";
+MetadataVersion parsedMetadataVersion = new 
MetadataVersion(twoDigitNumberMajorVersion);
+MetadataVersion expectedMetadataVersion = new MetadataVersion(10, 2);
+assertEquals("Parquet metadata version is parsed incorrectly", 
expectedMetadataVersion, parsedMetadataVersion);
+  }
+
+  @Test
+  public void testCorrectTwoDigitNumberMinorVersion() throws Exception {
+String twoDigitNumberMinorVersion = "v3.10";
+MetadataVersion parsedMetadataVersion = new 
MetadataVersion(twoDigitNumberMinorVersion);
+MetadataVersion expectedMetadataVersion = new MetadataVersion(3, 10);
+assertEquals("Parquet metadata version is parsed incorrectly", 
expectedMetadataVersion, parsedMetadataVersion);
+  }
+
+  @Test(expected = DrillRuntimeException.class)
+  public void testVersionWithoutFirstLetter() throws Exception {
+String versionWithoutFirstLetter = "3.1";
+try {
+  new MetadataVersion(versionWithoutFirstLetter);
+} catch (DrillRuntimeException e) {
+  assertTrue("Not expected exception is obtained while parsing parquet 
metadata version",
+  e.getMessage().contains("Could not parse metadata"));
+  throw new DrillRuntimeException(e);
--- End diff --

1. `throw e`, no need to wrap it again in `DrillRuntimeException`
2. You can check that error message also contains corrupted version:
`contains(String.format("Could not parse metadata version %s", 
versionWithoutFirstLetter)));`

Please correct here and in tests below.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102947#comment-16102947
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129783958
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.Lists;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+class MetadataVersion implements Comparable {
+  /**
+   * String version starts from 'v' letter
+   * First group is major metadata version (any number of digits, except a 
single zero digit)
+   * Next character is optional '.' (if minor version is specified)
+   * Next group is optional, minor metadata version (any number of digits, 
except a single zero digit)
+   * Examples of correct metadata versions: v1, v10, v4.13
+   */
+  private static final String FORMAT = "v((?!0)\\d+)\\.?((?!0)\\d+)?";
+  private static final Pattern PATTERN = Pattern.compile(FORMAT);
+
+  private final int major;
+  private final int minor;
+
+  public MetadataVersion(int major, int minor) {
+this.major = major;
+this.minor = minor;
+  }
+
+  public MetadataVersion(String metadataVersion) {
+Matcher matcher = PATTERN.matcher(metadataVersion);
+if (!matcher.matches()) {
+  DrillRuntimeException.format("Could not parse metadata version '%s' 
using format '%s'", metadataVersion, FORMAT);
+}
+this.major = Integer.parseInt(matcher.group(1));
+this.minor = matcher.group(2) != null ? 
Integer.parseInt(matcher.group(2)) : 0;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) return true;
+if (!(o instanceof MetadataVersion)) {
+  return false;
+}
+MetadataVersion that = (MetadataVersion) o;
+return this.major == that.major
+&& this.minor == that.minor;
+  }
+
+  @Override
+  public int hashCode() {
+int result = major;
+result = 31 * result + minor;
+return result;
+  }
+
+  @Override
+  public String toString() {
+return minor == 0 ? String.format("v%s1", major) : 
String.format("v%s1.%s2", major, minor);
+  }
+
+  @Override
+  public int compareTo(MetadataVersion o) {
+Preconditions.checkNotNull(o);
+return ComparisonChain.start()
+.compare(this.major, o.major)
+.compare(this.minor, o.minor)
+.result();
+  }
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+  public static class Constants {
+/**
+ * Version 1: Introduces parquet file metadata caching.
+ * See DRILL-2743
+ */
+public static final String V1 = "v1";
--- End diff --

As I have pointed out in the previous round of code review, we want to 
remove the burden from the developer on how parquet metadata version string 
representation should look like when adding new version. For this 
`MetadataVersion` class was introduced and `toString` method was overridden:
`public static final String V1 = new MetadataVersion(1, 0).toString();`


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102948#comment-16102948
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129780880
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.Lists;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+class MetadataVersion implements Comparable {
--- End diff --

public


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102083#comment-16102083
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129659864
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -424,32 +432,121 @@ public void testMoveCache() throws Exception {
 
   @Test
   public void testMetadataCacheAbsolutePaths() throws Exception {
+final String absolutePathsMetadata = "absolute_paths_metadata";
 try {
   test("use dfs_test.tmp");
-  final String relative_path_metadata_t1 = RELATIVE_PATHS_METADATA + 
"/t1";
-  final String relative_path_metadata_t2 = RELATIVE_PATHS_METADATA + 
"/t2";
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t1);
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t2);
+  // creating two inner directories to leverage 
METADATA_DIRECTORIES_FILENAME metadata file as well
+  final String absolutePathsMetadataT1 = absolutePathsMetadata + "/t1";
+  final String absolutePathsMetadataT2 = absolutePathsMetadata + "/t2";
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT1);
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT2);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_directories_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_DIRECTORIES_FILENAME);
+  "metadata_directories_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_DIRECTORIES_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
relative_path_metadata_t1, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
absolutePathsMetadataT1, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
relative_path_metadata_t2, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
absolutePathsMetadataT2, Metadata.METADATA_FILENAME);
+  String query = String.format("select * from %s", 
absolutePathsMetadata);
+  int expectedRowCount = 50;
+  int expectedNumFiles = 1; // point to selectionRoot since no pruning 
is done in this query
+  int actualRowCount = testSql(query);
+  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files",
+  expectedRowCount, actualRowCount);
+  String numFilesPattern = "numFiles=" + expectedNumFiles;
+  String usedMetaPattern = "usedMetadataFile=true";
+  String cacheFileRootPattern = String.format("cacheFileRoot=%s/%s", 
getDfsTestTmpSchemaLocation(), absolutePathsMetadata);
+  PlanTestBase.testPlanMatchingPatterns(query, new 
String[]{numFilesPattern, usedMetaPattern, cacheFileRootPattern},
+  new String[] {"Filter"});
+} finally {
+  test("drop table if exists %s", absolutePathsMetadata);
+}
+  }
 
-  int rowCount = testSql(String.format("select * from %s", 
RELATIVE_PATHS_METADATA));
-  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files", 50, rowCount);
+  @Test
+  public void testSpacesInMetadataCachePath() throws Exception {
+final String pathWithSpaces = "path with spaces";
+try {
+  // creating multilevel table to store path with spaces in both 
metadata files (METADATA and METADATA_DIRECTORIES)
+  test("create table dfs_test.tmp.`%s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("create table dfs_test.tmp.`%1$s/%1$s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("refresh table metadata dfs_test.tmp.`%s`", pathWithSpaces);
+  checkForMetadataFile(pathWithSpaces);
+  String query = 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102073#comment-16102073
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129545551
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersions.java
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.Lists;
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+public class MetadataVersions {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(Metadata.class);
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102079#comment-16102079
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129654257
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersions.java
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.Lists;
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+public class MetadataVersions {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(Metadata.class);
+
+  /**
+   * Version 1: Introduces parquet file metadata caching.
+   * See DRILL-2743
+   */
+  public static final String V1 = "v1";
+  /**
+   * Version 2: Metadata cache file size is reduced.
+   * See DRILL-4053
+   */
+  public static final String V2 = "v2";
+  /**
+   * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+   * Filter pushdown for Parquet is implemented. 
+   * See DRILL-1950
+   */
+  public static final String V3 = "v3";
+  /**
+   * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones.
+   * See DRILL-3867
+   */
+  public static final String V3_1 = "v3.1";
+
+  /**
+   * All historical versions of the Drill metadata cache files
+   */
+  public static final List SUPPORTED_VERSIONS = 
Lists.newArrayList(V1, V2, V3, V3_1);
+
+  /**
+   * @param metadataVersion parquet metadata version
+   * @return true if metadata version is supported, false otherwise
+   */
+  public static boolean isVersionSupported(String metadataVersion) {
+return SUPPORTED_VERSIONS.contains(metadataVersion);
+  }
+
+  /**
+   * Helper compare method similar to {@link java.util.Comparator#compare}
+   *
+   * @param metadataVersion1 the first metadata version to be compared
+   * @param metadataVersion2 the second metadata version to be compared
+   * @return a negative integer, zero, or a positive integer as the
+   * first argument is less than, equal to, or greater than the
+   * second.
+   */
+  public static int compare(String metadataVersion1, String 
metadataVersion2) {
+if (isVersionSupported(metadataVersion1) && 
isVersionSupported(metadataVersion2)) {
+  return 
VersionParser.parse(metadataVersion1).compareTo(VersionParser.parse(metadataVersion2));
+} else {
+  // this is never reached
+  throw new DrillRuntimeException(String.format("Unsupported metadata 
version. '%s' version can't be compared with '%s'",
+  metadataVersion1, metadataVersion2));
+}
+  }
+
+  /**
+   * Parses a parquet metadata version string
+   */
+  public static class VersionParser {
--- End diff --

`VersionParser` is deleted. 
`MetadataVersion` is introduced with inner class `Constants`.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102077#comment-16102077
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129653217
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataVersion.java
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.junit.Test;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+
+public class TestParquetMetadataVersion {
+
+  private static String TEST_EXCEPTION_MESSAGE = "Failure is expected 
while parsing the wrong metadata version";
+  private static String TEST_ASSERT_EQUALS_MESSAGE = "Parquet metadata 
version is parsed incorrectly";
+  private static String TEST_ASSERT_TRUE_MESSAGE = "Not expected exception 
is obtained while parsing parquet metadata version";
--- End diff --

constants are removed. Thanks


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102074#comment-16102074
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129654624
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersions.java
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.Lists;
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+public class MetadataVersions {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(Metadata.class);
+
+  /**
+   * Version 1: Introduces parquet file metadata caching.
+   * See DRILL-2743
+   */
+  public static final String V1 = "v1";
+  /**
+   * Version 2: Metadata cache file size is reduced.
+   * See DRILL-4053
+   */
+  public static final String V2 = "v2";
+  /**
+   * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+   * Filter pushdown for Parquet is implemented. 
+   * See DRILL-1950
+   */
+  public static final String V3 = "v3";
+  /**
+   * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones.
+   * See DRILL-3867
+   */
+  public static final String V3_1 = "v3.1";
+
+  /**
+   * All historical versions of the Drill metadata cache files
+   */
+  public static final List SUPPORTED_VERSIONS = 
Lists.newArrayList(V1, V2, V3, V3_1);
--- End diff --

`MetadataVersion` is created.  


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102075#comment-16102075
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129562100
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataVersion.java
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.junit.Test;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+
+public class TestParquetMetadataVersion {
+
+  private static String TEST_EXCEPTION_MESSAGE = "Failure is expected 
while parsing the wrong metadata version";
+  private static String TEST_ASSERT_EQUALS_MESSAGE = "Parquet metadata 
version is parsed incorrectly";
+  private static String TEST_ASSERT_TRUE_MESSAGE = "Not expected exception 
is obtained while parsing parquet metadata version";
+
+  @Test
+  public void testCorrectOnlyMajorVersion() throws Exception {
+String correctOnlyMajorVersion = MetadataVersions.V1;
+Version parsedMetadataVersion = 
MetadataVersions.VersionParser.parse(correctOnlyMajorVersion);
+Version expectedMetadataVersion = new Version(correctOnlyMajorVersion, 
1, 0, 0, 0, "");
+assertEquals(TEST_ASSERT_EQUALS_MESSAGE, expectedMetadataVersion, 
parsedMetadataVersion);
+  }
+
+  @Test
+  public void testCorrectMajorMinorVersion() throws Exception {
+String correctMajorMinorVersion = MetadataVersions.V3_1;
+Version parsedMetadataVersion = 
MetadataVersions.VersionParser.parse(correctMajorMinorVersion);
+Version expectedMetadataVersion = new 
Version(correctMajorMinorVersion, 3, 1, 0, 0, "");
+assertEquals(TEST_ASSERT_EQUALS_MESSAGE, expectedMetadataVersion, 
parsedMetadataVersion);
+  }
+
+  @Test
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102072#comment-16102072
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129653855
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersions.java
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.Lists;
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+public class MetadataVersions {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(Metadata.class);
+
+  /**
+   * Version 1: Introduces parquet file metadata caching.
+   * See DRILL-2743
+   */
+  public static final String V1 = "v1";
+  /**
+   * Version 2: Metadata cache file size is reduced.
+   * See DRILL-4053
+   */
+  public static final String V2 = "v2";
+  /**
+   * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+   * Filter pushdown for Parquet is implemented. 
+   * See DRILL-1950
+   */
+  public static final String V3 = "v3";
+  /**
+   * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones.
+   * See DRILL-3867
+   */
+  public static final String V3_1 = "v3.1";
+
+  /**
+   * All historical versions of the Drill metadata cache files
+   */
+  public static final List SUPPORTED_VERSIONS = 
Lists.newArrayList(V1, V2, V3, V3_1);
+
+  /**
+   * @param metadataVersion parquet metadata version
+   * @return true if metadata version is supported, false otherwise
+   */
+  public static boolean isVersionSupported(String metadataVersion) {
+return SUPPORTED_VERSIONS.contains(metadataVersion);
+  }
+
+  /**
+   * Helper compare method similar to {@link java.util.Comparator#compare}
+   *
+   * @param metadataVersion1 the first metadata version to be compared
+   * @param metadataVersion2 the second metadata version to be compared
+   * @return a negative integer, zero, or a positive integer as the
+   * first argument is less than, equal to, or greater than the
+   * second.
+   */
+  public static int compare(String metadataVersion1, String 
metadataVersion2) {
+if (isVersionSupported(metadataVersion1) && 
isVersionSupported(metadataVersion2)) {
+  return 
VersionParser.parse(metadataVersion1).compareTo(VersionParser.parse(metadataVersion2));
+} else {
+  // this is never reached
+  throw new DrillRuntimeException(String.format("Unsupported metadata 
version. '%s' version can't be compared with '%s'",
+  metadataVersion1, metadataVersion2));
+}
+  }
+
+  /**
+   * Parses a parquet metadata version string
+   */
+  public static class VersionParser {
+// example: v3.1 or v2
+private static final String FORMAT = "v(\\d)\\.?(\\d)?";
+private static final Pattern PATTERN = Pattern.compile(FORMAT);
+
+/**
+ * @param metadataVersion text metadata version
+ * @return comparable metadata version object
+ */
+public static Version parse(String metadataVersion) {
+  Matcher matcher = PATTERN.matcher(metadataVersion);
+  if (!matcher.matches()) {
+DrillRuntimeException.format("Could not parse metadata version 
'%s' using format '%s'", metadataVersion, 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102078#comment-16102078
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129655324
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataVersion.java
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.junit.Test;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+
+public class TestParquetMetadataVersion {
+
+  private static String TEST_EXCEPTION_MESSAGE = "Failure is expected 
while parsing the wrong metadata version";
--- End diff --

Constant messages are deleted.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101532#comment-16101532
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129539105
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -424,32 +432,121 @@ public void testMoveCache() throws Exception {
 
   @Test
   public void testMetadataCacheAbsolutePaths() throws Exception {
+final String absolutePathsMetadata = "absolute_paths_metadata";
 try {
   test("use dfs_test.tmp");
-  final String relative_path_metadata_t1 = RELATIVE_PATHS_METADATA + 
"/t1";
-  final String relative_path_metadata_t2 = RELATIVE_PATHS_METADATA + 
"/t2";
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t1);
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t2);
+  // creating two inner directories to leverage 
METADATA_DIRECTORIES_FILENAME metadata file as well
+  final String absolutePathsMetadataT1 = absolutePathsMetadata + "/t1";
+  final String absolutePathsMetadataT2 = absolutePathsMetadata + "/t2";
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT1);
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT2);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_directories_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_DIRECTORIES_FILENAME);
+  "metadata_directories_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_DIRECTORIES_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
relative_path_metadata_t1, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
absolutePathsMetadataT1, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
relative_path_metadata_t2, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
absolutePathsMetadataT2, Metadata.METADATA_FILENAME);
+  String query = String.format("select * from %s", 
absolutePathsMetadata);
+  int expectedRowCount = 50;
+  int expectedNumFiles = 1; // point to selectionRoot since no pruning 
is done in this query
+  int actualRowCount = testSql(query);
+  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files",
+  expectedRowCount, actualRowCount);
+  String numFilesPattern = "numFiles=" + expectedNumFiles;
+  String usedMetaPattern = "usedMetadataFile=true";
+  String cacheFileRootPattern = String.format("cacheFileRoot=%s/%s", 
getDfsTestTmpSchemaLocation(), absolutePathsMetadata);
+  PlanTestBase.testPlanMatchingPatterns(query, new 
String[]{numFilesPattern, usedMetaPattern, cacheFileRootPattern},
+  new String[] {"Filter"});
+} finally {
+  test("drop table if exists %s", absolutePathsMetadata);
+}
+  }
 
-  int rowCount = testSql(String.format("select * from %s", 
RELATIVE_PATHS_METADATA));
-  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files", 50, rowCount);
+  @Test
+  public void testSpacesInMetadataCachePath() throws Exception {
+final String pathWithSpaces = "path with spaces";
+try {
+  // creating multilevel table to store path with spaces in both 
metadata files (METADATA and METADATA_DIRECTORIES)
+  test("create table dfs_test.tmp.`%s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("create table dfs_test.tmp.`%1$s/%1$s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("refresh table metadata dfs_test.tmp.`%s`", pathWithSpaces);
+  checkForMetadataFile(pathWithSpaces);
+  String query = 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101512#comment-16101512
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129530761
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,25 +134,64 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (isMetadataFilesCorrupted(metaContext, path)) {
+  return null;
+}
 Metadata metadata = new Metadata(fs, formatConfig);
 metadata.readBlockMeta(path, false, metaContext);
 return metadata.parquetTableMetadata;
   }
 
-  public static ParquetTableMetadataDirs readMetadataDirs(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
+  /**
+   * Get the parquet metadata for all subdirectories by reading the 
metadata file
+   *
+   * @param fs current file system
+   * @param path The path to the metadata file, located in the directory 
that contains the parquet files
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet metadata for a directory
+   * @throws IOException if metadata file can't be read or updated
+   */
+  public static @Nullable ParquetTableMetadataDirs 
readMetadataDirs(FileSystem fs, Path path,
+  MetadataContext metaContext, ParquetFormatConfig formatConfig) {
+if (isMetadataFilesCorrupted(metaContext, path)) {
+  return null;
+}
 Metadata metadata = new Metadata(fs, formatConfig);
 metadata.readBlockMeta(path, true, metaContext);
 return metadata.parquetTableMetadataDirs;
   }
 
+  /**
+   * Checking whether metadata is corrupted
+   *
+   * @param metaContext metadata context
+   * @param path The path to the metadata file, located in the directory 
that contains the parquet files
+   * @return true if parquet metadata is corrupted, false otherwise
+   */
+  private static boolean isMetadataFilesCorrupted(MetadataContext 
metaContext, Path path) {
--- End diff --

This method does not only check if metadata files are corrupt, it also 
warns the user that corrupted metadata files won't be read. Taking into account 
warn message, it's better to rename method to `ignoreReadingMetadata`.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101517#comment-16101517
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129528313
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersions.java
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.Lists;
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+public class MetadataVersions {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(Metadata.class);
+
+  /**
+   * Version 1: Introduces parquet file metadata caching.
+   * See DRILL-2743
+   */
+  public static final String V1 = "v1";
+  /**
+   * Version 2: Metadata cache file size is reduced.
+   * See DRILL-4053
+   */
+  public static final String V2 = "v2";
+  /**
+   * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+   * Filter pushdown for Parquet is implemented. 
+   * See DRILL-1950
+   */
+  public static final String V3 = "v3";
+  /**
+   * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones.
+   * See DRILL-3867
+   */
+  public static final String V3_1 = "v3.1";
+
+  /**
+   * All historical versions of the Drill metadata cache files
+   */
+  public static final List SUPPORTED_VERSIONS = 
Lists.newArrayList(V1, V2, V3, V3_1);
+
+  /**
+   * @param metadataVersion parquet metadata version
+   * @return true if metadata version is supported, false otherwise
+   */
+  public static boolean isVersionSupported(String metadataVersion) {
+return SUPPORTED_VERSIONS.contains(metadataVersion);
+  }
+
+  /**
+   * Helper compare method similar to {@link java.util.Comparator#compare}
+   *
+   * @param metadataVersion1 the first metadata version to be compared
+   * @param metadataVersion2 the second metadata version to be compared
+   * @return a negative integer, zero, or a positive integer as the
+   * first argument is less than, equal to, or greater than the
+   * second.
+   */
+  public static int compare(String metadataVersion1, String 
metadataVersion2) {
+if (isVersionSupported(metadataVersion1) && 
isVersionSupported(metadataVersion2)) {
+  return 
VersionParser.parse(metadataVersion1).compareTo(VersionParser.parse(metadataVersion2));
+} else {
+  // this is never reached
+  throw new DrillRuntimeException(String.format("Unsupported metadata 
version. '%s' version can't be compared with '%s'",
+  metadataVersion1, metadataVersion2));
+}
+  }
+
+  /**
+   * Parses a parquet metadata version string
+   */
+  public static class VersionParser {
+// example: v3.1 or v2
+private static final String FORMAT = "v(\\d)\\.?(\\d)?";
+private static final Pattern PATTERN = Pattern.compile(FORMAT);
+
+/**
+ * @param metadataVersion text metadata version
+ * @return comparable metadata version object
+ */
+public static Version parse(String metadataVersion) {
+  Matcher matcher = PATTERN.matcher(metadataVersion);
+  if (!matcher.matches()) {
+DrillRuntimeException.format("Could not parse metadata version 
'%s' using format '%s'", 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101518#comment-16101518
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129535500
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersions.java
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.Lists;
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+public class MetadataVersions {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(Metadata.class);
--- End diff --

private


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101514#comment-16101514
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129525648
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataVersion.java
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.junit.Test;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+
+public class TestParquetMetadataVersion {
+
+  private static String TEST_EXCEPTION_MESSAGE = "Failure is expected 
while parsing the wrong metadata version";
--- End diff --

Please add final keyword.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101519#comment-16101519
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129526081
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataVersion.java
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.junit.Test;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+
+public class TestParquetMetadataVersion {
+
+  private static String TEST_EXCEPTION_MESSAGE = "Failure is expected 
while parsing the wrong metadata version";
+  private static String TEST_ASSERT_EQUALS_MESSAGE = "Parquet metadata 
version is parsed incorrectly";
+  private static String TEST_ASSERT_TRUE_MESSAGE = "Not expected exception 
is obtained while parsing parquet metadata version";
--- End diff --

`TEST_ASSERT_EQUALS_MESSAGE` -> `TEST_ASSERT_EQUALS_ERROR_MESSAGE`
`TEST_ASSERT_TRUE_MESSAGE` -> `TEST_ASSERT_TRUE_ERROR_MESSAGE`
Though in tests you may leave error messages hard-coded string for better 
visibility.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101513#comment-16101513
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129526936
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersions.java
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.Lists;
+import org.apache.drill.common.Version;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Supported metadata versions.
+ * 
+ * Note: keep them synchronized with {@link 
Metadata.ParquetTableMetadataBase} versions
+ */
+public class MetadataVersions {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(Metadata.class);
+
+  /**
+   * Version 1: Introduces parquet file metadata caching.
+   * See DRILL-2743
+   */
+  public static final String V1 = "v1";
+  /**
+   * Version 2: Metadata cache file size is reduced.
+   * See DRILL-4053
+   */
+  public static final String V2 = "v2";
+  /**
+   * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+   * Filter pushdown for Parquet is implemented. 
+   * See DRILL-1950
+   */
+  public static final String V3 = "v3";
+  /**
+   * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones.
+   * See DRILL-3867
+   */
+  public static final String V3_1 = "v3.1";
+
+  /**
+   * All historical versions of the Drill metadata cache files
+   */
+  public static final List SUPPORTED_VERSIONS = 
Lists.newArrayList(V1, V2, V3, V3_1);
+
+  /**
+   * @param metadataVersion parquet metadata version
+   * @return true if metadata version is supported, false otherwise
+   */
+  public static boolean isVersionSupported(String metadataVersion) {
+return SUPPORTED_VERSIONS.contains(metadataVersion);
+  }
+
+  /**
+   * Helper compare method similar to {@link java.util.Comparator#compare}
+   *
+   * @param metadataVersion1 the first metadata version to be compared
+   * @param metadataVersion2 the second metadata version to be compared
+   * @return a negative integer, zero, or a positive integer as the
+   * first argument is less than, equal to, or greater than the
+   * second.
+   */
+  public static int compare(String metadataVersion1, String 
metadataVersion2) {
+if (isVersionSupported(metadataVersion1) && 
isVersionSupported(metadataVersion2)) {
+  return 
VersionParser.parse(metadataVersion1).compareTo(VersionParser.parse(metadataVersion2));
+} else {
+  // this is never reached
+  throw new DrillRuntimeException(String.format("Unsupported metadata 
version. '%s' version can't be compared with '%s'",
+  metadataVersion1, metadataVersion2));
+}
+  }
+
+  /**
+   * Parses a parquet metadata version string
+   */
+  public static class VersionParser {
--- End diff --

This an be extracted into separate class, there is no need to keep it as 
inner.
If we create separate entity for parquet metadata version, we might include 
parse method directly in that class.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>   

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101370#comment-16101370
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129104084
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,23 +135,57 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, false, metaContext);
-return metadata.parquetTableMetadata;
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, false, metaContext);
+return metadata.parquetTableMetadata;
+  } catch (IOException e) {
+logger.error(e.toString());
+metaContext.isMetaCacheFileCorrect = false;
+  }
+}
+logger.warn("Ignoring unsupported or corrupted metadata file version. 
Query performance may be slow. Make sure " +
+"the cache file is up-to-date by running the REFRESH TABLE 
METADATA command");
+return null;
   }
 
-  public static ParquetTableMetadataDirs readMetadataDirs(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, true, metaContext);
-return metadata.parquetTableMetadataDirs;
+  /**
+   * Get the parquet metadata for all subdirectories by reading the 
metadata file
+   *
+   * @param fs current file system
+   * @param path The path to the metadata file, located in the directory 
that contains the parquet files
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet metadata for a directory
+   * @throws IOException if metadata file can't be read or updated
+   */
+  public static @Nullable ParquetTableMetadataDirs 
readMetadataDirs(FileSystem fs, Path path,
+  MetadataContext metaContext, ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaDirsCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, true, metaContext);
+return metadata.parquetTableMetadataDirs;
+  } catch (IOException e) {
+logger.error(e.toString());
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101375#comment-16101375
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129485867
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1851,9 +1922,73 @@ private static String relativize(String baseDir, 
String childPath) {
   .relativize(fullPathWithoutSchemeAndAuthority.toUri()));
   if (relativeFilePath.isAbsolute()) {
 throw new IllegalStateException(String.format("Path %s is not a 
subpath of %s.",
-basePathWithoutSchemeAndAuthority.toUri().toString(), 
fullPathWithoutSchemeAndAuthority.toUri().toString()));
+basePathWithoutSchemeAndAuthority.toUri().getPath(), 
fullPathWithoutSchemeAndAuthority.toUri().getPath()));
+  }
+  return relativeFilePath.toUri().getPath();
+}
+  }
+
+  /**
+   * Supported metadata versions.
+   * 
+   * Note: keep them synchronized with {@link ParquetTableMetadataBase} 
versions
+   */
+  public static class MetadataVersion {
--- End diff --

Agree. Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101371#comment-16101371
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129485531
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -38,24 +40,31 @@
   private static final String TEST_RES_PATH = WORKING_PATH + 
"/src/test/resources";
   private static final String tableName1 = "parquetTable1";
   private static final String tableName2 = "parquetTable2";
-  private static final String RELATIVE_PATHS_METADATA = 
"relative_paths_metadata";
+  private static File dataDir1;
+  private static File dataDir2;
 
 
   @BeforeClass
   public static void copyData() throws Exception {
 // copy the data into the temporary location
 String tmpLocation = getDfsTestTmpSchemaLocation();
-File dataDir1 = new File(tmpLocation + Path.SEPARATOR + tableName1);
+dataDir1 = new File(tmpLocation + Path.SEPARATOR + tableName1);
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101378#comment-16101378
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129485543
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -38,24 +40,31 @@
   private static final String TEST_RES_PATH = WORKING_PATH + 
"/src/test/resources";
   private static final String tableName1 = "parquetTable1";
   private static final String tableName2 = "parquetTable2";
-  private static final String RELATIVE_PATHS_METADATA = 
"relative_paths_metadata";
+  private static File dataDir1;
+  private static File dataDir2;
 
 
   @BeforeClass
   public static void copyData() throws Exception {
 // copy the data into the temporary location
 String tmpLocation = getDfsTestTmpSchemaLocation();
-File dataDir1 = new File(tmpLocation + Path.SEPARATOR + tableName1);
+dataDir1 = new File(tmpLocation + Path.SEPARATOR + tableName1);
 dataDir1.mkdir();
 FileUtils.copyDirectory(new 
File(String.format(String.format("%s/multilevel/parquet", TEST_RES_PATH))),
 dataDir1);
 
-File dataDir2 = new File(tmpLocation + Path.SEPARATOR + tableName2);
+dataDir2 = new File(tmpLocation + Path.SEPARATOR + tableName2);
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101374#comment-16101374
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129486188
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1851,9 +1922,73 @@ private static String relativize(String baseDir, 
String childPath) {
   .relativize(fullPathWithoutSchemeAndAuthority.toUri()));
   if (relativeFilePath.isAbsolute()) {
 throw new IllegalStateException(String.format("Path %s is not a 
subpath of %s.",
-basePathWithoutSchemeAndAuthority.toUri().toString(), 
fullPathWithoutSchemeAndAuthority.toUri().toString()));
+basePathWithoutSchemeAndAuthority.toUri().getPath(), 
fullPathWithoutSchemeAndAuthority.toUri().getPath()));
+  }
+  return relativeFilePath.toUri().getPath();
+}
+  }
+
+  /**
+   * Supported metadata versions.
+   * 
+   * Note: keep them synchronized with {@link ParquetTableMetadataBase} 
versions
+   */
+  public static class MetadataVersion {
+
+/**
+ * Version 1: Introduces parquet file metadata caching.
+ * See DRILL-2743
+ */
+public static final String V1 = "v1";
+/**
+ * Version 2: Metadata cache file size is reduced.
+ * See DRILL-4053
+ */
+public static final String V2 = "v2";
+/**
+ * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+ * Filter pushdown for Parquet is implemented. 
+ * See DRILL-1950
+ */
+public static final String V3 = "v3";
+/**
+ * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones.
+ * See DRILL-3867
+ */
+public static final String V3_1 = "v3.1";
+
+
+/**
+ * All historical versions of the Drill metadata cache files
+ */
+public static final List SUPPORTED_VERSIONS = 
Lists.newArrayList(V1, V2, V3, V3_1);
+
+/**
+ * @param metadataVersion parquet metadata version
+ * @return true if metadata version is supported, false otherwise
+ */
+public static boolean isVersionSupported(String metadataVersion) {
+  return SUPPORTED_VERSIONS.contains(metadataVersion);
+}
+
+/**
+ * Helper compare method similar to {@link 
java.util.Comparator#compare}
+ *
+ * @param metadataVersion1 the first metadata version to be compared
+ * @param metadataVersion2 the second metadata version to be compared
+ * @return a negative integer, zero, or a positive integer as the
+ * first argument is less than, equal to, or greater than the
+ * second.
+ */
+public static int compare(String metadataVersion1, String 
metadataVersion2) {
+  if (isVersionSupported(metadataVersion1) && 
isVersionSupported(metadataVersion2)) {
+return 
Integer.compare(SUPPORTED_VERSIONS.indexOf(metadataVersion1), 
SUPPORTED_VERSIONS.indexOf(metadataVersion2));
+  } else {
+// this is never reached
+throw UserException.validationError()
--- End diff --

Replaced


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101372#comment-16101372
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129103929
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,23 +135,57 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, false, metaContext);
-return metadata.parquetTableMetadata;
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, false, metaContext);
+return metadata.parquetTableMetadata;
+  } catch (IOException e) {
+logger.error(e.toString());
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101373#comment-16101373
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129409026
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,23 +135,57 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, Path 
path, MetadataContext metaContext, ParquetFormatConfig formatConfig) throws 
IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, false, metaContext);
-return metadata.parquetTableMetadata;
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, false, metaContext);
+return metadata.parquetTableMetadata;
+  } catch (IOException e) {
+logger.error(e.toString());
+metaContext.isMetaCacheFileCorrect = false;
+  }
+}
+logger.warn("Ignoring unsupported or corrupted metadata file version. 
Query performance may be slow. Make sure " +
+"the cache file is up-to-date by running the REFRESH TABLE 
METADATA command");
+return null;
   }
 
-  public static ParquetTableMetadataDirs readMetadataDirs(FileSystem fs, 
Path path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, true, metaContext);
-return metadata.parquetTableMetadataDirs;
+  /**
+   * Get the parquet metadata for all subdirectories by reading the 
metadata file
+   *
+   * @param fs current file system
+   * @param path The path to the metadata file, located in the directory 
that contains the parquet files
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet metadata for a directory
+   * @throws IOException if metadata file can't be read or updated
+   */
+  public static @Nullable ParquetTableMetadataDirs 
readMetadataDirs(FileSystem fs, Path path,
+  MetadataContext metaContext, ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaDirsCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, true, metaContext);
+return metadata.parquetTableMetadataDirs;
+  } catch (IOException e) {
+logger.error(e.toString());
+metaContext.isMetaDirsCacheFileCorrect = false;
+  }
+}
+logger.warn("Ignoring corrupted metadata file. Query performance may 
be slow. Make sure the cache file" +
--- End diff --

Static helper method is added to avoid duplicated code. 
Added path to the file that is corrupt in the error message. And added file 
path that is ignored in the warning message.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101377#comment-16101377
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129485798
  
--- Diff: exec/java-exec/src/test/java/org/apache/drill/BaseTestQuery.java 
---
@@ -639,14 +644,18 @@ protected static void 
copyDirectoryIntoTempSpace(String resourcesDir, String des
* @param srcFileOnClassPath the source path of metadata cache file, 
which should be replaced
* @param destFolderInTmp  the parent folder name of the metadata cache 
file
* @param metaFileName the name of metadata cache file depending on the 
type of the metadata
+   * @param customStringReplacement custom string to replace the 
"CUSTOM_REPLACED" target string in metadata file
* @throws IOException if a create or write errors occur
*/
-  protected static void 
copyMetaDataCacheToTempReplacingInternalPaths(String srcFileOnClassPath, String 
destFolderInTmp,
-  String metaFileName) throws IOException {
+  protected static void copyMetaDataCacheToTempWithReplacements(String 
srcFileOnClassPath,
+  String destFolderInTmp, String metaFileName, String 
customStringReplacement) throws IOException {
 String metadataFileContents = getFile(srcFileOnClassPath);
 Path rootMeta = new Path(dfsTestTmpSchemaLocation, destFolderInTmp);
 Path newMetaCache = new Path(rootMeta, metaFileName);
 FSDataOutputStream outSteam = fs.create(newMetaCache);
+if (customStringReplacement!=null) {
--- End diff --

Done


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101376#comment-16101376
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129486097
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1851,9 +1922,73 @@ private static String relativize(String baseDir, 
String childPath) {
   .relativize(fullPathWithoutSchemeAndAuthority.toUri()));
   if (relativeFilePath.isAbsolute()) {
 throw new IllegalStateException(String.format("Path %s is not a 
subpath of %s.",
-basePathWithoutSchemeAndAuthority.toUri().toString(), 
fullPathWithoutSchemeAndAuthority.toUri().toString()));
+basePathWithoutSchemeAndAuthority.toUri().getPath(), 
fullPathWithoutSchemeAndAuthority.toUri().getPath()));
+  }
+  return relativeFilePath.toUri().getPath();
+}
+  }
+
+  /**
+   * Supported metadata versions.
+   * 
+   * Note: keep them synchronized with {@link ParquetTableMetadataBase} 
versions
+   */
+  public static class MetadataVersion {
+
+/**
+ * Version 1: Introduces parquet file metadata caching.
+ * See DRILL-2743
+ */
+public static final String V1 = "v1";
+/**
+ * Version 2: Metadata cache file size is reduced.
+ * See DRILL-4053
+ */
+public static final String V2 = "v2";
+/**
+ * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+ * Filter pushdown for Parquet is implemented. 
+ * See DRILL-1950
+ */
+public static final String V3 = "v3";
+/**
+ * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones.
+ * See DRILL-3867
+ */
+public static final String V3_1 = "v3.1";
+
+
+/**
+ * All historical versions of the Drill metadata cache files
+ */
+public static final List SUPPORTED_VERSIONS = 
Lists.newArrayList(V1, V2, V3, V3_1);
+
+/**
+ * @param metadataVersion parquet metadata version
+ * @return true if metadata version is supported, false otherwise
+ */
+public static boolean isVersionSupported(String metadataVersion) {
+  return SUPPORTED_VERSIONS.contains(metadataVersion);
+}
+
+/**
+ * Helper compare method similar to {@link 
java.util.Comparator#compare}
+ *
+ * @param metadataVersion1 the first metadata version to be compared
+ * @param metadataVersion2 the second metadata version to be compared
+ * @return a negative integer, zero, or a positive integer as the
+ * first argument is less than, equal to, or greater than the
+ * second.
+ */
+public static int compare(String metadataVersion1, String 
metadataVersion2) {
--- End diff --

Added new `VersionParser` class, which allow to parse parquet metadata 
version string. Added test cases for different correct and wrong parquet 
metadata versions.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101369#comment-16101369
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r129103339
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/MetadataContext.java
 ---
@@ -41,6 +41,10 @@
 
   private PruneStatus pruneStatus = PruneStatus.NOT_STARTED;
 
+  // False values of these flags allow to avoid double reading of 
corrupted or unsupported metadata files
+  public boolean isMetaCacheFileCorrect = true;
+  public boolean isMetaDirsCacheFileCorrect = true;
--- End diff --

The state is created after failed opening of the metadata cache file. And 
will be stored per query.
There is no case for now to clear that state manually. 
Note: `MetadataContext.clear()` removes all metadata context info 
(`isMetadataFilesCorrupted` too).

I left only one state, since if one file is unsupported or corrupted it is 
more likely that other files are the same. Even if not, it can be dangerous to 
read some files, but ignore the others.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095968#comment-16095968
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128712562
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1851,9 +1922,73 @@ private static String relativize(String baseDir, 
String childPath) {
   .relativize(fullPathWithoutSchemeAndAuthority.toUri()));
   if (relativeFilePath.isAbsolute()) {
 throw new IllegalStateException(String.format("Path %s is not a 
subpath of %s.",
-basePathWithoutSchemeAndAuthority.toUri().toString(), 
fullPathWithoutSchemeAndAuthority.toUri().toString()));
+basePathWithoutSchemeAndAuthority.toUri().getPath(), 
fullPathWithoutSchemeAndAuthority.toUri().getPath()));
+  }
+  return relativeFilePath.toUri().getPath();
+}
+  }
+
+  /**
+   * Supported metadata versions.
+   * 
+   * Note: keep them synchronized with {@link ParquetTableMetadataBase} 
versions
+   */
+  public static class MetadataVersion {
+
+/**
+ * Version 1: Introduces parquet file metadata caching.
+ * See DRILL-2743
+ */
+public static final String V1 = "v1";
+/**
+ * Version 2: Metadata cache file size is reduced.
+ * See DRILL-4053
+ */
+public static final String V2 = "v2";
+/**
+ * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+ * Filter pushdown for Parquet is implemented. 
+ * See DRILL-1950
+ */
+public static final String V3 = "v3";
+/**
+ * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones.
+ * See DRILL-3867
+ */
+public static final String V3_1 = "v3.1";
+
+
+/**
+ * All historical versions of the Drill metadata cache files
+ */
+public static final List SUPPORTED_VERSIONS = 
Lists.newArrayList(V1, V2, V3, V3_1);
+
+/**
+ * @param metadataVersion parquet metadata version
+ * @return true if metadata version is supported, false otherwise
+ */
+public static boolean isVersionSupported(String metadataVersion) {
+  return SUPPORTED_VERSIONS.contains(metadataVersion);
+}
+
+/**
+ * Helper compare method similar to {@link 
java.util.Comparator#compare}
+ *
+ * @param metadataVersion1 the first metadata version to be compared
+ * @param metadataVersion2 the second metadata version to be compared
+ * @return a negative integer, zero, or a positive integer as the
+ * first argument is less than, equal to, or greater than the
+ * second.
+ */
+public static int compare(String metadataVersion1, String 
metadataVersion2) {
--- End diff --

Here you compare two versions based on their index in array list. It's 
fragile approach. First of all developer should always add version to the end 
of  `SUPPORTED_VERSIONS` lists but how do you ensure that? At least, note 
should have been added. Though I think it's better to parse version string 
representation and compare it.



> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-21 Thread Arina Ielchiieva (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095900#comment-16095900
 ] 

Arina Ielchiieva commented on DRILL-5660:
-

[~vitalii] please summarize how Drill will behave with old / new / corrupted 
metadata files. You can add this information in `Docs Text` section.

> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095892#comment-16095892
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128697711
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,23 +135,57 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, false, metaContext);
-return metadata.parquetTableMetadata;
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, false, metaContext);
+return metadata.parquetTableMetadata;
+  } catch (IOException e) {
+logger.error(e.toString());
--- End diff --

`logger.error("Failed to read metadata", e)`


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095893#comment-16095893
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128697735
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,23 +135,57 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, false, metaContext);
-return metadata.parquetTableMetadata;
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, false, metaContext);
+return metadata.parquetTableMetadata;
+  } catch (IOException e) {
+logger.error(e.toString());
+metaContext.isMetaCacheFileCorrect = false;
+  }
+}
+logger.warn("Ignoring unsupported or corrupted metadata file version. 
Query performance may be slow. Make sure " +
+"the cache file is up-to-date by running the REFRESH TABLE 
METADATA command");
+return null;
   }
 
-  public static ParquetTableMetadataDirs readMetadataDirs(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, true, metaContext);
-return metadata.parquetTableMetadataDirs;
+  /**
+   * Get the parquet metadata for all subdirectories by reading the 
metadata file
+   *
+   * @param fs current file system
+   * @param path The path to the metadata file, located in the directory 
that contains the parquet files
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet metadata for a directory
+   * @throws IOException if metadata file can't be read or updated
+   */
+  public static @Nullable ParquetTableMetadataDirs 
readMetadataDirs(FileSystem fs, Path path,
+  MetadataContext metaContext, ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaDirsCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, true, metaContext);
+return metadata.parquetTableMetadataDirs;
+  } catch (IOException e) {
+logger.error(e.toString());
--- End diff --

The same as above.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095897#comment-16095897
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128697396
  
--- Diff: exec/java-exec/src/test/java/org/apache/drill/BaseTestQuery.java 
---
@@ -639,14 +644,18 @@ protected static void 
copyDirectoryIntoTempSpace(String resourcesDir, String des
* @param srcFileOnClassPath the source path of metadata cache file, 
which should be replaced
* @param destFolderInTmp  the parent folder name of the metadata cache 
file
* @param metaFileName the name of metadata cache file depending on the 
type of the metadata
+   * @param customStringReplacement custom string to replace the 
"CUSTOM_REPLACED" target string in metadata file
* @throws IOException if a create or write errors occur
*/
-  protected static void 
copyMetaDataCacheToTempReplacingInternalPaths(String srcFileOnClassPath, String 
destFolderInTmp,
-  String metaFileName) throws IOException {
+  protected static void copyMetaDataCacheToTempWithReplacements(String 
srcFileOnClassPath,
+  String destFolderInTmp, String metaFileName, String 
customStringReplacement) throws IOException {
 String metadataFileContents = getFile(srcFileOnClassPath);
 Path rootMeta = new Path(dfsTestTmpSchemaLocation, destFolderInTmp);
 Path newMetaCache = new Path(rootMeta, metaFileName);
 FSDataOutputStream outSteam = fs.create(newMetaCache);
+if (customStringReplacement!=null) {
--- End diff --

`if (customStringReplacement!=null) {` -> `if (customStringReplacement != 
null) {`


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095891#comment-16095891
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128697252
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -38,24 +40,31 @@
   private static final String TEST_RES_PATH = WORKING_PATH + 
"/src/test/resources";
   private static final String tableName1 = "parquetTable1";
   private static final String tableName2 = "parquetTable2";
-  private static final String RELATIVE_PATHS_METADATA = 
"relative_paths_metadata";
+  private static File dataDir1;
+  private static File dataDir2;
 
 
   @BeforeClass
   public static void copyData() throws Exception {
 // copy the data into the temporary location
 String tmpLocation = getDfsTestTmpSchemaLocation();
-File dataDir1 = new File(tmpLocation + Path.SEPARATOR + tableName1);
+dataDir1 = new File(tmpLocation + Path.SEPARATOR + tableName1);
--- End diff --

Please replace with `new File (tmpLocation, tableName1)`


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095896#comment-16095896
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128698529
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1851,9 +1922,73 @@ private static String relativize(String baseDir, 
String childPath) {
   .relativize(fullPathWithoutSchemeAndAuthority.toUri()));
   if (relativeFilePath.isAbsolute()) {
 throw new IllegalStateException(String.format("Path %s is not a 
subpath of %s.",
-basePathWithoutSchemeAndAuthority.toUri().toString(), 
fullPathWithoutSchemeAndAuthority.toUri().toString()));
+basePathWithoutSchemeAndAuthority.toUri().getPath(), 
fullPathWithoutSchemeAndAuthority.toUri().getPath()));
+  }
+  return relativeFilePath.toUri().getPath();
+}
+  }
+
+  /**
+   * Supported metadata versions.
+   * 
+   * Note: keep them synchronized with {@link ParquetTableMetadataBase} 
versions
+   */
+  public static class MetadataVersion {
+
+/**
+ * Version 1: Introduces parquet file metadata caching.
+ * See DRILL-2743
+ */
+public static final String V1 = "v1";
+/**
+ * Version 2: Metadata cache file size is reduced.
+ * See DRILL-4053
+ */
+public static final String V2 = "v2";
+/**
+ * Version 3: Difference between v3 and v2 : min/max, type_length, 
precision, scale, repetitionLevel, definitionLevel.
+ * Filter pushdown for Parquet is implemented. 
+ * See DRILL-1950
+ */
+public static final String V3 = "v3";
+/**
+ * Version 3.1: Absolute paths of files and directories are replaced 
with relative ones.
+ * See DRILL-3867
+ */
+public static final String V3_1 = "v3.1";
+
+
+/**
+ * All historical versions of the Drill metadata cache files
+ */
+public static final List SUPPORTED_VERSIONS = 
Lists.newArrayList(V1, V2, V3, V3_1);
+
+/**
+ * @param metadataVersion parquet metadata version
+ * @return true if metadata version is supported, false otherwise
+ */
+public static boolean isVersionSupported(String metadataVersion) {
+  return SUPPORTED_VERSIONS.contains(metadataVersion);
+}
+
+/**
+ * Helper compare method similar to {@link 
java.util.Comparator#compare}
+ *
+ * @param metadataVersion1 the first metadata version to be compared
+ * @param metadataVersion2 the second metadata version to be compared
+ * @return a negative integer, zero, or a positive integer as the
+ * first argument is less than, equal to, or greater than the
+ * second.
+ */
+public static int compare(String metadataVersion1, String 
metadataVersion2) {
+  if (isVersionSupported(metadataVersion1) && 
isVersionSupported(metadataVersion2)) {
+return 
Integer.compare(SUPPORTED_VERSIONS.indexOf(metadataVersion1), 
SUPPORTED_VERSIONS.indexOf(metadataVersion2));
+  } else {
+// this is never reached
+throw UserException.validationError()
--- End diff --

Replace please with `DrillRuntimeException`, it's not user's fault if we 
try to compare with unsupported version.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095895#comment-16095895
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128698268
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1851,9 +1922,73 @@ private static String relativize(String baseDir, 
String childPath) {
   .relativize(fullPathWithoutSchemeAndAuthority.toUri()));
   if (relativeFilePath.isAbsolute()) {
 throw new IllegalStateException(String.format("Path %s is not a 
subpath of %s.",
-basePathWithoutSchemeAndAuthority.toUri().toString(), 
fullPathWithoutSchemeAndAuthority.toUri().toString()));
+basePathWithoutSchemeAndAuthority.toUri().getPath(), 
fullPathWithoutSchemeAndAuthority.toUri().getPath()));
+  }
+  return relativeFilePath.toUri().getPath();
+}
+  }
+
+  /**
+   * Supported metadata versions.
+   * 
+   * Note: keep them synchronized with {@link ParquetTableMetadataBase} 
versions
+   */
+  public static class MetadataVersion {
--- End diff --

1. Since this class is used in  other classes, I suggest we factor out as 
separate class. This class has too many inner classes.
2. It's not an enum, in this case name should be plural - 
`MetadataVersions`.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095894#comment-16095894
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128697267
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -38,24 +40,31 @@
   private static final String TEST_RES_PATH = WORKING_PATH + 
"/src/test/resources";
   private static final String tableName1 = "parquetTable1";
   private static final String tableName2 = "parquetTable2";
-  private static final String RELATIVE_PATHS_METADATA = 
"relative_paths_metadata";
+  private static File dataDir1;
+  private static File dataDir2;
 
 
   @BeforeClass
   public static void copyData() throws Exception {
 // copy the data into the temporary location
 String tmpLocation = getDfsTestTmpSchemaLocation();
-File dataDir1 = new File(tmpLocation + Path.SEPARATOR + tableName1);
+dataDir1 = new File(tmpLocation + Path.SEPARATOR + tableName1);
 dataDir1.mkdir();
 FileUtils.copyDirectory(new 
File(String.format(String.format("%s/multilevel/parquet", TEST_RES_PATH))),
 dataDir1);
 
-File dataDir2 = new File(tmpLocation + Path.SEPARATOR + tableName2);
+dataDir2 = new File(tmpLocation + Path.SEPARATOR + tableName2);
--- End diff --

The same as above.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095779#comment-16095779
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128683724
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,23 +135,57 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, Path 
path, MetadataContext metaContext, ParquetFormatConfig formatConfig) throws 
IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, false, metaContext);
-return metadata.parquetTableMetadata;
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, false, metaContext);
+return metadata.parquetTableMetadata;
+  } catch (IOException e) {
+logger.error(e.toString());
+metaContext.isMetaCacheFileCorrect = false;
+  }
+}
+logger.warn("Ignoring unsupported or corrupted metadata file version. 
Query performance may be slow. Make sure " +
+"the cache file is up-to-date by running the REFRESH TABLE 
METADATA command");
+return null;
   }
 
-  public static ParquetTableMetadataDirs readMetadataDirs(FileSystem fs, 
Path path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
-Metadata metadata = new Metadata(fs, formatConfig);
-metadata.readBlockMeta(path, true, metaContext);
-return metadata.parquetTableMetadataDirs;
+  /**
+   * Get the parquet metadata for all subdirectories by reading the 
metadata file
+   *
+   * @param fs current file system
+   * @param path The path to the metadata file, located in the directory 
that contains the parquet files
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet metadata for a directory
+   * @throws IOException if metadata file can't be read or updated
+   */
+  public static @Nullable ParquetTableMetadataDirs 
readMetadataDirs(FileSystem fs, Path path,
+  MetadataContext metaContext, ParquetFormatConfig formatConfig) {
+if (metaContext.isMetaDirsCacheFileCorrect) {
+  Metadata metadata = new Metadata(fs, formatConfig);
+  try {
+metadata.readBlockMeta(path, true, metaContext);
+return metadata.parquetTableMetadataDirs;
+  } catch (IOException e) {
+logger.error(e.toString());
+metaContext.isMetaDirsCacheFileCorrect = false;
+  }
+}
+logger.warn("Ignoring corrupted metadata file. Query performance may 
be slow. Make sure the cache file" +
--- End diff --

Maybe move the message to constant so it does not have to be maintained in 
two places? And, maybe add the file or directory that is corrupt?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-07-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095778#comment-16095778
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r128683594
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/MetadataContext.java
 ---
@@ -41,6 +41,10 @@
 
   private PruneStatus pruneStatus = PruneStatus.NOT_STARTED;
 
+  // False values of these flags allow to avoid double reading of 
corrupted or unsupported metadata files
+  public boolean isMetaCacheFileCorrect = true;
+  public boolean isMetaDirsCacheFileCorrect = true;
--- End diff --

Nice!

How are these states cleared? Do we keep track of the file timestamp, and 
try again if the file is replaced/updated? Or, is there some kind of manual 
reset? Is the state only per query?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >