[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/595 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r82690304 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/FixedByteAlignedReader.java --- @@ -161,7 +160,7 @@ void addNext(int start, int index) { intValue = readIntLittleEndian(bytebuf, start); } - mutator.set(index, DateTimeUtils.fromJulianDay(intValue + ParquetReaderUtility.CORRECT_CORRUPT_DATE_SHIFT)); + mutator.set(index, (intValue - ParquetReaderUtility.CORRECT_CORRUPT_DATE_SHIFT) * DateTimeConstants.MILLIS_PER_DAY); --- End diff -- Reviewing this math can give anyone a headache :(. But it is correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r82291973 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveDrillNativeScanBatchCreator.java --- @@ -123,6 +123,7 @@ public ScanBatch getBatch(FragmentContext context, HiveDrillNativeParquetSubScan // in the first row group ParquetReaderUtility.DateCorruptionStatus containsCorruptDates = ParquetReaderUtility.detectCorruptDates(parquetMetadata, config.getColumns(), true); + logger.info(containsCorruptDates.toString()); --- End diff -- Sounds good --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r82292165 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -184,17 +201,15 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer, if (parsedCreatedByVersion.hasSemanticVersion()) { SemanticVersion semVer = parsedCreatedByVersion.getSemanticVersion(); String pre = semVer.pre + ""; -if (semVer != null && semVer.major == 1 && semVer.minor == 8 && semVer.patch == 1 && pre.contains("drill")) { --- End diff -- Sorry I should have looked through the rest of the changes, thank you for fixing this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r82292241 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java --- @@ -107,6 +107,7 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS boolean autoCorrectCorruptDates = rowGroupScan.formatConfig.autoCorrectCorruptDates; ParquetReaderUtility.DateCorruptionStatus containsCorruptDates = ParquetReaderUtility.detectCorruptDates(footers.get(e.getPath()), rowGroupScan.getColumns(), autoCorrectCorruptDates); +logger.info(containsCorruptDates.toString()); --- End diff -- addressed, in other similar comment, this is good --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r82287507 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -184,17 +201,15 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer, if (parsedCreatedByVersion.hasSemanticVersion()) { SemanticVersion semVer = parsedCreatedByVersion.getSemanticVersion(); String pre = semVer.pre + ""; -if (semVer != null && semVer.major == 1 && semVer.minor == 8 && semVer.patch == 1 && pre.contains("drill")) { --- End diff -- If `semVer `will be null we get NPE above, where `semVer.pre` is called. It happened with tests on our regression test framework. That's why I added `hasSemanticVersion()` above. So if this method return `true`, `semVer `cann't be `NULL` according `org.apache.parquet.SemanticVersion` and the following `semVer!=null` checking is redundant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r82290736 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveDrillNativeScanBatchCreator.java --- @@ -123,6 +123,7 @@ public ScanBatch getBatch(FragmentContext context, HiveDrillNativeParquetSubScan // in the first row group ParquetReaderUtility.DateCorruptionStatus containsCorruptDates = ParquetReaderUtility.detectCorruptDates(parquetMetadata, config.getColumns(), true); + logger.info(containsCorruptDates.toString()); --- End diff -- But it is not a regular Boolean value, `DateCorruptionStatus` is an enum with overrided toString method. For example how log file looks after querying hive parquet file with correct date values: `2016-10-06 23:32:30,598 [280920f1-e362-136e-0fdd-24779fef2c4a:frag:0:0] INFO o.a.d.e.s.p.ParquetScanBatchCreator - It is determined from metadata that the date values are definitely CORRECT` Is it enought? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r82274813 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -184,17 +201,15 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer, if (parsedCreatedByVersion.hasSemanticVersion()) { SemanticVersion semVer = parsedCreatedByVersion.getSemanticVersion(); String pre = semVer.pre + ""; -if (semVer != null && semVer.major == 1 && semVer.minor == 8 && semVer.patch == 1 && pre.contains("drill")) { --- End diff -- I don't know if you are guarding against a null value for semVer at a higher level, but I'm pretty sure the reason I added it is that older versions of parquet files can be lacking this field. I assume with all of the tests cases that were added we should be okay, but it might be better to keep it in defensively. If this field isn't set in the metadata, it shouldn't make it an invalid file, but if we get an NPE it will stop query processing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r82273282 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -918,18 +916,22 @@ public void setMax(Object max) { @JsonProperty public ConcurrentHashMapcolumnTypeInfo; @JsonProperty List files; @JsonProperty List directories; -@JsonProperty String drillVersion; --- End diff -- This sounds reasonable, having both is okay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r82272848 --- Diff: contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveDrillNativeScanBatchCreator.java --- @@ -123,6 +123,7 @@ public ScanBatch getBatch(FragmentContext context, HiveDrillNativeParquetSubScan // in the first row group ParquetReaderUtility.DateCorruptionStatus containsCorruptDates = ParquetReaderUtility.detectCorruptDates(parquetMetadata, config.getColumns(), true); + logger.info(containsCorruptDates.toString()); --- End diff -- not a very useful logging statement to check in, should at least contain a message about what this value is. This will just print a boolean value. If it was just for local debugging it can be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81736781 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -918,18 +916,22 @@ public void setMax(Object max) { @JsonProperty public ConcurrentHashMapcolumnTypeInfo; @JsonProperty List files; @JsonProperty List directories; -@JsonProperty String drillVersion; --- End diff -- I agree regarding future similar issues. So I returned this field to `ParquetTableMetadataBase` in the new commit. But to determine that the parquet file is new and definitely with correct date values we have a new flag in parquet metadata "is.date.correct = true". Thus, it is more easy to determine corrupted date values and there is no need to wait the end of release to merge these changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81722838 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java --- @@ -104,6 +104,9 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS logger.trace("ParquetTrace,Read Footer,{},{},{},{},{},{},{}", "", e.getPath(), "", 0, 0, 0, timeToRead); footers.put(e.getPath(), footer ); } +boolean autoCorrectCorruptDates = rowGroupScan.formatConfig.autoCorrectCorruptDates; --- End diff -- We are going there only to detect corrupt dates according the option in the parquet format config. I add the log message below regarding DateCorruptionStatus and also in other places where we `detectCorruptDates`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81636854 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java --- @@ -207,6 +229,31 @@ public OperatorContext getOperatorContext() { return operatorContext; } + /** --- End diff -- It is just Jason's minor refactoring. Is it ok? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81625584 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -664,27 +667,37 @@ private void init(MetadataContext metaContext) throws IOException { } if (metaPath != null && fs.exists(metaPath)) { usedMetadataCache = true; -parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext); +if (parquetTableMetadata == null) { + parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext, formatConfig); +} } else { -parquetTableMetadata = Metadata.getParquetTableMetadata(fs, p.toString()); +parquetTableMetadata = Metadata.getParquetTableMetadata(fs, p.toString(), formatConfig); } } else { Path p = Path.getPathWithoutSchemeAndAuthority(new Path(selectionRoot)); Path metaPath = new Path(p, Metadata.METADATA_FILENAME); if (fs.isDirectory(new Path(selectionRoot)) && fs.exists(metaPath)) { usedMetadataCache = true; if (parquetTableMetadata == null) { - parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext); + parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext, formatConfig); } if (fileSet != null) { - parquetTableMetadata = removeUnneededRowGroups(parquetTableMetadata); + if (parquetTableMetadata == null) { --- End diff -- agree. It is redundant. Changed in the last commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81625440 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -935,6 +972,11 @@ public ColumnTypeMetadata_v2 getColumnTypeInfo(String[] name) { @JsonIgnore @Override public ParquetTableMetadataBase clone() { return new ParquetTableMetadata_v2(files, directories, columnTypeInfo); } + +@JsonIgnore @Override public boolean isDateCorrect() { + return isDateCorrect; --- End diff -- If metadata cache file is existed Drill reads it instead of retrieving metadata from multiple Parquet files. In the case when it was generated with drill after this commit the value of isDateCorrect will be true. In the case when it was generated with drill before this commit the isDateCorrect field in metadata cache file will be absent and value of this will be false in ParquetTableMetadata_v2. And according to this value we just define DateCorruptionStatus (you can see more in ParquetReaderUtility.correctDatesInMetadataCache()). The leftover way of data checking in the cache was not changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user jaltekruse commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81396800 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -918,18 +916,22 @@ public void setMax(Object max) { @JsonProperty public ConcurrentHashMapcolumnTypeInfo; @JsonProperty List files; @JsonProperty List directories; -@JsonProperty String drillVersion; --- End diff -- I had intentionally added the drill version here assuming that it would be good information to have around if a similar issue ever comes up the the future, as well as provide all of the info we need to have an explicit flag that the dates have become correct. For this to work completely, this commit should be the last commit right before a release (it could be a point release). Any particular reason that we would want to not write it into the file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81240983 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java --- @@ -104,6 +104,9 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS logger.trace("ParquetTrace,Read Footer,{},{},{},{},{},{},{}", "", e.getPath(), "", 0, 0, 0, timeToRead); footers.put(e.getPath(), footer ); } +boolean autoCorrectCorruptDates = rowGroupScan.formatConfig.autoCorrectCorruptDates; --- End diff -- It would be nice to have a log message here if we are going to autocorrect dates. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81238209 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -664,27 +667,37 @@ private void init(MetadataContext metaContext) throws IOException { } if (metaPath != null && fs.exists(metaPath)) { usedMetadataCache = true; -parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext); +if (parquetTableMetadata == null) { + parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext, formatConfig); +} } else { -parquetTableMetadata = Metadata.getParquetTableMetadata(fs, p.toString()); +parquetTableMetadata = Metadata.getParquetTableMetadata(fs, p.toString(), formatConfig); } } else { Path p = Path.getPathWithoutSchemeAndAuthority(new Path(selectionRoot)); Path metaPath = new Path(p, Metadata.METADATA_FILENAME); if (fs.isDirectory(new Path(selectionRoot)) && fs.exists(metaPath)) { usedMetadataCache = true; if (parquetTableMetadata == null) { - parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext); + parquetTableMetadata = Metadata.readBlockMeta(fs, metaPath.toString(), metaContext, formatConfig); } if (fileSet != null) { - parquetTableMetadata = removeUnneededRowGroups(parquetTableMetadata); + if (parquetTableMetadata == null) { --- End diff -- You've already made sure to read ParquetTableMetadata if it was null, so you probably needn't read it again --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81244832 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java --- @@ -207,6 +229,31 @@ public OperatorContext getOperatorContext() { return operatorContext; } + /** --- End diff -- This seems completely unrelated to the bug --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81240013 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -18,16 +18,62 @@ package org.apache.drill.exec.store.parquet; import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.expression.PathSegment; +import org.apache.drill.common.expression.SchemaPath; import org.apache.drill.exec.planner.physical.PlannerSettings; import org.apache.drill.exec.server.options.OptionManager; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.ParquetOutputRecordWriter; import org.apache.drill.exec.work.ExecErrorConstants; +import org.apache.parquet.SemanticVersion; +import org.apache.parquet.VersionParser; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.statistics.Statistics; +import org.apache.parquet.format.ConvertedType; +import org.apache.parquet.format.FileMetaData; +import org.apache.parquet.format.SchemaElement; +import org.apache.parquet.format.converter.ParquetMetadataConverter; +import org.apache.parquet.hadoop.ParquetFileWriter; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.hadoop.metadata.ColumnPath; +import org.apache.parquet.hadoop.metadata.ParquetMetadata; +import org.apache.parquet.schema.OriginalType; +import org.joda.time.Chronology; +import org.joda.time.DateTimeConstants; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.List; +import java.util.Map; /* * Utility class where we can capture common logic between the two parquet readers */ public class ParquetReaderUtility { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetReaderUtility.class); + // Note the negation symbol in the beginning + public static final double CORRECT_CORRUPT_DATE_SHIFT = -ParquetOutputRecordWriter.JULIAN_DAY_EPOC - 0.5; --- End diff -- Is the the number by which you must correct the incorrect values? If so, please add a comment. This had me very confused for a while. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly
GitHub user vdiravka opened a pull request: https://github.com/apache/drill/pull/595 DRILL-4203: Parquet File. Date is stored wrongly Drill was writing non-standard dates into parquet files for all releases before this commit. The values have been read correctly by Drill, but external tools like Spark reading the files will see corrupted values for all dates that have been written by Drill. This change corrects the behavior of the Drill parquet writer to correctly store dates in the format given in the parquet specification. To maintain compatibility with old files, the parquet reader code has been updated to check for the old format and automatically shift the corrupted values into corrected ones automatically. The test cases included here should ensure that all files produced by historical versions of Drill will continue to return the same values they had in previous releases. For compatibility with external tools, any old files with corrupted dates can be re-written using the CREATE TABLE AS command (as the writer will now only produce the specification-compliant values, even if after reading out of older corrupt files, one new extra field "is.date.correct = true" will be included into the parquet meta information of files and into drill metadata cache files). While the old behavior was a consistent shift into an unlikely range to be used in a modern database (over 10,000 years in the future), these are still valid date values. In the case where these may have been written into files intentionally, and we cannot be certain from the metadata if Drill produced the files, an option is included to turn off the auto-correction. Use of this option is assumed to be extremely unlikely, but it is included for completeness. One small fix in the ParquetGroupScan to accommodate changes in master that changed when metadata is read. Added new tests for bugs (revealed by the regression suite) with old and new parquet (binary) files for new tests, updated metadata cache files accordingly. Removed unnecessary double conversion of value with Julian day. Added ability to correct corrupted dates for parquet files with the second version old metadata cache file as well. Fix DrillVersionInfo to make it provide a valid version number even during the unit tests. This is now a build-time generated class, rather than one that looks on the classpath for META-INF files. (This pattern for file generation with parameters passed from the POM files was borrowed from parquet-mr) You can merge this pull request into a Git repository by running: $ git pull https://github.com/vdiravka/drill DRILL-4203 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/595.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #595 commit 6f816742d773a1696b5329472c2465a79e35140c Author: Vitalii DiravkaDate: 2016-09-22T13:44:37Z DRILL-4203: Parquet File. Date is stored wrongly Drill was writing non-standard dates into parquet files for all releases before this commit. The values have been read correctly by Drill, but external tools like Spark reading the files will see corrupted values for all dates that have been written by Drill. This change corrects the behavior of the Drill parquet writer to correctly store dates in the format given in the parquet specification. To maintain compatibility with old files, the parquet reader code has been updated to check for the old format and automatically shift the corrupted values into corrected ones automatically. The test cases included here should ensure that all files produced by historical versions of Drill will continue to return the same values they had in previous releases. For compatibility with external tools, any old files with corrupted dates can be re-written using the CREATE TABLE AS command (as the writer will now only produce the specification-compliant values, even if after reading out of older corrupt files, one new extra field "is.date.correct = true" will be included into the parquet meta information of files and into drill metadata cache files). While the old behavior was a consistent shift into an unlikely range to be used in a modern database (over 10,000 years in the future), these are still valid date values. In the case where these may have been written into files intentionally, and we cannot be certain from the metadata if Drill produced the files, an option is included to turn off the auto-correction. Use of this option is assumed to be extremely unlikely, but it is