subject:"\[GitHub\] drill pull request #595\: DRILL\-4203\: Parquet File. Date is stored wrongly"

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/595


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-10 Thread parthchandra

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r82690304
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/FixedByteAlignedReader.java
 ---
@@ -161,7 +160,7 @@ void addNext(int start, int index) {
 intValue = readIntLittleEndian(bytebuf, start);
   }
 
-  mutator.set(index, DateTimeUtils.fromJulianDay(intValue + 
ParquetReaderUtility.CORRECT_CORRUPT_DATE_SHIFT));
+  mutator.set(index, (intValue - 
ParquetReaderUtility.CORRECT_CORRUPT_DATE_SHIFT) * 
DateTimeConstants.MILLIS_PER_DAY);
--- End diff --

Reviewing this math can give anyone a headache :(. But it is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-06 Thread jaltekruse

Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r82291973
  
--- Diff: 
contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveDrillNativeScanBatchCreator.java
 ---
@@ -123,6 +123,7 @@ public ScanBatch getBatch(FragmentContext context, 
HiveDrillNativeParquetSubScan
   // in the first row group
   ParquetReaderUtility.DateCorruptionStatus containsCorruptDates =
   ParquetReaderUtility.detectCorruptDates(parquetMetadata, 
config.getColumns(), true);
+  logger.info(containsCorruptDates.toString());
--- End diff --

Sounds good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-06 Thread jaltekruse

Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r82292165
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -184,17 +201,15 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
   if (parsedCreatedByVersion.hasSemanticVersion()) {
 SemanticVersion semVer = 
parsedCreatedByVersion.getSemanticVersion();
 String pre = semVer.pre + "";
-if (semVer != null && semVer.major == 1 && semVer.minor == 8 
&& semVer.patch == 1 && pre.contains("drill")) {
--- End diff --

Sorry I should have looked through the rest of the changes, thank you for 
fixing this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-06 Thread jaltekruse

Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r82292241
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -107,6 +107,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 boolean autoCorrectCorruptDates = 
rowGroupScan.formatConfig.autoCorrectCorruptDates;
 ParquetReaderUtility.DateCorruptionStatus containsCorruptDates = 
ParquetReaderUtility.detectCorruptDates(footers.get(e.getPath()), 
rowGroupScan.getColumns(),
 autoCorrectCorruptDates);
+logger.info(containsCorruptDates.toString());
--- End diff --

addressed, in other similar comment, this is good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-06 Thread vdiravka

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r82287507
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -184,17 +201,15 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
   if (parsedCreatedByVersion.hasSemanticVersion()) {
 SemanticVersion semVer = 
parsedCreatedByVersion.getSemanticVersion();
 String pre = semVer.pre + "";
-if (semVer != null && semVer.major == 1 && semVer.minor == 8 
&& semVer.patch == 1 && pre.contains("drill")) {
--- End diff --

If `semVer `will be null we get NPE above, where `semVer.pre` is called. 
It happened with tests on our regression test framework. That's why I added 
`hasSemanticVersion()` above. So if this method return `true`, `semVer `cann't 
be `NULL` according `org.apache.parquet.SemanticVersion` and the following 
`semVer!=null` checking is redundant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-06 Thread vdiravka

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r82290736
  
--- Diff: 
contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveDrillNativeScanBatchCreator.java
 ---
@@ -123,6 +123,7 @@ public ScanBatch getBatch(FragmentContext context, 
HiveDrillNativeParquetSubScan
   // in the first row group
   ParquetReaderUtility.DateCorruptionStatus containsCorruptDates =
   ParquetReaderUtility.detectCorruptDates(parquetMetadata, 
config.getColumns(), true);
+  logger.info(containsCorruptDates.toString());
--- End diff --

But it is not a regular Boolean value, `DateCorruptionStatus` is an enum 
with overrided toString method. For example how log file looks after querying 
hive parquet file with correct date values:

`2016-10-06 23:32:30,598 [280920f1-e362-136e-0fdd-24779fef2c4a:frag:0:0] 
INFO  o.a.d.e.s.p.ParquetScanBatchCreator - It is determined from metadata that 
the date values are definitely CORRECT`

Is it enought?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-06 Thread jaltekruse

Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r82274813
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -184,17 +201,15 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
   if (parsedCreatedByVersion.hasSemanticVersion()) {
 SemanticVersion semVer = 
parsedCreatedByVersion.getSemanticVersion();
 String pre = semVer.pre + "";
-if (semVer != null && semVer.major == 1 && semVer.minor == 8 
&& semVer.patch == 1 && pre.contains("drill")) {
--- End diff --

I don't know if you are guarding against a null value for semVer at a 
higher level, but I'm pretty sure the reason I added it is that older versions 
of parquet files can be lacking this field. I assume with all of the tests 
cases that were added we should be okay, but it might be better to keep it in 
defensively. If this field isn't set in the metadata, it shouldn't make it an 
invalid file, but if we get an NPE it will stop query processing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-06 Thread jaltekruse

Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r82273282
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -918,18 +916,22 @@ public void setMax(Object max) {
 @JsonProperty public ConcurrentHashMap columnTypeInfo;
 @JsonProperty List files;
 @JsonProperty List directories;
-@JsonProperty String drillVersion;
--- End diff --

This sounds reasonable, having both is okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-06 Thread jaltekruse

Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r82272848
  
--- Diff: 
contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveDrillNativeScanBatchCreator.java
 ---
@@ -123,6 +123,7 @@ public ScanBatch getBatch(FragmentContext context, 
HiveDrillNativeParquetSubScan
   // in the first row group
   ParquetReaderUtility.DateCorruptionStatus containsCorruptDates =
   ParquetReaderUtility.detectCorruptDates(parquetMetadata, 
config.getColumns(), true);
+  logger.info(containsCorruptDates.toString());
--- End diff --

not a very useful logging statement to check in, should at least contain a 
message about what this value is. This will just print a boolean value. If it 
was just for local debugging it can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-04 Thread vdiravka

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81736781
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -918,18 +916,22 @@ public void setMax(Object max) {
 @JsonProperty public ConcurrentHashMap columnTypeInfo;
 @JsonProperty List files;
 @JsonProperty List directories;
-@JsonProperty String drillVersion;
--- End diff --

I agree regarding future similar issues. So I returned this field to 
`ParquetTableMetadataBase` in the new commit. 
But to determine that the parquet file is new and definitely with correct 
date values we have a new flag in parquet metadata "is.date.correct = true". 
Thus, it is more easy to determine corrupted date values and there is no need 
to wait the end of release to merge these changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-04 Thread vdiravka

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81722838
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -104,6 +104,9 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
   logger.trace("ParquetTrace,Read Footer,{},{},{},{},{},{},{}", 
"", e.getPath(), "", 0, 0, 0, timeToRead);
   footers.put(e.getPath(), footer );
 }
+boolean autoCorrectCorruptDates = 
rowGroupScan.formatConfig.autoCorrectCorruptDates;
--- End diff --

We are going there only to detect corrupt dates according the option in the 
parquet format config. 
I add the log message below regarding DateCorruptionStatus and also in 
other places where we `detectCorruptDates`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-03 Thread vdiravka

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81636854
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -207,6 +229,31 @@ public OperatorContext getOperatorContext() {
 return operatorContext;
   }
 
+  /**
--- End diff --

It is just Jason's minor refactoring. Is it ok? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-03 Thread vdiravka

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81625584
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -664,27 +667,37 @@ private void init(MetadataContext metaContext) throws 
IOException {
   }
   if (metaPath != null && fs.exists(metaPath)) {
 usedMetadataCache = true;
-parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaPath.toString(), metaContext);
+if (parquetTableMetadata == null) {
+  parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaPath.toString(), metaContext, formatConfig);
+}
   } else {
-parquetTableMetadata = Metadata.getParquetTableMetadata(fs, 
p.toString());
+parquetTableMetadata = Metadata.getParquetTableMetadata(fs, 
p.toString(), formatConfig);
   }
 } else {
   Path p = Path.getPathWithoutSchemeAndAuthority(new 
Path(selectionRoot));
   Path metaPath = new Path(p, Metadata.METADATA_FILENAME);
   if (fs.isDirectory(new Path(selectionRoot)) && fs.exists(metaPath)) {
 usedMetadataCache = true;
 if (parquetTableMetadata == null) {
-  parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaPath.toString(), metaContext);
+  parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaPath.toString(), metaContext, formatConfig);
 }
 if (fileSet != null) {
-  parquetTableMetadata = 
removeUnneededRowGroups(parquetTableMetadata);
+  if (parquetTableMetadata == null) {
--- End diff --

agree. It is redundant. Changed in the last commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-10-03 Thread vdiravka

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81625440
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -935,6 +972,11 @@ public ColumnTypeMetadata_v2 
getColumnTypeInfo(String[] name) {
 @JsonIgnore @Override public ParquetTableMetadataBase clone() {
   return new ParquetTableMetadata_v2(files, directories, 
columnTypeInfo);
 }
+
+@JsonIgnore @Override public boolean isDateCorrect() {
+  return isDateCorrect;
--- End diff --

If metadata cache file is existed Drill reads it instead of retrieving 
metadata from multiple Parquet files. In the case when it was generated with 
drill after this commit the value of isDateCorrect will be true. In the case 
when it was generated with drill before this commit the isDateCorrect field in 
metadata cache file will be absent and value of this will be false in 
ParquetTableMetadata_v2.
And according to this value we just define DateCorruptionStatus (you can 
see more in ParquetReaderUtility.correctDatesInMetadataCache()). The leftover 
way of data checking in the cache was not changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-09-30 Thread jaltekruse

Github user jaltekruse commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81396800
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -918,18 +916,22 @@ public void setMax(Object max) {
 @JsonProperty public ConcurrentHashMap columnTypeInfo;
 @JsonProperty List files;
 @JsonProperty List directories;
-@JsonProperty String drillVersion;
--- End diff --

I had intentionally added the drill version here assuming that it would be 
good information to have around if a similar issue ever comes up the the 
future, as well as provide all of the info we need to have an explicit flag 
that the dates have become correct. For this to work completely, this commit 
should be the last commit right before a release (it could be a point release). 
Any particular reason that we would want to not write it into the file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-09-29 Thread parthchandra

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81240983
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -104,6 +104,9 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
   logger.trace("ParquetTrace,Read Footer,{},{},{},{},{},{},{}", 
"", e.getPath(), "", 0, 0, 0, timeToRead);
   footers.put(e.getPath(), footer );
 }
+boolean autoCorrectCorruptDates = 
rowGroupScan.formatConfig.autoCorrectCorruptDates;
--- End diff --

It would be nice to have a log message here if we are going to autocorrect 
dates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-09-29 Thread parthchandra

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81238209
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -664,27 +667,37 @@ private void init(MetadataContext metaContext) throws 
IOException {
   }
   if (metaPath != null && fs.exists(metaPath)) {
 usedMetadataCache = true;
-parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaPath.toString(), metaContext);
+if (parquetTableMetadata == null) {
+  parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaPath.toString(), metaContext, formatConfig);
+}
   } else {
-parquetTableMetadata = Metadata.getParquetTableMetadata(fs, 
p.toString());
+parquetTableMetadata = Metadata.getParquetTableMetadata(fs, 
p.toString(), formatConfig);
   }
 } else {
   Path p = Path.getPathWithoutSchemeAndAuthority(new 
Path(selectionRoot));
   Path metaPath = new Path(p, Metadata.METADATA_FILENAME);
   if (fs.isDirectory(new Path(selectionRoot)) && fs.exists(metaPath)) {
 usedMetadataCache = true;
 if (parquetTableMetadata == null) {
-  parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaPath.toString(), metaContext);
+  parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaPath.toString(), metaContext, formatConfig);
 }
 if (fileSet != null) {
-  parquetTableMetadata = 
removeUnneededRowGroups(parquetTableMetadata);
+  if (parquetTableMetadata == null) {
--- End diff --

You've already made sure to read ParquetTableMetadata if it was null, so 
you probably needn't read it again


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-09-29 Thread parthchandra

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81244832
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -207,6 +229,31 @@ public OperatorContext getOperatorContext() {
 return operatorContext;
   }
 
+  /**
--- End diff --

This seems completely unrelated to the bug


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-09-29 Thread parthchandra

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/595#discussion_r81240013
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -18,16 +18,62 @@
 package org.apache.drill.exec.store.parquet;
 
 import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.PathSegment;
+import org.apache.drill.common.expression.SchemaPath;
 import org.apache.drill.exec.planner.physical.PlannerSettings;
 import org.apache.drill.exec.server.options.OptionManager;
+import org.apache.drill.exec.store.AbstractRecordReader;
+import org.apache.drill.exec.store.ParquetOutputRecordWriter;
 import org.apache.drill.exec.work.ExecErrorConstants;
+import org.apache.parquet.SemanticVersion;
+import org.apache.parquet.VersionParser;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.format.ConvertedType;
+import org.apache.parquet.format.FileMetaData;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.OriginalType;
+import org.joda.time.Chronology;
+import org.joda.time.DateTimeConstants;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
 
 /*
  * Utility class where we can capture common logic between the two parquet 
readers
  */
 public class ParquetReaderUtility {
   private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ParquetReaderUtility.class);
 
+  // Note the negation symbol in the beginning
+  public static final double CORRECT_CORRUPT_DATE_SHIFT = 
-ParquetOutputRecordWriter.JULIAN_DAY_EPOC - 0.5;
--- End diff --

Is the the number by which you must correct the incorrect values? If so, 
please add a comment. This had me very confused for a while.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

2016-09-22 Thread vdiravka

GitHub user vdiravka opened a pull request:

https://github.com/apache/drill/pull/595

DRILL-4203: Parquet File. Date is stored wrongly

Drill was writing non-standard dates into parquet files for all releases
before this commit. The values have been read correctly by Drill, but
external tools like Spark reading the files will see corrupted values for
all dates that have been written by Drill.

This change corrects the behavior of the Drill parquet writer to correctly
store dates in the format given in the parquet specification.

To maintain compatibility with old files, the parquet reader code has been
updated to check for the old format and automatically shift the
corrupted values into corrected ones automatically.

The test cases included here should ensure that all files produced by
historical versions of Drill will continue to return the same values
they had in previous releases. For compatibility with external tools, any
old files with corrupted dates can be re-written using the CREATE TABLE AS
command (as the writer will now only produce the specification-compliant
values, even if after reading out of older corrupt files, one
new extra field "is.date.correct = true" will be included into the parquet 
meta
information of files and into drill metadata cache files).

While the old behavior was a consistent shift into an unlikely range to be
used in a modern database (over 10,000 years in the future), these are
still valid date values. In the case where these may have been written
into files intentionally, and we cannot be certain from the metadata if
Drill produced the files, an option is included to turn off the 
auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included 
for
completeness.

One small fix in the ParquetGroupScan to accommodate changes in master that 
changed
when metadata is read.

Added new tests for bugs (revealed by the regression suite) with old and new
parquet (binary) files for new tests, updated metadata cache files 
accordingly.

Removed unnecessary double conversion of value with Julian day.

Added ability to correct corrupted dates for parquet files with the second
version old metadata cache file as well.

Fix DrillVersionInfo to make it provide a valid version number even during
the unit tests. This is now a build-time generated class, rather than one
that looks on the classpath for META-INF files. (This pattern
for file generation with parameters passed from the POM files
was borrowed from parquet-mr)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vdiravka/drill DRILL-4203

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/595.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #595


commit 6f816742d773a1696b5329472c2465a79e35140c
Author: Vitalii Diravka 
Date:   2016-09-22T13:44:37Z

DRILL-4203: Parquet File. Date is stored wrongly

Drill was writing non-standard dates into parquet files for all releases
before this commit. The values have been read correctly by Drill, but
external tools like Spark reading the files will see corrupted values for
all dates that have been written by Drill.

This change corrects the behavior of the Drill parquet writer to correctly
store dates in the format given in the parquet specification.

To maintain compatibility with old files, the parquet reader code has been
updated to check for the old format and automatically shift the
corrupted values into corrected ones automatically.

The test cases included here should ensure that all files produced by
historical versions of Drill will continue to return the same values
they had in previous releases. For compatibility with external tools, any
old files with corrupted dates can be re-written using the CREATE TABLE AS
command (as the writer will now only produce the specification-compliant
values, even if after reading out of older corrupt files, one
new extra field "is.date.correct = true" will be included into the parquet 
meta
information of files and into drill metadata cache files).

While the old behavior was a consistent shift into an unlikely range to be
used in a modern database (over 10,000 years in the future), these are
still valid date values. In the case where these may have been written
into files intentionally, and we cannot be certain from the metadata if
Drill produced the files, an option is included to turn off the 
auto-correction.
Use of this option is assumed to be extremely unlikely, but it is

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

[GitHub] drill pull request #595: DRILL-4203: Parquet File. Date is stored wrongly

21 matches

Site Navigation

Mail list logo

Footer information