Re: isDateCorrect field in ParquetTableMetadata

2016-11-03 Thread Vitalii Diravka
I have opened a PR for the DRILL-4980 . @Jinfeng I removed is.date.correct flag from the writer. And I left the checking of it in detectCorruptDates() method for the reason if somebody has already

Re: isDateCorrect field in ParquetTableMetadata

2016-10-30 Thread Paul Rogers
Choosing a good property name should solve the confusion issue. Perhaps drill.writer as the name. The writer version is not needed if we feel that we’ll never again change the writer or introduce bugs. Since that is hard to predict, adding a writer version is very low cost insurance. Indeed,

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jason Altekruse
The only worry I have about declaring a writer version is possible confusion with the Parquet format version itself. The format is already defined through version 2.1 or something like that, but we are currently only writing files based on the 1.x version of the format. My preferred solution to

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jinfeng Ni
Vitalli, Just to confirm, you will "remove" isDateCorrect flag, and use parquet-writer version in stead, correct? On Fri, Oct 28, 2016 at 2:52 PM, Vitalii Diravka wrote: > Jinfeng, > > isDateCorrect will be false in the code when isDateCorrect property is > absent

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Vitalii Diravka
Jinfeng, isDateCorrect will be false in the code when isDateCorrect property is absent in the parquet metadata. Anyway I am going to implement the mentioned approach with the parquet-writer.version instead of isDateCorrect property.

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Paul Rogers
I like the proposal. The Parquet Writer version should just be 2 (no .0.0 as we won’t have major or minor versions.) With things like writer versions (or RPC versions, etc.) the usual rule is to use increasing integers. I am surprised that the other tools don’t include more detail about the

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jinfeng Ni
Thanks for the explanation, Jason. The three different values for DateCorruptionStatus make sense to me. The isDataCorrect flag = true, means that the values are known to be correct. The isDataCorrect flag = false, means that the values are know to be incorrect, or unclear? On Fri, Oct 28,

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Vitalii Diravka
I explored metadata of parquet files generated from different tools: * Impala:* creator: impala version 2.2.0-cdh5.4.5 (build 4a81c1d04c39961ef14ff6131d543dd96ef60e6e) *Hive:* creator: parquet-mr version 1.6.0 *Pig:* creator: parquet-mr version 1.5.1-SNAPSHOT extra:

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jason Altekruse
The isDataCorrect flag means that the values are known to be correct, and there is no need to auto-detect corruption or correct anything. META_SHOWS_CORRUPTION can be set either when we have a known old version of Drill written in the metadata, or we have older files that might have been written

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jinfeng Ni
Hi Vitalli, DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION, META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES. What value will this isDateCorrect flag have for each possiblity, especially for META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect same things,

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Paul Rogers
Thanks Vitalii. The Parquet Writer solution “just works”. As soon as someone upgrades the writer, files are labeled as having that new version. No fuzziness during a release as in 1.9. It is fine to also include the Drill version. But, format decisions should be keyed off of the writer

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Vitalii Diravka
I agree that it would be good if the approach of parquet date correctness detection will be upgraded. So I created the jira for it DRILL-4980 . But now we have two ideas: 1. To add checking of the drill version additionally, so later we can delete

Re: isDateCorrect field in ParquetTableMetadata

2016-10-27 Thread Paul Rogers
FWIW: back on the magic flag issue… I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course grained for our needs. A typical solution is include the version of the Parquet writer in addition to that of Drill. Each time we change something in the writer, increment the version

Re: isDateCorrect field in ParquetTableMetadata

2016-10-27 Thread Jason Altekruse
Vitalli, Thank you for looking into this, sorry I missed it in the review. When you open up a request to fix this issue could you update the check for correctness in the metadata to check for the is.date.correct flag, or a version greater than or equal to 1.9.0 (no snapshot)? This will allow us

Re: isDateCorrect field in ParquetTableMetadata

2016-10-27 Thread Zelaine Fong
Vitalii -- are you still planning to open a ticket and pull request for the fix you've noted below? -- Zelaine On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka wrote: > @Paul Rogers > It may be the undefined case when the file is generated with drill.version > =

Re: isDateCorrect field in ParquetTableMetadata

2016-10-26 Thread Vitalii Diravka
@Paul Rogers It may be the undefined case when the file is generated with drill.version = 1.9-SNAPSHOT. It is more easy to determine corrupted date with this flag and there is no need to wait the end of release to merge these changes. @Jinfeng NI It looks like you are right. With consistent mode

Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Jinfeng Ni
I'm not sure if I fully understand your answers. The bottom line is quite simple: given a set of parquet files, the ParquetTableMeta instance constructed in Drill should have identical value for "isDateCorrect", whether it comes from parquet footer, or parquet metadata cache, or whether there is

Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Paul Rogers
Would it make sense to write the Drill version into the Parquet metadata, and decide on date format based on the Drill version? This works if Drill versions after, say, 1.9 has the “correct” format and anything with an earlier version has “incorrect” dates. This is the typical way that folks

Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Vitalii Diravka
Hi Jinfeng, 1.If the parquet files are generated with Drill after Drill-4203 these files have "isDateCorrect = true" property. Drill serializes this property from metadata now. When we set this property in the first constructor we will hide the value from metadata. IsDateCorrect will be false

Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Jinfeng Ni
Forgot to copy the link to the code. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java#L950-L955 On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni wrote: > @Jason, @Vitalli, > > Any thoughts on this

Re: isDateCorrect field in ParquetTableMetadata

2016-10-25 Thread Jinfeng Ni
@Jason, @Vitalli, Any thoughts on this question, since both you worked on fix of DRILL-4203? Looking through the code, there is a third case [1], where this flag is set to false when Parquet metadata is cloned (after partition pruning, etc). That means, for the 2nd case where the flag is set to

isDateCorrect field in ParquetTableMetadata

2016-10-24 Thread Jinfeng Ni
Hello All, DRILL-4203 addressed the date field issue. In the fix, it introduced a new field in ParquetTableMetadata_v2 : isDateCorrect. I have some difficulty in understanding the meaning of this field. According to [1], this field is set to false, when Drill gets parquet metadata from parquet