[jira] [Commented] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)

2016-02-09 Thread Jason Altekruse (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140236#comment-15140236
 ] 

Jason Altekruse commented on DRILL-4349:


As I was rolling the rc3 release candidate for 1.5.0 I decided to apply this 
fix to the release branch as it seemed useful to get into the release. The 
commit hash will be different but the patch applied cleanly and has an 
identical diff represented.

> parquet reader returns wrong results when reading a nullable column that 
> starts with a large number of nulls (>30k)
> ---
>
> Key: DRILL-4349
> URL: https://issues.apache.org/jira/browse/DRILL-4349
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.4.0
>Reporter: Deneche A. Hakim
>Assignee: Deneche A. Hakim
>Priority: Critical
> Fix For: 1.5.0
>
> Attachments: drill4349.tar.gz
>
>
> While reading a nullable column, if in a single pass we only read null 
> values, the parquet reader resets the value of pageReader.readPosInBytes 
> which will lead to wrong data read from the file.
> To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, 
> val) with 50100 rows, where id equals to the row number and val is empty for 
> the first 50k rows, and equal to id for the remaining rows.
> create a parquet table from the csv file:
> {noformat}
> CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, 
> CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`;
> {noformat}
> Now if you query any of the non null values you will get wrong results:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit 
> 10;
> ++---+
> |   id   |val|
> ++---+
> | 5  | 9.11337776337441E-309 |
> | 50001  | 3.26044E-319  |
> | 50002  | 1.4916681476489723E-154   |
> | 50003  | 2.18890676|
> | 50004  | 2.681561588521345E154 |
> | 50005  | -2.1016574E-317   |
> | 50006  | -1.4916681476489723E-154  |
> | 50007  | -2.18890676   |
> | 50008  | -2.681561588521345E154|
> | 50009  | 2.1016574E-317|
> ++---+
> 10 rows selected (0.238 seconds)
> {noformat}
> and here are the expected values:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as 
> int)>=5 limit 10;
> ++
> |  columns   |
> ++
> | ["5","5"]  |
> | ["50001","50001"]  |
> | ["50002","50002"]  |
> | ["50003","50003"]  |
> | ["50004","50004"]  |
> | ["50005","50005"]  |
> | ["50006","50006"]  |
> | ["50007","50007"]  |
> | ["50008","50008"]  |
> | ["50009","50009"]  |
> ++
> {noformat}
> I confirmed that the file is written correctly and the issue is in the 
> parquet reader (already have a fix for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)

2016-02-07 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136569#comment-15136569
 ] 

Khurram Faraaz commented on DRILL-4349:
---

Verified fix. Test added here, 
Functional/parquet_storage/parquet_generic/drill4349.q

> parquet reader returns wrong results when reading a nullable column that 
> starts with a large number of nulls (>30k)
> ---
>
> Key: DRILL-4349
> URL: https://issues.apache.org/jira/browse/DRILL-4349
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.4.0
>Reporter: Deneche A. Hakim
>Assignee: Deneche A. Hakim
>Priority: Critical
> Fix For: 1.6.0
>
> Attachments: drill4349.tar.gz
>
>
> While reading a nullable column, if in a single pass we only read null 
> values, the parquet reader resets the value of pageReader.readPosInBytes 
> which will lead to wrong data read from the file.
> To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, 
> val) with 50100 rows, where id equals to the row number and val is empty for 
> the first 50k rows, and equal to id for the remaining rows.
> create a parquet table from the csv file:
> {noformat}
> CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, 
> CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`;
> {noformat}
> Now if you query any of the non null values you will get wrong results:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit 
> 10;
> ++---+
> |   id   |val|
> ++---+
> | 5  | 9.11337776337441E-309 |
> | 50001  | 3.26044E-319  |
> | 50002  | 1.4916681476489723E-154   |
> | 50003  | 2.18890676|
> | 50004  | 2.681561588521345E154 |
> | 50005  | -2.1016574E-317   |
> | 50006  | -1.4916681476489723E-154  |
> | 50007  | -2.18890676   |
> | 50008  | -2.681561588521345E154|
> | 50009  | 2.1016574E-317|
> ++---+
> 10 rows selected (0.238 seconds)
> {noformat}
> and here are the expected values:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as 
> int)>=5 limit 10;
> ++
> |  columns   |
> ++
> | ["5","5"]  |
> | ["50001","50001"]  |
> | ["50002","50002"]  |
> | ["50003","50003"]  |
> | ["50004","50004"]  |
> | ["50005","50005"]  |
> | ["50006","50006"]  |
> | ["50007","50007"]  |
> | ["50008","50008"]  |
> | ["50009","50009"]  |
> ++
> {noformat}
> I confirmed that the file is written correctly and the issue is in the 
> parquet reader (already have a fix for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)

2016-02-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133473#comment-15133473
 ] 

ASF GitHub Bot commented on DRILL-4349:
---

Github user adeneche closed the pull request at:

https://github.com/apache/drill/pull/356


> parquet reader returns wrong results when reading a nullable column that 
> starts with a large number of nulls (>30k)
> ---
>
> Key: DRILL-4349
> URL: https://issues.apache.org/jira/browse/DRILL-4349
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.4.0
>Reporter: Deneche A. Hakim
>Assignee: Parth Chandra
>Priority: Critical
> Fix For: 1.6.0
>
> Attachments: drill4349.tar.gz
>
>
> While reading a nullable column, if in a single pass we only read null 
> values, the parquet reader resets the value of pageReader.readPosInBytes 
> which will lead to wrong data read from the file.
> To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, 
> val) with 50100 rows, where id equals to the row number and val is empty for 
> the first 50k rows, and equal to id for the remaining rows.
> create a parquet table from the csv file:
> {noformat}
> CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, 
> CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`;
> {noformat}
> Now if you query any of the non null values you will get wrong results:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit 
> 10;
> ++---+
> |   id   |val|
> ++---+
> | 5  | 9.11337776337441E-309 |
> | 50001  | 3.26044E-319  |
> | 50002  | 1.4916681476489723E-154   |
> | 50003  | 2.18890676|
> | 50004  | 2.681561588521345E154 |
> | 50005  | -2.1016574E-317   |
> | 50006  | -1.4916681476489723E-154  |
> | 50007  | -2.18890676   |
> | 50008  | -2.681561588521345E154|
> | 50009  | 2.1016574E-317|
> ++---+
> 10 rows selected (0.238 seconds)
> {noformat}
> and here are the expected values:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as 
> int)>=5 limit 10;
> ++
> |  columns   |
> ++
> | ["5","5"]  |
> | ["50001","50001"]  |
> | ["50002","50002"]  |
> | ["50003","50003"]  |
> | ["50004","50004"]  |
> | ["50005","50005"]  |
> | ["50006","50006"]  |
> | ["50007","50007"]  |
> | ["50008","50008"]  |
> | ["50009","50009"]  |
> ++
> {noformat}
> I confirmed that the file is written correctly and the issue is in the 
> parquet reader (already have a fix for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)

2016-02-03 Thread Deneche A. Hakim (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131132#comment-15131132
 ] 

Deneche A. Hakim commented on DRILL-4349:
-

This is a regression introduced by DRILL-3871

> parquet reader returns wrong results when reading a nullable column that 
> starts with a large number of nulls (>30k)
> ---
>
> Key: DRILL-4349
> URL: https://issues.apache.org/jira/browse/DRILL-4349
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.4.0
>Reporter: Deneche A. Hakim
>Assignee: Deneche A. Hakim
>Priority: Critical
> Fix For: 1.6.0
>
> Attachments: drill4349.tar.gz
>
>
> While reading a nullable column, if in a single pass we only read null 
> values, the parquet reader resets the value of pageReader.readPosInBytes 
> which will lead to wrong data read from the file.
> To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, 
> val) with 50100 rows, where id equals to the row number and val is empty for 
> the first 50k rows, and equal to id for the remaining rows.
> create a parquet table from the csv file:
> {noformat}
> CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, 
> CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`;
> {noformat}
> Now if you query any of the non null values you will get wrong results:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit 
> 10;
> ++---+
> |   id   |val|
> ++---+
> | 5  | 9.11337776337441E-309 |
> | 50001  | 3.26044E-319  |
> | 50002  | 1.4916681476489723E-154   |
> | 50003  | 2.18890676|
> | 50004  | 2.681561588521345E154 |
> | 50005  | -2.1016574E-317   |
> | 50006  | -1.4916681476489723E-154  |
> | 50007  | -2.18890676   |
> | 50008  | -2.681561588521345E154|
> | 50009  | 2.1016574E-317|
> ++---+
> 10 rows selected (0.238 seconds)
> {noformat}
> and here are the expected values:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as 
> int)>=5 limit 10;
> ++
> |  columns   |
> ++
> | ["5","5"]  |
> | ["50001","50001"]  |
> | ["50002","50002"]  |
> | ["50003","50003"]  |
> | ["50004","50004"]  |
> | ["50005","50005"]  |
> | ["50006","50006"]  |
> | ["50007","50007"]  |
> | ["50008","50008"]  |
> | ["50009","50009"]  |
> ++
> {noformat}
> I confirmed that the file is written correctly and the issue is in the 
> parquet reader (already have a fix for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)

2016-02-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131443#comment-15131443
 ] 

ASF GitHub Bot commented on DRILL-4349:
---

GitHub user adeneche opened a pull request:

https://github.com/apache/drill/pull/356

DRILL-4349: parquet reader returns wrong results when reading a nulla…

…ble column that starts with a large number of nulls (>30k)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adeneche/incubator-drill DRILL-4349

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/356.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #356


commit a1bc0c7dc11a117d18b6ae74b91e6390138be20f
Author: adeneche 
Date:   2016-02-03T23:42:22Z

DRILL-4349: parquet reader returns wrong results when reading a nullable 
column that starts with a large number of nulls (>30k)




> parquet reader returns wrong results when reading a nullable column that 
> starts with a large number of nulls (>30k)
> ---
>
> Key: DRILL-4349
> URL: https://issues.apache.org/jira/browse/DRILL-4349
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.4.0
>Reporter: Deneche A. Hakim
>Assignee: Deneche A. Hakim
>Priority: Critical
> Fix For: 1.6.0
>
> Attachments: drill4349.tar.gz
>
>
> While reading a nullable column, if in a single pass we only read null 
> values, the parquet reader resets the value of pageReader.readPosInBytes 
> which will lead to wrong data read from the file.
> To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, 
> val) with 50100 rows, where id equals to the row number and val is empty for 
> the first 50k rows, and equal to id for the remaining rows.
> create a parquet table from the csv file:
> {noformat}
> CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, 
> CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`;
> {noformat}
> Now if you query any of the non null values you will get wrong results:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit 
> 10;
> ++---+
> |   id   |val|
> ++---+
> | 5  | 9.11337776337441E-309 |
> | 50001  | 3.26044E-319  |
> | 50002  | 1.4916681476489723E-154   |
> | 50003  | 2.18890676|
> | 50004  | 2.681561588521345E154 |
> | 50005  | -2.1016574E-317   |
> | 50006  | -1.4916681476489723E-154  |
> | 50007  | -2.18890676   |
> | 50008  | -2.681561588521345E154|
> | 50009  | 2.1016574E-317|
> ++---+
> 10 rows selected (0.238 seconds)
> {noformat}
> and here are the expected values:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as 
> int)>=5 limit 10;
> ++
> |  columns   |
> ++
> | ["5","5"]  |
> | ["50001","50001"]  |
> | ["50002","50002"]  |
> | ["50003","50003"]  |
> | ["50004","50004"]  |
> | ["50005","50005"]  |
> | ["50006","50006"]  |
> | ["50007","50007"]  |
> | ["50008","50008"]  |
> | ["50009","50009"]  |
> ++
> {noformat}
> I confirmed that the file is written correctly and the issue is in the 
> parquet reader (already have a fix for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)

2016-02-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131444#comment-15131444
 ] 

ASF GitHub Bot commented on DRILL-4349:
---

Github user adeneche commented on the pull request:

https://github.com/apache/drill/pull/356#issuecomment-179551062
  
@parthchandra can you please review ? thanks


> parquet reader returns wrong results when reading a nullable column that 
> starts with a large number of nulls (>30k)
> ---
>
> Key: DRILL-4349
> URL: https://issues.apache.org/jira/browse/DRILL-4349
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.4.0
>Reporter: Deneche A. Hakim
>Assignee: Parth Chandra
>Priority: Critical
> Fix For: 1.6.0
>
> Attachments: drill4349.tar.gz
>
>
> While reading a nullable column, if in a single pass we only read null 
> values, the parquet reader resets the value of pageReader.readPosInBytes 
> which will lead to wrong data read from the file.
> To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, 
> val) with 50100 rows, where id equals to the row number and val is empty for 
> the first 50k rows, and equal to id for the remaining rows.
> create a parquet table from the csv file:
> {noformat}
> CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, 
> CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`;
> {noformat}
> Now if you query any of the non null values you will get wrong results:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit 
> 10;
> ++---+
> |   id   |val|
> ++---+
> | 5  | 9.11337776337441E-309 |
> | 50001  | 3.26044E-319  |
> | 50002  | 1.4916681476489723E-154   |
> | 50003  | 2.18890676|
> | 50004  | 2.681561588521345E154 |
> | 50005  | -2.1016574E-317   |
> | 50006  | -1.4916681476489723E-154  |
> | 50007  | -2.18890676   |
> | 50008  | -2.681561588521345E154|
> | 50009  | 2.1016574E-317|
> ++---+
> 10 rows selected (0.238 seconds)
> {noformat}
> and here are the expected values:
> {noformat}
> 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as 
> int)>=5 limit 10;
> ++
> |  columns   |
> ++
> | ["5","5"]  |
> | ["50001","50001"]  |
> | ["50002","50002"]  |
> | ["50003","50003"]  |
> | ["50004","50004"]  |
> | ["50005","50005"]  |
> | ["50006","50006"]  |
> | ["50007","50007"]  |
> | ["50008","50008"]  |
> | ["50009","50009"]  |
> ++
> {noformat}
> I confirmed that the file is written correctly and the issue is in the 
> parquet reader (already have a fix for it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)