benj created DRILL-7390: --------------------------- Summary: kvgen/flatten doesn't produce same result from .json or .parquet Key: DRILL-7390 URL: https://issues.apache.org/jira/browse/DRILL-7390 Project: Apache Drill Issue Type: Bug Components: Execution - Data Types, Functions - Drill, Storage - JSON, Storage - Parquet Affects Versions: 1.16.0 Reporter: benj Attachments: ANIMALS_json.tar.gz, ANIMALS_pqt.tar.gz
With a Parquet produce from JSON (_ANIMALS_json_ and _ANIMALS_pqt_ in attachment in tar.gz format) {code:sql} CREATE TABLE ....`ANIMALS_pqt` AS (SELECT * FROM ....`ANIMALS_json`); {code} Same request, using kvgen and flatten, applied on JSON and Parquet doesn't produce the same results {code:sql} SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x) => 8482290 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x) => 929430 {code} Or another example: {code:sql} SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x WHERE x.f.key='Cat') => 121368 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x WHERE x.f.key='Cat') => 13470 {code} The real result is the json one, as proved by: {code:bash} cat ANIMALS_json/*.json | grep -c "Cat" 121368 {code} Please note that, here, It's appear the particular file _ANIMALS_pqt/1_0_0.parquet_ is not well computed but the other are correct: {code:sql} SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_0_0.parquet`)) AS x WHERE x.f.key='Cat'); => 107898 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_1_0.parquet`)) AS x WHERE x.f.key='Cat'); => 2429 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_2_0.parquet`)) AS x WHERE x.f.key='Cat'); => 5419 SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_3_0.parquet`)) AS x WHERE x.f.key='Cat'); => 5622 {code} 2429+5419+5622=13470 (result of request on ANIMALS_pqt) 107898+2429+5419+5622=121368 (result of request on ANIMALS_json) -- This message was sent by Atlassian Jira (v8.3.4#803005)