Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Jinfeng Ni Thu, 23 Jul 2015 14:04:11 -0700

Hi Stefán,

Thanks a lot for bringing up this issue, which is really helpful to improve
Drill.

I tried to re-produce the incorrect issues, and I could re-produce the
missing data issue of CTAS parquet, but I could not re-produce the missing
data issue if I query the JSON file directly.

Here is how I tried:

1. with dfs.tmp.`test.json`
  800k of
   {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
  100k of
  {"some":"yes","others":{"other":"true","all":"false","
sometimes":"yes","additional":"last entries only"}}

2.  SELECT * from dfs.tmp.`test.json`;
I put the output of the query into a file. Here is part of the result,
shown in vim editor

824000
+------+------------------------------------------------------------------------------------+
824001 | some |                                       others
                        |
824002
+------+------------------------------------------------------------------------------------+
824003 | yes  | {"other":"true","all":"false","sometimes":"yes"}
                        |
824004 | yes  |
{"other":"true","all":"false","sometimes":"yes","additional":"last entries
only"}  |
824005 | yes  |
{"other":"true","all":"false","sometimes":"yes","additional":"last entries
only"}  |

The left most number is the line number from vim editor.  The first 824003
lines have rows without the "additional" field, while beyond that each row
contains "additional" field.  The line number 824003 (not 800000) comes
from the fact Drill's SqlLine add the columnName as the header for every
hundreds rows (?).

3.  SELECT t.`some`, t.`others` from dfs.tmp.`test.json` as t;

Same result as above.

4.  USE dfs.tmp;
     CREATE TABLE testparquet as select * from dfs.tmp.`test.json`;
     SELECT * from dfs.tmp.testparquet;

This one return the missing data from the generated parquet file.

 82400 +------+---------------------------------------------------+
 82401 | some |                      others                       |
 82402 +------+---------------------------------------------------+
 82403 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
 82404 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
 82405 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |

So, looks like there is a bug in the parquet writer operator, when it did
not output the additional field into parquet files, while the query against
the JSON seems to return correct result.

I just want to confirm whether you see similar behavior on your side.

Thanks again!

On Thu, Jul 23, 2015 at 1:35 PM, Stefán Baxter <[email protected]>
wrote:

> Thank you.
>
>
>
> On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning <[email protected]>
> wrote:
>
> > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <
> [email protected]>
> > wrote:
> >
> > > Someone must review the underlying optimization errors to prevent this
> > from
> > > happening to others.
> > >
> >
> > Jinfeng and Parth are examining this issue to try to come to a deeper
> > understanding.  Not surprisingly, they are a little quiet as they do
> this.
> >
> >
> > > JSON data, which is unstructured/schema-free in it's nature can not be
> > > treated as consistent, predictable or monolithic.
> > >
> >
> > Indeed.  And Drill vision is based on *exactly* this thought. Right now,
> > Drill is still new and does not fulfill all aspects of the vision, but we
> > are making progress rapidly.
> >
> > Your contributions and comments have been very helpful, btw.
> >
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Reply via email to