Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Stefán Baxter Thu, 23 Jul 2015 14:20:43 -0700

hi,

I can provide you with json file an statements to reproduce it if you wish.


thank you for looking into this.

regards,
  -Stefan
On Jul 23, 2015 9:03 PM, "Jinfeng Ni" <[email protected]> wrote:

> Hi Stefán,
>
> Thanks a lot for bringing up this issue, which is really helpful to improve
> Drill.
>
> I tried to re-produce the incorrect issues, and I could re-produce the
> missing data issue of CTAS parquet, but I could not re-produce the missing
> data issue if I query the JSON file directly.
>
> Here is how I tried:
>
> 1. with dfs.tmp.`test.json`
>   800k of
>    {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
>   100k of
>   {"some":"yes","others":{"other":"true","all":"false","
> sometimes":"yes","additional":"last entries only"}}
>
> 2.  SELECT * from dfs.tmp.`test.json`;
> I put the output of the query into a file. Here is part of the result,
> shown in vim editor
>
> 824000
>
> +------+------------------------------------------------------------------------------------+
> 824001 | some |                                       others
>                         |
> 824002
>
> +------+------------------------------------------------------------------------------------+
> 824003 | yes  | {"other":"true","all":"false","sometimes":"yes"}
>                         |
> 824004 | yes  |
> {"other":"true","all":"false","sometimes":"yes","additional":"last entries
> only"}  |
> 824005 | yes  |
> {"other":"true","all":"false","sometimes":"yes","additional":"last entries
> only"}  |
>
> The left most number is the line number from vim editor.  The first 824003
> lines have rows without the "additional" field, while beyond that each row
> contains "additional" field.  The line number 824003 (not 800000) comes
> from the fact Drill's SqlLine add the columnName as the header for every
> hundreds rows (?).
>
> 3.  SELECT t.`some`, t.`others` from dfs.tmp.`test.json` as t;
>
> Same result as above.
>
> 4.  USE dfs.tmp;
>      CREATE TABLE testparquet as select * from dfs.tmp.`test.json`;
>      SELECT * from dfs.tmp.testparquet;
>
> This one return the missing data from the generated parquet file.
>
>
>  82400 +------+---------------------------------------------------+
>  82401 | some |                      others                       |
>  82402 +------+---------------------------------------------------+
>  82403 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
>  82404 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
>  82405 | yes  | {"other":"true","all":"false","sometimes":"yes"}  |
>
>
> So, looks like there is a bug in the parquet writer operator, when it did
> not output the additional field into parquet files, while the query against
> the JSON seems to return correct result.
>
> I just want to confirm whether you see similar behavior on your side.
>
> Thanks again!
>
>
>
>
>
>
>
>
> On Thu, Jul 23, 2015 at 1:35 PM, Stefán Baxter <[email protected]>
> wrote:
>
> > Thank you.
> >
> >
> >
> > On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter <
> > [email protected]>
> > > wrote:
> > >
> > > > Someone must review the underlying optimization errors to prevent
> this
> > > from
> > > > happening to others.
> > > >
> > >
> > > Jinfeng and Parth are examining this issue to try to come to a deeper
> > > understanding.  Not surprisingly, they are a little quiet as they do
> > this.
> > >
> > >
> > > > JSON data, which is unstructured/schema-free in it's nature can not
> be
> > > > treated as consistent, predictable or monolithic.
> > > >
> > >
> > > Indeed.  And Drill vision is based on *exactly* this thought. Right
> now,
> > > Drill is still new and does not fulfill all aspects of the vision, but
> we
> > > are making progress rapidly.
> > >
> > > Your contributions and comments have been very helpful, btw.
> > >
> >
>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Reply via email to