hi, I can provide you with json file an statements to reproduce it if you wish.
thank you for looking into this. regards, -Stefan On Jul 23, 2015 9:03 PM, "Jinfeng Ni" <[email protected]> wrote: > Hi Stefán, > > Thanks a lot for bringing up this issue, which is really helpful to improve > Drill. > > I tried to re-produce the incorrect issues, and I could re-produce the > missing data issue of CTAS parquet, but I could not re-produce the missing > data issue if I query the JSON file directly. > > Here is how I tried: > > 1. with dfs.tmp.`test.json` > 800k of > {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}} > 100k of > {"some":"yes","others":{"other":"true","all":"false"," > sometimes":"yes","additional":"last entries only"}} > > 2. SELECT * from dfs.tmp.`test.json`; > I put the output of the query into a file. Here is part of the result, > shown in vim editor > > 824000 > > +------+------------------------------------------------------------------------------------+ > 824001 | some | others > | > 824002 > > +------+------------------------------------------------------------------------------------+ > 824003 | yes | {"other":"true","all":"false","sometimes":"yes"} > | > 824004 | yes | > {"other":"true","all":"false","sometimes":"yes","additional":"last entries > only"} | > 824005 | yes | > {"other":"true","all":"false","sometimes":"yes","additional":"last entries > only"} | > > The left most number is the line number from vim editor. The first 824003 > lines have rows without the "additional" field, while beyond that each row > contains "additional" field. The line number 824003 (not 800000) comes > from the fact Drill's SqlLine add the columnName as the header for every > hundreds rows (?). > > 3. SELECT t.`some`, t.`others` from dfs.tmp.`test.json` as t; > > Same result as above. > > 4. USE dfs.tmp; > CREATE TABLE testparquet as select * from dfs.tmp.`test.json`; > SELECT * from dfs.tmp.testparquet; > > This one return the missing data from the generated parquet file. > > > 82400 +------+---------------------------------------------------+ > 82401 | some | others | > 82402 +------+---------------------------------------------------+ > 82403 | yes | {"other":"true","all":"false","sometimes":"yes"} | > 82404 | yes | {"other":"true","all":"false","sometimes":"yes"} | > 82405 | yes | {"other":"true","all":"false","sometimes":"yes"} | > > > So, looks like there is a bug in the parquet writer operator, when it did > not output the additional field into parquet files, while the query against > the JSON seems to return correct result. > > I just want to confirm whether you see similar behavior on your side. > > Thanks again! > > > > > > > > > On Thu, Jul 23, 2015 at 1:35 PM, Stefán Baxter <[email protected]> > wrote: > > > Thank you. > > > > > > > > On Thu, Jul 23, 2015 at 7:24 PM, Ted Dunning <[email protected]> > > wrote: > > > > > On Thu, Jul 23, 2015 at 3:55 AM, Stefán Baxter < > > [email protected]> > > > wrote: > > > > > > > Someone must review the underlying optimization errors to prevent > this > > > from > > > > happening to others. > > > > > > > > > > Jinfeng and Parth are examining this issue to try to come to a deeper > > > understanding. Not surprisingly, they are a little quiet as they do > > this. > > > > > > > > > > JSON data, which is unstructured/schema-free in it's nature can not > be > > > > treated as consistent, predictable or monolithic. > > > > > > > > > > Indeed. And Drill vision is based on *exactly* this thought. Right > now, > > > Drill is still new and does not fulfill all aspects of the vision, but > we > > > are making progress rapidly. > > > > > > Your contributions and comments have been very helpful, btw. > > > > > >
