Re: Working with non-sane data - JSON Types

Hanifi Gunes Tue, 19 Jan 2016 10:53:54 -0800

Sure thing.

H+


On Tue, Jan 19, 2016 at 8:45 AM, John Omernik <[email protected]> wrote:

> Hanifi:
>
> I've created https://issues.apache.org/jira/browse/DRILL-4284
>
> Can you post your comment to it, I want to ensure it's represented coming
> from you here.
>
> John
>
>
>
> On Tue, Jan 19, 2016 at 10:37 AM, John Omernik <[email protected]> wrote:
>
> > Thank you Hanifi.
> >
> > I will open a JIRA.
> >
> > One more add on piece of information:  Even when I am not using the
> > "features" field at all (I am referencing a different field only) it
> gives
> > the same error.
> >
> >
> > so let's say we had events[{features:[], other_features[]}] (and both
> > features and other_features had items in their arrays
> >
> > select flatten(e.other_features) from (select flatten(events)e  from
> table
> > ) r
> >
> > gives the error even though I should be complete ignoring "features"
> >
> > Writing JIRA now.
> >
> >
> >
> > On Tue, Jan 19, 2016 at 9:52 AM, Hanifi Gunes <[email protected]>
> wrote:
> >
> >> You are on the right track, attempting to use union type. As far as I
> can
> >> see, writers attempt to put a value without allocating underlying
> buffers
> >> at the batch boundaries. Then this error surfaces.
> >>
> >> This is a bug. Please file a JIRA.
> >>
> >>
> >> Also as a future reference, you may want to check out [1]
> >>
> >> 1: https://issues.apache.org/jira/browse/DRILL-4283
> >>
> >> -Hanifi
> >>
> >> On Tue, Jan 19, 2016 at 6:59 AM, John Omernik <[email protected]> wrote:
> >>
> >> > After getting some pointers on the new experimental Union type with
> >> json, I
> >> > started getting a different error related to index out of bounds, I
> >> thought
> >> > I'd post here to determine what it could be, and if a bug, I can then
> >> open
> >> > a JIRA.
> >> >
> >> > So first, I did:
> >> >
> >> > ALTER SESSION SET `exec.errors.verbose` = true;  -- So I could get
> full
> >> > errors
> >> > ALTER SESSION SET `exec.enable_union_type` = true; -- So I could use
> the
> >> > experimental UNION type
> >> >
> >> > Now, my first query, select * from `/data/prod/src/`  gave me the
> errors
> >> > below.  The files change, and ironically, if I select directly from
> any
> >> > specific file (even the ones in the error) often times the query works
> >> > fine.  It's going through a directory of files that cause the error.
> >> > Sometimes I Can do multiple files, but often times, but I come to one
> >> file,
> >> > and it seems to break it.  The file that breaks things doesn't look
> >> > different from others, but at the same time, I can select directly
> from
> >> the
> >> > file, and it works... weird.  Let know if I can do anything to help
> >> > troubleshoot more.
> >> >
> >> > Data Notes (see example below):
> >> > - The ... represents LOTs of other fields, some simple, some
> >> > complex/nested. THis data is NOT Pretty.
> >> > - The files are goofy in that each file has one top level field of
> >> "count"
> >> > then a huge array of events
> >> > - The field that is ALWAYS (as far as I've seen) is the "features"
> field
> >> > - This field will sometimes be an array and sometimes be an empty
> >> object.
> >> > {}.
> >> > - The size of the array for the features field (when not an empty
> >> object)
> >> > does change from event to event.  (My hunch is an issue there)
> >> >
> >> > Error:
> >> >
> >> > Error: DATA_READ ERROR: index: 0, length: 4 (expected: range(0, 0))
> >> >
> >> >
> >> >
> >> > File  /data/prod/src/file1.json
> >> >
> >> > Record  1
> >> >
> >> > Line  193
> >> >
> >> > Column  34
> >> >
> >> > Field  feature
> >> >
> >> > Fragment 0:0
> >> >
> >> >
> >> >
> >> > [Error Id: 25a2c963-86db-40e9-b5cc-2674887de2fe on node7:31010]
> >> >
> >> >
> >> >
> >> >   (java.lang.IndexOutOfBoundsException) index: 0, length: 4 (expected:
> >> > range(0, 0))
> >> >
> >> >     io.netty.buffer.DrillBuf.checkIndexD():175
> >> >
> >> >     io.netty.buffer.DrillBuf.chk():197
> >> >
> >> >     io.netty.buffer.DrillBuf.getInt():477
> >> >
> >> >     org.apache.drill.exec.vector.UInt4Vector$Accessor.get():356
> >> >
> >> >
> >> >
> >>
> org.apache.drill.exec.vector.complex.ListVector$Mutator.startNewValue():305
> >> >
> >> >
> >> >
> >>
> org.apache.drill.exec.vector.complex.impl.UnionListWriter.startList():563
> >> >
> >> >
> >> >
> >> >
> >>
> org.apache.drill.exec.vector.complex.impl.AbstractPromotableFieldWriter.startList():126
> >> >
> >> >
> >> >
> >>
> org.apache.drill.exec.vector.complex.impl.PromotableWriter.startList():42
> >> >
> >> >     org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():461
> >> >
> >> >     org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305
> >> >
> >> >     org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():470
> >> >
> >> >     org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305
> >> >
> >> >
> >> >
> org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch():240
> >> >
> >> >
> >>  org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector():178
> >> >
> >> >     org.apache.drill.exec.vector.complex.fn.JsonReader.write():144
> >> >
> >> >     org.apache.drill.exec.store.easy.json.JSONRecordReader.next():191
> >> >
> >> >     org.apache.drill.exec.physical.impl.ScanBatch.next():191
> >> >
> >> >     org.apache.drill.exec.record.AbstractRecordBatch.next():119
> >> >
> >> >     org.apache.drill.exec.record.AbstractRecordBatch.next():109
> >> >
> >> >
> >>  org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> >> >
> >> >
> >> >
> >> >
> >>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132
> >> >
> >> >     org.apache.drill.exec.record.AbstractRecordBatch.next():162
> >> >
> >> >     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> >> >
> >> >
> >> >
> >>
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
> >> >
> >> >     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> >> >
> >> >     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
> >> >
> >> >     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
> >> >
> >> >     java.security.AccessController.doPrivileged():-2
> >> >
> >> >     javax.security.auth.Subject.doAs():422
> >> >
> >> >     org.apache.hadoop.security.UserGroupInformation.doAs():1595
> >> >
> >> >     org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
> >> >
> >> >     org.apache.drill.common.SelfCleaningRunnable.run():38
> >> >
> >> >     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> >> >
> >> >     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> >> >
> >> >     java.lang.Thread.run():745 (state=,code=0)
> >> >
> >> >
> >> >
> >> > Example Data:
> >> >
> >> > {
> >> >
> >> >   "count": 241,
> >> >
> >> >   "events": [
> >> >
> >> >     {
> >> >
> >> >                 ...
> >> >
> >> >                 ...
> >> >
> >> >                 ...
> >> >
> >> >                 "features": [
> >> >
> >> >         {
> >> >
> >> >           "count": 3,
> >> >
> >> >           "name": "feature1"
> >> >
> >> >         },
> >> >
> >> >         {
> >> >
> >> >           "count": 30,
> >> >
> >> >           "name": "feature2"
> >> >
> >> >         },
> >> >
> >> >         {
> >> >
> >> >           "count": 2,
> >> >
> >> >           "name": "feature3"
> >> >
> >> >         },
> >> >
> >> >         {
> >> >
> >> >           "count": 3,
> >> >
> >> >           "name": "feature4"
> >> >
> >> >         }
> >> >
> >> >       ],
> >> >
> >> >                 ...
> >> >
> >> >                 ...
> >> >
> >> >     },
> >> >
> >> >    {
> >> >
> >> >    ...
> >> >
> >> >    ...
> >> >
> >> >    ...
> >> >
> >> >    "features": {},
> >> >
> >> >    ...
> >> >
> >> >    },
> >> >
> >> >     {
> >> >
> >> >                 ...
> >> >
> >> >                 ...
> >> >
> >> >                 ...
> >> >
> >> >                 "features": [
> >> >
> >> >         {
> >> >
> >> >           "count": 3,
> >> >
> >> >           "name": "feature1"
> >> >
> >> >         },
> >> >
> >> >         {
> >> >
> >> >           "count": 30,
> >> >
> >> >           "name": "feature2"
> >> >
> >> >         },
> >> >
> >> >         {
> >> >
> >> >           "count": 2,
> >> >
> >> >           "name": "feature3"
> >> >
> >> >        }
> >> >
> >> >       ],
> >> >
> >> >                 ...
> >> >
> >> >                 ...
> >> >
> >> >     }
> >> >
> >> > ]
> >> >
> >> > }
> >> >
> >> > On Mon, Jan 18, 2016 at 4:58 PM, Brent Payne <[email protected]>
> >> > wrote:
> >> >
> >> > > We had a similar issue(s) and had to reprocess our data so that
> >> > everything
> >> > > had a consistent schema or it would break, sometimes with unexpected
> >> > > issues.  We started on 1.2, so maybe some of the issues are not
> there
> >> > > anymore.  Drill is awesome and can do a lot, but it cannot currently
> >> do
> >> > on
> >> > > the fly type conversion/cleanup.
> >> > >
> >> > > On Mon, Jan 18, 2016 at 2:11 PM, John Omernik <[email protected]>
> >> wrote:
> >> > >
> >> > > > I am working a LARGE volume of data (I state that because even my
> >> first
> >> > > > reaction was "I'll just write a simple sed command and fix this
> >> data up
> >> > > > lickity split)
> >> > > >
> >> > > > However, lots of files, lots of data, so let's avoid that as the
> >> > initial
> >> > > > answer if possible. (Ideally I am looking for an "on read"
> solution
> >> in
> >> > > > Drill)
> >> > > >
> >> > > > Basically, when I try to read a file, I get this error:
> >> > > >
> >> > > > Error: DATA_READ ERROR: You tried to start when you are using a
> >> > > ValueWriter
> >> > > > of type SingleMapWriter.
> >> > > >
> >> > > > The field in question had a silly setup, if it's empty they use {}
> >> if
> >> > > it's
> >> > > > not empty then it's an array of data.
> >> > > >
> >> > > > So:
> >> > > >
> >> > > > "field1":{}
> >> > > > or
> >> > > > "field1":[{"foo":bar"}, {"bar":"foo"}]
> >> > > >
> >> > > > I am pretty sure this is the error. Point: I am not sure the error
> >> > > message
> >> > > > I provided helps me to understand intuitively, perhaps some TLC on
> >> the
> >> > > > error messages could help less Drill aware users to know what's
> >> > actually
> >> > > > breaking (in fairness, the message in 1.4 showed me the line,
> >> column,
> >> > and
> >> > > > field which helped me to infer what could POSSIBLY be wrong).
> >> > > >
> >> > > > So, is there away to address this without reprocessing a lot of
> >> data?
> >> > An
> >> > > > option in Drill that would allow a dirty read of some sort?
> >> > > >
> >> > > > Thanks in advance!!
> >> > > >
> >> > > > John
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Working with non-sane data - JSON Types

Reply via email to