Re: Working with non-sane data - JSON Types

Hanifi Gunes Tue, 19 Jan 2016 07:52:54 -0800

You are on the right track, attempting to use union type. As far as I can
see, writers attempt to put a value without allocating underlying buffers
at the batch boundaries. Then this error surfaces.


This is a bug. Please file a JIRA.


Also as a future reference, you may want to check out [1]

1: https://issues.apache.org/jira/browse/DRILL-4283

-Hanifi

On Tue, Jan 19, 2016 at 6:59 AM, John Omernik <[email protected]> wrote:

> After getting some pointers on the new experimental Union type with json, I
> started getting a different error related to index out of bounds, I thought
> I'd post here to determine what it could be, and if a bug, I can then open
> a JIRA.
>
> So first, I did:
>
> ALTER SESSION SET `exec.errors.verbose` = true;  -- So I could get full
> errors
> ALTER SESSION SET `exec.enable_union_type` = true; -- So I could use the
> experimental UNION type
>
> Now, my first query, select * from `/data/prod/src/`  gave me the errors
> below.  The files change, and ironically, if I select directly from any
> specific file (even the ones in the error) often times the query works
> fine.  It's going through a directory of files that cause the error.
> Sometimes I Can do multiple files, but often times, but I come to one file,
> and it seems to break it.  The file that breaks things doesn't look
> different from others, but at the same time, I can select directly from the
> file, and it works... weird.  Let know if I can do anything to help
> troubleshoot more.
>
> Data Notes (see example below):
> - The ... represents LOTs of other fields, some simple, some
> complex/nested. THis data is NOT Pretty.
> - The files are goofy in that each file has one top level field of "count"
> then a huge array of events
> - The field that is ALWAYS (as far as I've seen) is the "features" field
> - This field will sometimes be an array and sometimes be an empty object.
> {}.
> - The size of the array for the features field (when not an empty object)
> does change from event to event.  (My hunch is an issue there)
>
> Error:
>
> Error: DATA_READ ERROR: index: 0, length: 4 (expected: range(0, 0))
>
>
>
> File  /data/prod/src/file1.json
>
> Record  1
>
> Line  193
>
> Column  34
>
> Field  feature
>
> Fragment 0:0
>
>
>
> [Error Id: 25a2c963-86db-40e9-b5cc-2674887de2fe on node7:31010]
>
>
>
>   (java.lang.IndexOutOfBoundsException) index: 0, length: 4 (expected:
> range(0, 0))
>
>     io.netty.buffer.DrillBuf.checkIndexD():175
>
>     io.netty.buffer.DrillBuf.chk():197
>
>     io.netty.buffer.DrillBuf.getInt():477
>
>     org.apache.drill.exec.vector.UInt4Vector$Accessor.get():356
>
>
> org.apache.drill.exec.vector.complex.ListVector$Mutator.startNewValue():305
>
>
> org.apache.drill.exec.vector.complex.impl.UnionListWriter.startList():563
>
>
>
> org.apache.drill.exec.vector.complex.impl.AbstractPromotableFieldWriter.startList():126
>
>
> org.apache.drill.exec.vector.complex.impl.PromotableWriter.startList():42
>
>     org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():461
>
>     org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305
>
>     org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():470
>
>     org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305
>
>
> org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch():240
>
>     org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector():178
>
>     org.apache.drill.exec.vector.complex.fn.JsonReader.write():144
>
>     org.apache.drill.exec.store.easy.json.JSONRecordReader.next():191
>
>     org.apache.drill.exec.physical.impl.ScanBatch.next():191
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
>
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
>
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
>
>     java.security.AccessController.doPrivileged():-2
>
>     javax.security.auth.Subject.doAs():422
>
>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
>
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>
>     java.lang.Thread.run():745 (state=,code=0)
>
>
>
> Example Data:
>
> {
>
>   "count": 241,
>
>   "events": [
>
>     {
>
>                 ...
>
>                 ...
>
>                 ...
>
>                 "features": [
>
>         {
>
>           "count": 3,
>
>           "name": "feature1"
>
>         },
>
>         {
>
>           "count": 30,
>
>           "name": "feature2"
>
>         },
>
>         {
>
>           "count": 2,
>
>           "name": "feature3"
>
>         },
>
>         {
>
>           "count": 3,
>
>           "name": "feature4"
>
>         }
>
>       ],
>
>                 ...
>
>                 ...
>
>     },
>
>    {
>
>    ...
>
>    ...
>
>    ...
>
>    "features": {},
>
>    ...
>
>    },
>
>     {
>
>                 ...
>
>                 ...
>
>                 ...
>
>                 "features": [
>
>         {
>
>           "count": 3,
>
>           "name": "feature1"
>
>         },
>
>         {
>
>           "count": 30,
>
>           "name": "feature2"
>
>         },
>
>         {
>
>           "count": 2,
>
>           "name": "feature3"
>
>        }
>
>       ],
>
>                 ...
>
>                 ...
>
>     }
>
> ]
>
> }
>
> On Mon, Jan 18, 2016 at 4:58 PM, Brent Payne <[email protected]>
> wrote:
>
> > We had a similar issue(s) and had to reprocess our data so that
> everything
> > had a consistent schema or it would break, sometimes with unexpected
> > issues.  We started on 1.2, so maybe some of the issues are not there
> > anymore.  Drill is awesome and can do a lot, but it cannot currently do
> on
> > the fly type conversion/cleanup.
> >
> > On Mon, Jan 18, 2016 at 2:11 PM, John Omernik <[email protected]> wrote:
> >
> > > I am working a LARGE volume of data (I state that because even my first
> > > reaction was "I'll just write a simple sed command and fix this data up
> > > lickity split)
> > >
> > > However, lots of files, lots of data, so let's avoid that as the
> initial
> > > answer if possible. (Ideally I am looking for an "on read" solution in
> > > Drill)
> > >
> > > Basically, when I try to read a file, I get this error:
> > >
> > > Error: DATA_READ ERROR: You tried to start when you are using a
> > ValueWriter
> > > of type SingleMapWriter.
> > >
> > > The field in question had a silly setup, if it's empty they use {} if
> > it's
> > > not empty then it's an array of data.
> > >
> > > So:
> > >
> > > "field1":{}
> > > or
> > > "field1":[{"foo":bar"}, {"bar":"foo"}]
> > >
> > > I am pretty sure this is the error. Point: I am not sure the error
> > message
> > > I provided helps me to understand intuitively, perhaps some TLC on the
> > > error messages could help less Drill aware users to know what's
> actually
> > > breaking (in fairness, the message in 1.4 showed me the line, column,
> and
> > > field which helped me to infer what could POSSIBLY be wrong).
> > >
> > > So, is there away to address this without reprocessing a lot of data?
> An
> > > option in Drill that would allow a dirty read of some sort?
> > >
> > > Thanks in advance!!
> > >
> > > John
> > >
> >
>

Re: Working with non-sane data - JSON Types

Reply via email to