Re: Working with non-sane data - JSON Types

John Omernik Tue, 19 Jan 2016 07:00:23 -0800

After getting some pointers on the new experimental Union type with json, I
started getting a different error related to index out of bounds, I thought
I'd post here to determine what it could be, and if a bug, I can then open
a JIRA.


So first, I did:

ALTER SESSION SET `exec.errors.verbose` = true;  -- So I could get full
errors
ALTER SESSION SET `exec.enable_union_type` = true; -- So I could use the
experimental UNION type

Now, my first query, select * from `/data/prod/src/`  gave me the errors
below.  The files change, and ironically, if I select directly from any
specific file (even the ones in the error) often times the query works
fine.  It's going through a directory of files that cause the error.
Sometimes I Can do multiple files, but often times, but I come to one file,
and it seems to break it.  The file that breaks things doesn't look
different from others, but at the same time, I can select directly from the
file, and it works... weird.  Let know if I can do anything to help
troubleshoot more.

Data Notes (see example below):
- The ... represents LOTs of other fields, some simple, some
complex/nested. THis data is NOT Pretty.
- The files are goofy in that each file has one top level field of "count"
then a huge array of events
- The field that is ALWAYS (as far as I've seen) is the "features" field
- This field will sometimes be an array and sometimes be an empty object.
{}.
- The size of the array for the features field (when not an empty object)
does change from event to event.  (My hunch is an issue there)

Error:

Error: DATA_READ ERROR: index: 0, length: 4 (expected: range(0, 0))



File  /data/prod/src/file1.json

Record  1

Line  193

Column  34

Field  feature

Fragment 0:0



[Error Id: 25a2c963-86db-40e9-b5cc-2674887de2fe on node7:31010]



  (java.lang.IndexOutOfBoundsException) index: 0, length: 4 (expected:
range(0, 0))

    io.netty.buffer.DrillBuf.checkIndexD():175

    io.netty.buffer.DrillBuf.chk():197

    io.netty.buffer.DrillBuf.getInt():477

    org.apache.drill.exec.vector.UInt4Vector$Accessor.get():356


org.apache.drill.exec.vector.complex.ListVector$Mutator.startNewValue():305


org.apache.drill.exec.vector.complex.impl.UnionListWriter.startList():563


org.apache.drill.exec.vector.complex.impl.AbstractPromotableFieldWriter.startList():126


org.apache.drill.exec.vector.complex.impl.PromotableWriter.startList():42

    org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():461

    org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305

    org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():470

    org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305

    org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch():240

    org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector():178

    org.apache.drill.exec.vector.complex.fn.JsonReader.write():144

    org.apache.drill.exec.store.easy.json.JSONRecordReader.next():191

    org.apache.drill.exec.physical.impl.ScanBatch.next():191

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.physical.impl.BaseRootExec.next():104


org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81

    org.apache.drill.exec.physical.impl.BaseRootExec.next():94

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250

    java.security.AccessController.doPrivileged():-2

    javax.security.auth.Subject.doAs():422

    org.apache.hadoop.security.UserGroupInformation.doAs():1595

    org.apache.drill.exec.work.fragment.FragmentExecutor.run():250

    org.apache.drill.common.SelfCleaningRunnable.run():38

    java.util.concurrent.ThreadPoolExecutor.runWorker():1142

    java.util.concurrent.ThreadPoolExecutor$Worker.run():617

    java.lang.Thread.run():745 (state=,code=0)



Example Data:

{

  "count": 241,

  "events": [

    {

                ...

                ...

                ...

                "features": [

        {

          "count": 3,

          "name": "feature1"

        },

        {

          "count": 30,

          "name": "feature2"

        },

        {

          "count": 2,

          "name": "feature3"

        },

        {

          "count": 3,

          "name": "feature4"

        }

      ],

                ...

                ...

    },

   {

   ...

   ...

   ...

   "features": {},

   ...

   },

    {

                ...

                ...

                ...

                "features": [

        {

          "count": 3,

          "name": "feature1"

        },

        {

          "count": 30,

          "name": "feature2"

        },

        {

          "count": 2,

          "name": "feature3"

       }

      ],

                ...

                ...

    }

]

}

On Mon, Jan 18, 2016 at 4:58 PM, Brent Payne <[email protected]> wrote:

> We had a similar issue(s) and had to reprocess our data so that everything
> had a consistent schema or it would break, sometimes with unexpected
> issues.  We started on 1.2, so maybe some of the issues are not there
> anymore.  Drill is awesome and can do a lot, but it cannot currently do on
> the fly type conversion/cleanup.
>
> On Mon, Jan 18, 2016 at 2:11 PM, John Omernik <[email protected]> wrote:
>
> > I am working a LARGE volume of data (I state that because even my first
> > reaction was "I'll just write a simple sed command and fix this data up
> > lickity split)
> >
> > However, lots of files, lots of data, so let's avoid that as the initial
> > answer if possible. (Ideally I am looking for an "on read" solution in
> > Drill)
> >
> > Basically, when I try to read a file, I get this error:
> >
> > Error: DATA_READ ERROR: You tried to start when you are using a
> ValueWriter
> > of type SingleMapWriter.
> >
> > The field in question had a silly setup, if it's empty they use {} if
> it's
> > not empty then it's an array of data.
> >
> > So:
> >
> > "field1":{}
> > or
> > "field1":[{"foo":bar"}, {"bar":"foo"}]
> >
> > I am pretty sure this is the error. Point: I am not sure the error
> message
> > I provided helps me to understand intuitively, perhaps some TLC on the
> > error messages could help less Drill aware users to know what's actually
> > breaking (in fairness, the message in 1.4 showed me the line, column, and
> > field which helped me to infer what could POSSIBLY be wrong).
> >
> > So, is there away to address this without reprocessing a lot of data?  An
> > option in Drill that would allow a dirty read of some sort?
> >
> > Thanks in advance!!
> >
> > John
> >
>

Re: Working with non-sane data - JSON Types

Reply via email to