Sure thing. H+
On Tue, Jan 19, 2016 at 8:45 AM, John Omernik <[email protected]> wrote: > Hanifi: > > I've created https://issues.apache.org/jira/browse/DRILL-4284 > > Can you post your comment to it, I want to ensure it's represented coming > from you here. > > John > > > > On Tue, Jan 19, 2016 at 10:37 AM, John Omernik <[email protected]> wrote: > > > Thank you Hanifi. > > > > I will open a JIRA. > > > > One more add on piece of information: Even when I am not using the > > "features" field at all (I am referencing a different field only) it > gives > > the same error. > > > > > > so let's say we had events[{features:[], other_features[]}] (and both > > features and other_features had items in their arrays > > > > select flatten(e.other_features) from (select flatten(events)e from > table > > ) r > > > > gives the error even though I should be complete ignoring "features" > > > > Writing JIRA now. > > > > > > > > On Tue, Jan 19, 2016 at 9:52 AM, Hanifi Gunes <[email protected]> > wrote: > > > >> You are on the right track, attempting to use union type. As far as I > can > >> see, writers attempt to put a value without allocating underlying > buffers > >> at the batch boundaries. Then this error surfaces. > >> > >> This is a bug. Please file a JIRA. > >> > >> > >> Also as a future reference, you may want to check out [1] > >> > >> 1: https://issues.apache.org/jira/browse/DRILL-4283 > >> > >> -Hanifi > >> > >> On Tue, Jan 19, 2016 at 6:59 AM, John Omernik <[email protected]> wrote: > >> > >> > After getting some pointers on the new experimental Union type with > >> json, I > >> > started getting a different error related to index out of bounds, I > >> thought > >> > I'd post here to determine what it could be, and if a bug, I can then > >> open > >> > a JIRA. > >> > > >> > So first, I did: > >> > > >> > ALTER SESSION SET `exec.errors.verbose` = true; -- So I could get > full > >> > errors > >> > ALTER SESSION SET `exec.enable_union_type` = true; -- So I could use > the > >> > experimental UNION type > >> > > >> > Now, my first query, select * from `/data/prod/src/` gave me the > errors > >> > below. The files change, and ironically, if I select directly from > any > >> > specific file (even the ones in the error) often times the query works > >> > fine. It's going through a directory of files that cause the error. > >> > Sometimes I Can do multiple files, but often times, but I come to one > >> file, > >> > and it seems to break it. The file that breaks things doesn't look > >> > different from others, but at the same time, I can select directly > from > >> the > >> > file, and it works... weird. Let know if I can do anything to help > >> > troubleshoot more. > >> > > >> > Data Notes (see example below): > >> > - The ... represents LOTs of other fields, some simple, some > >> > complex/nested. THis data is NOT Pretty. > >> > - The files are goofy in that each file has one top level field of > >> "count" > >> > then a huge array of events > >> > - The field that is ALWAYS (as far as I've seen) is the "features" > field > >> > - This field will sometimes be an array and sometimes be an empty > >> object. > >> > {}. > >> > - The size of the array for the features field (when not an empty > >> object) > >> > does change from event to event. (My hunch is an issue there) > >> > > >> > Error: > >> > > >> > Error: DATA_READ ERROR: index: 0, length: 4 (expected: range(0, 0)) > >> > > >> > > >> > > >> > File /data/prod/src/file1.json > >> > > >> > Record 1 > >> > > >> > Line 193 > >> > > >> > Column 34 > >> > > >> > Field feature > >> > > >> > Fragment 0:0 > >> > > >> > > >> > > >> > [Error Id: 25a2c963-86db-40e9-b5cc-2674887de2fe on node7:31010] > >> > > >> > > >> > > >> > (java.lang.IndexOutOfBoundsException) index: 0, length: 4 (expected: > >> > range(0, 0)) > >> > > >> > io.netty.buffer.DrillBuf.checkIndexD():175 > >> > > >> > io.netty.buffer.DrillBuf.chk():197 > >> > > >> > io.netty.buffer.DrillBuf.getInt():477 > >> > > >> > org.apache.drill.exec.vector.UInt4Vector$Accessor.get():356 > >> > > >> > > >> > > >> > org.apache.drill.exec.vector.complex.ListVector$Mutator.startNewValue():305 > >> > > >> > > >> > > >> > org.apache.drill.exec.vector.complex.impl.UnionListWriter.startList():563 > >> > > >> > > >> > > >> > > >> > org.apache.drill.exec.vector.complex.impl.AbstractPromotableFieldWriter.startList():126 > >> > > >> > > >> > > >> > org.apache.drill.exec.vector.complex.impl.PromotableWriter.startList():42 > >> > > >> > org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():461 > >> > > >> > org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305 > >> > > >> > org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():470 > >> > > >> > org.apache.drill.exec.vector.complex.fn.JsonReader.writeData():305 > >> > > >> > > >> > > org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch():240 > >> > > >> > > >> org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector():178 > >> > > >> > org.apache.drill.exec.vector.complex.fn.JsonReader.write():144 > >> > > >> > org.apache.drill.exec.store.easy.json.JSONRecordReader.next():191 > >> > > >> > org.apache.drill.exec.physical.impl.ScanBatch.next():191 > >> > > >> > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > >> > > >> > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > >> > > >> > > >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > >> > > >> > > >> > > >> > > >> > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132 > >> > > >> > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > >> > > >> > org.apache.drill.exec.physical.impl.BaseRootExec.next():104 > >> > > >> > > >> > > >> > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81 > >> > > >> > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 > >> > > >> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256 > >> > > >> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250 > >> > > >> > java.security.AccessController.doPrivileged():-2 > >> > > >> > javax.security.auth.Subject.doAs():422 > >> > > >> > org.apache.hadoop.security.UserGroupInformation.doAs():1595 > >> > > >> > org.apache.drill.exec.work.fragment.FragmentExecutor.run():250 > >> > > >> > org.apache.drill.common.SelfCleaningRunnable.run():38 > >> > > >> > java.util.concurrent.ThreadPoolExecutor.runWorker():1142 > >> > > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run():617 > >> > > >> > java.lang.Thread.run():745 (state=,code=0) > >> > > >> > > >> > > >> > Example Data: > >> > > >> > { > >> > > >> > "count": 241, > >> > > >> > "events": [ > >> > > >> > { > >> > > >> > ... > >> > > >> > ... > >> > > >> > ... > >> > > >> > "features": [ > >> > > >> > { > >> > > >> > "count": 3, > >> > > >> > "name": "feature1" > >> > > >> > }, > >> > > >> > { > >> > > >> > "count": 30, > >> > > >> > "name": "feature2" > >> > > >> > }, > >> > > >> > { > >> > > >> > "count": 2, > >> > > >> > "name": "feature3" > >> > > >> > }, > >> > > >> > { > >> > > >> > "count": 3, > >> > > >> > "name": "feature4" > >> > > >> > } > >> > > >> > ], > >> > > >> > ... > >> > > >> > ... > >> > > >> > }, > >> > > >> > { > >> > > >> > ... > >> > > >> > ... > >> > > >> > ... > >> > > >> > "features": {}, > >> > > >> > ... > >> > > >> > }, > >> > > >> > { > >> > > >> > ... > >> > > >> > ... > >> > > >> > ... > >> > > >> > "features": [ > >> > > >> > { > >> > > >> > "count": 3, > >> > > >> > "name": "feature1" > >> > > >> > }, > >> > > >> > { > >> > > >> > "count": 30, > >> > > >> > "name": "feature2" > >> > > >> > }, > >> > > >> > { > >> > > >> > "count": 2, > >> > > >> > "name": "feature3" > >> > > >> > } > >> > > >> > ], > >> > > >> > ... > >> > > >> > ... > >> > > >> > } > >> > > >> > ] > >> > > >> > } > >> > > >> > On Mon, Jan 18, 2016 at 4:58 PM, Brent Payne <[email protected]> > >> > wrote: > >> > > >> > > We had a similar issue(s) and had to reprocess our data so that > >> > everything > >> > > had a consistent schema or it would break, sometimes with unexpected > >> > > issues. We started on 1.2, so maybe some of the issues are not > there > >> > > anymore. Drill is awesome and can do a lot, but it cannot currently > >> do > >> > on > >> > > the fly type conversion/cleanup. > >> > > > >> > > On Mon, Jan 18, 2016 at 2:11 PM, John Omernik <[email protected]> > >> wrote: > >> > > > >> > > > I am working a LARGE volume of data (I state that because even my > >> first > >> > > > reaction was "I'll just write a simple sed command and fix this > >> data up > >> > > > lickity split) > >> > > > > >> > > > However, lots of files, lots of data, so let's avoid that as the > >> > initial > >> > > > answer if possible. (Ideally I am looking for an "on read" > solution > >> in > >> > > > Drill) > >> > > > > >> > > > Basically, when I try to read a file, I get this error: > >> > > > > >> > > > Error: DATA_READ ERROR: You tried to start when you are using a > >> > > ValueWriter > >> > > > of type SingleMapWriter. > >> > > > > >> > > > The field in question had a silly setup, if it's empty they use {} > >> if > >> > > it's > >> > > > not empty then it's an array of data. > >> > > > > >> > > > So: > >> > > > > >> > > > "field1":{} > >> > > > or > >> > > > "field1":[{"foo":bar"}, {"bar":"foo"}] > >> > > > > >> > > > I am pretty sure this is the error. Point: I am not sure the error > >> > > message > >> > > > I provided helps me to understand intuitively, perhaps some TLC on > >> the > >> > > > error messages could help less Drill aware users to know what's > >> > actually > >> > > > breaking (in fairness, the message in 1.4 showed me the line, > >> column, > >> > and > >> > > > field which helped me to infer what could POSSIBLY be wrong). > >> > > > > >> > > > So, is there away to address this without reprocessing a lot of > >> data? > >> > An > >> > > > option in Drill that would allow a dirty read of some sort? > >> > > > > >> > > > Thanks in advance!! > >> > > > > >> > > > John > >> > > > > >> > > > >> > > >> > > > > >
