Re: Troubleshooting JSON File read

John Omernik Tue, 22 Sep 2015 05:39:35 -0700

I didn't see a leading and trailing [] in the data, I think the issue was
one big file. Although when I split it, perhaps due to the 263 dynamic key
names, drill became extremely slow in processing the 25 gzipped json files.
These json files ranged from 14M to 85M compressed, not sure why it didn't
handle that situation better, unless it was just very unhappy with all the
keys.


John

On Mon, Sep 21, 2015 at 7:12 PM, Ted Dunning <[email protected]> wrote:

> Consider just deleting the leading [ and trailing ]. If your objects are on
> a single line, you are good to go at that point.
>
>
>
> On Mon, Sep 21, 2015 at 12:45 PM, John Omernik <[email protected]> wrote:
>
> > I think I found my issue: (see below). I'd recommend that the Drill
> > includes a warning when querying such data, in that open failure, like I
> > had. Now trying to figure out another issue (query takes forever on 25
> > smaller (50-100 mb gzipped files)... I'll keep posting here...
> >
> >
> >
> >
> >
> > Lengthy JSON objects
> >
> > Currently, Drill cannot manage lengthy JSON objects, such as a gigabit
> JSON
> > file. Finding the beginning and end of records can be time consuming and
> > require scanning the whole file.
> >
> > Workaround: Use a tool to split the JSON file into smaller chunks of
> > 64-128MB or 64-256MB initially until you know the total data size and
> node
> > configuration. Keep the JSON objects intact in each file. A distributed
> > file system, such as HDFS, is recommended over trying to manage file
> > partitions.
> >
> > On Mon, Sep 21, 2015 at 1:15 PM, John Omernik <[email protected]> wrote:
> >
> > > I am reading a MongoDB dump file in Drill.   On the surface it seems to
> > be
> > > working well, however, I have some need to trouble shoot, and I was
> > curious
> > > the best way to approach. Here are some "things"
> > >
> > >
> > > 1. It's a large file 1.2 GB compressed. It's named mondodump.json.gz
> and
> > > drill seems to be (on the surface) handling that correctly
> > > 2. It's Drill 1.1. (MapR Package)
> > > 3.  select * from `/pathoto/*` limit 10 seems to work, in this case the
> > > _id field is ip addresses (long story)
> > > 4. In the select * limit 10, if I do select * from `/pathto/*` where
> > `_id`
> > > = '123.123.123.123' (which was returned in the select * limit 10 query
> > from
> > > #3, it finds the record, all is well.
> > > 5. If I take select * from `/pathto/*` where `_id` = '127.0.0.1' which
> I
> > > know to be in the data (validated with zgrep) it does NOT find the
> data.
> > > Based on the results from zGrep, it should find it, I am not sure if
> > there
> > > something weird in reading the data, but its not throwing errors.
> > > 6. select count(*) from `/pathro/*` returns the same number as zcat
> > > source.json.gz|wc -l This is interesting because it apparently means
> > things
> > > are lined up, but why isn't that IP showing?
> > >
> > > So my question is this: Is there anything in Drill that would cause it
> to
> > > miss that? Weird chars? etc I know it's hard, but with a 1.2 GB
> > compressed
> > > file, how would one trouble shoot this?
> > >
> >
>

Re: Troubleshooting JSON File read

Reply via email to