Re: Troubleshooting JSON File read

Ted Dunning Mon, 21 Sep 2015 17:14:01 -0700

Consider just deleting the leading [ and trailing ]. If your objects are on
a single line, you are good to go at that point.




On Mon, Sep 21, 2015 at 12:45 PM, John Omernik <[email protected]> wrote:

> I think I found my issue: (see below). I'd recommend that the Drill
> includes a warning when querying such data, in that open failure, like I
> had. Now trying to figure out another issue (query takes forever on 25
> smaller (50-100 mb gzipped files)... I'll keep posting here...
>
>
>
>
>
> Lengthy JSON objects
>
> Currently, Drill cannot manage lengthy JSON objects, such as a gigabit JSON
> file. Finding the beginning and end of records can be time consuming and
> require scanning the whole file.
>
> Workaround: Use a tool to split the JSON file into smaller chunks of
> 64-128MB or 64-256MB initially until you know the total data size and node
> configuration. Keep the JSON objects intact in each file. A distributed
> file system, such as HDFS, is recommended over trying to manage file
> partitions.
>
> On Mon, Sep 21, 2015 at 1:15 PM, John Omernik <[email protected]> wrote:
>
> > I am reading a MongoDB dump file in Drill.   On the surface it seems to
> be
> > working well, however, I have some need to trouble shoot, and I was
> curious
> > the best way to approach. Here are some "things"
> >
> >
> > 1. It's a large file 1.2 GB compressed. It's named mondodump.json.gz and
> > drill seems to be (on the surface) handling that correctly
> > 2. It's Drill 1.1. (MapR Package)
> > 3.  select * from `/pathoto/*` limit 10 seems to work, in this case the
> > _id field is ip addresses (long story)
> > 4. In the select * limit 10, if I do select * from `/pathto/*` where
> `_id`
> > = '123.123.123.123' (which was returned in the select * limit 10 query
> from
> > #3, it finds the record, all is well.
> > 5. If I take select * from `/pathto/*` where `_id` = '127.0.0.1' which I
> > know to be in the data (validated with zgrep) it does NOT find the data.
> > Based on the results from zGrep, it should find it, I am not sure if
> there
> > something weird in reading the data, but its not throwing errors.
> > 6. select count(*) from `/pathro/*` returns the same number as zcat
> > source.json.gz|wc -l This is interesting because it apparently means
> things
> > are lined up, but why isn't that IP showing?
> >
> > So my question is this: Is there anything in Drill that would cause it to
> > miss that? Weird chars? etc I know it's hard, but with a 1.2 GB
> compressed
> > file, how would one trouble shoot this?
> >
>

Re: Troubleshooting JSON File read

Reply via email to