Are you using 0.8 that was just released? I have found it to be much better at 
handling large JSON data sets.

Also it is handy to use a predicate to filter out JSON docs where you may want 
use a map or array that is not present on all of the docs. Typically a null 
value will get assigned to missing objects or arrays.

A simple  WHERE a.b.c IS NOT NULL  will filter out docs that don’t have the 
specific nested map
Or WHERE a.b.c[0] IS NOT NULL for arrays
Or WHERE a.b.c[0].d IS NOT NULL

This avoids the functions having to deal with NULL values when doing 
calculations, as the empty sets gets filtered out.

While Drill is extremely powerful, it is always a good idea to apply some logic 
to avoid NULL values creeping in with complex data like JSON. Sometimes a 
simple cast for Data type can also go a long way to prevent Drill from 
estimating the Data type on data that may be inconsistent.

—Andries



On Apr 1, 2015, at 3:29 AM, Alexander Reshetov <[email protected]> 
wrote:

> Hello all,
> 
> I have 80GB dataset of JSONs which have many nested arrays.
> I'm trying to flatten it and make some calculations, but I got
> exceptions after reading about 2/3 of file.
> 
> I could (and want) to post an issue in Jira, but I cannot attach my dataset
> because it has sensitive data and also it's too large.
> 
> It there any way to help to investigate issues without posting my dataset?
> 
> To give a hit about issue I've attached file with exception text.

Reply via email to