Are you using 0.8 that was just released? I have found it to be much better at handling large JSON data sets.
Also it is handy to use a predicate to filter out JSON docs where you may want use a map or array that is not present on all of the docs. Typically a null value will get assigned to missing objects or arrays. A simple WHERE a.b.c IS NOT NULL will filter out docs that don’t have the specific nested map Or WHERE a.b.c[0] IS NOT NULL for arrays Or WHERE a.b.c[0].d IS NOT NULL This avoids the functions having to deal with NULL values when doing calculations, as the empty sets gets filtered out. While Drill is extremely powerful, it is always a good idea to apply some logic to avoid NULL values creeping in with complex data like JSON. Sometimes a simple cast for Data type can also go a long way to prevent Drill from estimating the Data type on data that may be inconsistent. —Andries On Apr 1, 2015, at 3:29 AM, Alexander Reshetov <[email protected]> wrote: > Hello all, > > I have 80GB dataset of JSONs which have many nested arrays. > I'm trying to flatten it and make some calculations, but I got > exceptions after reading about 2/3 of file. > > I could (and want) to post an issue in Jira, but I cannot attach my dataset > because it has sensitive data and also it's too large. > > It there any way to help to investigate issues without posting my dataset? > > To give a hit about issue I've attached file with exception text.
