What happens if you split your large file into 5 smaller files?
On Thu, Nov 2, 2017 at 12:52 PM, Yun Liu <[email protected]> wrote: > Yes- I increased planner.memory.max_query_memory_per_node to 10GB > HEAP to 12G > Direct memory to 16G > And Perm to 1024M > > It didn't have any schema changes. As with the same file format but less > data- it works perfectly ok. I am unable to tell if there's corruption. > > Yun > > -----Original Message----- > From: Andries Engelbrecht [mailto:[email protected]] > Sent: Thursday, November 2, 2017 3:35 PM > To: [email protected] > Subject: Re: Drill Capacity > > What memory setting did you increase? Have you tried 6 or 8GB? > > How much memory is allocated to Drill Heap and Direct memory for the > embedded Drillbit? > > Also did you check the larger document doesn’t have any schema changes or > corruption? > > --Andries > > > > On 11/2/17, 12:31 PM, "Yun Liu" <[email protected]> wrote: > > Hi Kunal and Andries, > > Thanks for your reply. We need json in this case because Drill only > supports up to 65536 columns in a csv file. I also tried increasing the > memory size to 4GB but I am still experiencing same issues. Drill is > installed in Embedded Mode. > > Thanks, > Yun > > -----Original Message----- > From: Kunal Khatua [mailto:[email protected]] > Sent: Thursday, November 2, 2017 2:01 PM > To: [email protected] > Subject: RE: Drill Capacity > > Hi Yun > > Andries solution should address your problem. However, do understand > that, unlike CSV files, a JSON file cannot be processed in parallel, > because there is no clear record delimiter (CSV data usually has a new-line > character to indicate the end of a record). So, the larger a file gets, the > more work a single minor fragment has to do in processing it, including > maintaining internal data-structures to represent the complex JSON document. > > The preferable way would be to create more JSON files so that the > files can be processed in parallel. > > Hope that helps. > > ~ Kunal > > -----Original Message----- > From: Andries Engelbrecht [mailto:[email protected]] > Sent: Thursday, November 02, 2017 10:26 AM > To: [email protected] > Subject: Re: Drill Capacity > > How much memory is allocated to the Drill environment? > Embedded or in a cluster? > > I don’t think there is a particular limit, but a single JSON file will > be read by a single minor fragment, in general it is better to match the > number/size of files to the Drill environment. > > In the short term try to bump up planner.memory.max_query_memory_per_node > in the options and see if that works for you. > > --Andries > > > > On 11/2/17, 7:46 AM, "Yun Liu" <[email protected]> wrote: > > Hi, > > I've been using Apache Drill actively and just wondering what is > the capacity of Drill? I have a json file which is 390MB and it keeps > throwing me an DATA_READ ERROR. I have another json file with exact same > format but only 150MB and it's processing fine. When I did a *select* on > the large json, it returns successfully for some of the fields. None of > these errors really apply to me. So I am trying to understand the capacity > of the json files Drill supports up to. Or if there's something else I > missed. > > Thanks, > > Yun Liu > Solutions Delivery Consultant > 321 West 44th St | Suite 501 | New York, NY 10036 > +1 212.871.8355 office | +1 646.752.4933 mobile > > CAST, Leader in Software Analysis and Measurement > Achieve Insight. Deliver Excellence. > Join the discussion http://blog.castsoftware.com/ > LinkedIn<http://www.linkedin.com/companies/162909> | Twitter< > http://twitter.com/onquality> | Facebook<http://www.facebook. > com/pages/CAST/105668942817177> > > > > > >
