What happens if you split your large file into 5 smaller files?


On Thu, Nov 2, 2017 at 12:52 PM, Yun Liu <[email protected]> wrote:

> Yes- I increased planner.memory.max_query_memory_per_node to 10GB
> HEAP to 12G
> Direct memory to 16G
> And Perm to 1024M
>
> It didn't have any schema changes. As with the same file format but less
> data- it works perfectly ok. I am unable to tell if there's corruption.
>
> Yun
>
> -----Original Message-----
> From: Andries Engelbrecht [mailto:[email protected]]
> Sent: Thursday, November 2, 2017 3:35 PM
> To: [email protected]
> Subject: Re: Drill Capacity
>
> What memory setting did you increase? Have you tried 6 or 8GB?
>
> How much memory is allocated to Drill Heap and Direct memory for the
> embedded Drillbit?
>
> Also did you check the larger document doesn’t have any schema changes or
> corruption?
>
> --Andries
>
>
>
> On 11/2/17, 12:31 PM, "Yun Liu" <[email protected]> wrote:
>
>     Hi Kunal and Andries,
>
>     Thanks for your reply. We need json in this case because Drill only
> supports up to 65536 columns in a csv file. I also tried increasing the
> memory size to 4GB but I am still experiencing same issues. Drill is
> installed in Embedded Mode.
>
>     Thanks,
>     Yun
>
>     -----Original Message-----
>     From: Kunal Khatua [mailto:[email protected]]
>     Sent: Thursday, November 2, 2017 2:01 PM
>     To: [email protected]
>     Subject: RE: Drill Capacity
>
>     Hi Yun
>
>     Andries solution should address your problem. However, do understand
> that, unlike CSV files, a JSON file cannot be processed in parallel,
> because there is no clear record delimiter (CSV data usually has a new-line
> character to indicate the end of a record). So, the larger a file gets, the
> more work a single minor fragment has to do in processing it, including
> maintaining internal data-structures to represent the complex JSON document.
>
>     The preferable way would be to create more JSON files so that the
> files can be processed in parallel.
>
>     Hope that helps.
>
>     ~ Kunal
>
>     -----Original Message-----
>     From: Andries Engelbrecht [mailto:[email protected]]
>     Sent: Thursday, November 02, 2017 10:26 AM
>     To: [email protected]
>     Subject: Re: Drill Capacity
>
>     How much memory is allocated to the Drill environment?
>     Embedded or in a cluster?
>
>     I don’t think there is a particular limit, but a single JSON file will
> be read by a single minor fragment, in general it is better to match the
> number/size of files to the Drill environment.
>
>     In the short term try to bump up planner.memory.max_query_memory_per_node
> in the options and see if that works for you.
>
>     --Andries
>
>
>
>     On 11/2/17, 7:46 AM, "Yun Liu" <[email protected]> wrote:
>
>         Hi,
>
>         I've been using Apache Drill actively and just wondering what is
> the capacity of Drill? I have a json file which is 390MB and it keeps
> throwing me an DATA_READ ERROR. I have another json file with exact same
> format but only 150MB and it's processing fine. When I did a *select* on
> the large json, it returns successfully for some of the fields. None of
> these errors really apply to me. So I am trying to understand the capacity
> of the json files Drill supports up to. Or if there's something else I
> missed.
>
>         Thanks,
>
>         Yun Liu
>         Solutions Delivery Consultant
>         321 West 44th St | Suite 501 | New York, NY 10036
>         +1 212.871.8355 office | +1 646.752.4933 mobile
>
>         CAST, Leader in Software Analysis and Measurement
>         Achieve Insight. Deliver Excellence.
>         Join the discussion http://blog.castsoftware.com/
>         LinkedIn<http://www.linkedin.com/companies/162909> | Twitter<
> http://twitter.com/onquality> | Facebook<http://www.facebook.
> com/pages/CAST/105668942817177>
>
>
>
>
>
>

Reply via email to