Hi David,

I didn't have experience with such huge number of columns. Hope someone
from Impala team can help you.

Regards,
Dejan

*Dejan Prokić* | Data Engineer | Nordeus


сре, 21. нов 2018. у 21:05 David Lauzon <davidonlap...@gmail.com> је
написао/ла:

> Hi Dejan :)
>
> It's like a giant fact table of data coming in daily from 700 external
> sources of similar data with slight variations. You are right that there is
> definitely some schema design optimization that can be done, but the data
> is really dirty. Every week my team discover inconsistencies in the data
> that even our client was unaware of...
>
> For now our goal is getting data into a queryable store and build tools to
> help our client extract the data faster than with their current technology.
> I know that 2200 columns is not the ultimate solution, but it would
> definitely lower the implementation cost of the first deliverable. I am
> trying to understand what aspects of Impala could be impacted by this
> direction.
>
> Using the catalog memory usage formula
> <https://www.slideshare.net/cloudera/the-impala-cookbook-42530186/17>,
> this gives 19GB of RAM for the catalogd.
>
>    - 1 table * 700 sources * 30 days * 2200 columns * 400 bytes = 17 GB
>    - and about 1.5 GB for the files / blocks / partitions / table
>    metadata.
>
> So how is impala's behavior affected by having a large number of columns?
>
>    - Is the metadata loaded on startup by the catalogd?
>    - Is it only the catalogd that will use higher RAM, or also the
>    impalad workers?
>    - Anything else to consider?
>
> Let me know if this mailing list is the right place for asking this
> question,
>
> -D
>
> On Tue, Nov 20, 2018 at 12:37 AM Dejan Prokić <dej...@nordeus.com> wrote:
>
>> I suppose that you want to create something like partitioned table with
>> measures. I would suggest you to split than table into smaller logical
>> units and create view which would be join of those smaller tables. That
>> would also speed up recalculation in the future since you would have to
>> rewrite much less columns if you want to recalculate only few columns.
>>
>> In case you are making dimension table I believe it is not the problem to
>> have such big number of columns since it would have not many rows.
>>
>> In case you use it for machine learning features, I think you still can
>> split it into smaller chunks and merge it from code.
>>
>> Hope you find this useful.
>>
>> Cheers,
>>
>> Dejan Prokić | Data Engineer | Nordeus
>>
>> uto, 20. nov 2018. 01:11 David Lauzon <davidonlap...@gmail.com> је
>> написао/ла:
>>
>>> Hi folks!
>>>
>>> I'm evaluating the possibility to build an Impala table with 2200
>>> columns. I came across the recommendation of max 2000 columns
>>> <https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_scalability.html#big_tables>
>>> in the documentation and would like to understand the impact.
>>>
>>> So far, I've found that it could impact the memory usage of the catalog
>>> service. Is the catalog memory usage formula
>>> <https://www.slideshare.net/cloudera/the-impala-cookbook-42530186/17>
>>> is still relevant? What other performance aspects I should consider?
>>>
>>> Thanks,
>>>
>>> -David
>>>
>>

Reply via email to