Hi David, I didn't have experience with such huge number of columns. Hope someone from Impala team can help you.
Regards, Dejan *Dejan Prokić* | Data Engineer | Nordeus сре, 21. нов 2018. у 21:05 David Lauzon <davidonlap...@gmail.com> је написао/ла: > Hi Dejan :) > > It's like a giant fact table of data coming in daily from 700 external > sources of similar data with slight variations. You are right that there is > definitely some schema design optimization that can be done, but the data > is really dirty. Every week my team discover inconsistencies in the data > that even our client was unaware of... > > For now our goal is getting data into a queryable store and build tools to > help our client extract the data faster than with their current technology. > I know that 2200 columns is not the ultimate solution, but it would > definitely lower the implementation cost of the first deliverable. I am > trying to understand what aspects of Impala could be impacted by this > direction. > > Using the catalog memory usage formula > <https://www.slideshare.net/cloudera/the-impala-cookbook-42530186/17>, > this gives 19GB of RAM for the catalogd. > > - 1 table * 700 sources * 30 days * 2200 columns * 400 bytes = 17 GB > - and about 1.5 GB for the files / blocks / partitions / table > metadata. > > So how is impala's behavior affected by having a large number of columns? > > - Is the metadata loaded on startup by the catalogd? > - Is it only the catalogd that will use higher RAM, or also the > impalad workers? > - Anything else to consider? > > Let me know if this mailing list is the right place for asking this > question, > > -D > > On Tue, Nov 20, 2018 at 12:37 AM Dejan Prokić <dej...@nordeus.com> wrote: > >> I suppose that you want to create something like partitioned table with >> measures. I would suggest you to split than table into smaller logical >> units and create view which would be join of those smaller tables. That >> would also speed up recalculation in the future since you would have to >> rewrite much less columns if you want to recalculate only few columns. >> >> In case you are making dimension table I believe it is not the problem to >> have such big number of columns since it would have not many rows. >> >> In case you use it for machine learning features, I think you still can >> split it into smaller chunks and merge it from code. >> >> Hope you find this useful. >> >> Cheers, >> >> Dejan Prokić | Data Engineer | Nordeus >> >> uto, 20. nov 2018. 01:11 David Lauzon <davidonlap...@gmail.com> је >> написао/ла: >> >>> Hi folks! >>> >>> I'm evaluating the possibility to build an Impala table with 2200 >>> columns. I came across the recommendation of max 2000 columns >>> <https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_scalability.html#big_tables> >>> in the documentation and would like to understand the impact. >>> >>> So far, I've found that it could impact the memory usage of the catalog >>> service. Is the catalog memory usage formula >>> <https://www.slideshare.net/cloudera/the-impala-cookbook-42530186/17> >>> is still relevant? What other performance aspects I should consider? >>> >>> Thanks, >>> >>> -David >>> >>