The three ideas you outline are probably your best options, and which one
you want to choose will just depend on your particular needs. It sounds
like your on the right track, though. Comments about each are inline below:
On Sat, Mar 10, 2018 at 2:12 PM Oleksandr Baliev <aleksanderba...@gmail.com>
> I have a big table with raw numbers and ids in impala. When I generate
> some reports I have bunch of ids for which I need to show some (from 100 to
> 100.000 it depends on query) corresponding names stored in some another
> table in some RDBMS. So I have raw ids with some key figures for it in
> Impala and user friendly names for this ids in RDBS, and i need to show
> somehow all this info.
> So I need to synch 2 data sources on query level and show user friendly
> info with all aggregated key figures. Could you please give some advises,
> what is the preferable way to deal with such tasks with Impala (and
> probably it's more related for big data area in general)?
> Few ways I see:
> 1. One way is to synch somehow RDBMS with Impala (question also how, but
> can be resolved I guess with using some key based storage like Kudu or
> Hbase), do some joins on Impala side (since subqueries are not supported in
> select) and show in the end id + corresponding name. But joins could slow
> down everything, not sure yet, want to make different test with all the
> data of course, but maybe some of you are already tried that.
This is a reasonable approach and Impala has good support for fast joins.
The main concern would be figuring out how to perform the sync, which will
depend on your use case, and the possibility of data being duplicated
between Impala and the other RDBMS, which could require more storage.
> 2. Another way is to get the report from Impala, resolve from RDBMS on
> some application level -> too slow as for me, i need different reports and
> flexible way to create them. But maybe it's possible to export to some
> third party stack, like BI tools, but I need more real time queries, and
> from my not big experience, they BI tools are good when they compile all
> that data and can good visualize it, but that' it. Maybe some of you are
> already using something reliable and flexible, would be interested to hear
> your experience
As you say, this approach is likely to be the slowest and least flexible,
since you would have to write any sort of join logic that you need
yourself. This would be necessary, though, if there are some requirements
for your use case that prevent you from importing the data from the other
RDBMS into Impala, as the other two options require.
> 3. Store id with name from the beginning in big table, use id for
> aggregation, name for viewing. Here the question is, do you thing it's a
> good approach? Won't performance be slowed down for any reasons or maybe
> there is another pitfalls.
This option will probably give you the best performance in terms of query
running time, depending on your setup.
- One issue here is that by adding more columns to your table, you may
increase the amount of data that has to be scanned for some queries, for
example if you have a query that only accesses the ids but not the names.
This can be avoided by using a columnar format such as parquet, so that
Impala will only scan the columns that are needed.
- Another possible problem is if this leads to duplicated data within
Impala, eg. if the names are now stored in two different tables when
previously they were only in one. Whether or not the extra storage needed
is manageable will depend on your workload.
> I would really appreciate if you could share your experience with such
> kind of tasks.
> with kind regards,
> Oleksandr Baliev