Hi community,



Currently rebuilding datamap has some problems in carbondata and I'll explain 
the problems and possible solutions here in order to fix it.

Note: User can refer to datamap-management.md in repo for the conception of 
'deferred-rebuild', 'rebuild'.


POINTS:

`REBUILD DATAMAP datamap_name ON TABLE table_name` is used to refresh a 
specific datamap.

PROBLEMS:

1. This operation can even be fired on a non-deferred-rebuild datamap, which is 
not need.

2. `REBUILD` in current implementation will rebuild the whole datamap, which 
will discard the old datamap storage -- in most of the scenarios, it is not 
needed. Besides, while generating the new datamap data, we didn't clear up the 
old data first, which cause rebuild failure.

SOLUTIONS:

It seems that currently for all types of datamap in carbondata, only `MV` needs 
to rebuild explicitly.

Index datamap (inlcuding lucene, bloomfilter) and preaggregate datamap 
(including timeseries) organize the datamap data by segment which maps to the 
segment in main table. So we can manage the datamap data in fine granularity:

11. For deferred-rebuild datamap, if we fire `REBUILD DATAMAP` command on it, 
carbondata will generate datamap data for the segments which does not have the 
datamap data yet.

12. If all the segments already have datamap data, this command will return 
immediately.

13. If this datamap is non-deferred-rebuild, this command will return with 
error message.

14. In case of concurrent rebuilding, we will block concurrent data rebuilding 
for one datamap. A lock will be used to achieve this. 


For MV datamap, it seems that by default it is deferred-rebuild by default. And 
the structure of datamap data is different from other datamaps. We will leave 
it as it is, which means user will explicitly rebuild datamap for it, we only 
have to ensure:

21. Since MV datamap is by default deferred-rebuild, `WITH DEFERRED REBUILD` is 
not needed for MV datamap, or we should explicit specify `WITH DEFERRED 
REBUILD` while creating MV datamap. I'd preferred to the former.

22. Block concurrent rebuilding for one MV datamap.

The last one:
31. Since deferred-rebuild is also a datamap property, how about letting the 
user specify it explicity in DMPROPERTIES?


Reply via email to