I had a similar sort of issues (granted, less data scale), and I went
with option 2.
If you put the rowkey of your "data" table plus the tag itself into the
rowkey for your other table/index, you should be able to grow without
running into HBase scalability (though, pulling 10GB of tags for one
lookup would be crazy slow :P). It's a fast rowkey, prefix scan to pull
all the tags for the "data record".
Just don't forget that hbase won't split a single row across multiple
Regions. That's the important part in designing this table.
On 2/21/21 11:51 PM, Simon Mottram wrote:
The requirement is to be able to search from a list of tags, each record
can have a possible large number of tags. There would be more than one
tag field.
An example might 3 different hashtag fields. They do have to be
different; we can't have just one tag cloud.
The data size is large so we need to be able to search the tag clouds
over large numbers. Millions but not billions (for now)
e.g:
I was wondering what the best method would be
1) a column per tag value.
ID, name, some_attributes..., type1_tag_1, type1_tag_2
While hbase is happy with many columns I can't see how to index this
2) A tag join table. Maybe just a single row key ID + single tag.
Then it becomes a straight join of ID + tag. Thus it would be indexed.
3) Is there a crafty way of using column families? Could that be
indexed efficiently?
Any tips/tricks gratefully received
Simon