I had a similar sort of issues (granted, less data scale), and I went with option 2.

If you put the rowkey of your "data" table plus the tag itself into the rowkey for your other table/index, you should be able to grow without running into HBase scalability (though, pulling 10GB of tags for one lookup would be crazy slow :P). It's a fast rowkey, prefix scan to pull all the tags for the "data record".

Just don't forget that hbase won't split a single row across multiple Regions. That's the important part in designing this table.

On 2/21/21 11:51 PM, Simon Mottram wrote:
The requirement is to be able to search from a list of tags, each record can have a possible large number of tags.  There would be more than one tag field.

An example might 3 different hashtag fields.  They do have to be different; we can't have just one tag cloud.

The data size is large so we need to be able to search the tag clouds over large numbers.  Millions but not billions (for now)

e.g:

I was wondering what the best method would be

1) a column per tag value.
ID, name, some_attributes..., type1_tag_1,  type1_tag_2

While hbase is happy with many columns I can't see how to index this

2) A tag join table.  Maybe just a single row key  ID + single tag. Then it becomes a straight join of ID + tag.   Thus it would be indexed.

3) Is there a crafty way of using column families?  Could that be indexed efficiently?

Any tips/tricks gratefully received

Simon

Reply via email to