Short description : I'd like to have 4000 tables in my HBase cluster. Will this be a problem? In general, what problems do you run into when you try to host thousands of tables in a cluster?
Long description : I'd like the performance advantage of pre-split tables, and I'd also like to do filtered range scans. Imagine a keyspace where the key consists of : [POSITION]_[WORD] , where POSITION is a number from 1 to 4000, and WORD is a string consisting of 96 characters. The value in the cell would be a single integer. My app will examine a 'document', where each 'line' consists of 4000 WORDs. For each WORD, it'll do a filtered regex lookup. Only problem? Say I have 200 mappers and they all start at POSITION 1, my region servers would get hotspotted like crazy. So my idea is to break it into 4000 tables (one for each POSITION), and then pre-split the tables such that each region gets an equal amount of the traffic. In this scenario, the key would just be WORD. Dunno if this a bad idea, would be open to suggestions Thanks! --J
