i was scanning through different questions that people asked in this mailing list regarding choosing the right schema so that map reduce jobs can be run appropriately and hot regions avoided due to sequential accesses. somewhere, i got the impression that it is ok for a row to have millions of columns and/or have large volume of data per region. but then my map reduce job to copy rows failed due to row size being too large (121MB). so now i am confused about whats the recommended way. does it mean that default region size and other configuration parameters need to be tweaked?
in my use case, my system is receiving lots of metrics for different users and i need to maintain daily counters for each of them. it is at day granularity and not a typical TSD series. my row key has user id, metric name as prefix and day timestamp as suffix. and i keep incrementing the values. the scale issue happens because i store information about the source of the metric too. e.g. i store the id of the person who mentioned my user in a tweet.. I am storing all that information in different columns of the same row. so the pattern here is variable - you can have a million people tweet about someone and just 2 people tweet about someone else on a given day. is it a bad idea to use columns here? i did it this way because it makes it easy for a different process to run later and aggregate information such as list all people who mentioned my user during a given date range. thanks
