i was scanning through different questions that people asked in this
mailing list regarding choosing the right schema so that map reduce jobs
can be run appropriately and hot regions avoided due to sequential
accesses.
somewhere, i got the impression that it is ok for a row to have millions of
columns and/or have large volume of data per region. but then my map reduce
job to copy rows failed due to row size being too large (121MB). so now i
am confused about whats the recommended way. does it mean that default
region size and other configuration parameters need to be tweaked?

in my use case, my system is receiving lots of metrics for different users
and i need to maintain daily counters for each of them. it is at day
granularity and not a typical TSD series. my row key has user id, metric
name as prefix and day timestamp as suffix. and i keep incrementing the
values. the scale issue happens because i store information about the
source of the metric too. e.g. i store the id of the person who mentioned
my user in a tweet.. I am storing all that information in different columns
of the same row. so the pattern here is variable - you can have a million
people tweet about someone and just 2 people tweet about someone else on a
given day. is it a bad idea to use columns here? i did it this way because
it makes it easy for a different process to run later and aggregate
information such as list all people who mentioned my user during a given
date range.

thanks

Reply via email to