Hi, How frequent do you need to query older versions of some message?
Regards, Ricardo Boaretto. On Mar 3, 2014 4:31 AM, "Jigar Shah" <[email protected]> wrote: > I am working in news processing industry, current system processes more > then million article per week. And provides this data in real time to > users, additionally it provides search capabilities via Lucene. > > We convert all news to a standard IPTC NewsML > G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ < > http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format, > before providing it to users (in real-time or via search) > > We have a requirement of component which provides analytical queries on > news data. I plan to load this all data in HBase and then have Map-Reduce > Jobs to compute analytical queries. More over current system is developed > on postgresql to store only 3 months data, anything more then this is big > data as it dosen't fit on one server. > > But i am bit confused in developing schema for it. > > Every news article has > > *"messageID" as guid*, unique id for news message. > *"version" as int,* incremented if newer version of same news message is > published. > there are other fields like location, channels, title, content, source > etc.. > > Current database primary key is a composite of (messageID & version). > > I thought that, i should use "messageID" as "rowKey" in HBase. and > "version" as "columnFamily" and all columns will be fields of news (like > location, channels ,title, body, sentTimstamp, ...) > > Keeping "version" as "columnFamily" is a good idea ? > > In reality "single message may have thousands of version". > > Or if any other solution when we have composite primary key in database. >
