Re: HBase Schema for IPTC News ML G2

Ted Yu Mon, 03 Mar 2014 02:19:34 -0800

When version is in its own column family, you can utilize essential column 
family support.


See https://issues.apache.org/jira/browse/HBASE-5416

Cheers

On Mar 2, 2014, at 11:31 PM, Jigar Shah <[email protected]> wrote:

> I am working in news processing industry, current system processes more
> then million article per week. And provides this data in real time to
> users, additionally it provides search capabilities via Lucene.
> 
> We convert all news to a standard IPTC NewsML
> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ 
> <http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
> before providing it to users (in real-time or via search)
> 
> We have a requirement of component which provides analytical queries on
> news data. I plan to load this all data in HBase and then have Map-Reduce
> Jobs to compute analytical queries. More over current system is developed
> on postgresql to store only 3 months data, anything more then this is big
> data as it dosen't fit on one server.
> 
> But i am bit confused in developing schema for it.
> 
> Every news article has
> 
> *"messageID" as guid*, unique id for news message.
> *"version" as int,* incremented if newer version of same news message is 
> published.
> there are other fields like location, channels, title, content, source etc..
> 
> Current database primary key is a composite of (messageID & version).
> 
> I thought that, i should use "messageID" as "rowKey" in HBase. and
> "version" as "columnFamily" and all columns will be fields of news (like 
> location, channels ,title, body, sentTimstamp, ...)
> 
> Keeping "version" as "columnFamily" is a good idea ?
> 
> In reality "single message may have thousands of version".
> 
> Or if any other solution when we have composite primary key in database.

Re: HBase Schema for IPTC News ML G2

Reply via email to