Re: HBase Schema for IPTC News ML G2

Boaretto, Ricardo Mon, 03 Mar 2014 00:26:48 -0800

Hi,

How frequent do you need to query older versions of some message?


Regards,
Ricardo Boaretto.
On Mar 3, 2014 4:31 AM, "Jigar Shah" <[email protected]> wrote:

> I am working in news processing industry, current system processes more
> then million article per week. And provides this data in real time to
> users, additionally it provides search capabilities via Lucene.
>
> We convert all news to a standard IPTC NewsML
> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ <
> http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
> before providing it to users (in real-time or via search)
>
> We have a requirement of component which provides analytical queries on
> news data. I plan to load this all data in HBase and then have Map-Reduce
> Jobs to compute analytical queries. More over current system is developed
> on postgresql to store only 3 months data, anything more then this is big
> data as it dosen't fit on one server.
>
> But i am bit confused in developing schema for it.
>
> Every news article has
>
> *"messageID" as guid*, unique id for news message.
> *"version" as int,* incremented if newer version of same news message is
> published.
> there are other fields like location, channels, title, content, source
> etc..
>
> Current database primary key is a composite of (messageID & version).
>
> I thought that, i should use "messageID" as "rowKey" in HBase. and
> "version" as "columnFamily" and all columns will be fields of news (like
> location, channels ,title, body, sentTimstamp, ...)
>
> Keeping "version" as "columnFamily" is a good idea ?
>
> In reality "single message may have thousands of version".
>
> Or if any other solution when we have composite primary key in database.
>

Re: HBase Schema for IPTC News ML G2

Reply via email to