RE: HBase Schema for IPTC News ML G2

Vladimir Rodionov Tue, 04 Mar 2014 11:13:16 -0800

HBase supports natural versioning for free. It is cell's timestamp

Your cell address in HBase tables the following:

rowkey:column-family:column-qualifier:timestamp

First on table, column family, column qualifier concept:

Table is similar to RDBMS table, but does not have rigid schema. When you 
create table in HBase you need to specify at least one column family.
Column family groups columns (which are defined by column qualifiers) into 
physically single storage file, frequently used together columns must be placed 
into
the same column family for performance reason.

Column qualifier is similar to RDBMS column, but HBase does not require ALL 
qualifiers to be defined in advance, therefore rows in HBase table may have 
different sets of qualifiers

For your use case, there are two possible approaches:

1. rowkey = messageID and Version is in a timestamp (you can put any value 
instead of a time or keep default timestamp)
2. rowkey = combination of messageID and Version

All above will give you ability to query N latest versions of a message, where 
N can be any >= 1.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: [email protected]

________________________________________
From: Jigar Shah [[email protected]]
Sent: Monday, March 03, 2014 9:24 PM
To: [email protected]
Subject: Re: HBase Schema for IPTC News ML G2

Hello Ted,

I can think of implementation, based on which you provided solution.

Current use-case is like this:

Consider an application is getting news message (xml) which has
(messageID & version) and other fields. For same news message i can get
different versions, usually incremental.

e.g:
*messageId:0bb4b5bd-c06e-400a-8b08-e2b6960dda25* with *version:1*.
*messageId:0bb4b5bd-c06e-400a-8b08-e2b6960dda25* with *version:2*.

I am currently using postgres, and having composite key primary key with
(messageID & version) and other columns stored in normalized way in
database.

If i design storage in HBase. what should be my rowKey and column
family. and how should i maintain multiple versions of same messageID.

I plan of keeping *messageId* as *rowKey* and *version* as *column
family*. so at one point of time i can pick-up one column family (by
version) and, will get all columns for that version
in particular message.

But if column family is pre-defined in HBase, then i think solution is
not feasible.

Thanks,
Jigar Shah.

On 03/03/2014 06:32 PM, Ted Yu wrote:
> There seems to be some misunderstanding.
>
> The column families need to be defined at the time of table creation.
> My understanding was that there would be one column family called version.
> Each row in this table would have version number (1, 2, or 3, etc) in version 
> column family, along with details in the other column family.
> At query time, you specify a filter to get latest version from version column 
> family and load the other column family accordingly.
>
> Cheers
>
> On Mar 3, 2014, at 3:22 AM, Jigar Shah<[email protected]>  wrote:
>
>> Hi Ted,
>>
>> Thanks for reply.
>>
>> I am more concerned about structure, what should be rowKey and column 
>> families (having each version of news as a column family will be a good idea 
>> ?).
>>
>> Will there be any problem if i orient my data in this way.
>>
>> |rowKey|                 | column-famlilies|
>> <guid>                    <1> <2>                                  <version>
>> newsMessageId       someTitle                someTitle
>>                                 someDescription changedSomeDescription
>> location
>>
>>
>> newsMessageID as RowKey, versions of same news (News XML) as column family, 
>> fields in XML as columns in respective version column family.
>>
>> If i have lot of versions for same message, I will have lot of column 
>> families.
>>
>> Does HBase have some limitations if i have undefined/large number of column 
>> families.
>>
>> Do you think i should orient data in different way ?
>>
>> System mostly queries latest version of news. But still we need to keep 
>> track of all versions for particular news message.
>>
>> Good to know that column families can be lazily loaded, based on column 
>> filter.
>>
>> Thanks
>> Jigar Shah.
>>
>>
>> On 03/03/2014 03:48 PM, Ted Yu wrote:
>>> When version is in its own column family, you can utilize essential column 
>>> family support.
>>>
>>> Seehttps://issues.apache.org/jira/browse/HBASE-5416
>>>
>>> Cheers
>>>
>>> On Mar 2, 2014, at 11:31 PM, Jigar Shah<[email protected]>  wrote:
>>>
>>>> I am working in news processing industry, current system processes more
>>>> then million article per week. And provides this data in real time to
>>>> users, additionally it provides search capabilities via Lucene.
>>>>
>>>> We convert all news to a standard IPTC NewsML
>>>> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/  
>>>> <http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
>>>> before providing it to users (in real-time or via search)
>>>>
>>>> We have a requirement of component which provides analytical queries on
>>>> news data. I plan to load this all data in HBase and then have Map-Reduce
>>>> Jobs to compute analytical queries. More over current system is developed
>>>> on postgresql to store only 3 months data, anything more then this is big
>>>> data as it dosen't fit on one server.
>>>>
>>>> But i am bit confused in developing schema for it.
>>>>
>>>> Every news article has
>>>>
>>>> *"messageID" as guid*, unique id for news message.
>>>> *"version" as int,* incremented if newer version of same news message is 
>>>> published.
>>>> there are other fields like location, channels, title, content, source 
>>>> etc..
>>>>
>>>> Current database primary key is a composite of (messageID & version).
>>>>
>>>> I thought that, i should use "messageID" as "rowKey" in HBase. and
>>>> "version" as "columnFamily" and all columns will be fields of news (like 
>>>> location, channels ,title, body, sentTimstamp, ...)
>>>>
>>>> Keeping "version" as "columnFamily" is a good idea ?
>>>>
>>>> In reality "single message may have thousands of version".
>>>>
>>>> Or if any other solution when we have composite primary key in database.

Confidentiality Notice:  The information contained in this message, including 
any attachments hereto, may be confidential and is intended to be read only by 
the individual or entity to whom this message is addressed. If the reader of 
this message is not the intended recipient or an agent or designee of the 
intended recipient, please note that any review, use, disclosure or 
distribution of this message or its attachments, in any form, is strictly 
prohibited.  If you have received this message in error, please immediately 
notify the sender and/or [email protected] and delete or destroy any 
copy of this message and its attachments.

RE: HBase Schema for IPTC News ML G2

Reply via email to