Re: Design a datastore maintaining historical view of users.

Alok Singh Tue, 13 Jan 2015 12:16:04 -0800

Have you looked at the phoenix project? http://phoenix.apache.org/
It is an SQL layer on top of hbase, an alternative to map-reduce jobs
when ad hoc/realtime queries are needed.
Your use case seems like it would work fairly well with phoenix.


Alok

On Mon, Jan 12, 2015 at 7:06 PM, Wilm Schumacher
<[email protected]> wrote:
> Hi,
>
> I'm doing something comparable right now, but not with such a HUGE
> database O_o. 10 Mio results for such a query? This would mean that you
> have 100 Mio -> 1 Billion customers ?!?!
>
> However: in my opinion with such a huge database HBase is a good fit.
> However, your data model should be changed for that purpose.
>
> You could use something like
> <userid>-<item_id> (or the other way around)
> for a row key and
> data:clicktime => <timestamp>
> or whatever for data ....
>
> for the "item table". In this example the "e-mail" table does not have
> to be touched (I assume they are separated)
>
> 1.) Map reduce
>
> Your problem make a good point for map reduce. Just map over the e-mail
> events, when you find an e-mail that fits your question, make a get on
> the user which is connected to the e-mail and the item_id (row key
> design above). If so => emit and use an hbase table as sink. By that you
> would make sure that the user is only processed once (as the e-mail
> could be opened twice or so) and you would have a nice list in a
> separate table which you could use.
>
> For such a huge database the question of 30 min or less then is "just" a
> question of money for hbase/yarn nodes.
>
> 2.) Simple index
>
> I thought of simple indexes, too. However, this only would work properly
> if you would make a table where the item_id would be the primary key for
> the index and the users browse events would make up the column family.
>
> with <ts> the specific timestamp of the action something like
> item_1    user:alice <ts>    user:bob <ts>
> item_2    user:alice <ts>                              user:charles <ts>
>
> (hope the e-mail is rendered okay to explain what I mean)
>
> Then you could simply fetch the <item_id> row and use that to check if
> the e-mail action exist.
>
> However, this index would be a table of <number of items> rows and
> <number of users> qualifiers in one column family. Of course it would be
> sparse, but would make up by above estimate 100 Mio -> 1 Billion
> qualifiers in the main column family. I never ran a table THAT wide.
> Does anyone on this list did something like that and can comment?
>
> Furthermore you would have to iterate over the result ... which has to
> be estimated >> 10 Mio items, which sounds like a pretty bad idea.
>
> 3.) Fancy index
>
> You make up an index for exactly your question from above.
>
> Row keys would be
> <e-mail-action>-<e-mail_id>-<item_id>-<userid>
> where <e-mail-action> is "click", "erase", ....
>
> with data
> email:1 => <rowkey of the email event>
> email:2 => <rowkey of the email event>
> ...
> (as some events could occure more than once)
>
> and
> item:1 => <rowkey of the item browse event>
> item:2 => <rowkey of the item browse event>
> ...
>
> Then you would have your list right away e.g. by setting up a scan for
> the specific range (which is the answer to your specific question in
> this case).
>
> But you would have to make on every e-mail event a scan over the list of
> items the user has browsed (by design above a simple scan in the "item
> table" for the specific user) and fill the index.
>
> Which actually sounds like an even worse idea.
>
> ====
>
> But as I read your e-mail, this example is just one of the query of many
> different queries, right? And for every query you need one such index.
>
> Normally I'm in favour of indexes, but on this case I think you really
> need a clever row key design of your separate tables and the full power
> of map reduce.
>
> However, I saw your analogue postings on the other "nosql" mailing
> lists. As cassandra have a "join" feature in the cql this would be a
> good fit ... at first sight. Furthermore the querying for ElasticSearch
> can easily be used in a way which satisfy your need. However, by design
> (as I understand what c* is doing on joins) cassandra would do A LOT of
> communication for that case (i.e. inefficient), and elastic search will
> probably break down if the master would have to give you 10 Mio results
> back. Thus you would have to iterate A LOT of pages to get your full
> result, which means a lot of JSON parsing and http stuff.
>
> Even if hbase is not the best choice on the first sight, I think it
> could be the best choice nevertheless.
>
> Best wishes,
>
> Wilm
>
> ps: man ... that's a lot of data. Is that you, amazon? ;)
>
> Am 13.01.2015 um 01:42 schrieb Chen Wang:
>> Hey Guys,
>> I am seeking advice on design a system that maintains a historical view of
>> a user's activities in past one year. Each user can have different
>> activities: email_open, email_click, item_view, add_to_cart, purchase etc.
>> The query I would like to do is, for example,
>>
>> Find all customers who browse item A in the past 6 month, and also clicked
>> an email.
>> and I would like the query to be done in reasonable time frame. (for
>> example, within 30 minutes to retrieve 10million such users)
>>
>> Since we already have HBase cluster in place, HBase becomes my first
>> choice. So I can have customer_id as the row key, column family be
>> 'Activity', then have certain attributes associated with the column
>> family,something like:
>>
>> custer_id, browse:{item_id:12334, timestamp:epoc}
>>
>> However, It seems that HBase would not be a good choice for supporting the
>> queries above. Even its possible with scan, it will be super inefficient
>> due to the size of the data set.
>>
>> Is my understanding correct and I should resort to other data store.(ES in
>> my opinion). or has anyone done similar thing with HBase?
>>
>> Thanks in advance.
>> Chen
>>
>

Re: Design a datastore maintaining historical view of users.

Reply via email to