Hi Nick,

The key question to ask about this use case is the access pattern. Do you need 
real-time access to new information as it is created? (I.e. if someone reads an 
article, do your queries need to immediately reflect that?) If not, and a batch 
approach is fine (say, nightly processing) then Hadoop is a good first step; it 
can map/reduce over the logs and aggregate / index the data, either creating 
static reports like you say below for all the users & pages, or inserting the 
data into a database.

If, on the other hand, you need real-time ingest & reporting, that's where 
HBase would be a better fit. You could write the code so that every "log" you 
write is actually passed to an API that inserts the data into one or more HBase 
tables in real-time.

The trick is that since HBase doesn't have built-in secondary indexing, you'd 
have to write the data in ways that could provide low latency responses to the 
queries below. That likely means denormalizing the data; in your case, probably 
one table that's keyed by page & time ("all users that have been to the webpage 
X in the last N days"), another that's keyed by user & time ("all the pages 
seen by a given user"), etc. So in other words, every action would result in 
multiple writes into HBase. (This is the same thing that happens in a 
relational database--an "index" means you're writing the data in two 
places--but there, it's hidden from you and maintained transparently). But, 
once you get that working, you can have very fast access to real time data for 
any of those queries.

Note: that's a lot of work. If you don't really need real-time ingest and 
query, Hadoop queries over the logs are much simpler (especially with tools 
like Hive, where you can write real SQL statements and it automatically 
translates them into Hadoop map/reduce jobs).

Also, 10TB isn't outside of the range that a traditional database could handle 
(given the right HW, schema design & indexing). You may find it is simpler to 
model your problem that way, either using Hadoop as a bridge between the raw 
log data and the database (if offline is OK) or inserting directly. The key 
benefit of going the Hadoop/HBase route is horizontal scalability, meaning that 
even if you don't know your eventual size target, you can be confident that you 
can scale linearly by adding hardware. That's critical if you're Google or 
Facebook, but not as frequently required for smaller businesses. Don't 
over-engineer ... :)

Ian

On Nov 8, 2012, at 3:00 AM, Nick maillard wrote:

Hi everyone

I'm currently testing Hbase/Hadoop in terms of performance but also in terms off
applicability. After some tries, and reads I'm wondering If Hbase is well fitted
for the current need I'm testing.

Say I had logs on websites listing users going to webpage, reading an article,
liking a piece of data, commenting or even bookmarking.
I would store these logs on a long period and for a lot of different websites
and I would like to use the data with these questions:
- All users that have been to the webpage X in the last Ndays
- All users that have liked and then bookmarked a page in a range of Y days.
- All the pages that are commented X times in the last N days.
- All users that have commented a page W and liked a page P.
- All pages seen,liked or commented by a given user.

As you see this might a very SQL way of thinking. The way I understand the
questions being different in nature I would have different tables to answer 
them.
Am I correct? How could this be represented and would sql be a better fit?
The data would be large around a 10 Tbytes.

regards


Reply via email to