2010

Todd Lipcon Mon, 13 Dec 2010 21:29:13 -0800

Some HBase developers had a meetup/Hackathon today at Facebook's Palo Alto
office (as announced last week on the list). Following are some notes I took
- please feel to reply with any corrections, etc. Apologies if I've
mispelled anyone's name!


Attendees/intros:
 - Jonathan Gray @ FB
 - Stack @ SU
 - Matthieu Lieber @ Datameer (building a connector)
 - Lars George - at Cloudera - writing HBase book
 - Gary Helmling @ Trend Micro - been working on coprocessors
 - Mingjie Lai @ Trend Micro
 - Todd Lipcon @ Cloudera
 - Joshua Ho @ Trend Micro
 - Eugene Koontz @ Trend Micro - working on ZK authentication
 - Andrew Purtell @ Trend Micro - working on CPs, also stability testing of
0.90 on a web crawler synthetic use case
 - Nicolas Spiegelberg @ FB - working on miscellaneous stuff in HBase for
scaling the FB messaging system
 - Karthik Ranganathan @ FB - not sure what he's going to work on
 - Kannan Muthukuruppan - same as above
 - Amit @ FB - same
 - Ryan Rawson @ StumbleUpon - not sure what he's wokring on
 - JD Cryans @ StumbleUpon - testing replication in trunk, interested in CPs
 - Himanshu - student at Uni of Alberta - using CPs for his thesis topic
 - Vaidhav @ GumGum using HBase in production, still running 20.6, wants to
upgrade to 90
 - Ken Weiner @ GumGum - same as above.

Mingjie Lai presenting from blog post on Coprocessors
Discussion:
 - Jon brings up that the pre/post hooks for get don't allow the coprocessor
to *not* call the original get.
   - eg completely sub out "get" implementation for something else
   - the current API is more like pre/post filters, but you can't skip the
normal implementation
   - we should probably amend API to allow this
     - some discussion of taking some kind of Context argument - we'll break
this out to a discussion later in the day
 - Some discussion about doing aggregation using coprocessors:
   - RPCs for CommandTarget are per region, not per region server
   - "Toy" mapreduce is easy to do (ie no spills, failure handling) - for
simple sum/average aggregates
   - HBASE-1512 has basic aggregate functions
 - Some discussion about how this actually is used:
   - You have to set the CP on the table descriptor
   - There is no class unloading currently implemented, so rolling restart
is required to update
   - There are some hacks around this, but nothing suitable for production
   - For development, unit tests are a good way - there are some unit test
examples that come with CPs
   - Lars has some experience some years ago with various frameworks that
might do things like this
   - People are generally afraid of class unloading, since it's very hairy
   - In general, if you have to patch a CP, you should use the same
procedure to upgrade a coprocessor as if there were a bug in HBase itself
   - Apache Felix might do stuff like this (OSGI stack)
     - This might necessitate refactoring other parts of HBase, pulling in
more dependencies, etc - lots of churn, overkill at this point
   - Consensus seems to be at this point that CPs are "wizard level" -
people deploying them should treat them like part of HBase itself, not be
redeploying willy-nilly
     - Long term maybe we have a coprocessor with support for some language
like JRuby for simple user-level things like triggers, etc


-- Lunch --

After lunch, discussion:

- Use cases we're thinking about:
  - Eventually consistent secondary indexing
    - In schema, define which columns are indexed
    - Observer on put would notice puts and write index writes to WAL
    - On WAL replay, make sure they've been replayed to index tables as well
  - Server-side search query join
    -
  - Row-level increment "group commit"
    - Some rows are really hot in a lot of ICV use cases
    - If an increment comes in and the row is locked, queue the increment
    - When row gets released, they get combined and committed together
    - Don't respond to client until it's been synced

- Himanshu: discussing adding support to CP for materializing an
intermediary result set and then scanning it
  - use cases: eg top K
  - some discussion about validity of use cases - for top K most people
would prefer to maintain it incrementally rather than doing a batch scan
followed by scanning the result set
  - can we make the more general API for just doing scanner-like actions
from CPs?

- Gary:
  - Multi-row transactions like Percolator

- Todd:
  - Offline blob storage - for large puts, write them to a side file on HDFS
and just store a reference.

- Ryan:
  - Basic datastructures (list append, etc)
  - Potentially operate on JSON structures

- Some general discussion about embedding a scripting language - Jython,
JRuby, Javascript all brought up. Some people will continue to look into
this on JIRA - looks like JavaScript seems to be winning for now since Rhino
is included in the JDK.


After discussion, different people broke off to work on different things -
some coprocessors, other general work on 0.90 stability, etc. Most people
stuck around for dinner at 6, and then filtered out between 7 and 8:30pm
with no particular wrap-up.

-- 
Todd Lipcon
Software Engineer, Cloudera

Hackathon notes 12/13/2010

Reply via email to