Re: Scaling using Hadoop

Jörn Kottmann Thu, 06 Oct 2011 01:24:43 -0700

On 10/6/11 7:43 AM, Thilo Götz wrote:

We use hadoop with UIMA.  Here's the "fit", in one case:
>> 1) UIMA runs as the map step; we put the uima pipeline into the mapper. Hadoop
>  has a configure (?) method where you can stick the creation and set up of the
>  uima pipeline, similar to UIMA's initialize.
>> 2) Write a hadoop record reader that reads input from hadoop's "splits", and
>  creates things that would go into individual CASes.  These are the input to 
the
>  Map step.
>> 3) The map takes the input (a string, say), and puts it into a CAS, and then
>  calls the process() method on the engine it set up and initialized in step 1.
>> 4) When the process method returns, the CAS has all the results - iterate thru
>  it and extract whatever you want, and stick those values into your hadoop 
output
>  object, and output it.
>> 5) The reduce step can take all of these output objects (which can be sorted as
>  you wish) and do whatever you want with them.
That basically sums it up.  We (and that's a different we than Marshall's we)
use hadoop only for batch processing, but since that's the only processing
we're currently doing, that works out well.  We use hdfs as the underlying
storage normally.

For low latency analysis I am using HBase and UIMA-AS, there an receiverwritestext items to HBase, afterward the row key is send to UIMA-AS which thenretrieves thedocument from HBase, after it is analyzed the results are written backto HBase.

Such a setup is well suited when you have a huge stream of documents,which must be analyzedin near real time. If you would use Map Reduce, you would need to firstwait and collected documents

before the job can be started.

Jörn

Re: Scaling using Hadoop

Reply via email to