On 10/6/11 7:43 AM, Thilo Götz wrote:
We use hadoop with UIMA.  Here's the "fit", in one case:
> > 1) UIMA runs as the map step; we put the uima pipeline into the mapper. Hadoop
>  has a configure (?) method where you can stick the creation and set up of the
>  uima pipeline, similar to UIMA's initialize.
> > 2) Write a hadoop record reader that reads input from hadoop's "splits", and
>  creates things that would go into individual CASes.  These are the input to 
the
>  Map step.
> > 3) The map takes the input (a string, say), and puts it into a CAS, and then
>  calls the process() method on the engine it set up and initialized in step 1.
> > 4) When the process method returns, the CAS has all the results - iterate thru
>  it and extract whatever you want, and stick those values into your hadoop 
output
>  object, and output it.
> > 5) The reduce step can take all of these output objects (which can be sorted as
>  you wish) and do whatever you want with them.
That basically sums it up.  We (and that's a different we than Marshall's we)
use hadoop only for batch processing, but since that's the only processing
we're currently doing, that works out well.  We use hdfs as the underlying
storage normally.

For low latency analysis I am using HBase and UIMA-AS, there an receiver writes text items to HBase, afterward the row key is send to UIMA-AS which then retrieves the document from HBase, after it is analyzed the results are written back to HBase.

Such a setup is well suited when you have a huge stream of documents, which must be analyzed in near real time. If you would use Map Reduce, you would need to first wait and collected documents
before the job can be started.

Jörn

Reply via email to