On 10/6/11 7:43 AM, Thilo Götz wrote:
We use hadoop with UIMA. Here's the "fit", in one case:
>
> 1) UIMA runs as the map step; we put the uima pipeline into the mapper. Hadoop
> has a configure (?) method where you can stick the creation and set up of the
> uima pipeline, similar to UIMA's initialize.
>
> 2) Write a hadoop record reader that reads input from hadoop's "splits", and
> creates things that would go into individual CASes. These are the input to
the
> Map step.
>
> 3) The map takes the input (a string, say), and puts it into a CAS, and then
> calls the process() method on the engine it set up and initialized in step 1.
>
> 4) When the process method returns, the CAS has all the results - iterate thru
> it and extract whatever you want, and stick those values into your hadoop
output
> object, and output it.
>
> 5) The reduce step can take all of these output objects (which can be sorted as
> you wish) and do whatever you want with them.
That basically sums it up. We (and that's a different we than Marshall's we)
use hadoop only for batch processing, but since that's the only processing
we're currently doing, that works out well. We use hdfs as the underlying
storage normally.
For low latency analysis I am using HBase and UIMA-AS, there an receiver
writes
text items to HBase, afterward the row key is send to UIMA-AS which then
retrieves the
document from HBase, after it is analyzed the results are written back
to HBase.
Such a setup is well suited when you have a huge stream of documents,
which must be analyzed
in near real time. If you would use Map Reduce, you would need to first
wait and collected documents
before the job can be started.
Jörn