[Solr Wiki] Update of "DocumentProcessing" by JanHoydahl

Apache Wiki Mon, 18 Apr 2011 09:29:31 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "DocumentProcessing" page has been changed by JanHoydahl.
The comment on this change is: Updated.
http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=8&rev2=9

--------------------------------------------------

  Solr would benefit from a flexible document processing framework meeting the 
requirements of enterprise grade content integration. Most search projects have 
some need for processing the incoming content prior to indexing, for example:
   * Language identification
   * Text extraction (Tika)
-  * Entity extraction and classification (e.g. UIMA)
+  * Entity extraction and classification
   * Data normalization and cleansing
+  * Routing
   * 3rd party systems integration (e.g. enrich document from external source)
   * etc
  
- The built-in UpdateRequestProcessorChain is a very good starting point. 
However, the chain is very simple, single-threaded and only built for local 
execution on the indexer node. This means that any performance heavy processing 
chains will slow down the indexers without any way to scale out processing 
independently. We have seen FAST systems with far more servers doing document 
processing than indexing.
+ The built-in UpdateRequestProcessorChain is capable of doing simple simple 
processing jobs, but it is only built for local execution on the indexer node 
in the same thread. This means that any performance heavy processing chains 
will slow down the indexers without any way to scale out processing 
independently. We have seen FAST systems with far more servers doing document 
processing than indexing.
  
- There are many processing pipeline frameworks from which to get inspiration, 
such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], 
[[http://openpipe.berlios.de/|OpenPipe]], [[http://www.pypes.org/|Pypes]], 
[[http://uima.apache.org/|UIMA]], [[http://www.eclipse.org/smila/|Eclipse 
SMILA]] and others. Indeed, some of these are already being used with Solr as a 
pre-processing server. This means weak coupling but also weak re-use of code. 
Each new project will have to choose which of the pipelines to invest in.
+ There are many processing pipeline frameworks from which to get inspiration, 
such as the one in FAST ESP, [[http://www.openpipeline.org/|OpenPipeline]], 
[[http://openpipe.berlios.de/|OpenPipe]] (now on 
[[https://github.com/kolstae/openpipe|GitHub]]), 
[[http://www.pypes.org/|Pypes]], [[http://uima.apache.org/|UIMA]], 
[[http://www.eclipse.org/smila/|Eclipse SMILA]], 
[[http://commons.apache.org/sandbox/pipeline/|Apache commons pipeline]], 
[[http://found.no/products/piped/|Piped]] and others. Indeed, some of these are 
already being used with Solr as a pre-processing server. This means weak 
coupling but also weak re-use of code. Each new project will have to choose 
which of the pipelines to invest in.
  
- The community would benefit from an official processing framework and more 
importantly an official repository of processing stages which are shared and 
reused. The sharing part is crucial. If a company develops, say a Geo``Names 
stage to translate address into lat/lon, the whole community can benefit from 
that by fetching the stage from the shared repository. This will not happen as 
long as there is not one single preferred integration point.
+ The community would benefit from an official processing framework -- and more 
importantly an official repository of processing stages which are shared and 
reused. The sharing part is crucial. If a company develops, say a Geo``Names 
stage to translate address into lat/lon, the whole community can benefit from 
that by fetching the stage from the shared repository. This will not happen as 
long as there is not one single preferred integration point.
  
- There have recently been interest in the Solr community for such a framework. 
See 
[[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this 
presentation]] from Lucene Eurocon 2010 as well as 
[[http://findabilityblog.se/solr-processing-pipeline|this blog post]] for 
thoughts from Find``Wise.
+ There have recently been interest in the Solr community for such a framework. 
See 
[[http://lucene-eurocon.org/slides/A-Pipeline-for-Solr_Charas-Jansson.pdf|this 
presentation]] from Lucene Eurocon 2010 as well as 
[[http://findabilityblog.se/solr-processing-pipeline|this blog post]] for 
thoughts from Find``Wise, as well as the recent solr-user thread 
[[http://search-lucene.com/m/pFegS7BQ7k2|Pipeline for Solr]].
  
- = Solution =
+ = Solution proposal =
  Develop a simple, scalable, easily scriptable and configurable document 
processing framework for Solr, which builds on existing best practices. The 
framework should be simple and lightweight enough for use with Solr single 
node, and powerful enough to scale out in a separate document processing 
cluster, simply by changing configuration.
+ 
+ NOTE: It is not given that the code needs to be part of the Solr/Lucene 
codebase itself. It could start its life somewhere else, and perhaps later 
become an Apache project of its own.
  
  == Key requirements ==
  === Must ===
@@ -28, +31 @@

   * Java based
   * Lightweight (not over-engineered)
   * Support for multiple named pipelines, addressable at document ingestion
+  * Easy drop-in integration with existing Solr installs, i.e. called from 
UpdateProcessor
-  * Must work with existing Request``Handlers (XML, CSV, DIH, Binary etc) as 
entry point
-  * Allow as drop-in feature to existing installs (after upgrading to needed 
Solr version)
   * Support for metadata on document and field level (e.g. tokenized=true, 
language=en)
   * Allow scaling out processing to multiple dedicated servers for heavy tasks
   * Well defined API for the processing stages
-  * Easy configuration of pipelines through separate XML (not in 
solrconfig.xml)
+  * Easy configuration of pipelines through separate config and GUI
  
  === Should ===
+  * Function as a standalone data integration framework outside the context of 
Solr
   * Support for writing stages in JVM scripting languages such as Jython
   * Robust - if a batch fails, it should re-schedule to another processor
   * Optimize for performance through e.g. batch support
   * Support status callbacks to the client
   * SDK for stage developers - to encourage stage development
-  * Separate stage repository (outside of ASF svn) to encourage sharing
+  * Separate stages repository (outside of ASF svn) to encourage sharing
   * Integration points for UIMA, [[http://alias-i.com/lingpipe/|LingPipe]], 
[[http://opennlp.sourceforge.net/|OpenNLP]] etc
   * Integrate with Analysis so that if you tokenize in the Pipeline, analysis 
does not do it over again.
   * Allow re-use of TokenFilters from Analysis inside of Pipeline - avoid 
reinventing the wheel
@@ -49, +52 @@

  === Could ===
   * GUI for configuring pipelines
   * Hot pluggable pipelines
-  * Function as a standalone data integration framework outside the context of 
Solr
   * Wrappers for custom FAST ESP stages to work with minor modification
+  * Wrappers for custom UpdateProcessor stages to work with minor modification
  
  = Anti-patterns =
-  * Do not require new APIs, but allow feeding through existing 
Update``Request``Handlers
+  * Do not require new APIs, but allow integration inside Solr and feeding 
through existing Update``Request``Handlers
+  * Do not over-architecture like Eclipse SMILA and others have done with ESB 
etc
  
  = Proposed architecture =
- Hook into the context of the existing UpdateRequestProcessorChain (integrate 
in Content``Stream``Handler``Base) by providing a dispatcher class, 
Solr``Pipeline``Dispatcher. The dispatcher would be enabled and configured 
through update parameters pipeline.name and pipeline.mode, either from the 
update request or in solrconfig.xml.
+ The core pipeline and Processor SDK self-contained and not depend on Solr 
APIs. A good starting point for the core pipeline could be the Apache-licensed 
[[http://openpipe.berlios.de/|OpenPipe]], which already works stand-alone. We 
could add GUI config and scalability to this code base.
  
- Solr``Pipeline``Dispatcher would have two possible modes: "local" and 
"distributed". In case of local mode, the pipeline executes locally and results 
in the ProcessorChain being completed with RunUpdateProcessorFactory submitting 
the content to local index. This would work well for single-node as well as low 
load scenarios. Local mode is easiest to implement and could be phase one.
+ Glue code to hook the pipeline into Solr could be an UpdateRequestProcessor, 
e.g. Pipeline``Dispatcher``Processor (or deeper through 
Content``Stream``Handler``Base?). The dispatcher would be enabled and 
configured through update parameters, e.g. pipeline.name and pipeline.mode, 
either from the update request or in solrconfig.xml.
  
- We need a robust architecture for configuring and executing pipelines; 
preferably multi threaded. We could start from scratch or base it on another 
mature framework such as [[http://commons.apache.org/sandbox/pipeline/|Apache 
Commons Pipeline]], Open``Pipe or some other project with a compatible license 
who are willing to donate to ASF. Apache Commons Pipeline is not directly what 
we need, it has a funny, somewhat rigid, stage architecture with each stage 
having its own queue and thread(s) instead of running a whole pipeline in the 
same thread.
+ Pipeline``Dispatcher``Processor would have two possible modes: "local" and 
"distributed". In case of local mode, the pipeline executes locally in-thread 
and results in the ProcessorChain being completed with 
RunUpdateProcessorFactory submitting the content to local index. This would 
work well for single-node as well as low load scenarios. Local mode is easiest 
to implement and could be phase one.
  
  == Distributed mode ==
- The "distributed" mode would enable more advanced dispatching (streaming) to 
a cluster of remote worker nodes which execute the actual pipeline. This means 
that indexing will not happen locally. Thus a Solr node can take the role as 
RequestHandler + Pipeline``Dispatcher only, or as a Document Processor only. 
The dispatcher streams output to a Request``Handler on the processing node. 
When the pipeline has finished executing, the resulting documents enter the 
Solr``Pipeline``Dispatcher again and get routed to the correct shard for 
indexing. As we can tell, there are some major devlopment effort to support 
distributed pipelines!
+ The "distributed" mode would enable more advanced dispatching (streaming) to 
a cluster of remote worker nodes which execute the actual pipeline. This means 
that indexing will not happen locally. Thus a Solr node can take the role as 
RequestHandler + Pipeline``Dispatcher only, or as a Document Processor only. 
The dispatcher streams output to a Request``Handler on the processing node. 
When the pipeline has finished executing, the resulting documents enter the 
Pipeline``Dispatcher again and get routed to the correct shard for indexing 
(also see [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]). As we 
can tell, there are some major devlopment effort to support distributed 
pipelines!
  
  = Risks =
-  * Automated distributed indexing is a larger problem. Split the camel!
+  * Automated distributed indexing 
[[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]] needs to work 
with this
   * Introducing multiple worker nodes introduces sequencing issues and 
potential deadlocks
   * Need sophisticated dispatching and scheduling code to make a robust and 
fault tolerant model

[Solr Wiki] Update of "DocumentProcessing" by JanHoydahl

Reply via email to