Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "DocumentProcessing" page has been changed by JanHoydahl. The comment on this change is: Clarification. http://wiki.apache.org/solr/DocumentProcessing?action=diff&rev1=12&rev2=13 -------------------------------------------------- = Anti-patterns = * Do not over-architecture like Eclipse SMILA and others have done going crazy with ESB etc + * Do not try to be a connector framework as well. Let ManifoldCF do that job. Focuson on the pipeline! + * Do not keep the source private (although Apache licensed) as DieselPoint did with OpenPipeline - create a community! = Proposed architecture = [[https://docs.google.com/drawings/edit?id=1rVsy-p7sexSw3wrald2_fHtkLk6opYp5qzllvOHOB8c&hl=en|Architecture diagram]] @@ -66, +68 @@ Glue code to hook the pipeline into Solr could be an UpdateRequestProcessor which can either work in "local" mode, executing the pipeline locally in-thread, or in "distributed" mode which would dispatch the batch to an available node in a document processing cluster. I envision that the whole pipeline could (in addition to running standalone) be wrapped in a Solr RequestHandler i.e. a Document-processing-only node would be an instance of Solr with a new BinaryDocumentRequestHandler, without a local index. When processing is finished, the documents are routed to the final destination for indexing (perhpas using [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]]). + + The architecture diagram above shows the local and the fully distributed cases. Another option would be to round-robin feeding to the set of pipeline nodes directly (not needing a BinaryDocumentRequestHandler), and letting them do the distributed indexing as the last UdateProcessor. = Risks = * Automated distributed indexing [[https://issues.apache.org/jira/browse/SOLR-2358|SOLR-2358]] needs to work with this