Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "ExtractingRequestHandler" page has been changed by HossMan: http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=73&rev2=74 Comment: fill in osme TODOs and clean up some formatting 1. if {{{uprefix}}} is specified, any unknown field names are prefixed with that value, else if {{{defaultField}}} is specified, unknown fields are copied to that. = Configuration = - // TODO: this is out of date as of Solr 1.4 - dist/apache-solr-cell-1.4.jar and all of contrib/extraction/lib are needed - If you are not working from the supplied example/solr directory you must copy all libraries from example/solr/libs into a libs directory within your own solr directory. The !ExtractingRequestHandler is not incorporated into the solr war file, you have to install it separately. + The !ExtractingRequestHandler is not incorporated into the solr war file, it is provided as a SolrPlugin, and you have to load it (and it's dependencies) explicitly. - Example config: + Example configuration for loading plugin and dependencies: + + {{{ + <lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" /> + <lib dir="../../contrib/extraction/lib" regex=".*\.jar" /> + }}} + + + Example configuration for the Handler: {{{ <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> @@ -101, +108 @@ </lst> </requestHandler> }}} + In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field named last_modified. We are also telling it to ignore undeclared fields. These are all overridden parameters. The tika.config entry points to a file containing a Tika configuration. You would only need this if you have customized your own Tika configuration. The Tika config contains info about parsers, mime types, etc. @@ -184, +192 @@ See TikaExtractOnlyExampleOutput. = Sending documents to Solr = - // TODO: describe the different ways to send the documents to solr (POST body, form encoded, remoteStreaming) + The ExtractingRequestHandler can process any document sent as a ContentStream ... + * Raw POST + * Multi-part file upload (each file is processed as a distinct document) + * "stream.body", "stream.url" and "stream.file" request params. + + Example... + + {{{ - * curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" --data-binary @tutorial.html -H 'Content-type:text/html' + curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" --data-binary @tutorial.html -H 'Content-type:text/html' + }}} + - . <!> NOTE, this literally streams the file as the body of the POST, which does not, then, provide info to Solr about the name of the file. + <!> NOTE, this literally streams the file as the body of the POST, which does not, then, provide info to Solr about the name of the file. == SolrJ == Use the !ContentStreamUpdateRequest (see ContentStreamUpdateRequestExample for a full example): @@ -225, +242 @@ * Commit = Additional Resources = - * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid Imagination article]] * [[http://tika.apache.org/0.10/formats.html|Supported document formats via Tika (0.10)]] + * [[http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source|Lucid Imagination article]] + * [[http://tika.apache.org/0.10/formats.html|Supported document formats via Tika (0.10)]] = What's in a Name = Grant was writing the javadocs for the code and needed an entry for the <title> tag and wrote out "Solr Content Extraction Library", since the contrib directory is named "extraction". This then lead to an "acronym": Solr CEL which then gets mashed to: Solr Cell. Hence, the project name is "Solr Cell". It's also appropriate because a Solar Cell's job is to convert the raw energy of the Sun to electricity, and this contrib's module is responsible for converting the "raw" content of a document to something usable by Solr. http://en.wikipedia.org/wiki/Solar_cell