Re: [Solr Wiki] Update of "ExtractingRequestHandler" by YonikSeeley

Yonik Seeley Tue, 14 Jul 2009 13:45:54 -0700

FYI, I'm only really done down to the "// TODO: move this somewhere else[...]"


I've removed a number of things that were complicated or misleading
and tried to improve the first example - a good OOTB experience with
this handler is esp important I think.  Let me know if you think I've
removed something I shouldn't have, or if anything will be confusing
to someone looking at it the first time.

I'll continue making changes today and tomorrow.

-Yonik
http://www.lucidimagination.com

On Tue, Jul 14, 2009 at 4:39 PM, Apache Wiki<wikidi...@apache.org> wrote:
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
> notification.
>
> The following page has been changed by YonikSeeley:
> http://wiki.apache.org/solr/ExtractingRequestHandler
>
> The comment on the change is:
> snapshot - updating to reflect committed code, simplifying
>
> ------------------------------------------------------------------------------
>
>  [[TableOfContents]]
>
> - Please see [https://issues.apache.org/jira/browse/SOLR-284 SOLR-284] for 
> more information on the incorporation of this feature into Solr 1.4.
> -
>  = Introduction =
>
> - A common need of users is the ability to ingest binary and/or structured 
> documents such as Office, PDF and other proprietary formats.  The 
> [http://incubator.apache.org/tika/ Apache Tika] project provides a framework 
> for wrapping many different file format parsers, such as PDFBox, POI and 
> others.
> + <!> ["Solr1.4"]
>
> + A common need of users is the ability to ingest binary and/or structured 
> documents such as Office, Word, PDF and other proprietary formats.  The 
> [http://incubator.apache.org/tika/ Apache Tika] project provides a framework 
> for wrapping many different file format parsers, such as PDFBox, POI and 
> others.
> +
> - Solr's !ExtractingRequestHandler provides a wrapper around Tika to allow 
> users to upload binary files to Solr and have Solr extract text from it and 
> then index it.
> + Solr's !ExtractingRequestHandler uses Tika to allow users to upload binary 
> files to Solr and have Solr extract text from it and then index it.
>
>  = Concepts =
>
> @@ -18, +18 @@
>
>
>   * Tika will automatically attempt to determine the input document type 
> (word, pdf, etc.) and extract the content appropriately. If you want, you can 
> explicitly specify a MIME type for Tika wth the stream.type parameter
>   * Tika does everything by producing an XHTML stream that it feeds to a SAX 
> !ContentHandler.
> -  * Solr then implements a !SolrContentHandler which reacts to Tika's SAX 
> events and creates a !SolrInputDocument.  You can override the 
> !SolrContentHandler.  See the section below on Customization.
> -  * Tika produces Metadata information according to things like !DublinCore 
> and other specifications.  See the Tika javadocs on the Metadata class for 
> what gets produced.  <!> TODO: Link to Tika Javadocs <!>  See also 
> http://lucene.apache.org/tika/formats.html
> +  * Solr then reacts to Tika's SAX events and creates the fields to index.
> +  * Tika produces Metadata information such as Title, Subject, and Author, 
> according to specifications like !DublinCore.  See 
> http://lucene.apache.org/tika/formats.html for the file types supported.
> +  * All of the extracted text is added to the "content" field
>   * We can map Tika's metadata fields to Solr fields.  We can boost these 
> fields
> -  * We can also pass in literals.
> +  * We can also pass in literals for field values.
> +  * We can apply an XPath expression to the Tika XHTML to restrict the 
> content that is produced.
> -  * We can apply an XPath expression to the Tika XHTML by passing in the 
> ext.xpath parameter (described below).  This restricts down the events that 
> are given to the !SolrContentHandler.  It is still up to the 
> !SolrContentHandler to process those events.
> -  * Field boosts are applied after name mapping
> -  * It is useful to keep in mind what a given operation is using for input 
> when specifying parameters.  For instance, captured fields are specified to 
> the !SolrContentHandler for capturing content in the Tika XHTML.  Thus, the 
> names of the fields are those of the XHTML, not the mapped names.
> -  * A default field name is required for indexing, but not for extraction 
> only.
> -  * The default field name and any literal values are not mapped.  They can 
> be boosted.  See the examples.
> -
> -
> - == When To Use ==
> -
> - The !ExtractingRequestHandler can be used any time you have the need to 
> index both the metadata and text of binary documents like Word, PDF, etc.  It 
> doesn't, however, make sense to use it if you are only interested in indexing 
> the metadata about documents, since it will be much faster to determine the 
> metadata on the client side and then send that as a normal Solr document.  In 
> fact, it might make sense for someone to write a piece for SolrJ that uses 
> Tika on the client-side to construct Solr documents.
>
>  = Getting Started with the Solr Example =
> +  * Check out Solr trunk or get a 1.4 release or later.
> +  * If using a check out, running "ant example" will build the necessary 
> jars.
> + Now start the solr example server:
> + {{{
> + cd example
> + java -jar start.jar
> + }}}
>
> -  * Check out Solr trunk or get a 1.4 release or later if it exists.
> -  * If using a check out, running "ant example" will build the necessary 
> jars.
> -  * cd example
> -  * The example directory comes with all required libs, but the 
> configuration files are not setup for the !ExtractingRequestHandler. Add the 
> Configuration as defined below to the example's solrconfig.xml.
> -   *''recent versions of the solr code from svn, do contain a configuration 
> section within example/solr/conf/solrconfig.xml but it needs uncommented.''
> -  * java -jar start.jar
> -  * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in 
> example/solr/solr.xml in order for Solr to find the jars in example/solr/lib
> + In a separate window go to the {{{site/}}} directory (which contains some 
> nice example docs) and send Solr a file via HTTP POST:
> + {{{
> + cd site
> + curl 
> 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F 
> "myfi...@tutorial.html"
> + }}}
> +  * Note, the /site directory in the solr download contains some nice 
> example docs to try
> +  * hint: myfi...@tutorial.html needs a valid path (absolute or relative), 
> e.g. "myfi...@../../site/tutorial.html" if you are still in exampledocs dir.
> +  * the {{{literal.id=doc1}}} param provides the necessary unique id for the 
> document being indexed
> +  * the {{{commit=true}}} param causes Solr to do a commit after indexing 
> the document, making it immediately searchable.  For good performance when 
> loading many documents, don't call commit until you are done.
> +  * using "curl" or other command line tools to post documents to Solr is 
> nice for testing, but not the recommended update method for best performance.
>
> + Now, you should be able to execute a query and find that document (open the 
> following link in your browser):
> + http://localhost:8983/solr/select?q=tutorial
>
> - In a separate window, post a file:
> + You may notice that although you can search on any of the text in the 
> sample document, you may not be able to see that text when the document is 
> retrieved.  This is simply because the "content" field generated by Tika is 
> mapped to the Solr field called "text" (which is indexed but not stored) via 
> the default map rule in {{{solrconfig.xml}}} that can be changed or 
> overridden.  For example, to store and see all metadata and content, execute 
> the following:
> + {{{
> + curl 
> 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&map.content=attr_content&commit=true'
>  -F "myfi...@tutorial.html"
> + }}}
> + And then query via http://localhost:8983/solr/select?q=attr_content:tutorial
>
> + // TODO: move this somewhere else to a more in-depth discussion of 
> different ways to send the data to Solr (prob with remoteStreaming discussion)
> -  *  curl 
> http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text  
> -F "myfi...@tutorial.html" //Note, the trunk/site contains some nice example 
> docs
> -   * hint: myfi...@tutorial.html needs a valid path (absolute or relative), 
> e.g. "myfi...@../../site/tutorial.html" if you are still in exampledocs dir.
> -   * with recent svn, you may need to add a unique '''id''' param to curl 
> (see 
> [http://www.nabble.com/Missing-required-field:-id-Using-ExtractingRequestHandler-td22611039.html
>  nabble msg]):
> -   * e.g. curl 
> http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.literal.id=123
>  -F "myfi...@../../site/tutorial.html"
> -
> - or
> -
>   * curl 
> http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text  
> --data-binary @tutorial.html  -H 'Content-type:text/html'
>         <!> NOTE, this literally streams the file, which does not, then, 
> provide info to Solr about the name of the file.
>
> - or whatever other way you know how to do it.  Don't forget to COMMIT!
> -  * e.g. curl "http://localhost:8983/solr/update/"; -H "Content-Type: 
> text/xml" --data-binary '<commit waitFlush="false"/>'   --see 
> [http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source
>  LucidImagination note]
>
>  If you are not working from the supplied example/solr directory you must 
> copy all libraries from example/solr/libs into a libs directory within your 
> own solr directory. The !ExtractingRequestHandler is not incorporated into 
> the solr war file, you have to install it separately.
> +
> + = Input Parameters =
> +
> +  * ext.boost.<NAME> = Float -  Boost the field with the specified name.  
> The NAME value is the name of the Solr field (not the Tika metadata name).
> +  * ext.capture = <Tika XHTML NAME> - Capture fields with the name 
> separately for adding to the Solr document.  This can be useful for grabbing 
> chunks of the XHTML into a separate field.  For instance, it could be used to 
> grab paragraphs (<p>) and index them into a separate field.  Note that 
> content is also still captured into the overall string buffer.
> +  * ext.def.fl = <NAME> - The name of the field to add the default content 
> to.  See also ext.capture below.  This NAME is not mapped, but it can be 
> boosted.
> +  * ext.extract.only = true|false - Default is false.  If true, return the 
> extracted content from Tika without indexing the document.  This literally 
> includes the extracted XHTML as a <str> in the response.  See 
> TikaExtractOnlyExampleOutput.
> +  * ext.idx.attr = true|false - Index the Tika XHTML attributes into 
> separate fields, named after the attribute.  For example, when extracting 
> from HTML, Tika can return the href values of <a> tags as attributes of a tag 
> name.  See the examples below.
> +  * ext.ignore.und.fl = true|false - Default is false.  If true, ignore 
> fields that are extracted but are not in the Solr Schema.  Otherwise, an 
> exception will be thrown for fields that are not mapped.
> +  * ext.literal.<NAME> = <VALUE> - Create a field on the document with field 
> name NAME and literal value VALUE, e.g. ext.literal.foo=bar.  May be 
> multivalued if the Field is multivalued.  Otherwise, the ERH will throw an 
> exception.
> +  * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika 
> metadata attribute to a Solr field name.  If no mapping is specified, the 
> metadata attribute will be used as the field name.  If the field name doesn't 
> exist, it can be ignored by setting the "ignore undeclared fields" 
> (ext.ignore.und.fl) attribute described below
> +  * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, 
> such that it is easy to map new metadata fields to dynamic fields
> +  * ext.resource.name=<File Name> - Optional.  The name of the file.  Tika 
> can use it as a hint for detecting mime type.
> +  * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML 
> content that satisfies the XPath expression.  See 
> http://lucene.apache.org/tika/documentation.html for details on the format of 
> Tika XHTML.  See also TikaExtractOnlyExampleOutput.
>
>  = Configuration =
>
> @@ -104, +118 @@
>
>  EEE MMM d HH:mm:ss yyyy
>  }}}
>
> - = Input Parameters =
> + == MultiCore config
> +  * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in 
> example/solr/solr.xml in order for Solr to find the jars in example/solr/lib
>
> -  * ext.boost.<NAME> = Float -  Boost the field with the specified name.  
> The NAME value is the name of the Solr field (not the Tika metadata name).
> -  * ext.capture = <Tika XHTML NAME> - Capture fields with the name 
> separately for adding to the Solr document.  This can be useful for grabbing 
> chunks of the XHTML into a separate field.  For instance, it could be used to 
> grab paragraphs (<p>) and index them into a separate field.  Note that 
> content is also still captured into the overall string buffer.
> -  * ext.def.fl = <NAME> - The name of the field to add the default content 
> to.  See also ext.capture below.  This NAME is not mapped, but it can be 
> boosted.
> -  * ext.extract.only = true|false - Default is false.  If true, return the 
> extracted content from Tika without indexing the document.  This literally 
> includes the extracted XHTML as a <str> in the response.  See 
> TikaExtractOnlyExampleOutput.
> -  * ext.idx.attr = true|false - Index the Tika XHTML attributes into 
> separate fields, named after the attribute.  For example, when extracting 
> from HTML, Tika can return the href values of <a> tags as attributes of a tag 
> name.  See the examples below.
> -  * ext.ignore.und.fl = true|false - Default is false.  If true, ignore 
> fields that are extracted but are not in the Solr Schema.  Otherwise, an 
> exception will be thrown for fields that are not mapped.
> -  * ext.literal.<NAME> = <VALUE> - Create a field on the document with field 
> name NAME and literal value VALUE, e.g. ext.literal.foo=bar.  May be 
> multivalued if the Field is multivalued.  Otherwise, the ERH will throw an 
> exception.
> -  * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika 
> metadata attribute to a Solr field name.  If no mapping is specified, the 
> metadata attribute will be used as the field name.  If the field name doesn't 
> exist, it can be ignored by setting the "ignore undeclared fields" 
> (ext.ignore.und.fl) attribute described below
> -  * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, 
> such that it is easy to map new metadata fields to dynamic fields
> -  * ext.resource.name=<File Name> - Optional.  The name of the file.  Tika 
> can use it as a hint for detecting mime type.
> -  * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML 
> content that satisfies the XPath expression.  See 
> http://lucene.apache.org/tika/documentation.html for details on the format of 
> Tika XHTML.  See also TikaExtractOnlyExampleOutput.
>
>  = Metadata =
>
> @@ -171, +175 @@
>
>  See TikaExtractOnlyExampleOutput.
>
>
> + == Additional Resources ==
> + * 
> [http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source
>  Lucid Imagination article]
> + * [http://lucene.apache.org/tika/formats.html Supported document formats 
> via Tika]
> - = Customizing =
> -
> - While the current !ExtractingRequestHandler only allows for the use of the 
> !SolrContentHandler in creating new documents, it is relatively easy to 
> implement your own extension that processes the Tika extracted content 
> differently and produces a different !SolrInputDocument.
> -
> - To do this, implement your own instance of the !SolrContentHandlerFactory 
> and override the createFactory() method on the !ExtractingRequestHandler.
>
>  = What's in a Name =
>
>

Re: [Solr Wiki] Update of "ExtractingRequestHandler" by YonikSeeley

Reply via email to