FYI, I'm only really done down to the "// TODO: move this somewhere else[...]"
I've removed a number of things that were complicated or misleading and tried to improve the first example - a good OOTB experience with this handler is esp important I think. Let me know if you think I've removed something I shouldn't have, or if anything will be confusing to someone looking at it the first time. I'll continue making changes today and tomorrow. -Yonik http://www.lucidimagination.com On Tue, Jul 14, 2009 at 4:39 PM, Apache Wiki<wikidi...@apache.org> wrote: > Dear Wiki user, > > You have subscribed to a wiki page or wiki category on "Solr Wiki" for change > notification. > > The following page has been changed by YonikSeeley: > http://wiki.apache.org/solr/ExtractingRequestHandler > > The comment on the change is: > snapshot - updating to reflect committed code, simplifying > > ------------------------------------------------------------------------------ > > [[TableOfContents]] > > - Please see [https://issues.apache.org/jira/browse/SOLR-284 SOLR-284] for > more information on the incorporation of this feature into Solr 1.4. > - > = Introduction = > > - A common need of users is the ability to ingest binary and/or structured > documents such as Office, PDF and other proprietary formats. The > [http://incubator.apache.org/tika/ Apache Tika] project provides a framework > for wrapping many different file format parsers, such as PDFBox, POI and > others. > + <!> ["Solr1.4"] > > + A common need of users is the ability to ingest binary and/or structured > documents such as Office, Word, PDF and other proprietary formats. The > [http://incubator.apache.org/tika/ Apache Tika] project provides a framework > for wrapping many different file format parsers, such as PDFBox, POI and > others. > + > - Solr's !ExtractingRequestHandler provides a wrapper around Tika to allow > users to upload binary files to Solr and have Solr extract text from it and > then index it. > + Solr's !ExtractingRequestHandler uses Tika to allow users to upload binary > files to Solr and have Solr extract text from it and then index it. > > = Concepts = > > @@ -18, +18 @@ > > > * Tika will automatically attempt to determine the input document type > (word, pdf, etc.) and extract the content appropriately. If you want, you can > explicitly specify a MIME type for Tika wth the stream.type parameter > * Tika does everything by producing an XHTML stream that it feeds to a SAX > !ContentHandler. > - * Solr then implements a !SolrContentHandler which reacts to Tika's SAX > events and creates a !SolrInputDocument. You can override the > !SolrContentHandler. See the section below on Customization. > - * Tika produces Metadata information according to things like !DublinCore > and other specifications. See the Tika javadocs on the Metadata class for > what gets produced. <!> TODO: Link to Tika Javadocs <!> See also > http://lucene.apache.org/tika/formats.html > + * Solr then reacts to Tika's SAX events and creates the fields to index. > + * Tika produces Metadata information such as Title, Subject, and Author, > according to specifications like !DublinCore. See > http://lucene.apache.org/tika/formats.html for the file types supported. > + * All of the extracted text is added to the "content" field > * We can map Tika's metadata fields to Solr fields. We can boost these > fields > - * We can also pass in literals. > + * We can also pass in literals for field values. > + * We can apply an XPath expression to the Tika XHTML to restrict the > content that is produced. > - * We can apply an XPath expression to the Tika XHTML by passing in the > ext.xpath parameter (described below). This restricts down the events that > are given to the !SolrContentHandler. It is still up to the > !SolrContentHandler to process those events. > - * Field boosts are applied after name mapping > - * It is useful to keep in mind what a given operation is using for input > when specifying parameters. For instance, captured fields are specified to > the !SolrContentHandler for capturing content in the Tika XHTML. Thus, the > names of the fields are those of the XHTML, not the mapped names. > - * A default field name is required for indexing, but not for extraction > only. > - * The default field name and any literal values are not mapped. They can > be boosted. See the examples. > - > - > - == When To Use == > - > - The !ExtractingRequestHandler can be used any time you have the need to > index both the metadata and text of binary documents like Word, PDF, etc. It > doesn't, however, make sense to use it if you are only interested in indexing > the metadata about documents, since it will be much faster to determine the > metadata on the client side and then send that as a normal Solr document. In > fact, it might make sense for someone to write a piece for SolrJ that uses > Tika on the client-side to construct Solr documents. > > = Getting Started with the Solr Example = > + * Check out Solr trunk or get a 1.4 release or later. > + * If using a check out, running "ant example" will build the necessary > jars. > + Now start the solr example server: > + {{{ > + cd example > + java -jar start.jar > + }}} > > - * Check out Solr trunk or get a 1.4 release or later if it exists. > - * If using a check out, running "ant example" will build the necessary > jars. > - * cd example > - * The example directory comes with all required libs, but the > configuration files are not setup for the !ExtractingRequestHandler. Add the > Configuration as defined below to the example's solrconfig.xml. > - *''recent versions of the solr code from svn, do contain a configuration > section within example/solr/conf/solrconfig.xml but it needs uncommented.'' > - * java -jar start.jar > - * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in > example/solr/solr.xml in order for Solr to find the jars in example/solr/lib > + In a separate window go to the {{{site/}}} directory (which contains some > nice example docs) and send Solr a file via HTTP POST: > + {{{ > + cd site > + curl > 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F > "myfi...@tutorial.html" > + }}} > + * Note, the /site directory in the solr download contains some nice > example docs to try > + * hint: myfi...@tutorial.html needs a valid path (absolute or relative), > e.g. "myfi...@../../site/tutorial.html" if you are still in exampledocs dir. > + * the {{{literal.id=doc1}}} param provides the necessary unique id for the > document being indexed > + * the {{{commit=true}}} param causes Solr to do a commit after indexing > the document, making it immediately searchable. For good performance when > loading many documents, don't call commit until you are done. > + * using "curl" or other command line tools to post documents to Solr is > nice for testing, but not the recommended update method for best performance. > > + Now, you should be able to execute a query and find that document (open the > following link in your browser): > + http://localhost:8983/solr/select?q=tutorial > > - In a separate window, post a file: > + You may notice that although you can search on any of the text in the > sample document, you may not be able to see that text when the document is > retrieved. This is simply because the "content" field generated by Tika is > mapped to the Solr field called "text" (which is indexed but not stored) via > the default map rule in {{{solrconfig.xml}}} that can be changed or > overridden. For example, to store and see all metadata and content, execute > the following: > + {{{ > + curl > 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&map.content=attr_content&commit=true' > -F "myfi...@tutorial.html" > + }}} > + And then query via http://localhost:8983/solr/select?q=attr_content:tutorial > > + // TODO: move this somewhere else to a more in-depth discussion of > different ways to send the data to Solr (prob with remoteStreaming discussion) > - * curl > http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text > -F "myfi...@tutorial.html" //Note, the trunk/site contains some nice example > docs > - * hint: myfi...@tutorial.html needs a valid path (absolute or relative), > e.g. "myfi...@../../site/tutorial.html" if you are still in exampledocs dir. > - * with recent svn, you may need to add a unique '''id''' param to curl > (see > [http://www.nabble.com/Missing-required-field:-id-Using-ExtractingRequestHandler-td22611039.html > nabble msg]): > - * e.g. curl > http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.literal.id=123 > -F "myfi...@../../site/tutorial.html" > - > - or > - > * curl > http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text > --data-binary @tutorial.html -H 'Content-type:text/html' > <!> NOTE, this literally streams the file, which does not, then, > provide info to Solr about the name of the file. > > - or whatever other way you know how to do it. Don't forget to COMMIT! > - * e.g. curl "http://localhost:8983/solr/update/" -H "Content-Type: > text/xml" --data-binary '<commit waitFlush="false"/>' --see > [http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source > LucidImagination note] > > If you are not working from the supplied example/solr directory you must > copy all libraries from example/solr/libs into a libs directory within your > own solr directory. The !ExtractingRequestHandler is not incorporated into > the solr war file, you have to install it separately. > + > + = Input Parameters = > + > + * ext.boost.<NAME> = Float - Boost the field with the specified name. > The NAME value is the name of the Solr field (not the Tika metadata name). > + * ext.capture = <Tika XHTML NAME> - Capture fields with the name > separately for adding to the Solr document. This can be useful for grabbing > chunks of the XHTML into a separate field. For instance, it could be used to > grab paragraphs (<p>) and index them into a separate field. Note that > content is also still captured into the overall string buffer. > + * ext.def.fl = <NAME> - The name of the field to add the default content > to. See also ext.capture below. This NAME is not mapped, but it can be > boosted. > + * ext.extract.only = true|false - Default is false. If true, return the > extracted content from Tika without indexing the document. This literally > includes the extracted XHTML as a <str> in the response. See > TikaExtractOnlyExampleOutput. > + * ext.idx.attr = true|false - Index the Tika XHTML attributes into > separate fields, named after the attribute. For example, when extracting > from HTML, Tika can return the href values of <a> tags as attributes of a tag > name. See the examples below. > + * ext.ignore.und.fl = true|false - Default is false. If true, ignore > fields that are extracted but are not in the Solr Schema. Otherwise, an > exception will be thrown for fields that are not mapped. > + * ext.literal.<NAME> = <VALUE> - Create a field on the document with field > name NAME and literal value VALUE, e.g. ext.literal.foo=bar. May be > multivalued if the Field is multivalued. Otherwise, the ERH will throw an > exception. > + * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika > metadata attribute to a Solr field name. If no mapping is specified, the > metadata attribute will be used as the field name. If the field name doesn't > exist, it can be ignored by setting the "ignore undeclared fields" > (ext.ignore.und.fl) attribute described below > + * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, > such that it is easy to map new metadata fields to dynamic fields > + * ext.resource.name=<File Name> - Optional. The name of the file. Tika > can use it as a hint for detecting mime type. > + * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML > content that satisfies the XPath expression. See > http://lucene.apache.org/tika/documentation.html for details on the format of > Tika XHTML. See also TikaExtractOnlyExampleOutput. > > = Configuration = > > @@ -104, +118 @@ > > EEE MMM d HH:mm:ss yyyy > }}} > > - = Input Parameters = > + == MultiCore config > + * For multi-core, specify {{{ sharedLib='lib' }}} in {{{ <solr /> }}} in > example/solr/solr.xml in order for Solr to find the jars in example/solr/lib > > - * ext.boost.<NAME> = Float - Boost the field with the specified name. > The NAME value is the name of the Solr field (not the Tika metadata name). > - * ext.capture = <Tika XHTML NAME> - Capture fields with the name > separately for adding to the Solr document. This can be useful for grabbing > chunks of the XHTML into a separate field. For instance, it could be used to > grab paragraphs (<p>) and index them into a separate field. Note that > content is also still captured into the overall string buffer. > - * ext.def.fl = <NAME> - The name of the field to add the default content > to. See also ext.capture below. This NAME is not mapped, but it can be > boosted. > - * ext.extract.only = true|false - Default is false. If true, return the > extracted content from Tika without indexing the document. This literally > includes the extracted XHTML as a <str> in the response. See > TikaExtractOnlyExampleOutput. > - * ext.idx.attr = true|false - Index the Tika XHTML attributes into > separate fields, named after the attribute. For example, when extracting > from HTML, Tika can return the href values of <a> tags as attributes of a tag > name. See the examples below. > - * ext.ignore.und.fl = true|false - Default is false. If true, ignore > fields that are extracted but are not in the Solr Schema. Otherwise, an > exception will be thrown for fields that are not mapped. > - * ext.literal.<NAME> = <VALUE> - Create a field on the document with field > name NAME and literal value VALUE, e.g. ext.literal.foo=bar. May be > multivalued if the Field is multivalued. Otherwise, the ERH will throw an > exception. > - * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika > metadata attribute to a Solr field name. If no mapping is specified, the > metadata attribute will be used as the field name. If the field name doesn't > exist, it can be ignored by setting the "ignore undeclared fields" > (ext.ignore.und.fl) attribute described below > - * ext.metadata.prefix=<VALUE> - Prepend a String value to all Metadata, > such that it is easy to map new metadata fields to dynamic fields > - * ext.resource.name=<File Name> - Optional. The name of the file. Tika > can use it as a hint for detecting mime type. > - * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML > content that satisfies the XPath expression. See > http://lucene.apache.org/tika/documentation.html for details on the format of > Tika XHTML. See also TikaExtractOnlyExampleOutput. > > = Metadata = > > @@ -171, +175 @@ > > See TikaExtractOnlyExampleOutput. > > > + == Additional Resources == > + * > [http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#example.source > Lucid Imagination article] > + * [http://lucene.apache.org/tika/formats.html Supported document formats > via Tika] > - = Customizing = > - > - While the current !ExtractingRequestHandler only allows for the use of the > !SolrContentHandler in creating new documents, it is relatively easy to > implement your own extension that processes the Tika extracted content > differently and produces a different !SolrInputDocument. > - > - To do this, implement your own instance of the !SolrContentHandlerFactory > and override the createFactory() method on the !ExtractingRequestHandler. > > = What's in a Name = > >