Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "ExtractingRequestHandler" page has been changed by YonikSeeley: http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=48&rev2=49 java -jar start.jar }}} - In a separate window go to the {{{docs/}}} directory (which contains some nice example docs) and send Solr a file via HTTP POST: + In a separate window go to the {{{docs/}}} directory (which contains some nice example docs), or the {{{site}}} directory if you built Solr from source, and send Solr a file via HTTP POST: {{{ cd docs curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "[email protected]" @@ -142, +142 @@ = Examples = - <!> NOTE: All the examples are run using curl on the command line, so there are extra escapes ("\") in the URL. - == Mapping and Capture == Capture <div> tags separate, and then map that field to a dynamic field named foo_t. {{{ - curl http://localhost:8983/solr/update/extract?literal.id=doc2\&captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div -F "[email protected]" + curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div" -F "[email protected]" }}} == Mapping, Capture and Boost == Capture <div> tags separate, and then map that field to a dynamic field named foo_t. Boost foo_t by 3. {{{ - curl http://localhost:8983/solr/update/extract?literal.id=doc3\&captureAttr=true\&defaultField=text\&capture=div\&fmap.div=foo_t\&boost.foo_t=3 -F "[email protected]" + curl "http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3" -F "[email protected]" }}} == Literals == To add in your own metadata, pass in the literal parameter along with the file: {{{ - curl http://localhost:8983/solr/update/extract?literal.id=doc4\&captureAttr=true\&defaultField=text\&capture=div\&fmap.div=foo_t\&boost.foo_t=3\&literal.blah_s=Bah -F "[email protected]" + curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah" -F "[email protected]" }}} == XPath == @@ -170, +168 @@ Restrict down the XHTML returned by Tika by passing in an XPath expression {{{ - curl http://localhost:8983/solr/update/extract?literal.id=doc5\&captureAttr=true\&defaultField=text\&capture=div\&fmap.div=foo_t\&boost.foo_t=3\&literal.id=id\&\&xpath=\/xhtml:html\/xhtml:body\/xhtml:div\/descendant:node\(\) -F "[email protected]" + curl "http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()" -F "[email protected]" }}} == Extract Only == {{{ - curl http://localhost:8983/solr/update/extract?\&extractOnly=true --data-binary @tutorial.html -H 'Content-type:text/html' + curl "http://localhost:8983/solr/update/extract?&extractOnly=true" --data-binary @tutorial.html -H 'Content-type:text/html' }}} A the output includes XML generated by Tika (and is hence further escaped by Solr's XML) using a different output format enhance the readability: {{{ - curl http://localhost:8983/solr/update/extract?\&extractOnly=true\&wt=ruby\&indent=true --data-binary @tutorial.html -H 'Content-type:text/html' + curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true" --data-binary @tutorial.html -H 'Content-type:text/html' }}} See TikaExtractOnlyExampleOutput. @@ -188, +186 @@ = Sending documents to Solr = // TODO: describe the different ways to send the documents to solr (POST body, form encoded, remoteStreaming) - * curl http://localhost:8983/solr/update/extract?literal.id=doc5\&defaultField=text --data-binary @tutorial.html -H 'Content-type:text/html' + * curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" --data-binary @tutorial.html -H 'Content-type:text/html' - <!> NOTE, this literally streams the file, which does not, then, provide info to Solr about the name of the file. + <!> NOTE, this literally streams the file as the body of the POST, which does not, then, provide info to Solr about the name of the file. * SolrJ: Use the ContentStreamUpdateRequest (see SolrExampleTests.java for full example):{{{ ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract"); up.addFile(new File("mailing_lists.pdf"));
