[Solr Wiki] Update of "ExtractingRequestHandler" by Yon ikSeeley

Apache Wiki Fri, 16 Oct 2009 07:04:47 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "ExtractingRequestHandler" page has been changed by YonikSeeley:
http://wiki.apache.org/solr/ExtractingRequestHandler?action=diff&rev1=48&rev2=49

  java -jar start.jar
  }}}
  
- In a separate window go to the {{{docs/}}} directory (which contains some 
nice example docs) and send Solr a file via HTTP POST:
+ In a separate window go to the {{{docs/}}} directory (which contains some 
nice example docs), or the {{{site}}} directory if you built Solr from source, 
and send Solr a file via HTTP POST:
  {{{
  cd docs
  curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' 
-F "[email protected]"
@@ -142, +142 @@

  
  = Examples =
  
- <!> NOTE: All the examples are run using curl on the command line, so there 
are extra escapes ("\") in the URL.
- 
  == Mapping and Capture ==
  
  Capture <div> tags separate, and then map that field to a dynamic field named 
foo_t.
  
  {{{
-  curl 
http://localhost:8983/solr/update/extract?literal.id=doc2\&captureAttr=true\&defaultField=text\&fmap.div=foo_t\&capture=div
  -F "[email protected]"
+  curl 
"http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div";
  -F "[email protected]"
  }}}
  
  == Mapping, Capture and Boost ==
  Capture <div> tags separate, and then map that field to a dynamic field named 
foo_t.  Boost foo_t by 3.
  {{{
- curl 
http://localhost:8983/solr/update/extract?literal.id=doc3\&captureAttr=true\&defaultField=text\&capture=div\&fmap.div=foo_t\&boost.foo_t=3
 -F "[email protected]"
+ curl 
"http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3";
 -F "[email protected]"
  }}}
  
  == Literals ==
  
  To add in your own metadata, pass in the literal parameter along with the 
file:
  {{{
- curl 
http://localhost:8983/solr/update/extract?literal.id=doc4\&captureAttr=true\&defaultField=text\&capture=div\&fmap.div=foo_t\&boost.foo_t=3\&literal.blah_s=Bah
  -F "[email protected]"
+ curl 
"http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah";
  -F "[email protected]"
  }}}
  
  == XPath ==
@@ -170, +168 @@

  Restrict down the XHTML returned by Tika by passing in an XPath expression
  
  {{{
- curl 
http://localhost:8983/solr/update/extract?literal.id=doc5\&captureAttr=true\&defaultField=text\&capture=div\&fmap.div=foo_t\&boost.foo_t=3\&literal.id=id\&\&xpath=\/xhtml:html\/xhtml:body\/xhtml:div\/descendant:node\(\)
  -F "[email protected]"
+ curl 
"http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()"
  -F "[email protected]"
  }}}
  
  == Extract Only ==
  {{{
- curl http://localhost:8983/solr/update/extract?\&extractOnly=true  
--data-binary @tutorial.html  -H 'Content-type:text/html'
+ curl "http://localhost:8983/solr/update/extract?&extractOnly=true";  
--data-binary @tutorial.html  -H 'Content-type:text/html'
  }}}
  
  A the output includes XML generated by Tika (and is hence further escaped by 
Solr's XML) using a different output format enhance the readability:
  {{{
- curl 
http://localhost:8983/solr/update/extract?\&extractOnly=true\&wt=ruby\&indent=true
  --data-binary @tutorial.html  -H 'Content-type:text/html'
+ curl 
"http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true";
  --data-binary @tutorial.html  -H 'Content-type:text/html'
  }}}
  
  See TikaExtractOnlyExampleOutput.
@@ -188, +186 @@

  = Sending documents to Solr =
  
  // TODO: describe the different ways to send the documents to solr (POST 
body, form encoded, remoteStreaming)
-  * curl 
http://localhost:8983/solr/update/extract?literal.id=doc5\&defaultField=text  
--data-binary @tutorial.html  -H 'Content-type:text/html'  
+  * curl 
"http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text";  
--data-binary @tutorial.html  -H 'Content-type:text/html'  
-        <!> NOTE, this literally streams the file, which does not, then, 
provide info to Solr about the name of the file.
+        <!> NOTE, this literally streams the file as the body of the POST, 
which does not, then, provide info to Solr about the name of the file.
   * SolrJ:  Use the ContentStreamUpdateRequest (see SolrExampleTests.java for 
full example):{{{
      ContentStreamUpdateRequest up = new 
ContentStreamUpdateRequest("/update/extract");
      up.addFile(new File("mailing_lists.pdf"));

[Solr Wiki] Update of "ExtractingRequestHandler" by Yon ikSeeley

Reply via email to