[Solr Wiki] Update of "UpdateRichDocuments" by EricPugh

Apache Wiki Tue, 04 Sep 2007 13:19:15 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by EricPugh:
http://wiki.apache.org/solr/UpdateRichDocuments

New page:
= Updating a Solr Index with Rich Documents such as PDF and MS Office =

Solr has an extensible 

Solr accepts index updates in 
[http://en.wikipedia.org/wiki/Comma-separated_values CSV] (Comma Separated 
Values) format.  Different separators are configurable, and multi-valued fields 
are supported.

[[TableOfContents]]

== Requirements ==
Solr 1.2 is the first version with CSV support for updates.
The CSV request handler needs to be configured in solrconfig.xml
This should already be present in the example solrconfig.xml
{{{
  <!-- CSV update handler, loaded on demand -->
  <requestHandler name="/update/csv" class="solr.CSVRequestHandler" 
startup="lazy">
  </requestHandler>
}}}

== How to Install ==

1) You need a couple patch files and zips of source and testcode that are 
attached to the JIRA issue at https://issues.apache.org/jira/browse/SOLR-284.

2) Download the libs.zip, rich.patch, test-files.zip, source.zip, and test.zip 
files.

3) Unzip the libs.zip into SOLR_HOME/lib.  These are the jar's required for 
parsing the rich documents, using PDFBox and POI.

4) Unzip the test-files.zip into SOLR_HOME/test/test-files/.  These are various 
test files for running the included unit tests.

5) Apply the rich.patch to your source.  Rich.patch has tweaks that add the 
solr.RichDocumentRequestHandler to your solrconfig.xml.

6) Copy the contents of source.zip into 
SOLR_HOME/src/java/org/apache/solr/handler

7) Copy the contents of test.zip into SOLR_HOME/src/test/org/apache/solr/handler

8) Run {{{ant test}}} to verify everything is working!

== Methods of uploading Binary records ==
Binary records may be uploaded to Solr by sending the data to the 
/solr/update/rich URL.
All of the normal methods for [SolrContentStreams uploading content] are 
supported.

=== Example ===
There is a sample PDF file at {{{src/test/test-files/simple.pdf}}} that may be 
used to add a PDF to the solr example server.

Example of using HTTP-POST to send the CSV data over the network to the Solr 
server:
{{{
cd src/test/test-files/simple.pdf
curl http://localhost:8983/solr/update/rich --data-binary @simple.pdf -H 
'Content-type:text/plain; charset=utf-8'
}}}

Uploading a binary file can be more efficient than sending it over the network 
via HTTP.
Remote streaming must be enabled for this method to work.  See the following 
line in {{{solrconfig.xml}}}, change it to {{{enableRemoteStreaming="true"}}}, 
and restart Solr.
{{{
  <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" 
/>
}}}

The following request will cause Solr to directly read the input file:
{{{
curl 
http://localhost:8983/solr/update/rich?stream.file=src/test/test-files/simple.pdf
#NOTE: The full path, or a path relative to the CWD of the running solr server 
must be used.
}}}

== Parameters ==
Some parameters may be specified on a per field basis via 
{{{f.<fieldname>.param=value}}}


=== fieldnames ===
Specifies a comma separated list of field names to use when adding documents to 
the Solr index.  If the CSV input already has a header, the names specified by 
this parameter will override them.

Example: {{{fieldnames=id,name,category}}}



=== overwrite ===
If {{{true}}} (the default), overwrite documents based on the uniqueKey field 
declared in the solr schema.

=== commit ===
Commit changes after all records in this request have been indexed.  The 
default is {{{commit=false}}} to avoid the potential performance impact of 
frequent commits.

== Disadvantages ==
There is no way to provide document or field index-time boosts with the CSV 
format, however many indicies do not utilize that feature.

[Solr Wiki] Update of "UpdateRichDocuments" by EricPugh

Reply via email to