The idea of having Solr handle various document types is a good one,
for sure. I'm not sure what specifics would need to be implemented,
but I at least wanted to reply and say its a good idea!
Care has to be taken when passing a URL to Solr for it to go fetch,
though. There are a lot of complexities in fetching resources via
HTTP, especially when handing something off to Solr which should be
behind a firewall and may not be able to see the web as you would
with your browser.
Erik
On Jan 4, 2007, at 4:53 PM, Alan Burlison wrote:
Original problem statement:
----------
I'm considering using Solr to replace an existing bare-metal Lucene
deployment - the current Lucene setup is embedded inside an
existing monolithic webapp, and I want to factor out the search
functionality into a separate webapp so it can be reused more easily.
At present the content of the Lucene index comes from many
different sources (web pages, documents, blog posts etc) and can be
different formats (plaintext, HTML, PDF etc). All the various
content types are rendered to plaintext before being inserted into
the Lucene index.
The net result is that the data in one field in the index (say
"content") may have come from one of a number of source document
types. I'm having difficulty understanding how I might map this
functionality onto Solr. I understand how (for example) I could
use HTMLStripStandardTokenizer to insert the contents of a HTML
document into a field called "content", but (assuming I'd written a
PDF analyser) how would I insert the content of a PDF document into
the same "content" field?
I know I could do this by preprocessing the various document types
to plaintext in the various Solr clients before inserting the data
into the index, but that means that each client would need to know
how to do the document transformation. As well as centralising the
index, I also want to centralise the handling of the different
document types.
----------
My initial suggestion, to get the discussion started, is to extend
the <doc> and <field> element with the following attributes:
mime-type
Mime type of the document, e.g. application/pdf, text/html and so on.
encoding
Encoding of the document, with base64 being the standard
implementation.
href
The URL of any documents that can be accessed over HTTP, instead of
embedding them in the indexing request. The indexer would fetch
the document using the specified URL.
There would then be entries in the configuration file that map each
MIME type to a handler that is capable of dealing with that
document type.
Thoughts?
--
Alan Burlison
--