Re: Handling disparate data sources in Solr

Erik Hatcher Sun, 07 Jan 2007 07:24:56 -0800

The idea of having Solr handle various document types is a good one,for sure. I'm not sure what specifics would need to be implemented,but I at least wanted to reply and say its a good idea!

Care has to be taken when passing a URL to Solr for it to go fetch,though. There are a lot of complexities in fetching resources viaHTTP, especially when handing something off to Solr which should bebehind a firewall and may not be able to see the web as you wouldwith your browser.


        Erik


On Jan 4, 2007, at 4:53 PM, Alan Burlison wrote:

Original problem statement:

----------
I'm considering using Solr to replace an existing bare-metal Lucenedeployment - the current Lucene setup is embedded inside anexisting monolithic webapp, and I want to factor out the searchfunctionality into a separate webapp so it can be reused more easily.
At present the content of the Lucene index comes from manydifferent sources (web pages, documents, blog posts etc) and can bedifferent formats (plaintext, HTML, PDF etc). All the variouscontent types are rendered to plaintext before being inserted intothe Lucene index.
The net result is that the data in one field in the index (say"content") may have come from one of a number of source documenttypes. I'm having difficulty understanding how I might map thisfunctionality onto Solr. I understand how (for example) I coulduse HTMLStripStandardTokenizer to insert the contents of a HTMLdocument into a field called "content", but (assuming I'd written aPDF analyser) how would I insert the content of a PDF document intothe same "content" field?
I know I could do this by preprocessing the various document typesto plaintext in the various Solr clients before inserting the datainto the index, but that means that each client would need to knowhow to do the document transformation. As well as centralising theindex, I also want to centralise the handling of the differentdocument types.
----------
My initial suggestion, to get the discussion started, is to extendthe <doc> and <field> element with the following attributes:
mime-type
Mime type of the document, e.g. application/pdf, text/html and so on.

encoding
Encoding of the document, with base64 being the standardimplementation.
href
The URL of any documents that can be accessed over HTTP, instead ofembedding them in the indexing request. The indexer would fetchthe document using the specified URL.
There would then be entries in the configuration file that map eachMIME type to a handler that is capable of dealing with thatdocument type.
Thoughts?

--
Alan Burlison
--

Re: Handling disparate data sources in Solr

Reply via email to