Re: Handling disparate data sources in Solr

Chris Hostetter Tue, 09 Jan 2007 14:03:55 -0800

: There's two cases I can think of:
:
: 1. The document is already decomposed into fields before the
: insert/update, but one or more of the fields requires special handling.


: 2. The document contains both metadata and content.  PDF is a good
: example of such a document type.

there's a third big example: multiple documents are compused into a single
stream of raw data, and you want Solr to extract the individual documents.
the simplest example of this case being that you want to point Solr at a
CSV file where each record is a document.

: And for both of these you'd need to be able to specify the mapping
: between the data/metadata in the source document and the corresponding
: Solr schema fields.  I'm not sure if you'd want this in the
: solrconfig.xml file or in the indexing request itself.  Doing it in
: solrconfig.xml means you could change the disposition of the indexed
: data without changing the clients submitting the content.

right ... i think that's something that could be controlled on a per
"parser" basis, much they way RequestHandlers can currently take in a lot
of options at request time, but can also have default values (or
invariant values) specified for those options in the solrconfig when they
are registered.

: That was the reasoning behind my initial suggestion:
:
: | Extend the <doc> and <field> element with the following attributes:

Right, i was suggesting we take it to the next level, and allow for
plugins to handle updates that didn't have to have any XML encapsulation
at all -- the options and the raw data stream could be expressed entirely
in the HttpServletRequest for the update .. which would still allow us to
add the type of syntax you are describing to some new "XmlUpdateSource"
containing the refactored code which currently parses updates in SolrCore.


-Hoss

Re: Handling disparate data sources in Solr

Reply via email to