Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Yonik Seeley Fri, 12 Jan 2007 12:41:35 -0800

On 1/10/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:

The one hitch i think to the the notion that updates and queries map
cleanlly with something like this...


  SolrRequestHandler => SolrUpdateHandler
  SolrQueryRequest => SolrUpdateRequest
  SolrQueryResponse => SolrUpdateResponse (possibly the same class)
  QueryResponseWriter => UpdateResponseWriter (possible the same class)

...is that with queries, the "input" tends to be fairly simple.  very
generic code can be run by the query Servlet to get all of the input
params and build the SolrQueryRequest ... but with updates this isn't
quite as simple.  there's the two issues i spoke of in my earlier mail
which should be independenly confiugable:
  1) where does the "stream" of update data come from?  is it in the raw
     POST body? is it in a POSTed multi-part MIME part? is it a remote
     resource refrenced by URL?
  2) how should the raw binary stream of update data be parsed?  is it
     XML? (in the current update format)  is it a CSV file?  is it a PDF?

...#2 can be what the SolrUpdateHandler interface is all about -- when
hitting the update url you specify a "ut" (update type) that determines
that logic ... but it should be independed of #1


Right, you're getting at issues of why I haven't committed my CSV handler yet.
It currently handles reading a local file (this is more like an SQL
update handler... only a reference to the data is passed).  But I also
wanted to be able to handle a POST of the data  , or even a file
upload from a browser.  Then I realized that this should be generic...
the same should also apply to XML updates, and potential future update
formats like JSON.

The most important issue is to nail down the external HTTP interface.
If the URL structure changes, it's also an opportunity to change
whatever we don't like about the current XML format.  The old update
URL can still implement the original syntax.
It's also an opportunity to make the interface a little more REST-like
if we so choose.

Brainstorming:
- for errors, use HTTP error codes instead of putting it in the XML as now.

- perhaps get rid of the enclosing <add>... that could be a verb in
the URL, or for multiple documents, change it to <docs>.

- add information about the data in the URL:

POST /solr/add?format=json&overwrite=true
[
 {"field1":"value1", "field2":[false,true,false,true,true]}
]

POST /solr/add?format=csv&separator=,&...
field1,field2
val1,val2

This is more flexible as it allows one to add more metadata about the
data w/o having to change the data format.  For example, if one wanted
to be able to specify which index the add should go to, or other info
about the handling of the data, it's simple to add an additional param
in the URL.

- For browser friendliness, we could support a standard mechanism for
putting the body in the URL (not for general use since the URL can be
size limited, but good for testing).

POST /solr/add?format=json&overwrite=true&body=[{"field1":"value1"}]

- more REST like?
PUT /solr/document/1003?title=howdy&author=snafoo&cat=misc&cat=book
#not sure I like that format, and we would still want the multi-doc
format anyway

- more REST like?
DEL /solr/document/1003
 OR
DEL /solr/document?id=1003
 OR
POST /solr/document/delete?id=1003

#how to do delete-by-query, optimize, etc?
DEL/POST /solr/document/delete?q=id:[10 TO 20]
 OR
POST /solr/command/delete?id=1002&id=1003&q=id:[1000 TO 1010]
 OR
POST /solr/command/deletebyquery?q=id:[10 TO 20]

POST /solr/command/optimize&wait=true

- administrative commands, setting certain limits

POST /solr/command/set&mergeFactor=100&maxBufferedDocs=1000
POST /solr/command/set&logLevel=3

You get the idea of some of the options available.
Ideas?  Thoughts?

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Reply via email to