Re: Update Plugins (was Re: Handling disparate data sources in Solr)

J.J. Larrea Tue, 16 Jan 2007 10:58:09 -0800

I'm in frantic deadline mode so I'm just going to throw in some (hopefully) 
short comments...

At 11:02 PM -0800 1/15/07, Ryan McKinley wrote:
>>the one thing that still seems missing is those "micro-plugins" i was
>> [SNIP]
>>
>>  interface SolrRequestParser {
>>     SolrRequest process( HttpServletRequest req );
>>  }
>>
>
>
>I left out "micro-plugins" because i don't quite have a good answer
>yet :)  This may a place where a custom dispatcher servlet/filter
>defined in web.xml is the most appropriate solution.

If the issue is munging HTTPServletRequest information, then a proper 
separation of concerns suggests responsibility should lie with a Servlet 
Filter, as Ryan suggests.

For example, while the Servlet 2.4 spec doesn't have specifications for how the 
servlet container can/should "burst" a multipart-MIME payload into separate 
files or streams, there are a number of 3rd party Filters which do this.

The Iterator<ContentStream> is a great idea because if each stream is read to 
completion before the next is opened it doesn't impose any limitation on 
individual stream length and doesn't require disk buffering.

(Of course some handlers may require access to more than one stream at a time; 
each time next() is called on the iterator before the current stream is closed, 
the remainder of that stream will have to be buffered in memory or on disk, 
depending on the part length.  Nonetheless that detail can be entirely hidden 
from the handler, as it should be.  I am not sure if any available 
ServletFilter implementations work this way, but it's certainly doable.)

But that detail is irrelevant for now; as I suggest below, using this API lets 
one immediately implement it with only next() value of the entire POST stream; 
that would answer the needs of the existing update request handling code, but 
establish an API to handle multi-part.  Whenever someone wants to write a 
multi-stream handler, they can write or find a better Iterator<ContentStream> 
implementation, which would best be cast as a ServletFilter.

>I like the SolrRequestParser suggestion.

Me too.  It answers a hole in my vision for how this can all fit together.

>Consider:
>qt='RequestHandler'
>wt='ResponseWriter'
>rp='RequestParser ' (rb='SolrBuilder'?)
>
>To avoid possible POST read-ahead stream mungling: qt,wt, and rp
>should be defined by the URL, not parameters.  (We can add special
>logic to allow /query?qt=xxx)
>
>For qt, I like J.J. Larrea's suggestion on SOLR-104 to let people
>define arbitrary path mapping for qt.
>
>We could append 'wt', 'rb', and arbitrary arbitrary text to the
>registered path, something like
> /registered/path/wt:json/rb:standard/more/stuff/in/the/path?params...
>
>(any other syntax ideas?)

No need for new syntax, I think.  The pathInfo or qt or other source resolves 
to a requestHandler CONFIG name.  The handler config is read to determine the 
handler class name.  It also can be consulted (with URL or form-POST params 
overriding if allowed by the  config) to decide which RequestParser to invoke 
BEFORE IT IS CALLED and which ResponseWriter to invoke AFTER.  Once those 
objects are set up, the request body gets executed.

Handler config inheritance (as I proposed in SOLR-104 point #2) would greatly 
simplify, for example, creating a dozen query handlers which used a particular 
invariant combination of qt, wt, and rp

>The 'standard' RequestParser would:
>GET:
> fill up SolrParams directly with req.getParameterMap()
>if there is a 'post' parameter (post=XXX)
>  return a stream with XXX as its content
>else
>  empty iterator.
>Perhaps add a standard way to reference a remote URI stream.
>
>POST:
> if( multipart ) {
>  read all form fields into parameter map.

This should use the same req.getParameterMap as for GET, which Servlet 2.4 says 
is suppose to be automatically by the servlet container if the payload is 
application/x-www-form-urlencoded; in that case the input stream should be null.

>  return an iterator over the collection of files

Collection of streams, per Hoss.

>}
>else {
>  no parameters? parse parameters from the URL? /name:value/
>  return the body stream

As above, this introduces unneeded complexity and should be avoided.

>}
>DEL:
> throw unsupported exception?
>
>
>Maybe each RequestHandler could have a default RequestParser.  If we
>limited the 'arbitrary path' to one level, this could be used to
>generate more RESTful URLs. Consider:
>
>/myadder/1111/2222/3333/
>
>/myadder maps to MyCustomHandler and that gives you
>MyCustomRequestBuilder that maps /1111/2222/3333 to SolrParams

I think these are best left for an extra-SOLR layer, especially since SOLR URLs 
are meant for interprogram communication and not direct use by non-developer 
end users.  For example, for my org's website I have hundreds of Apache 
mod_rewrite rules which do URL munging such as
        /journals/abc/7/3/192a.pdf
into
        /journalroot/index.cfm?journal=abc&volume=7&issue=3
                &page=192&seq=a&format=pdf

Or someone could custom-code a subclass of SolrServlet which handles 
application-specific URL requirments.  But the base implementation should be as 
simple as possible - or perhaps more accurately, complex only where complexity 
is really called for: query caching, faceting, and the like.

>>one last thought: while the interfaces you outlined would make a lot
>>of sense if we were starting from scratch, there are probably several
>>cases where not having those exact names/APIs doesn't really hurt, and
>>would allow backwards compatibility with more of the current code (and
>>current SolrRequestHandler plugin people have written) ... just something
>>we should keep in mind: we don't want to go hog wild renaming a lot of
>>stuff and alienating our existing "plugin" user base. (nor do we want to
>>make a bunch of unneccessary config file format changes)
>>
>
>I totally understand and agree.
>
>Perhaps the best approach is to offer a SolrRequestProcessor framework
>that can sit next to the existing SolrRequestHandler without affecting
>it much (if at all).  For what i have suggested, i *think* it could
>all be done with simple additions to solrschema.xml that would still
>work on an unedited 1.1.0 solrconfig.xml

Yah

>
>If we use a servletFilter for the dispatcher, this can sit next to the
>current /query?xxx servlet without problem.  When the
>SolrRequestProcessor framework is rock solid, we would @Deprecate
>SolrRequestHandler and change the default solrconfig.xml to map /query
>to the new framework.

I think there are some small to moderate steps which should be done within the 
SOLR 1.x where x < 5 framework, where one wants to be non-API breaking as much 
as possible, and some bigger improvements which should be contemplated for an 
API-incompatible release.

For the short term, I think the following make sense:

- Refactor the request handlers (based in part on what Ryan has already done in 
SOLR-102/20) to unify query and update requests.

- Move scattered functionality to UpdateCommand and xxxUpdateCommmand 
implementations per my comments #3 SOLR

- It would make sense to establish the Iterator<ContentStream> API in but 
initially only provide a single stream, the raw POST when not form-urlencoded. 

- Revise the XML-based update code (broken out of SolrCore into a 
RequestHandler) to use all the above.

-  To keep existing configs unmodified, we would want to continue to use qt as 
the arg even for update.

- Add handler config inheritance and structure the sample solrconfig to suggest 
structuring handlers as a hierarchy

- Change the servlet (or add a filter) to use pathInfo to select the handler. 
For the moment, the servlet could simply take any pathINFO and store it in qt.  
So
        /<base>?qt=standard
  and   /<base>/standard
are identical, and
        /<base>/query/products/instock would be the same as
        /base?qt=query/products/instock
and match a response handler name="query/products/instock"

- Provide config examples which exercise the new mechanisms.

This would have immediate benefits for plugin writers (both query and update), 
config file writers (shorter files, request namespace structuring), command 
issuers (having the request part of the pathInfo rather than qt seems to make 
sense to everyone), and so forth.

With that done, as a second step, having Request Parsing plugins makes great 
sense.  I'll save further commentary on that for another email.

- J.J.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

Reply via email to