subject:"Update Plugins \(was Re\: Handling disparate data sources in Solr\)"

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-22 Thread Chris Hostetter


:3) there's a comment in RequestHandlerBase.init about indexOf that
:   comes form the existing impl in DismaxRequestHandler -- but doesn't match
:   the new code ... i also wasn't certain that the change you made matches

:  I just copied the code from DismaxRequestHandler and made sure it
:  passes the tests.  I don't totally understand what that case is doing.
:
: The first iteration of dismax (before we did generic defaults,
: invariants, etc for request handlers) took defaults directly from the
: init params, and that is what that case is checking for and

bingo .. the reason it jumped out at me in your patch, is that the comment
still refered to indexOf, but the code didn't ... it might be functionally
equivilent, i just wasn't sure when i did my quick read.

there's mention in the comment that indexOf is used so that null
name=defaults / can indicate that you don't want all the init params as
defaults, but you don't acctually want defaults either -- but there
doesn't seem to be a test for that case.

you can see support for the legacy defaults syntax in
src/test/test-files/solr/conf/solrconfig.xml if you grep for
dismaxOldStyleDefaults



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Chris Hostetter


:  ...i was trying to avoid keeping the parser name out of the query string,
:  so we don't have to do any hack parsing of
:  HttpServletRequest.getQueryString() to get it.
:
: We need code to do that anyway since getParameterMap() doesn't support
: getting params from the URL if it's a POST (I believe I tried this in
: the past and it didn't work).

Uh ... i'm pretty sure you are mistaken ... yep, i've just checked and you
are *definitely* mistaken.

getParameterMap will in fact pull out params from both the URL and the
body if it's a POST -- but only if you have not allready accessed either
getReader or getInputStream -- this was at the heart of my cumbersome
preProcess/process API that we all agree now was way too complicated.

At the bottom of this email is a quick and dirty servlet i just tried to
prove to myself that posting with params in the URL and the body worked
fine ... i do rememebr reading up on this a few years back and verifying
that it's documented somewhere in the servlet spec, a quick google search
points this this article implying it was solidified in 2.2...

   http://java.sun.com/developer/technicalArticles/Servlets/servletapi/
   (grep for Nit-picky on Parameters)


: Pluggable request parsers seems needlessly complex, and it gets harder
: to explain it all to someone new.
: Can't we start simple and defer anything like that until there is a real need?

Alas ... i appear to be getting worse at explaining myself in my old age.

What i was trying to say is that this idea i had for expressing
requestParsers as an optional prefix in fron of the requestHandler would
allow us to worry about the things i'm worried about *later* -- if/when
they become a problem (or when i have time to stop whinning, and actually
write the code)

The nut shell being: i'm totally on board with Ryan's simple URL scheme,
having a single RequestParser/SolrRequestBuilder, going with an entirely
inspection based approach for deciding where the streams come from, and
leaving all mention of parsers or stream.type out of the URL.

(because i have a good idea of how to support it in a backwards campatible
way *later*)



public class TestServlet extends HttpServlet {
  public void doPost(HttpServletRequest request, HttpServletResponse response)
throws Exception {

response.setContentType(text/plain);
java.util.Map params = request.getParameterMap();
for (Object k : params.keySet()) {
  Object v = params.get(k);
  if (v instanceof Object[]) {
for (Object vv : (Object[])v) {
  response.getWriter().println(k.toString() + : + vv);
}
  } else {
response.getWriter().println(k.toString() + : + v);
  }
}
  }
}

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread J.J. Larrea

At 1:20 AM -0800 1/21/07, Chris Hostetter wrote:
: We need code to do that anyway since getParameterMap() doesn't support
: getting params from the URL if it's a POST (I believe I tried this in
: the past and it didn't work).

Uh ... i'm pretty sure you are mistaken ... yep, i've just checked and you
are *definitely* mistaken.

getParameterMap will in fact pull out params from both the URL and the
body if it's a POST -- but only if you have not allready accessed either
getReader or getInputStream -- this was at the heart of my cumbersome
preProcess/process API that we all agree now was way too complicated.

The rules are very explicitly laid out in the Servlet 2.4 specification:

-
SRV.4.1.1 When Parameters Are Available
The following are the conditions that must be met before post form data will
be populated to the parameter set:
1. The request is an HTTP or HTTPS request.
2. The HTTP method is POST.
3. The content type is application/x-www-form-urlencoded.
4. The servlet has made an initial call of any of the getParameter family of 
methods on the request object.
If the conditions are not met and the post form data is not included in the
parameter set, the post data must still be available to the servlet via the 
request object's input stream. If the conditions are met, post form data will 
no longer be available for reading directly from the request object's input 
stream.
-

As Hoss notes a POST request can still have GET-style parameters in the URL 
query string, and getParameterMap will return both sets intermixed for a POST 
meeting the above conditions.  And calling getParameterMap won't impede the 
ability to subsequently read the input stream if the conditions are not met: 
the post data must still be available to the servlet.  So it's theoretically 
valid to simply call getParameterMap and then blindly call getInputStream 
(possibly catching an Exception), or else use the results of getParameterMap to 
decide whether and how to process the input stream.

The bugaboo is if the POST data is NOT in fact 
application/x-www-form-urlencoded but the user agent says it is -- as both of 
you have indicated can be the case when using curl.  Could that be why Yonik 
thought POST params was broken?

- J.J.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Ryan McKinley



The nut shell being: i'm totally on board with Ryan's simple URL scheme,
having a single RequestParser/SolrRequestBuilder, going with an entirely
inspection based approach for deciding where the streams come from, and
leaving all mention of parsers or stream.type out of the URL.

(because i have a good idea of how to support it in a backwards campatible
way *later*)



Great!  I just posted an update to SOLR-104 that I hope will make you happy.

It moved the various request parsing methods into distinct classes
that could easily be pluggable if that is necessary.  As written, It
supports stream.type=raw|multipart|simple|standard  We can comment
that out and use 'standard' for everything as a first pass.

I added configuation to solrconfig.xml:
 requestParsers enableRemoteStreaming=true
multipartUploadLimitInKB=2048 /

I removed LegacySelectServlet and added an explicit check in the
DispatchFilter for paths starting with /select  This seems like a
better idea as the logic and expected results are identical.

If i'm following our discussion correctly, I *think* this takes care
of all the major issues we have.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Yonik Seeley


On 1/21/07, J.J. Larrea [EMAIL PROTECTED] wrote:

The bugaboo is if the POST data is NOT in fact 
application/x-www-form-urlencoded but the user agent says it is -- as both of 
you have indicated can be the case when using curl.  Could that be why Yonik 
thought POST params was broken?


Correct.  That's the format that post.sh in the example sends
(application/x-www-form-urlencoded) and we ignore it in the update
handler and always treat the body as binary.

Now if you wanted to add some query args to what we already have, you
can't use getParameterMap().

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Yonik Seeley


On 1/21/07, Chris Hostetter [EMAIL PROTECTED] wrote:

At the bottom of this email is a quick and dirty servlet i just tried to
prove to myself that posting with params in the URL and the body worked
fine ...


I tried that by simply posting to the Solr standard request handler
(it echoes params in the example config), and yes, it worked fine. The
problem is if the body should be the stream, and the content-type is
wrong (and we currently send it wrong with curl).


The nut shell being: i'm totally on board with Ryan's simple URL scheme,
having a single RequestParser/SolrRequestBuilder, going with an entirely
inspection based approach for deciding where the streams come from, and
leaving all mention of parsers or stream.type out of the URL.

(because i have a good idea of how to support it in a backwards campatible
way *later*)


A.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Chris Hostetter


: Great!  I just posted an update to SOLR-104 that I hope will make you happy.

Dude ... i can *not* keep up with you.

: If i'm following our discussion correctly, I *think* this takes care
: of all the major issues we have.

I don't think i'll have time to look at your new patch today, design wise
i think you are right, but there was still stuff that needed to be
refactored out of core.update and into the UpdateHandler wasn't there?

a couple of minor comments i had when i read the last patch (but didn't
mention since i was focusing on design issues) ...

 1) why rename the servlets Legacy* instead of just marking them deprecated?
 2) getSourceId and getSoure need to be left in the concrete Handlers so
they get illed in with the correct file version info on checkout.
 3) there's a comment in RequestHandlerBase.init about indexOf that
comes form the existing impl in DismaxRequestHandler -- but doesn't match
the new code ... i also wasn't certain that the change you made matches
the old semantics for dismax (i don't think we have a unit test for that
case)
 4) ContentStream.getFieldName() would proabably be more general as
ContentStream.getSourceInfo() ... it could stay as it is for files/urls,
but raw posts and multipart posts could have a usefull debuging
description as well.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Chris Hostetter


:  The bugaboo is if the POST data is NOT in fact
:  application/x-www-form-urlencoded but the user agent says it is -- as
:  both of you have indicated can be the case when using curl.  Could that
:  be why Yonik thought POST params was broken?
:
: Correct.  That's the format that post.sh in the example sends
: (application/x-www-form-urlencoded) and we ignore it in the update
: handler and always treat the body as binary.
:
: Now if you wanted to add some query args to what we already have, you
: can't use getParameterMap().

I think i mentioned this before, but I think what we should do is make the
stream guessing code in the Dispatcher/RequestBuilder very strict, and
make it's decisison about how to treat the post body entirely based on the
Content-Type ... meanwhile the existing (eventually know as old) way of
doing updates via /update to the UpdateServlet can be more lax, and
assume everything is a raw POST of XML.

we can change post.sh to spcify XML as the Content-Type by default,
modify the example schema to have other update handlers registered with
names like /update/csv and eventually add an /update/xml encouraging
people to use it if they want to send updates as xml dcouments, regardless
of wehter htey want to POST them raw, uplodae them, or identify them by
filename -- as long as they are explicit about their content type.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Yonik Seeley


On 1/21/07, Chris Hostetter [EMAIL PROTECTED] wrote:

:  The bugaboo is if the POST data is NOT in fact
:  application/x-www-form-urlencoded but the user agent says it is -- as
:  both of you have indicated can be the case when using curl.  Could that
:  be why Yonik thought POST params was broken?
:
: Correct.  That's the format that post.sh in the example sends
: (application/x-www-form-urlencoded) and we ignore it in the update
: handler and always treat the body as binary.
:
: Now if you wanted to add some query args to what we already have, you
: can't use getParameterMap().

I think i mentioned this before, but I think what we should do is make the
stream guessing code in the Dispatcher/RequestBuilder very strict, and
make it's decisison about how to treat the post body entirely based on the
Content-Type ... meanwhile the existing (eventually know as old) way of
doing updates via /update to the UpdateServlet can be more lax, and
assume everything is a raw POST of XML.

we can change post.sh to spcify XML as the Content-Type by default,
modify the example schema to have other update handlers registered with
names like /update/csv and eventually add an /update/xml encouraging
people to use it if they want to send updates as xml dcouments, regardless
of wehter htey want to POST them raw, uplodae them, or identify them by
filename -- as long as they are explicit about their content type.


I think I agree with all that.

A long time ago in this thread, I remember saying that new URLs are an
opportunity to change request/response formats w/o worrying about
backward compatibility.

So is everyone happy with the way that errors are currently reported?
If not, now (or right after this is committed), is the time to change
that.  /solr/select/qt=myhandler  should be backward compatible, but
/solr/myhandler doesn't need to be.  Same for the update stuff.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Ryan McKinley



So is everyone happy with the way that errors are currently reported?
If not, now (or right after this is committed), is the time to change
that.  /solr/select/qt=myhandler  should be backward compatible, but
/solr/myhandler doesn't need to be.  Same for the update stuff.



In SOLR-104, all exceptions are passed to the client as HTTP Status
codes with the message.  If you write:

 throw new SolrException( 400, missing parameter: +p );

This will return 400 with a message missing parameter:  + p.

Exceptions or SolrExceptions with code=500 || code100 are sent to
client with status code 500 and a full stack trace.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Yonik Seeley


On 1/21/07, Ryan McKinley [EMAIL PROTECTED] wrote:


 So is everyone happy with the way that errors are currently reported?
 If not, now (or right after this is committed), is the time to change
 that.  /solr/select/qt=myhandler  should be backward compatible, but
 /solr/myhandler doesn't need to be.  Same for the update stuff.


In SOLR-104, all exceptions are passed to the client as HTTP Status
codes with the message.  If you write:

  throw new SolrException( 400, missing parameter: +p );

This will return 400 with a message missing parameter:  + p.

Exceptions or SolrExceptions with code=500 || code100 are sent to
client with status code 500 and a full stack trace.


That all seems ideal to me, but there had been talk in the past about
formatted responses on errors.  Given that even update handlers can
return full responses, I don't see the point of formatted (XML,etc)
response bodies when an exception is thrown.
Just making sure there's a consensus.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Ryan McKinley



I don't think i'll have time to look at your new patch today, design wise
i think you are right, but there was still stuff that needed to be
refactored out of core.update and into the UpdateHandler wasn't there?



Yes, I avoided doing that in an effort to minimize refactoring and
focus just on adding ContentStreams to RequestHandlers.

I just posted (yet another) update to SOLR-104.  This one moves the
core.update logic into UpdateRequestHander, and adds some glue to make
old request behave as they used to.

I also deprecated the exception in SolrQueryResponse.  Handlers should
throw the exception, not put it in the response.  (If you want error
messages, put that in the response, not the exception)

It still needs some cleanup and some idea what data/messages should be
returned in the SolrResponse.

The bottom of http://localhost:8983/solr/test.html has a form calling
/update2 with posted XML so you can see the output



a couple of minor comments i had when i read the last patch (but didn't
mention since i was focusing on design issues) ...

 1) why rename the servlets Legacy* instead of just marking them deprecated?


In the new version, I got rid of both Servlets and am handling the
'legacy' cases explicitly in the dispatch filter.  This minimizes the
duplicated code and keeps things consisten.



 2) getSourceId and getSoure need to be left in the concrete Handlers so
they get illed in with the correct file version info on checkout.


done.


 3) there's a comment in RequestHandlerBase.init about indexOf that
comes form the existing impl in DismaxRequestHandler -- but doesn't match
the new code ... i also wasn't certain that the change you made matches
the old semantics for dismax (i don't think we have a unit test for that
case)


When you get a chance to look at the patch, can you investigate this.
I just copied the code from DismaxRequestHandler and made sure it
passes the tests.  I don't totally understand what that case is doing.



 4) ContentStream.getFieldName() would proabably be more general as
ContentStream.getSourceInfo() ...


done.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-21 Thread Erik Hatcher



On Jan 21, 2007, at 2:39 PM, Yonik Seeley wrote:


On 1/21/07, Ryan McKinley [EMAIL PROTECTED] wrote:


 So is everyone happy with the way that errors are currently  
reported?
 If not, now (or right after this is committed), is the time to  
change
 that.  /solr/select/qt=myhandler  should be backward  
compatible, but

 /solr/myhandler doesn't need to be.  Same for the update stuff.


In SOLR-104, all exceptions are passed to the client as HTTP Status
codes with the message.  If you write:

  throw new SolrException( 400, missing parameter: +p );

This will return 400 with a message missing parameter:  + p.

Exceptions or SolrExceptions with code=500 || code100 are sent to
client with status code 500 and a full stack trace.


That all seems ideal to me, but there had been talk in the past about
formatted responses on errors.  Given that even update handlers can
return full responses, I don't see the point of formatted (XML,etc)
response bodies when an exception is thrown.
Just making sure there's a consensus.


Being able to check the HTTP status code to determine if there is an  
error, rather than having to parse XML and get a Solr-specific status  
code seems best for the Ruby work we're doing.  I'll confer with the  
others working on it and report back if they have any suggestions for  
improvement also.


Erik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Alan Burlison


Chris Hostetter wrote:


: 1) I think it should be a ServletFilter applied to all requests that
: will only process requests with a registered handler.

I'm not sure what it is in the above sentence ... i believe from the
context of the rest of hte message you are you refering to
using a ServletFilter instead of a Servlet -- i honestly have no opinion
about that either way.


I thought a filter required you to open up the WAR file and change 
web.xml, or am I misunderstanding?


--
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Ryan McKinley


I just posted a new patch on SOLR-104.  I think it addresses most of
the issues we have discussed.  (Its a little difficult to know as it
has been somewhat circular)   I was going to reply to your points one
by one, but i think that would just make the discussion more confusing
then it already is!



 (i don't trust HTTP Client code -- but for the sake
 of argument let's assume all clients are perfect) what happens when a
 person wants to send a mim multi-part message *AS* the raw post body -- so
 the RequestHandler gets it as a single ContentStream (ie: single input
 stream, mime type of multipart/mixed) ?

Multi-part posts will have the content-type set correctly, or it won't work.
The big use-case I see is browser file upload, and they will set it correctly.



I don't see it as a big problem because we don't have to deal with
legacy streams yet.  No one is expecting their existing stream code to
work.  The only header values the SOLR-104 code relies on is
'multipart'  I think that is a reasonable constraint since it has to
be implemented properly for commons-file-upload to work.

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Chris Hostetter


:  A user should be confident that they can pick anyname they possily want
:  for their plugin, and it won't collide with any future addition we might
:  add to Solr.
:
: But that doesn't seem possible unless we make user plugins
: second-class citizens by scoping them differently.  In the event there
: is a collision in the future, the user could rename one of the
: plugins.

when it comes to URLs, our plugins currently are second class citizens --
plugin names appear in the qt or wt params -- users can pick any names
they want and they are totally legal, they don't have to worry about any
possibility that a name they pick will collide with a path we have mapped
to a servlet.

Users shouldn't have the change the names of requestHandlers juse because
SOlr adds a new feature with the same name -- changing a requestHandler
name could be a heavy burden for a Solr user to make depending on how many
clients *they* have using that requestHandler with that name.  i wouldn't
make a big deal out of this if it was unavoidable -- but it is such an
easy thing to deal with just by scoping the URLs .. put something,
ANYTHING, in front of these urls, that isn't select or update and
then put the requestHandler name and we've now protected ourself and our
users.

consider the case where a user today has this in his solrconfig...

  requestHandler name=select class=solr.StandardRequestHandler

..with the URL structure you guys are talking about, with the
DispatchFilter matching on /* and interpreting the first part of hte path
as a posisble requestHandler name, that user can't upgrade Solr
because he's relying on the old /select?qt=select style URLs to
work ... he has to change the name of his requestHandler and all of his
clients, then upgrade, then change all of his clients againt to take
advantage of the new URL structure (and the new features it provides for
updates)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Chris Hostetter


:  that scares me ... not only does it rely on the client code sending the
:  correct content-type
:
: Not really... that would perhaps be the default, but the parser (or a
: handler) can make intelligent decisions about that.
:
: If you put the parser in the URL, then there's *that* to be messed up
: by the client.

but the HTTP Client libraries in vaious languages don't allways make it
easy to set Content-type -- and even if they do that doesn't mean the
person using that library knows how to use it properly -- put everyone
understands how to put something in a URL.  if nothing else, think of
putting the parsetype in the URL as a checksum that the RequestParaser
can use to validate it's assumptions -- if it's not there, then it can do
all of the intellegent things you think it should do, but if it is there
that dictates what it should do.

(aren't you the one that convinced me a few years back that it was better
to trust a URL then to trust HTTP Headers? ... because people understand
URLs and put things in them, but they don't allways know what headers to
send .. curl being the great example, it allways sends a Content-TYpe even
if the user doesn't ask it to right?)

: Multi-part posts will have the content-type set correctly, or it won't work.
: The big use-case I see is browser file upload, and they will set it correctly.

right, but my point is what if i want the multi-part POST body left alone
so my RequestHandler can deal with it as a single stream -- if i set
every header correctly, the smart parsing code will parse it -- which is
why sometihng in the URL telling it *not* to parse it is important.

: We should not preclude wacky handlers from doing things for
: themselves, calling our stuff as utility methods.

how? ... if there is one and only one RequestParser which makes the
SolrRequest before the RequestHandler ever sees it, and parses the post
body because the content-type is multipart/mixed how can a  wacky
handler ever get access to the raw post body?



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Ryan McKinley



easy thing to deal with just by scoping the URLs .. put something,
ANYTHING, in front of these urls, that isn't select or update and


I'll let you and Yonik decide this one.  I'm fine either way, but I
really don't see a problem letting people easily override URLs.  I
actually think it is a good thing.




consider the case where a user today has this in his solrconfig...

  requestHandler name=select class=solr.StandardRequestHandler



To be clear, (with the current implementation in SOLR-104) you would
have to put this in your solrconfig.xml

requestHandler name=/select class=solr.StandardRequestHandler

Notice the preceding '/'.  I think this is a strong indication that
someone *wants* /select to behave distinctly.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Chris Hostetter


: I just posted a new patch on SOLR-104.  I think it addresses most of
: the issues we have discussed.  (Its a little difficult to know as it
: has been somewhat circular)   I was going to reply to your points one
: by one, but i think that would just make the discussion more confusing
: then it already is!

Ryan: this patch truely does kick ass ... we can probably simplify a lot
of the Legacy stuff by leveraging your new StandardRequestBuilder -- but
that can be done later.

i'm stil really not liking the way there is a single SolrRequestBuilder
with a big complicated build method that guesses what streams the user
wants.   i really feel strongly that even if all the parsing logic is in
the core, even if it's all in one class: a piece of the path should be
used to determine where the streams come from.

consider the example you've got on your test.html page: POST - with query
string ... that doesn't obey the typical semantics of a POST with a query
string ... if you used the methods on HttpServletRequest to get the params
it would give you all the params it found both in the query strings *and*
in the post body.

This is a great example of what i was talking about: if i have no
intention of sending a stream, it should be possible for me to send params
in both the URL and in the POST body -- but in other cases i should be
able to POST some raw XML and still have params in the URL.

arguable: we could look at the Content-Type of the request and make the
assumption based on that -- but as i mentioned before, people don't
allways set the Content-TYpe perfectly.  if we used a URL fragment to
determine where the streams should come from we could be a lot more
confident that we know where the stream should come from -- and let the
RequestHandler decide if it wants to trust the ContentType

the multipart/mixed example i gave previously is another example -- your
code here assumes that should be given to the RequsetHandler as multiple
streams -- which is a great assumption to make for fileuploads, but which
gives me no way to POST multipart/mixed mime data that i want given to the
RequestHandler as a single ContentStream (so it can have access to all of
hte mime headers for each part)



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Chris Hostetter


: To be clear, (with the current implementation in SOLR-104) you would
: have to put this in your solrconfig.xml
:
: requestHandler name=/select class=solr.StandardRequestHandler
:
: Notice the preceding '/'.  I think this is a strong indication that
: someone *wants* /select to behave distinctly.

crap ... i totally misread that ... so if people have a requestHandler
registered with a name that doesn't start with a slash, they can't use the
new URL structure and they have to use the old one.

DAMN! ... that is slick dude ... okay, i agree with you, the odds of that
causing problems are pretty fucking low.

I'm still hung up on this parse logic thing ... i really think it needs
to be in the path .. or at the very least, there needs to be a way to
specify it in the path to force one behavior or another, and if it's not
in the path then we can guess based on the Content-Type.

Putting it in a query arg would make getting it without contaminating the
POST body kludgy, putting it at the start of the path doesn't work well
for supporting a default if it isn't there, and putting it at the end of
the PATH messes up the nice work you've done letting RequestHandlers have
extra path info for encoding info they want.

H...

What if we did soemthing like this...

   /exec/handler/name:extra/path?param1=val1
   /raw/handler/name:extra/path?param1=val1
   /url/handler/name:extra/path?param1=val1url=...url=...
   /file/handler/name:extra/path?param1=val1file=...file=...

where exec means guess based on the Content-TYpe, raw means use the
POST body as a single stream regardless of Content-Type, etc...

thoughts?


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Yonik Seeley


On 1/20/07, Chris Hostetter [EMAIL PROTECTED] wrote:

Ryan: this patch truely does kick ass ... we can probably simplify a lot
of the Legacy stuff by leveraging your new StandardRequestBuilder -- but
that can be done later.


Much is already done by the looks of it.


i'm stil really not liking the way there is a single SolrRequestBuilder
with a big complicated build method that guesses what streams the user
wants.


But I don't need a separate URL to do GET vs POST in HTTP.
It seems like having a different URL for where you put the args would
be hard to explain to people.


  i really feel strongly that even if all the parsing logic is in
the core, even if it's all in one class: a piece of the path should be
used to determine where the streams come from.


If there's a ? in the URL, then it's args, so that could always
safetly  be parsed.  Perhaps a special arg, if present, could override
the default method of getting input streams?


consider the example you've got on your test.html page: POST - with query
string ... that doesn't obey the typical semantics of a POST with a query
string ... if you used the methods on HttpServletRequest to get the params
it would give you all the params it found both in the query strings *and*
in the post body.


Blech.  I was wondering about that.  Sounds like bad form, but perhaps could be
supported via something like
/solr/foo?postbody=args

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Yonik Seeley


On 1/20/07, Chris Hostetter [EMAIL PROTECTED] wrote:

but the HTTP Client libraries in vaious languages don't allways make it
easy to set Content-type -- and even if they do that doesn't mean the
person using that library knows how to use it properly -


I think we have to go with common usages.  We neither rely on, nor
discard content-type in all cases.
- When it has a charset, believe it.
- When it says form-encoded, only believe it if there aren't args on
the URL (because many clients like curl default to
application/x-www-form-urlencoded for a post.


- put everyone
understands how to put something in a URL.  if nothing else, think of
putting the parsetype in the URL as a checksum that the RequestParaser
can use to validate it's assumptions -- if it's not there, then it can do
all of the intellegent things you think it should do, but if it is there
that dictates what it should do.


If it's optional in the args, I could be on board with that.


(aren't you the one that convinced me a few years back that it was better
to trust a URL then to trust HTTP Headers? ... because people understand
URLs and put things in them, but they don't allways know what headers to
send .. curl being the great example, it allways sends a Content-TYpe even
if the user doesn't ask it to right?)


Well, for the update server, we do ignore the form-data stuff, but we
don't ignore the charset.


: Multi-part posts will have the content-type set correctly, or it won't work.
: The big use-case I see is browser file upload, and they will set it correctly.

right, but my point is what if i want the multi-part POST body left alone
so my RequestHandler can deal with it as a single stream -- if i set
every header correctly, the smart parsing code will parse it -- which is
why sometihng in the URL telling it *not* to parse it is important.


That sounds like a pretty rare corner case.


: We should not preclude wacky handlers from doing things for
: themselves, calling our stuff as utility methods.

how? ... if there is one and only one RequestParser which makes the
SolrRequest before the RequestHandler ever sees it, and parses the post
body because the content-type is multipart/mixed how can a  wacky
handler ever get access to the raw post body?


I wasn't thinking *that* whacky :-)
There are always other options, such as using your own servlet though.
I don't think we should try to solve every case (the whole 80/20
thing).

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Yonik Seeley


On 1/20/07, Yonik Seeley [EMAIL PROTECTED] wrote:

 It would be:
 http://${context}/${path}?stream.type=post

Yes!
Feels like a much more natural place to me than as part of the path of the URL.
Just need to hash out meaningful param names/values?


Oh, and I'm more interested in the semantics of those param/values,
and not what request parser it happens to get mapped to.  I'd vote for
different request parsers being an implementation detail, and keeping
those details (plugability) out of solrconfig.xml for now.

We could always add it later, but it's a lot tougher to remove things.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Chris Hostetter


(the three of us are online way to much ... for crying out loud it's a
saturday night folks!)

: In my opinion, I don't think we need to worry about it for the
: *default* handler.  That is not a very difficult constraint and, there
: is no one out there expecting to be able to post parameters in the URL
: and the body.  I'm not sure it is worth complicating anything if this
: is the only thing we are trying to avoid.

you'd be suprised the number of people i've run into who expect thta to
work.

: I think the *default* should handle all the cases mentioned without
: the client worrying about different URLs  for the various methods.
:
: The next question is which (if any) of the explicit parsers you think
: are worth including in web.xml?

holy crap, i think i have a solution that will make all of us really
happy...

remember that idea we all really detested of a public plugin interface,
configured in the solrconfig.xml that looked like this...

 public interface RequestParser(
SolrRequest parse(HttpServletRequest req);
 }

...what if we bring that idea back, and let people configure it in the
solrconfig.xml, using path like names...

  requestParser name=/raw class=solr.RawPostRequestParser /
  requestParser name=/multi class=solr.MultiPartRequestParser /
  requestParser name=/nostream class=solr.SimpleRequestParser /
  requestParser name=/guess class=solr.UseContentTypeRequestParser /

...but don't make it a *public* interface ... make it package protected,
or maybe even a private static interface of the Dispatch Filter .. either
way, don't instantiate instances of it using the plugin-lib ClassLoader,
make sure it comes from the WAR to only uses the ones provided out of hte
box.

then make the dispatcher check each URL first by seeeing if it starts with
the name of any registered requestParser ... if it doesn't then use the
default UseContentTypeRequestParser .. *then* it does what the rest of
ryans current Dispatcher does, taking the rest of hte path to pick a
request handler.

the bueaty of this approach, is that if no requestParser/ tags appear in
the solrconfig.xml, then the URLs look exactly like you guys want, and the
request parsing / stream building semantics are exactly the same as they
are today ... if/when we (or maybe just i) write those other
RequestParsers people can choose to turn them on (and change their URLs)
if they want, but if they don't they can keep having the really simple
URLs ... OR they could register something like this...

  requestParser name= class=solr.RawPostRequestParser /

...and have really simple URLs, but be garunteed that they allways got
their streams from raw POST bodies.

This would also solve Ryans concern about allowing people to turn off
fetching streams from remote URLs (or from local files, a small concern i
had but hadn't mentioend yet since we had bigger fish to fry)



Thoughts?


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Chris Hostetter

On Sat, 20 Jan 2007, Ryan McKinley wrote:

: Date: Sat, 20 Jan 2007 19:17:16 -0800
: From: Ryan McKinley [EMAIL PROTECTED]
: Reply-To: solr-dev@lucene.apache.org
: To: solr-dev@lucene.apache.org
: Subject: Re: Update Plugins (was Re: Handling disparate data sources in
: Solr)
:
: 
:  ...what if we bring that idea back, and let people configure it in the
:  solrconfig.xml, using path like names...
: 
:requestParser name=/raw class=solr.RawPostRequestParser /
:requestParser name=/multi class=solr.MultiPartRequestParser /
:requestParser name=/nostream class=solr.SimpleRequestParser /
:requestParser name=/guess class=solr.UseContentTypeRequestParser /
: 
:  ...but don't make it a *public* interface ... make it package protected,
:  or maybe even a private static interface of the Dispatch Filter .. either
:  way, don't instantiate instances of it using the plugin-lib ClassLoader,
:  make sure it comes from the WAR to only uses the ones provided out of hte
:  box.


: I'm on board as long as the URL structure is:
:   ${path/from/solr/config}?stream.type=raw

actually the URL i was suggesting was...

${parser/path/from/solr/config}${handler/path/from/solr/config}?param=val

...i was trying to avoid keeping the parser name out of the query string,
so we don't have to do any hack parsing of
HttpServletRequest.getQueryString() to get it.

basically if you have this...

  requestParser name=/raw class=solr.RawPostRequestParser /
  requestParser name=/multi class=solr.MultiPartRequestParser /
  requestParser name=/nostream class=solr.SimpleRequestParser /

  requestHandler name=/update/commit class=solr.CommitRequestHandler/
  requestHandler name=/update class=solr.UpdateRequestHandler /
  requestHandler name=/xml class=solr.XmlQueryRequestHandler /

...then these urls are all valid...

   http://localhost:/solr/raw/update?param=val
  ..uses raw post body for update
   http://localhost:/solr/multi/update?param=val
  ..uses multipart mime for update
   http://localhost:/solr/update?param=val
  ..no requestParser matched path prefix, so default is choosen and
COntent-Type is used to decide where streams come from.

but if instead my config looks like this...

  requestParser name= class=solr.MultiPartRequestParser /
  requestParser name=/raw class=solr.RawPostRequestParser /

  requestHandler name=/update/commit class=solr.CommitRequestHandler/
  requestHandler name=/update class=solr.UpdateRequestHandler /
  requestHandler name=/xml class=solr.XmlQueryRequestHandler /

...then these URLs would fail...

   http://localhost:/solr/raw/update?param=val
   http://localhost:/solr/multi/update?param=val

...because the empty string would match as a parser, but /raw/update
and /multi/update wouldn't match as requestHandlers (the registration of
/raw as a parser would be useless)

this URL would work however...

   http://localhost:/solr/update?param=val
  ..treat all requetss as if they have multi-part mime streams

...i use this only as an example of what i'm describing ... not sa an
example of soemthing we shoudl recommend.

The key to all of this being that we'd check parser names against the URL
prefix in order from shortest to longest, then check the rest of the path
as a requestHandler ... if either of those fail, then the filter would
skip the request.

What we would probably recommended is that people map the guess request
parser to / so that they could put in all of hte options they want on
buffer sizes and such, then map their requestHandlers without a /
prefix, and use content types correctly.

if they really had a reason to want to force one type of parsing, they
could register it with a differnet prefix.

  * default URLs stay clean
  * no need for an extra stream.type param
  * urls only get ugly if people want them to get ugly because they don't
want to make their clients set the mime type correctly.




-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley


On 1/19/07, Yonik Seeley [EMAIL PROTECTED] wrote:

On 1/19/07, Chris Hostetter [EMAIL PROTECTED] wrote:
 whoa ... hold on a minute, even if we use a ServletFilter do do all of the
 dispatching instead of a Servlet we still need a base path right?

I thought that's what the filter gave you... the ability to filter any
URL to the /solr webapp, and Ryan was doing a lookup on the next
element for a request handler.



yes, this is the beauty of a Filter.  It *can* process the request
and/or it can pass it along.  There is no problem at all with mapping
a filter to all requests and a servlet to some paths.  The filter will
only handle paths declared in solrconfig.xml everything else will be
handled however it is defined in web.xml

(As a sidenote, wicket 2.0 replaces their dispatch servlet with a
filter - it makes it MUCH easier to have their app co-exist with other
things in a shared URL structure.)

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley


then all is fine and dandy ... but what happens if someone tries to
configure a plugin with the name admin ... now all of the existing admin
pages break.



that is exactly what you would expect to happen if you map a handler
to /admin.  The person configuring solrconfig.xml is saying Hey, use
this instead of the default /admin.  I want mine to make sure you are
logged in using my custom authentication method.  In addition, It may
be reasonable (sometime in the future) to implement /admin as a
RequestHandler.  This could be a clean way to address SOLR-58  (xml
with stylesheets, or JSON, etc...)



also: what happens a year from now when we add some completely new
Servlet/ServletFilter to Solr, and want to give it a unique URL...

  http://host:/solr/bar/



obviously, I think the default solr settings should be prudent about
selecting URLs.  The standard configuration should probably map most
things to /select/xxx or /update/xxx.


...we could put it earlier in the processing chain before the existing
ServletFilter, but then we break any users that have registered a plugin
with the name bar.


Even if we move this to have a prefix path, we run into the exact same
issue when sometime down the line solr has a default handler mapped to
'bar'

/solr/dispatcher/bar

But, if it ever becomes a problem, we can add an excludes pattern to
the filter-config that would  skip processing even if it maps to a
known handler.



more short term: if there is no prefix that the ervletFilter requires,
then supporting the legacy http://host:/solr/update; and
http://host:/solr/select; URLs becomes harder,


I don't think /update or /select need to be legacy URLs.  They can
(and should) continue work as they currently do using a new framework.

The reason I was suggesting that the Handler interface adds support to
ask for the default RequestParser and/or ResponseWriter is to support
this exact issue.  (However in the case of path=/select the filter
would need to get the handler from ?qt=xxx)

- - - - -

All that said, this could just as cleanly map everything to:
 /solr/dispatch/update/xml
 /solr/cmd/update/xml
 /solr/handle/update/xml
 /solr/do/update/xml

thoughts?

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Yonik Seeley


On 1/19/07, Ryan McKinley [EMAIL PROTECTED] wrote:

All that said, this could just as cleanly map everything to:
  /solr/dispatch/update/xml
  /solr/cmd/update/xml
  /solr/handle/update/xml
  /solr/do/update/xml

thoughts?


That was my original assumption (because I was thinking of using
servlets, not a filter),
but I see little advantage to scoping under additional path elements.
I also agree with the other points you make.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley


(Note: this is different then what i have suggested before.  Treat it
as brainstorming on how to take what i have suggested and mesh it with
your concerns)

What if:

The RequestParser is not be part of the core API - It would be a
helper function for Servlets and Filters that call the core API.  It
could be configured in web.xml rather then solrconfig.xml.  A
RequestDispatcher (Servlet or Filter) would be configured with a
single RequestParser.

The RequestParser would be in charge of taking HttpRequest and determining:
 1) The RequestHandler
 2) The SolrRequest (Params  Streams)

It would not be the most 'pluggable' of plugins, but I am still having
trouble imagining anything beyond a single default RequestParser.
Assuming anything doing *really* complex ways of extracting
ContentStreams will do it in the Handler not the request parser.  For
reference see my argument for a seperate DocumentParser interface in:
http://www.nabble.com/Re%3A-Update-Plugins-%28was-Re%3A-Handling-disparate-data-sources-in-Solr%29-p8386161.html

In my view, the default one could be mapped to /* and a custom one
could be mapped to /mycustomparser/*

This would drop the ':' from my proposed URL and change the scheme to look like:
/parser/path/the/parser/knows/how/to/extract/?params

This would give people a relativly easy way to implement 'restful'
URLs if they need to.  (but they would have to edit web.xml)



: Would that be configured in solrconfig.xml as handler name=xml?
: name=update/xml?  If it is update/xml would it only really work if
: the 'update' servlet were configured properly?

it would only make sense to map that as xml ... the SolrCore (and hte
solrconfig.xml) shouldn't have any knowledge of the Servlet/ServletFilter
base paths because it should be possible to use the SolrCore independent
of any ServletContainer (if for no other reason in unit tests)



Correct, SolrCore shoudl not care what the request path is.  That is
why I want to deprecate the execute( ) function that assumes the
handler is defined by 'qt'

Unit tests should be handled by execute( handler, req, res )

If I had my druthers, It would be:
 res = handler.execute( req )
but that is too big of leap for now :)



...

A third use case of doing queries with POST might be that you want to use
standard CGI form encoding/multi-part file upload semantics of HTTP to
send an XML file (or files) to the above mentioned XmlQPRequestHandler ...
so then we have MultiPartMimeRequestParser ...


I agree with all your use cases.  It just seems like a LOT of complex
overhead to extract the general aspects of translating a
URL+Params+Streams = Handler+Request(Params+Streams)

Again, since the number of 'RequestParsers' is small, it seems overly
complex to have a separate plugin to extract URL, another to extract
the Handler, and another to extract the streams.  Particulary since
the decsiions on how you parse the URL can totally affect the other
aspects.




...i really, really, REALLY don't like the idea that the RequestParser
Impls -- classes users should be free to write on their own and plugin to
Solr using the solrconfig.xml -- are responsible for the URL parsing and
parameter extraction.  Maybe calling them RequestParser in my suggested
design is missleading, maybe a better name like StreamExtractor would be
better ... but they shouldn't be in charge of doing anything with the URL.



What if it were configured in web.xml, would you feel more comfortable
letting it determine how the URL is parsed and streams are extracted?


Imagine if 3 years ago, when Yonik and I were first hammering out the API
for SolrRequestHandlers, we had picked this...

   public interface SolrRequestHandlers extends SolrInfoMBean {
 public void init(NamedList args);
 public void handleRequest(HttpServletRequest req, SolrQueryResponse rsp);
   }


Thank goodness you didn't!  I'm confident you won't let me (or anyone)
talk you into something like that!  You guys made a lot of good
choices and solr is an amazing platform for it.

That said, the task at issue is: How do we convert an arbitrary
HttpServletRequest into a SolrRequest.

I am proposing we have a single interface to do this:
 SolrRequest r = RequestParser.parse( HttpServletRequest  )

You are proposing this is broken down further.  Something like:
 Handler h = (the filter) getHandler( req.getPath() )
 SolrParams = (the filter) do stuff to extract the params (using
parser.preProcess())
 ContentStreams = parser.parse( request )

While it is not great to have plugins manipulate the HttpRequest -
someone needs to do it.  In my opinion, the RequestParser's job is to
isolate *everything* *else* from the HttpServletRequest.

Again, since the number of RequestParser is small, it seems ok (to me)



keeping HttpServletRequest out of the API for RequestParsers helps us
future-proof against breaking plugins down the road.



I agree.  This is why i suggest the RequestParsers is not a core part
of the API, just a helper class

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Yonik Seeley


First Ryan, thank you for your patience on this *very* long hash
session.  Most wouldn't last that long unless it were a flame war ;-)
And thanks to Hoss, who seems to have the highest read+response
bandwidth of anyone I've ever seen (I'll admit I've only been
selectively reading this thread, with good intentions of coming back
to it).

On 1/19/07, Ryan McKinley [EMAIL PROTECTED] wrote:

It would not be the most 'pluggable' of plugins, but I am still having
trouble imagining anything beyond a single default RequestParser.
Assuming anything doing *really* complex ways of extracting
ContentStreams will do it in the Handler not the request parser.


Agreed... a custom handler opening various streams not covered by the
default will most easily be handled by the handler opening the streams
themselves.


This would give people a relativly easy way to implement 'restful'
URLs if they need to.  (but they would have to edit web.xml)


A handler could alternately get the rest of the path (absent params), right?


Correct, SolrCore shoudl not care what the request path is.  That is
why I want to deprecate the execute( ) function that assumes the
handler is defined by 'qt'

Unit tests should be handled by execute( handler, req, res )


How does the unit test get the handler?


If I had my druthers, It would be:
  res = handler.execute( req )
but that is too big of leap for now :)


Yep... esp since the response writers now need the request for
parameters, for the searcher (streaming docs, etc).


You guys made a lot of good
choices and solr is an amazing platform for it.


I just wish I had known Lucene when I *started* Sol(a)r ;-)


I am proposing we have a single interface to do this:
  SolrRequest r = RequestParser.parse( HttpServletRequest  )


That's currently what new SolrServletRequest(HttpServletRequest) does.
We just need to figure out how to get InputStreams, Readers, etc.


I agree.  This is why i suggest the RequestParsers is not a core part
of the API, just a helper class for Servlets and Filters.


Sounds good to as a practical starting point to me.  If we need more
in the future, we can add it then.

USECASE: The XML update plugin using the woodstox XML parser:
Woodstox docs say to give the parser an InputStream (with char
encoding, if available) for best performance.  This is also preferable
since if the char isn't specified, the parser can try to snoop it from
the stream.

So, the hander needs to be able to get an InputStream, and HTTP headers.
Other plugins (CSV) will ask for a Reader and expect the details to be
ironed out for it.

Method1: come up with ways to expose all this info through an
interface... a headers object could be added to the SolrRequest
context (see getContext())
Method2: consider it a more special case, have an XML update servlet
that puts that info into the SolrRequest (perhaps via the context
again)

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Chris Hostetter


: First Ryan, thank you for your patience on this *very* long hash

I could not agree more ... as i was leaving work this afternoon, it
occured to me I really hope Ryan realizes i like all of his ideas, i'm
just wondering if they can be better -- most people I work with don't
have the stamina to deal with my design reviews :)

What occured to me as i was *getting* home was that since I seem to be the
only one that's (overly) worried about the RequestParser/HTTP abstraction
-- and since i haven't managed to convince Ryan after all of my badgering
-- it's probably just me being paranoid.

I think in general, the approach you've outlined should work great -- i'll
reply to some of your more recent comments directly.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley


On 1/19/07, Chris Hostetter [EMAIL PROTECTED] wrote:


: First Ryan, thank you for your patience on this *very* long hash

I could not agree more ... as i was leaving work this afternoon, it
occured to me I really hope Ryan realizes i like all of his ideas, i'm
just wondering if they can be better -- most people I work with don't
have the stamina to deal with my design reviews :)



Thank you both!  This is the first time I've taken the time and effort
to contribute to an open source project.  I'm learning the
pace/etiquette etc as I go along :)   Honestly your critique is
refreshing - I'm used to working alone or directing others.

I *think* we are close to something we will all be happy with.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley



what!? .. really? ... you don't think the ones i mentioned before are
things we should support out of the box?

  - no stream parser (needed for simple GETs)
  - single stream from raw post body (needed for current updates
  - multiple streams from multipart mime in post body (needed for SOLR-85)
  - multiple streams from files specified in params (needed for SOLR-66)
  - multiple streams from remote URL specified in params



I have imagined the single default parser handles *all* the cases you
just mentioned.

GET: read params from paramMap().  Check thoes params for special
params that send you to one or many remote streams.

POST: depending on headers/content type etc you parse the body as a
single stream, multi-part files or read the params.

It will take some careful design, but I think all the standard cases
can be handled by a single parser.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Yonik Seeley


On 1/20/07, Ryan McKinley [EMAIL PROTECTED] wrote:


 what!? .. really? ... you don't think the ones i mentioned before are
 things we should support out of the box?

   - no stream parser (needed for simple GETs)
   - single stream from raw post body (needed for current updates
   - multiple streams from multipart mime in post body (needed for SOLR-85)
   - multiple streams from files specified in params (needed for SOLR-66)
   - multiple streams from remote URL specified in params


I have imagined the single default parser handles *all* the cases you
just mentioned.


Yes, this is what I had envisioned.
And if we come up with another cool standard one, we can add it and
all the current/older handlers get that additional behavior for free.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley


:
: This would drop the ':' from my proposed URL and change the scheme to look 
like:
: /parser/path/the/parser/knows/how/to/extract/?params

i was totally okay with the : syntax (although we should double check if
: is actaully a legal unescaped URL character) .. but i'm confused by
this new suggestions ... is parser the name of the parser in that
example and path/the/parser/knows/how/to/extract data that the parser
may use to build to SolrRequest with? (ie: perhaps the RequestHandler)

would parser names be required to not have slashes in them in that case?



(working with the assumption that most cases can be defined by a
single request parser)

I am/was suggesting that a dispatch servlet/fliter has a single
request parser.  The default request parser will choose the handler
based on names defined in solrconfig.xml.  If someone needs a custom
RequestParser, it would be linked to a new servlet/filter (possibly)
mapped to a distinct prefix.

If it is not possible to handle most standard stream cases with a
single request parser, i will go back to the /path:parser format.

I suggest it is configured in web.xml because that is a configurable
place that is not solrconfg.xml.  I don't think it is or should be a
highly configurable component.



:
: Thank goodness you didn't!  I'm confident you won't let me (or anyone)
: talk you into something like that!  You guys made a lot of good

the point i was trying to make is that if we make a RequestParser
interface with a parseRequest(HttpServletRequest req) method, it amouts
to just as much badness -- the key is we can make that interface as long
as all the implimentations are in the SOlr code base where we can keep an
eye on them, and people have to go way, WAY, *WAY* into solr to start
shanging them.




Yes, implementing a RequestParser is more like writing a custom
Servlet then adding a Tokenizer.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Yonik Seeley


On 1/20/07, Chris Hostetter [EMAIL PROTECTED] wrote:

: I have imagined the single default parser handles *all* the cases you
: just mentioned.

A ... a lot of confusing things make more sense now. .. but
some things are more confusing: If there is only one parser, and it
decides what to do based entirely on param names and HTTP headers, then
what's the point of having the parser name be part of the path in your
URL design?


I didn't think it would be part of the URL anymore.


: POST: depending on headers/content type etc you parse the body as a
: single stream, multi-part files or read the params.
:
: It will take some careful design, but I think all the standard cases
: can be handled by a single parser.

that scares me ... not only does it rely on the client code sending the
correct content-type


Not really... that would perhaps be the default, but the parser (or a
handler) can make intelligent decisions about that.

If you put the parser in the URL, then there's *that* to be messed up
by the client.


(i don't trust HTTP Client code -- but for the sake
of argument let's assume all clients are perfect) what happens when a
person wants to send a mim multi-part message *AS* the raw post body -- so
the RequestHandler gets it as a single ContentStream (ie: single input
stream, mime type of multipart/mixed) ?


Multi-part posts will have the content-type set correctly, or it won't work.
The big use-case I see is browser file upload, and they will set it correctly.


This may sound like a completely ridiculous idea, but consider the
situation where someone is indexing email ... they've written a
RequestHandler that knows how to parser multipart mime emails and
convert them to documents, they want to POST them directly to Solr and let
their RequestHandler deal with them as a single entity.


We should not preclude wacky handlers from doing things for
themselves, calling our stuff as utility methods.


..i think life would be a lot simpler if we kept the RequestParser name as
part of hte URL, completely determined by the client (since the client
knows what it's trying to send) ... even if there are only 2 or 3 types of
RequestParsing being done.


Having to do different types of posts to different URLs doesn't seem
optimal, esp if we can do it in one.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Chris Hostetter


: I think the confusion is that (in my view) the RequestParser is the
: *only* object able to touch the stream.  I don't think anything should
: happen between preProcess() and process();  A RequestParser converts a
: HttpServletRequest to a SolrRequest.  Nothing else will touch the
: servlet request.

that makes it the RequestParsers responsibility to dictate the URL format
(if it's the only one that can touch the HttpServletRequest) i was
proposing a method by which the Servlet could determine the URL format --
there could in fact be multiple servlets supporting different URL formats
if we had some need for it -- and the RequestParser could generate streams
based on the raw POST data and/or any streams it wants to find based on
the SolrParams generated from the URL (ie: local files, remote resources,
etc)

I'm confused by your sentence A RequestParser converts a
HttpServletRequest to a SolrRequest. .. i thought you were advocating
that the servlet parse the URL to pick a RequestHandler, and then the
RequestHandler dicates the RequestParser?

: /path/registered/in/solr/config:requestparser?params
:
: If no ':' is in the URL, use 'standard' parser
:
: 1. The URL path determins the RequestHandler
: 2. The URL path determins the RequestParser
: 3. SolrRequest = RequestParser.parse( HttpServletRequest )
: 4. handler.handleRequest( req, res );
: 5. write the response

do you mean the path before hte colon determins the RequestHandler and the
path after the colon determines the RequestParser? ... that would work
fine too ... i was specificly trying to avoid making any design
decissions that required a particular URL structure, in what you propose
we are dictating more then just the /handler/path:parser piece of the
URL, we are also dicating that the Parser decides how the rest of the path
and all URL query string data will be interpreted -- which means if we
have a PostBodyRequestParser and a LocalFileRequestParser and a
RemoteUrlRequestParser and which all use the query stirng params to get
the SolrParams for the request (and in the case of the last two: to know
what file/url to parse) and then we decide that we want to support a URL
structure that is more REST like and uses the path for including
information, now we have to write a new version of all of those
RequestParsers ( subclass of each probably) that knows what our new URL
structure looks like ... even if that never comes up, every RequestParser
(even custom ones written by users to use some crazy proprietery binary
protocols we've never heard of to fetch stream of data has to worry about
extracting the SOlrParams out of the URL.

what i'm proposing is that the Servlet decide how to get the SolrParams
out of an HttpServletRequest, using whatever URL that servlet wants; the
RequestParser decides how to get the ContentStreams needed for that
request -- in a way that can work regardless of wether the stream is
acctually part of the HttpServletRequest, or just refrenced by a param in
the the request; the RequestHandler decides what to do with those params
and streams; and the the ResponseWriter decides how to format the results
produced by the RequestHandler back to the client.

:  : If anyone needs to customize this chain of events, they could easily
:  : write their own Servlet/Filter

: I don't *think* this would happen often, and the people would only do
: it if they are unhappy with the default URL structure - behavior
: mapping.  I am not suggesting this would be the normal way to
: configure solr.

I think i'm getting confused ... i thought you were advocating that
RequestParsers be implimented as ServletFilters (or Servlets) ... but if
that were the case it wouldn't just be able changing hte URL structure, it
would be able picking new ways to get streams .. but that doesn't seem to
be what you are suggesting, so i'm not sure what i was missunderstanding.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Ryan McKinley



I'm confused by your sentence A RequestParser converts a
HttpServletRequest to a SolrRequest. .. i thought you were advocating
that the servlet parse the URL to pick a RequestHandler, and then the
RequestHandler dicates the RequestParser?



I was...  then you talked me out of it!  You are correct, the client
should determine the RequestParser independent of the RequestHandler.



: /path/registered/in/solr/config:requestparser?params
:
: If no ':' is in the URL, use 'standard' parser
:
: 1. The URL path determins the RequestHandler
: 2. The URL path determins the RequestParser
: 3. SolrRequest = RequestParser.parse( HttpServletRequest )
: 4. handler.handleRequest( req, res );
: 5. write the response

do you mean the path before hte colon determins the RequestHandler and the
path after the colon determines the RequestParser?


yes, that is my proposal.


fine too ... i was specificly trying to avoid making any design
decissions that required a particular URL structure, in what you propose
we are dictating more then just the /handler/path:parser piece of the
URL, we are also dicating that the Parser decides how the rest of the path
and all URL query string data will be interpreted ...


Yes, this proposal would fix the URL structure to be
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

I *think* this cleanly handles most cases cleanly and simply.  The
only exception is where you want to extract variables from the URL
path.  There are pleanty of ways to rewrite RESTfull urls into a
path+params structure.  If someone absolutly needs RESTfull urls, it
can easily be implemented with a new Filter/Servlet that picks the
'handler' and directly creates a SolrRequest from the URL path.  In my
opinion, for this level of customization is reasonable that people
edit web.xml and put in their own servlets and filters.



what i'm proposing is that the Servlet decide how to get the SolrParams
out of an HttpServletRequest, using whatever URL that servlet wants;


I guess I'm not understanding this yet:

Are you suggesting there would be multiple servlets each with a
different methods to get the SolrParams from the url?  How does the
servlet know if it can touch req.getParameter()?

How would the default servlet fill up SolrParams?




I think i'm getting confused ... i thought you were advocating that
RequestParsers be implimented as ServletFilters (or Servlets) ...


Originally I was... but again, you talked me out of it.  (this time
not totally)  I think the /path:parser format is clear and allows for
most everything off the shelf.  If you want to do something different,
that can easily be a custom filter (or servlet)

Essentially, i think it is reasonable for people to skip
'RequestParsers' in a custom servlet and be able to build the
SolrRequest directly.  This level of customization is reasonable to
handle directly with web.xml

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Yonik Seeley


On 1/18/07, Ryan McKinley [EMAIL PROTECTED] wrote:

Yes, this proposal would fix the URL structure to be
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

I *think* this cleanly handles most cases cleanly and simply.  The
only exception is where you want to extract variables from the URL
path.


But that's not a hypothetical case, extracting variables from the URL
path is something I need now (to add metadata about the data in the
raw post body, like the CSV separator).

POST to http://localhost:8983/solr/csv?separator=,fields=foo,bar,baz
with a body of 10,20,30


There are pleanty of ways to rewrite RESTfull urls into a
path+params structure.  If someone absolutly needs RESTfull urls, it
can easily be implemented with a new Filter/Servlet that picks the
'handler' and directly creates a SolrRequest from the URL path.


While being able to customize something is good, having really good
defaults is better IMO :-)  We should also be focused on exactly what
we want our standard update URLs to look like in parallel with the
design of how to support them.

As a site note, with a change of URLs, we get a free chance to
change whatever we want about the parameters or response format...
backward compatibility only applies to the original URLs IMO.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Ryan McKinley


On 1/18/07, Yonik Seeley [EMAIL PROTECTED] wrote:

On 1/18/07, Ryan McKinley [EMAIL PROTECTED] wrote:
 Yes, this proposal would fix the URL structure to be
 /path/defined/in/solrconfig:parser?params
 /${handler}:${parser}

 I *think* this cleanly handles most cases cleanly and simply.  The
 only exception is where you want to extract variables from the URL
 path.

But that's not a hypothetical case, extracting variables from the URL
path is something I need now (to add metadata about the data in the
raw post body, like the CSV separator).

POST to http://localhost:8983/solr/csv?separator=,fields=foo,bar,baz
with a body of 10,20,30



Sorry, by in the URL I mean in the URL path. The RequestParser can
extract whatever it likes from getQueryString()

The url you list above could absolutely be handled with the proposed
format.  The thing that could not be handled is:
http://localhost:8983/solr/csv/foo/bar/baz/
with body 10,20,30



 There are pleanty of ways to rewrite RESTfull urls into a
 path+params structure.  If someone absolutly needs RESTfull urls, it
 can easily be implemented with a new Filter/Servlet that picks the
 'handler' and directly creates a SolrRequest from the URL path.

While being able to customize something is good, having really good
defaults is better IMO :-)  We should also be focused on exactly what
we want our standard update URLs to look like in parallel with the
design of how to support them.



again, i totally agree.  My point is that I don't think we need to
make the dispatch filter handle *all* possible ways someone may want
to structure their request.  It should offer the best defaults
possible.  If that is not sufficient, someone can extend it.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Yonik Seeley


On 1/18/07, Ryan McKinley [EMAIL PROTECTED] wrote:

On 1/18/07, Yonik Seeley [EMAIL PROTECTED] wrote:
 On 1/18/07, Ryan McKinley [EMAIL PROTECTED] wrote:
  Yes, this proposal would fix the URL structure to be
  /path/defined/in/solrconfig:parser?params
  /${handler}:${parser}
 
  I *think* this cleanly handles most cases cleanly and simply.  The
  only exception is where you want to extract variables from the URL
  path.

 But that's not a hypothetical case, extracting variables from the URL
 path is something I need now (to add metadata about the data in the
 raw post body, like the CSV separator).

 POST to http://localhost:8983/solr/csv?separator=,fields=foo,bar,baz
 with a body of 10,20,30


Sorry, by in the URL I mean in the URL path. The RequestParser can
extract whatever it likes from getQueryString()

The url you list above could absolutely be handled with the proposed
format.


Cool.  I think i need more examples... concrete is good :-)

I don't quite grok your format below... is it one line or two?
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

Is that simply

/${handler}:${parser}?params

Or is it all one line where you actually have params twice?

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Chris Hostetter


: However, I'm not yet convinced the benefits are worth the costs.  If
: the number of RequestParsers remain small, and within the scope of
: being included in the core, that functionality could just be included
: in a single non-pluggable RequestParser.
:
: I'm not convinced is a bad idea either, but I'd like to hear about
: usecases for new RequestParsers (new ways of generically getting an
: input stream)?

I don't really see it being a very high cost ... and even if we can't
imagine any other potential user written RequestParser, we already know of
at least 4 use cases we want to support out of the box for getting
streams:

 1) raw post body (as a single stream)
 2) multi-part post body (file upload, potentially several streams)
 3) local file(s) specified by path (1 or more streams)
 4) remote resource(s) specified by URL(s) (1 or more streams)

...we could put all that logic in a single class with that looks at a
SolrParam to pick what method to use or we could extract each one into
it's own class using a common interface ... either way we can hardcode the
list of viable options if we want to avoid the issue of letting the client
configure them .. but i still think it's worth the effort to talk about
what that common interface might be.

I think my idea of having both a preProcess and a process method in
RequestParser so it can do things before and after the Servlet has
extracted SolrParams from the URL would work in all of the cases we've
thought of.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Ryan McKinley



Cool.  I think i need more examples... concrete is good :-)

I don't quite grok your format below... is it one line or two?
/path/defined/in/solrconfig:parser?params
/${handler}:${parser}

Is that simply

/${handler}:${parser}?params



yes.  the ${} is just to show what is extracted from the request URI,
not a specific example

Imagine you have a CsvUpdateHander defined in solrconfig.xml with a
name=my/update/csv.

The standard RequestParser could extract the parameters and
IterableContentStream for each of the following requests:

POST: /my/update/csv/?separator=,fields=foo,bar,baz
(body) 10,20,30

POST:/my/update/csv/
multipart post with 5 files and 6 form fields defining
(unlike the previous example this the handle would get 5 input streams
rather then 1)

GET: /my/update/csv/?post.remoteURL=http://..separator=,fields=foo,bar,baz;...
fill the stream with the content from a remote URL

GET: /my/update/csv/?post.body=bodycontent,fields=foo,bar,baz...
use 'bodycontent' as the input stream.  (note, this does not make much
sense for csv, but is a useful example)

POST: /my/update/csv:remoteurls/?separator=,fields=foo,bar,baz
(body) http://url1,http://url2,http:/url3...
In this case we would use a custom RequestParser (remoteurls) that
would read the post body and convert it to a stream of content urls.

- - - - - - -

The URL path (everything before the ':') would be entirely defined and
configured by solrconfig.xml  A filter would see if the request path
matches a registered handler - if not it will pass it up the filter
chain.  This would allow custom filters and servlets to co-exist in
the top level URL path.  Consider:

solrconfig.xml
 handler name=delete class=DeleteHandler /

web.xml:
 servlet-mapping
   servlet-nameMyRestfulDelete/servlet-name
   url-pattern/mydelete/*/url-pattern
 /servlet-mapping

POST: /delete?id=AAA   would be sent to DeleteHandler
POST: /mydelete/AAA/ would be sent to MyRestfulDelete

Alternativly, you could have:


solrconfig.xml
 handler name=standard/delete class=DeleteHandler /

web.xml:
 servlet-mapping
   servlet-nameMyRestfulDelete/servlet-name
   url-pattern/delete/*/url-pattern
 /servlet-mapping

POST: /standard/delete?id=AAA   would be sent to DeleteHandler
POST: /delete/AAA/ would be sent to MyRestfulDelete

I am suggesting we do not try have the default request servlet/filter
support extracting parameters from the URL.  I think this is a
reasonable tradeoff to be able to have the request path easily user
configurable using the *existing* plugin configuration.

- - - - - - - -

In a previous email, you mentioned changing the URL structure.  With
this proposal, we would continue to support:
/select?wt=XXX

for the Csv example, you would also be able to call:
GET: /select?qt=/my/update/csv/post.remoteURL=http://..sepa...

ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Ryan McKinley



: I was...  then you talked me out of it!  You are correct, the client
: should determine the RequestParser independent of the RequestHandler.

Ah ... this is the one problem with high volume on an involved thread ...
i'm sending replies to messages you write after you've already read other
replies to other messages you sent and changed your mind :)



Should we start a new thread?




Here's a more fleshed out version of the psuedo-java i posted earlier,
with all of my adendums inlined and a few simple metho calls changed to
try and make the purpose more clear...



Ok, now (I think) I see the difference between our ideas.


From your code, it looks like you want the RequestParser to extract

'qt' that defines the RequestHandler.  In my proposal, the
RequestHandler is selected independent of the RequestParser.

What do you imagine happens in:


String p = pickRequestParser(req);



This looks like you would have to have a standard way (per servlet) of
gettting the RequestParser.  How do you invision that?  What would be
the standard way to choose your request parser?


If the RequestHandler is defined by the RequestParser,  I would
suggest something like:

interface SolrRequest
{
 RequestHandler getHandler();
 IterableContentStream getContentStreams();
 SolrParams getParams();
}

interface RequestParser
{
 SolrRequest getRequest( HttpServletRequest req );

 // perhaps remove getHandler() from SolrRequest and add:
 RequestHandler getHandler();
}

And then configure a servlet or filter with the RequestParser

filter
   filter-nameSolrRequestFilter/filter-name
   filter-class.../filter-class
   init-param
 param-nameRequestParser/param-name
 param-valueorg.apache.solr.parser.StandardRequestParser/param-value
   /init-param
/filter

Given that the number of RequestParsers is realistically small (as
Yonik mentioned), I think this could be a good solution.

To update my current proposal:
1. Servlet/Filter defines the RequestParser
2. requestParser parses handler  request from HttpServletRequest
3. handled essentially as before

To update the example URLs, defined by the StandardRequestParser
 /path/to/handler/?param
where /path/to/handler is the name defined in solrconfig.xml

To use a different RequestParser, it would need to be configured in web.xml
 /customparser/whatever/path/i/like


- - - - - - - - - - - - - -

I still don't see why:



// let the parser preprocess the streams if it wants...
IterableContentStreams s = solrParser.preprocess
  (getStreamIno(req),  new PointerInputStream() {
InputStream get() {
  return req.getInputStream();
});

Solrparams params = makeSolrRequest(req);

// let the parser decide what to do with the existing streams,
// or provide new ones
IterableContentStreams solrParser.process(solrReq, s);

// ServletSolrRequest is a basic impl of SolrRequest
SolrRequest solrReq = new ServletSolrRequest(params, s);



can not be contained entirely in:

 SolrRequest solrReq = parser.parse( req );

assuming the SolrRequest interface includes

 IterableContentStream getContentStreams();

the parser can use use req.getInputStream() however it likes - either
to make params and/or to build ContentStreams

- - - - - - - -

good good
ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-18 Thread Chris Hostetter


:  Ah ... this is the one problem with high volume on an involved thread ...
:  i'm sending replies to messages you write after you've already read other
:  replies to other messages you sent and changed your mind :)

: Should we start a new thread?

I don't think it would make a differnece ... we just need to slow down :)

: Ok, now (I think) I see the difference between our ideas.
:
: From your code, it looks like you want the RequestParser to extract
: 'qt' that defines the RequestHandler.  In my proposal, the
: RequestHandler is selected independent of the RequestParser.

no, no, no ... i'm sorry if i gave that impression ... the RequestParser
*only* worries about getting a streams, it shouldn't have any way of even
*guessing* what RequestHandler is going to be used.

for refrence: http://www.nabble.com/Re%3A-p8438292.html

note that i never mention qt .. instead i refer to
core.execute(solrReq, solrRsp); doing exactly what it does today ...
core.execute will call getRequestHandler(solrReq.getQueryType()) to pick
the RequestHandler to use.

the Servlet is what creates the SolrRequest object, and puts whatever
SolrParams it wants (including qt) in that SolrRequest before asking the
SolrCore to take care of it.

: What do you imagine happens in:
: 
:  String p = pickRequestParser(req);

let's use the URL syntax you've been talking about that people seem to
have agreed looks good (assuming i understand correctly) ...

   /servlet/${requesthandler}:${requestparser}?param1=val1param2=val2

what i was suggesting was that then the servlet which uses that URL
structure might have a utility method called pickRequestParser that would look 
like...

  private String pickRequestParser(HttpServletRequest req) {
String[] pathParts = req.getPathInfo().split(\:);
if (pathParts.length  2 || .equal(pathParts[1]))
  return default; // or standard, or null -- whatever
return pathParts[1];
  }


: If the RequestHandler is defined by the RequestParser,  I would
: suggest something like:

again, i can't emphasis enough that that's not what i was proposing ... i
am in no way shape or form trying to talk you out of the idea that it
should be possible to specify the RequestParser, the RequestHandler, and
the OutputWriter all as part of the URL, and completley independent of
eachother.

the RequestHandler and the OutputWriter could be specified as regular
SolrParams that come from any part of the HTTP request, but the
RequestParser needs to come from some part of the URL thta can be
inspected with out any risk of affecting the raw post stream (ie: no
HttpServletRequest.getParameter() calls)

: I still don't see why:
:
: 
:  // let the parser preprocess the streams if it wants...
:  IterableContentStreams s = solrParser.preprocess
:(getStreamIno(req),  new PointerInputStream() {
:  InputStream get() {
:return req.getInputStream();
:  });
: 
:  Solrparams params = makeSolrRequest(req);
: 
:  // let the parser decide what to do with the existing streams,
:  // or provide new ones
:  IterableContentStreams solrParser.process(solrReq, s);
: 
:  // ServletSolrRequest is a basic impl of SolrRequest
:  SolrRequest solrReq = new ServletSolrRequest(params, s);
: 
:
: can not be contained entirely in:
:
:   SolrRequest solrReq = parser.parse( req );

because then the RequestParser would be defining how the URL is getting
parsed -- the makeSolrRequest utility placeholder i described had the
wrong name, i should have called it makeSolrParams ... it would look
something like this in the URL syntax i described above...

  private SolrParams makeSolrParams(HttpServletRequest req) {
// this class already in our code base, used as is
SolrParams p = new ServletSolrParams(req);
String[] pathParts = req.getPathInfo().split(\:);
if (.equal(pathParts[0]))
  return p;
Map tmp = new HashMap();
tmp.put(qt, pathPaths[0]);
return new DefaultSolrParams(new MapSolrParams(tmp), p);
  }



the nutshell version of everything i'm trying to say is...

 SolrRequest
   - models all info about a request to solr to do something:
 - the key=val params assocaited with that request
 - any streams of data associated with that request
 RequestParser(s)
   - different instances for different sources of streams
   - is given two chances to generate ContentStreams:
 - once using the raw stream from the HTTP request
 - once using the params for the SolrRequest
 SolrSerlvet
   - the only thing with direct access to the HttpServletRequest, shields
 the other interface APIs from from the mechanincs of HTTP
   - dictates the URL structure
 - determines the name of the RequestParser to use
 - lets parser have the raw input stream
 - determines where SolrParams for request come from
 - lets parser have params to make more streams if it wants to.
 SolrCore
   - does all of hte name lookups for processing a SolrRequest:
 -

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Ryan McKinley


data and wrote it out in the current update response format .. so the
current SolrUpdateServlet could be completley replaced with a simple url
mapping...

   /update -- /select?qt=xmlupdatewt=legacyxmlupdate



Using the filter method above, it could (and i think should) be mapped to:
/update

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Chris Hostetter


talking about the URL structure made me realize that the Servlet should
dicate the URL structure and the param parsing, but it should do it after
giving the RequestParser a crack at any streams it wants (actually i think
that may be a direct quote from JJ ... can't remember now) ... *BUT* the
RequestParser may not want to provide a list of streams, untill the params
have been parsed (if for example, one of the params is the name of a file)

so what if the interface for RequestParser looked like this...

  interface RequestParser {
public init(NamedList nl); // the usual
/** will be passed the raw input stream from the
 * HttpServletRequest, ... may need other HttpServletRequest info as
 * SolrParam (ie: method, content-type/content-length, ...but we use
 * a SolrParam instance instead of the HttpServletRequest to
 * maintain an abstraction.
 */
public IterableContentStream preProcess(SolrParam headers,
  InputStream s);
/** garunteed that the second arg will be the result from
 * a previous call to preProcess, and that that Iterable from
 * preProcess will not have been inspected or touched in anyway, nor
 * will any refrences to it be maintained after this call.
 * this method is responsible for calling
 * request.setContentStreams(IterableContentStreams) as it sees fit
 */
public void process(SolrRequest request, IterableContentStream i);

  }

...the idea being that many RequestParsers will choose to impliment one or
both of those methods as a NOOP that just returns null but if they want
to impliment both, they have the choice of obliterating the Iterable
returned by preProcess and completely replacing it once they see the
SolrParams in the request

: specifically what i had in mind was something like this...
:
:   class SolrUberServlet extends HttpServlet {
: public service(HttpServletRequest req, HttpServletResponse response) {
:   SolrCore core = getCore();
:   Solr(Query)Response solrRsp = new Solr(Query)Response();
:
:   // servlet specific method which does minimal inspection of
:   // req to determine the parser name
:   String p = pickRequestParser(req);
:
:   // looks up a registered instance (from solrconfig.xml)
:   // matching that name
:   RequestParser solrParser = coreGetParserByName(p);
:

// let the parser preprocess the streams if it wants...
IterableContentStreams s = solrParser.preprocess(req.getInputStream())

// build the request using servlet specific URL rules
Solr(Query)Request solrReq = makeSolrRequest(req);

// let the parser decide what to do with the existing streams,
// or provide new ones
solrParser.process(solrReq, s);

:   // does exactly what it does now: picks the RequestHandler to
:   // use based on the params, calls it's handleRequest method
:   core.execute(solrReq, solrRsp)
:
:   // the rest of this is cut/paste from the current SolrServlet.
:   // use SolrParams to pick OutputWriter name, ask core for instance,
:   // have that writer write the results.
:   QueryResponseWriter responseWriter = 
core.getQueryResponseWriter(solrReq);
:   response.setContentType(responseWriter.getContentType(solrReq, 
solrRsp));
:   PrintWriter out = response.getWriter();
:   responseWriter.write(out, solrReq, solrRsp);
:
: }
:   }
:
:
: -Hoss
:



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Alan Burlison


Chris Hostetter wrote:


i'm totally on board now ... the RequestParser decides where the streams
come from if any (post body, file upload, local file, remote url, etc...);
the RequestHandler decides what it wants to do with those streams, and has
a library of DocumentProcessors it can pick from to help it parse them if
it wants to, then it takes whatever actions it wants, and puts the
response information in the existing Solr(Query)Response class, which the
core hands off to any of the various OutputWriters to format according to
the users wishes.


+1

--
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Alan Burlison


Ryan McKinley wrote:


In addition, consider the case where you want to index a SVN
repository.  Yes, this could be done in SolrRequestParser that logs in
and returns the files as a stream iterator.  But this seems like more
'work' then the RequestParser is supposed to do.  Not to mention you
would need to augment the Document with svn specific attributes.


This is indeed one of the things I'd like to do - use Solr as a back-end
for OpenGrok (http://www.opensolaris.org/os/project/opengrok/)

--
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread J.J. Larrea

At 11:48 PM -0800 1/16/07, Chris Hostetter wrote:
yeah ... once we have a RequestHandler doing that work, and populating a
SolrQueryResponse with it's result info, it
would probably be pretty trivial to make an extremely bare-bones
LegacyUpdateOutputWRiter that only expected that simple mount of response
data and wrote it out in the current update response format .. so the
current SolrUpdateServlet could be completley replaced with a simple url
mapping...

   /update -- /select?qt=xmlupdatewt=legacyxmlupdate

Yah!  But in my vision it would be

/update - qt=update

because pathInfo is update.  There's no need to remap anything in the URL, 
the existing SolrServlet is ready for dispatch once it:
  - Prepares request params into SolrParams
  - Sets params(qt) to pathInfo
  - Somehow (perhaps with StreamIterator) prepares streams for RequestParser use

I'm still trying to conceptually maintain a separation of concerns between 
handling the details of HTTP (servlet-layer) and handling different payload 
encodings (a different layer, one I believe can be invoked after config is 
read).

The following is vision more than proposal or suggestion...

requestHandler name=update class=lets.write.this.UpdateRequestHandler
lst name=invariants
str name=wtlegacyxml/str
/lst
lst name=defaults
!-- rp matches queryRequestParser --
str name=rpxml/str
/lst
/request

!-- only if standard responseWriter is not up to the task --
queryResponseWriter name=legacyxml
class=do.we.really.need.LegacyUpdateOutputWRiter/

queryRequestParser name=xml class=solr.XMLStreamRequestParser/

queryRequestParser name=json class=solr.JSONStreamRequestParser/

So when incoming URL comes in:

/update?rp=json

the pipeline which is established is:

SolrServlet -
solr.JSONStreamRequestParser
|
|- request data carrier e.g. SolrQueryRequest
|
lets.write.this.UpdateRequestHandler
|
|- response data carrier e.g. SolrQueryResponse
|
do.we.really.need.LegacyUpdateOutputWRiter

I expect this is all fairly straightforward, except for one sticky question:

Is there a universal format which can efficiently (e.g. lazily, for stream 
input) convey all kinds of different request body encodings, such that the 
RequestHandler has no idea how it was dispatched?

Something to think about...

- J.J.

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Ryan McKinley


I'm not sure i underestand preProcess( ) and what it gets us.

I like the model that

1. The URL path selectes the RequestHandler
2. RequestParser = RequestHandler.getRequestParser()  (typically from
its default params)
3. SolrRequest = RequestParser.parse( HttpServletRequest )
4. handler.handleRequest( req, res );
5. write the response

If anyone needs to customize this chain of events, they could easily
write their own Servlet/Filter


On 1/17/07, Chris Hostetter [EMAIL PROTECTED] wrote:


Acctually, i have to amend that ... it occured to me in my slep last night
that calling HttpServletRequest.getInputStream() wasn't safe unless we
*now* the Requestparser wasnts it, and will close it if it's non-null, so
the API for preProcess would need to look more like this...

 interface PointerT {
   T get();
 }
 interface RequestParser {
   ...
   /** the will be passed a Pointer to the raw input stream from the
* HttpServletRequest, ... if this method accesses the IputStream
* from the pointer, it is required to close it if it is non-null.
*/
   public IterableContentStream preProcess(SolrParam headers,
 PointerInputStream s);
   ...
 }



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Ryan McKinley


On 1/17/07, Chris Hostetter [EMAIL PROTECTED] wrote:


: I'm not sure i underestand preProcess( ) and what it gets us.

it gets us the abiliity for a RequestParser to be able to pull out the raw
InputStream from the HTTP POST body, and make it available to the
RequestHandler as a ContentStream and/or it can wait untill the servlet
has parsed the URL to get the params and *then* it can generate
ContentStreams based on those param values.

 - preProcess is neccessary to write a RequestParser that can handle the
   current POST raw XML model,
 - process is neccessary to write RequestParsers that can get file names
   or URLs out of escaped query params and fetch them as streams



I think the confusion is that (in my view) the RequestParser is the
*only* object able to touch the stream.  I don't think anything should
happen between preProcess() and process();  A RequestParser converts a
HttpServletRequest to a SolrRequest.  Nothing else will touch the
servlet request.



: 1. The URL path selectes the RequestHandler
: 2. RequestParser = RequestHandler.getRequestParser()  (typically from
: its default params)
: 3. SolrRequest = RequestParser.parse( HttpServletRequest )
: 4. handler.handleRequest( req, res );
: 5. write the response

the problem i see with that, is that the RequestHandler shouldn't have any
say in what RequestParser is used -- ...



got it.  Then i vote we use a syntax like:

/path/registered/in/solr/config:requestparser?params

If no ':' is in the URL, use 'standard' parser

1. The URL path determins the RequestHandler
2. The URL path determins the RequestParser
3. SolrRequest = RequestParser.parse( HttpServletRequest )
4. handler.handleRequest( req, res );
5. write the response



: If anyone needs to customize this chain of events, they could easily
: write their own Servlet/Filter

this is why i was confused about your Filter comment earlier: if the only
way a user can customize behavior is by writting a Servlet, they can't
specify that servlet in a solr config file -- they'd have to unpack the
war and manually eidt the web.xml ... which makes upgrading a pain.



I don't *think* this would happen often, and the people would only do
it if they are unhappy with the default URL structure - behavior
mapping.  I am not suggesting this would be the normal way to
configure solr.

The main case where I imagine someone would need to write their own
servlet/filter is if they insist the parameters need to be in the URL.
For example:

 /delete/id/

The URL structure I am proposing could not support this (unless you
had a handler mapped to each id :)

ryan

RE: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Cook, Jeryl

Sorry for the flame , but I've used spring on 2 large projects and it
worked out great.. you should check out some of the GUIs to help manage
the XML configuration files, if that is reason your team thought it was
a nightmare because of the configuration(we broke ours up to help).. 

Jeryl Cook

-Original Message-
From: Alan Burlison [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, January 16, 2007 10:52 AM
To: solr-dev@lucene.apache.org
Subject: Re: Update Plugins (was Re: Handling disparate data sources in
Solr)

Bertrand Delacretaz wrote:

 With all this talk about plugins, registries etc., /me can't help
 thinking that this would be a good time to introduce the Spring IoC
 container to manage this stuff.
 
 More info at http://www.springframework.org/docs/reference/beans.html
 for people who are not familiar with it. It's very easy to use for
 simple cases like the ones we're talking about.

Please, no.  I work on a big webapp that uses spring - it's a complete 
nightmare to figure out what's going on.

-- 
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Bertrand Delacretaz


On 1/16/07, Ryan McKinley [EMAIL PROTECTED] wrote:


...I think a DocumentParser registry is a good way to isolate this top level 
task...


With all this talk about plugins, registries etc., /me can't help
thinking that this would be a good time to introduce the Spring IoC
container to manage this stuff.

More info at http://www.springframework.org/docs/reference/beans.html
for people who are not familiar with it. It's very easy to use for
simple cases like the ones we're talking about.

-Bertrand

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Alan Burlison


Bertrand Delacretaz wrote:


With all this talk about plugins, registries etc., /me can't help
thinking that this would be a good time to introduce the Spring IoC
container to manage this stuff.

More info at http://www.springframework.org/docs/reference/beans.html
for people who are not familiar with it. It's very easy to use for
simple cases like the ones we're talking about.


Please, no.  I work on a big webapp that uses spring - it's a complete 
nightmare to figure out what's going on.


--
Alan Burlison
--

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread J.J. Larrea

I'm in frantic deadline mode so I'm just going to throw in some (hopefully) 
short comments...

At 11:02 PM -0800 1/15/07, Ryan McKinley wrote:
the one thing that still seems missing is those micro-plugins i was
 [SNIP]

  interface SolrRequestParser {
 SolrRequest process( HttpServletRequest req );
  }



I left out micro-plugins because i don't quite have a good answer
yet :)  This may a place where a custom dispatcher servlet/filter
defined in web.xml is the most appropriate solution.

If the issue is munging HTTPServletRequest information, then a proper 
separation of concerns suggests responsibility should lie with a Servlet 
Filter, as Ryan suggests.

For example, while the Servlet 2.4 spec doesn't have specifications for how the 
servlet container can/should burst a multipart-MIME payload into separate 
files or streams, there are a number of 3rd party Filters which do this.

The IteratorContentStream is a great idea because if each stream is read to 
completion before the next is opened it doesn't impose any limitation on 
individual stream length and doesn't require disk buffering.

(Of course some handlers may require access to more than one stream at a time; 
each time next() is called on the iterator before the current stream is closed, 
the remainder of that stream will have to be buffered in memory or on disk, 
depending on the part length.  Nonetheless that detail can be entirely hidden 
from the handler, as it should be.  I am not sure if any available 
ServletFilter implementations work this way, but it's certainly doable.)

But that detail is irrelevant for now; as I suggest below, using this API lets 
one immediately implement it with only next() value of the entire POST stream; 
that would answer the needs of the existing update request handling code, but 
establish an API to handle multi-part.  Whenever someone wants to write a 
multi-stream handler, they can write or find a better IteratorContentStream 
implementation, which would best be cast as a ServletFilter.

I like the SolrRequestParser suggestion.

Me too.  It answers a hole in my vision for how this can all fit together.

Consider:
qt='RequestHandler'
wt='ResponseWriter'
rp='RequestParser ' (rb='SolrBuilder'?)

To avoid possible POST read-ahead stream mungling: qt,wt, and rp
should be defined by the URL, not parameters.  (We can add special
logic to allow /query?qt=xxx)

For qt, I like J.J. Larrea's suggestion on SOLR-104 to let people
define arbitrary path mapping for qt.

We could append 'wt', 'rb', and arbitrary arbitrary text to the
registered path, something like
 /registered/path/wt:json/rb:standard/more/stuff/in/the/path?params...

(any other syntax ideas?)

No need for new syntax, I think.  The pathInfo or qt or other source resolves 
to a requestHandler CONFIG name.  The handler config is read to determine the 
handler class name.  It also can be consulted (with URL or form-POST params 
overriding if allowed by the  config) to decide which RequestParser to invoke 
BEFORE IT IS CALLED and which ResponseWriter to invoke AFTER.  Once those 
objects are set up, the request body gets executed.

Handler config inheritance (as I proposed in SOLR-104 point #2) would greatly 
simplify, for example, creating a dozen query handlers which used a particular 
invariant combination of qt, wt, and rp

The 'standard' RequestParser would:
GET:
 fill up SolrParams directly with req.getParameterMap()
if there is a 'post' parameter (post=XXX)
  return a stream with XXX as its content
else
  empty iterator.
Perhaps add a standard way to reference a remote URI stream.

POST:
 if( multipart ) {
  read all form fields into parameter map.

This should use the same req.getParameterMap as for GET, which Servlet 2.4 says 
is suppose to be automatically by the servlet container if the payload is 
application/x-www-form-urlencoded; in that case the input stream should be null.

  return an iterator over the collection of files

Collection of streams, per Hoss.

}
else {
  no parameters? parse parameters from the URL? /name:value/
  return the body stream

As above, this introduces unneeded complexity and should be avoided.

}
DEL:
 throw unsupported exception?


Maybe each RequestHandler could have a default RequestParser.  If we
limited the 'arbitrary path' to one level, this could be used to
generate more RESTful URLs. Consider:

/myadder////

/myadder maps to MyCustomHandler and that gives you
MyCustomRequestBuilder that maps /// to SolrParams


I think these are best left for an extra-SOLR layer, especially since SOLR URLs 
are meant for interprogram communication and not direct use by non-developer 
end users.  For example, for my org's website I have hundreds of Apache 
mod_rewrite rules which do URL munging such as
/journals/abc/7/3/192a.pdf
into
/journalroot/index.cfm?journal=abcvolume=7issue=3
page=192seq=aformat=pdf

Or someone could custom-code a subclass of SolrServlet which

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Yonik Seeley


On 1/15/07, Chris Hostetter [EMAIL PROTECTED] wrote:

: The most important issue is to nail down the external HTTP interface.

I'm not sure if i agree with that statement .. i would think that figuring
out the model or how updates should be handled in a generic way, what
all of the Plugin types are, and what their APIs should be is the most
important issue -- once we have those issues settled we could allways
write a new SolrServlet2 that made the URL structure work anyway we
want.


The number of people writing update plugins will be small compared to
the number of users using the external HTTP API (the URL + query
parameters, and the relationship URL-wise between different update
formats).  My main concern is making *that* as nice and utilitarian as
possible, and any plugin stuff is implementation and a secondary
concern IMO.

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Yonik Seeley


On 1/16/07, J.J. Larrea [EMAIL PROTECTED] wrote:

- Revise the XML-based update code (broken out of SolrCore into a 
RequestHandler) to use all the above.


+++1, that's been needed forever.
If one has the time, I'd also advocate moving to StAX (via woodstox
for Java5, but it's built into Java6).

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Chris Hostetter


: The number of people writing update plugins will be small compared to
: the number of users using the external HTTP API (the URL + query
: parameters, and the relationship URL-wise between different update
: formats).  My main concern is making *that* as nice and utilitarian as
: possible, and any plugin stuff is implementation and a secondary
: concern IMO.

Agreed, but my point was that we should try to design the internal APIs
indepently from the URL structure ... if we have a set of APIs,
it's easy to come up with a URL structure that will map well (we could
theoretically have several URL structures using different servlets) but if
we worry too much about what hte URL should look like, we may hamstring
the model design.


-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Ryan McKinley


kind of like a binary stream equivilent to the way analyzers
can be customized -- is thta kind of what you had in mind?



exactly.



  interface SolrDocumentParser {
public init(NamedList args);
Document parse(SolrParams p, ContentStream content);
  }




yes

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Thorsten Scherler

On Fri, 2007-01-12 at 15:41 -0500, Yonik Seeley wrote:
 On 1/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:
  The one hitch i think to the the notion that updates and queries map
  cleanlly with something like this...
 
SolrRequestHandler = SolrUpdateHandler
SolrQueryRequest = SolrUpdateRequest
SolrQueryResponse = SolrUpdateResponse (possibly the same class)
QueryResponseWriter = UpdateResponseWriter (possible the same class)
 
  ...is that with queries, the input tends to be fairly simple.  very
  generic code can be run by the query Servlet to get all of the input
  params and build the SolrQueryRequest ... but with updates this isn't
  quite as simple.  there's the two issues i spoke of in my earlier mail
  which should be independenly confiugable:
1) where does the stream of update data come from?  is it in the raw
   POST body? is it in a POSTed multi-part MIME part? is it a remote
   resource refrenced by URL?
2) how should the raw binary stream of update data be parsed?  is it
   XML? (in the current update format)  is it a CSV file?  is it a PDF?
 
  ...#2 can be what the SolrUpdateHandler interface is all about -- when
  hitting the update url you specify a ut (update type) that determines
  that logic ... but it should be independed of #1
 
 Right, you're getting at issues of why I haven't committed my CSV handler yet.
 It currently handles reading a local file (this is more like an SQL
 update handler... only a reference to the data is passed).  But I also
 wanted to be able to handle a POST of the data  , or even a file
 upload from a browser.  Then I realized that this should be generic...
 the same should also apply to XML updates, and potential future update
 formats like JSON.

I do not see the problem here. One just need to add a couple of lines in
the upload servlet and change the csv plugin to input stream (not local
file).

See
https://issues.apache.org/jira/secure/attachment/12347425/solar-85.with.file.upload.diff
...
+boolean isMultipart = ServletFileUpload
+.isMultipartContent(new ServletRequestContext(request));
...
+if (isMultipart) {
+// Create a new file upload handler
...
+commandReader = new BufferedReader(new 
InputStreamReader(stream));

Now instead of 
+core.update(commandReader, responseWriter);
one would use the updateHandler for the in the request defined format 
(format=json)

UpdateHandler handler = core.lookupUpdateHandler(format);
handler.update(commandReader, responseWriter);

Or do I miss something?

salu2

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Chris Hostetter


: The most important issue is to nail down the external HTTP interface.

I'm not sure if i agree with that statement .. i would think that figuring
out the model or how updates should be handled in a generic way, what
all of the Plugin types are, and what their APIs should be is the most
important issue -- once we have those issues settled we could allways
write a new SolrServlet2 that made the URL structure work anyway we
want.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Chris Hostetter


:SolrRequestHandler = SolrUpdateHandler
:SolrQueryRequest = SolrUpdateRequest
:SolrQueryResponse = SolrUpdateResponse (possibly the same class)
:QueryResponseWriter = UpdateResponseWriter (possible the same class)
: 
:
: Is there any reason the plugin system needs a different RequestObject
: for Query vs Update?

as i said: only to the extend that Updates tend to have streams of data
that queries don't need (as far as i can imagine)

: SolrRequest would be the current SolrQueryRequest augmented with the
: HTTP method type and a way to get the raw post stream.

the raw POST stream may not be where the data is though -- consider the
file upload case, or the reading from a local file case, or the reading
form a list of remote URLs specified in params.

: I'm not sure the nitty gritty, but it should be as close to
: HttpServletRequest as possible.  If possible, I think handlers should
: choose how to handle the stream.
:
: It it is a remote resource, I think its the handlers job to open the stream.

i disagree ... it should be possible to create micro-plugins (I think i
called them UpdateSource instances in my orriginal suggestion) that know
about getting streams in various ways, but don't care what format of data
is found on those streams -- that would be left for the
(Update)RequestHandler (which wouldn't need to know where the data came
from)

a JDBC/SQL updater would probably be a very special case -- where the
format and the stream are inheriently related -- in which case a No-Op
UpdateSource could be used that didn't provide any stream, and the
JdbcUpdateRequestHandler would manage it's JDBC streams directly.

: Likewise I don't see anything in QueryResponseWriter that should tie
: it to 'Query.'  Could it just be ResponseWriter?

probably -- as i said, both it and SolrQueryResponse could probably be
reused, the only hitch is that their names might be confusing (we could
allways refactor all of their guts into super classes, and deprecate the
existing classes)

: While we are at it... is there any reason (for or against) exposing
: other parts of the HttpServletRequest to SolrRequestHandlers?

the biggest one is Unit testing -- giving plugins very simple APIs that
don't require a lot of knowledge about external APIs make it much easier
to test them.  it also helps make it possible for use to future proof
plugins.  other messages in this thread have discussed the possibility of
changing the URL structure, supporting more restful URLs and things like
that ... if we currently exposed lots of info from the HttpServletRequest
in the SolrQueryRequest, then making changes like that in a backwards
compatible way would be nearly impossible.  As it stands, we can write a
new Servlet that deals with input *completely* differently from the
current URL structure, and be 99% certain that existing plugins will
continue to work.

: While it is not the focus of solr, someone (including me!) may want to
: implement some more complex authentication scheme - Perhaps setting a
: field on each document saying who added it and from what IP.
:
: stuff to consider: cookies, headers, remoteUser, remoteHost...

all of that could concievably be done by changing the servlet to add
that info into the SolrParams.



-Hoss

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Ryan McKinley


On 1/15/07, Chris Hostetter [EMAIL PROTECTED] wrote:


: The most important issue is to nail down the external HTTP interface.

I'm not sure if i agree with that statement .. i would think that figuring
out the model or how updates should be handled in a generic way, what
all of the Plugin types are, and what their APIs should be is the most
important issue -- once we have those issues settled we could allways
write a new SolrServlet2 that made the URL structure work anyway we
want.



-Hoss



I hate to inundate you with more code, but it seems like the best way
to describe a possible interface.

//---

interface ContentStream
{
 String getName();
 String getContentType();
 InputStream getStream();
}

interface SolrParams
{
 String getParam( String name );
 String[] getParams( String name );
}

//-

interface SolrRequest
{
 SolrParams getParams();
 ContentStream[] getContentStreams(); // Iterator?
 long getStartTime();
}

interface SolrResponse
{
 int getStatus(); // ???
 NamedList getProps(); // ???
}

//-

interface SolrRequestProcessor
{
 SolrResponse process( SolrRequest req );
 SolrResponseWriter getWriter( SolrRequest req ); // default
}

interface SolrResponseWriter
{
 void write(Writer writer, SolrRequest request, SolrResponse response);
 String getContentType(SolrRequest request, SolrResponse response);
}

//-

Then a servlet (or filter) could be in charge of parsing URL/params
into a request.  It would pick a Processor and send the output to a
writer.  If someone wanted a custom URL scheme, they would overide the
servlet/filter.

Perhaps SolrRequest should have an object for solrCore.  It would be
better if it does not need to go to the static
SolrCore.getUpdateHandler().

I am proposing ContentStream[] getContentStreams() because it would be
simpler then an iterator.  In the case of multipart upload, if you
offered an API closer to:
http://jakarta.apache.org/commons/fileupload/streaming.html
You would not have any parameters until after you read each Item and
convert the form fields to parameters.

Thoughts?

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Bertrand Delacretaz


On 1/16/07, Chris Hostetter [EMAIL PROTECTED] wrote:


  interface SolrRequestParser {
 SolrRequest process( HttpServletRequest req );
  }

(the trick being that the servlet would need to parse the st info out
of the URL (either from the path or from the QueryString) directly without
using any of the HttpServletRequest.getParameter*() methods...


I haven't followed all of the discussion, but wouldn't it be easier to
use the request path, instead of parameters, to select these
RequestParsers?

i.e. solr/update/pdf-parser, solr/update/hssf-parser,
solr/update/my-custom-parser, etc.

-Bertrand

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-12 Thread Ryan McKinley



Is there any reason anyone would want/need to run multple Indexers?
This would be the equivolent of running DirectUpdateHandler and
DirectUpdateHandler2 at the same time.  If not, the Indexer could be
stored in SolrCore and each plugin could talk to it directly.



I just realized that the updateHandler already is in SolrCore

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-12 Thread Yonik Seeley


On 1/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:

The one hitch i think to the the notion that updates and queries map
cleanlly with something like this...

  SolrRequestHandler = SolrUpdateHandler
  SolrQueryRequest = SolrUpdateRequest
  SolrQueryResponse = SolrUpdateResponse (possibly the same class)
  QueryResponseWriter = UpdateResponseWriter (possible the same class)

...is that with queries, the input tends to be fairly simple.  very
generic code can be run by the query Servlet to get all of the input
params and build the SolrQueryRequest ... but with updates this isn't
quite as simple.  there's the two issues i spoke of in my earlier mail
which should be independenly confiugable:
  1) where does the stream of update data come from?  is it in the raw
 POST body? is it in a POSTed multi-part MIME part? is it a remote
 resource refrenced by URL?
  2) how should the raw binary stream of update data be parsed?  is it
 XML? (in the current update format)  is it a CSV file?  is it a PDF?

...#2 can be what the SolrUpdateHandler interface is all about -- when
hitting the update url you specify a ut (update type) that determines
that logic ... but it should be independed of #1


Right, you're getting at issues of why I haven't committed my CSV handler yet.
It currently handles reading a local file (this is more like an SQL
update handler... only a reference to the data is passed).  But I also
wanted to be able to handle a POST of the data  , or even a file
upload from a browser.  Then I realized that this should be generic...
the same should also apply to XML updates, and potential future update
formats like JSON.

The most important issue is to nail down the external HTTP interface.
If the URL structure changes, it's also an opportunity to change
whatever we don't like about the current XML format.  The old update
URL can still implement the original syntax.
It's also an opportunity to make the interface a little more REST-like
if we so choose.

Brainstorming:
- for errors, use HTTP error codes instead of putting it in the XML as now.

- perhaps get rid of the enclosing add... that could be a verb in
the URL, or for multiple documents, change it to docs.

- add information about the data in the URL:

POST /solr/add?format=jsonoverwrite=true
[
 {field1:value1, field2:[false,true,false,true,true]}
]

POST /solr/add?format=csvseparator=,...
field1,field2
val1,val2

This is more flexible as it allows one to add more metadata about the
data w/o having to change the data format.  For example, if one wanted
to be able to specify which index the add should go to, or other info
about the handling of the data, it's simple to add an additional param
in the URL.

- For browser friendliness, we could support a standard mechanism for
putting the body in the URL (not for general use since the URL can be
size limited, but good for testing).

POST /solr/add?format=jsonoverwrite=truebody=[{field1:value1}]

- more REST like?
PUT /solr/document/1003?title=howdyauthor=snafoocat=misccat=book
#not sure I like that format, and we would still want the multi-doc
format anyway

- more REST like?
DEL /solr/document/1003
 OR
DEL /solr/document?id=1003
 OR
POST /solr/document/delete?id=1003

#how to do delete-by-query, optimize, etc?
DEL/POST /solr/document/delete?q=id:[10 TO 20]
 OR
POST /solr/command/delete?id=1002id=1003q=id:[1000 TO 1010]
 OR
POST /solr/command/deletebyquery?q=id:[10 TO 20]

POST /solr/command/optimizewait=true

- administrative commands, setting certain limits

POST /solr/command/setmergeFactor=100maxBufferedDocs=1000
POST /solr/command/setlogLevel=3

You get the idea of some of the options available.
Ideas?  Thoughts?

-Yonik

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-12 Thread Ryan McKinley



- for errors, use HTTP error codes instead of putting it in the XML as now.



yes!



- more REST like?



I would like the URL form to look like this:
 /solr/${verb}/?param=valueparam=value...
or:
 /solr/${verb}/${handler}/?param=valueparam=value...

Commands should work with or without the trailing '/'

/solr/query should continue to support qt=dismax, but I don't think we
should add params for 'at' (add type) 'ct' (commit type) etc...

Examples:

/solr/query/?q=...  (use standard)
/solr/query/dismax/?q=...
/solr/add/
POST: docs.../docs
/solr/add/sql/?q=SELECT *
/solr/add/csv/?param=xxx...
 POST (fileupload): file.csv
/solr/commit/?waitSearcher=true
/solr/delete/?id=AAAid=BBBq=id:[* TO CCC]
/solr/optimize/?waitSearcher=true


RequestHandlers would be registered with a verb (command?) in solrconfig.xml
 requestHandler command=query name=dismax ...
 requestHandler command=add name=xml class=..
 requestHandler command=add name=sql class=..
requestHandler command=commit ..

RequestHandlers would register verb/name not just name.  It would also
be nice to specify the default handler in solrconfig.xml (rather then
hardcoded to 'standard')
 requestHandler command=query name=dismax isDefault=true




DEL /solr/document?id=1003
  OR
POST /solr/document/delete?id=1003



I think the request method should be added to the base SolrRequest and
let the handler decide if it will do something different for GET vs.
POST vs. DEL

Conceptually /commit should be a POST, but it may be nice to use your
bowser (GET) to run commands like:
 /solr/commit?waitSearcher=true

If the method is passed to the RequestHandler, this will let anyone
who is unhappy with the standard behavior change it easily.  Someone
may want to require you send DEL to delete and POST to change anything
-- or the opposite if you care.



- administrative commands, setting certain limits

POST /solr/command/setmergeFactor=100maxBufferedDocs=1000
POST /solr/command/setlogLevel=3



In my proposal, this could be something like:
/solr/setconfig?logLevel=3

If someone wrote a handler that set the variables, then saved them in
solrconfig.xml, it could be:
/solr/setconfig/save/?mergeFactor=100

good good
ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-12 Thread Ryan McKinley



  SolrRequestHandler = SolrUpdateHandler
  SolrQueryRequest = SolrUpdateRequest
  SolrQueryResponse = SolrUpdateResponse (possibly the same class)
  QueryResponseWriter = UpdateResponseWriter (possible the same class)



Is there any reason the plugin system needs a different RequestObject
for Query vs Update?

I think the most flexible system would have the plugin manager take
any HttpServletRequest, convert it to a 'SolrRequest' and pass it to a
RequestHandler.

SolrRequest would be the current SolrQueryRequest augmented with the
HTTP method type and a way to get the raw post stream.

Likewise I don't see anything in QueryResponseWriter that should tie
it to 'Query.'  Could it just be ResponseWriter?

This way the plugin system would not need to care if its a
query/update/ or some other command someone wants to add.



  1) where does the stream of update data come from?  is it in the raw
 POST body? is it in a POSTed multi-part MIME part? is it a remote
 resource refrenced by URL?


I'm not sure the nitty gritty, but it should be as close to
HttpServletRequest as possible.  If possible, I think handlers should
choose how to handle the stream.

It it is a remote resource, I think its the handlers job to open the stream.


  2) how should the raw binary stream of update data be parsed?  is it
 XML? (in the current update format)  is it a CSV file?  is it a PDF?

...#2 can be what the SolrUpdateHandler interface is all about -- when
hitting the update url you specify a ut (update type) that determines
that logic ... but it should be independed of #1

maybe the full list of stream sources for #1 is finite and the code for
all of them can live in the UpdateServlet ... but it still needs to be an
option configured as a param, and it seems like it might as well be a
plugin so it's easy for people to write new ones in the future.

-Hoss



While we are at it... is there any reason (for or against) exposing
other parts of the HttpServletRequest to SolrRequestHandlers?

While it is not the focus of solr, someone (including me!) may want to
implement some more complex authentication scheme - Perhaps setting a
field on each document saying who added it and from what IP.

stuff to consider: cookies, headers, remoteUser, remoteHost...


ryan

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-10 Thread Chris Hostetter


: indexing app I wrote into SOLR.  It occurred to me that it would almost
: be simpler to use the plugin-friendly QueryRequest mechanism rather than
: the UpdateRequest mechanism; coupled with what you wrote below, Hoss, it
: makes me think that a little refactoring of request handling might go a
: long way:

I think you are definitely right ... refacting some of the
SolrRequestHandler/SolrQueryRequest/SolrQueryResponse interfaces/abstract
base classes to have some bases extendeable by some other
SolrUpdateHandler/SolrUpdateRequest/SolrUpdateResponse interfaces/abstract
base classes would go a long way.

Your post also made me realize that i'd total discounted the issue of
returning information about the *results* of the update back to the client
... currently it's done with XML which is ok because in order to send an
update the client has to udnerstand XML -- but if we start supporting
arbitrary formats for updates, we need to be able to respond in kind.
Your comment about reusing SolrQueryResponse and QueryResponseWriters for
this sounds perfect.

The one hitch i think to the the notion that updates and queries map
cleanlly with something like this...

  SolrRequestHandler = SolrUpdateHandler
  SolrQueryRequest = SolrUpdateRequest
  SolrQueryResponse = SolrUpdateResponse (possibly the same class)
  QueryResponseWriter = UpdateResponseWriter (possible the same class)

...is that with queries, the input tends to be fairly simple.  very
generic code can be run by the query Servlet to get all of the input
params and build the SolrQueryRequest ... but with updates this isn't
quite as simple.  there's the two issues i spoke of in my earlier mail
which should be independenly confiugable:
  1) where does the stream of update data come from?  is it in the raw
 POST body? is it in a POSTed multi-part MIME part? is it a remote
 resource refrenced by URL?
  2) how should the raw binary stream of update data be parsed?  is it
 XML? (in the current update format)  is it a CSV file?  is it a PDF?

...#2 can be what the SolrUpdateHandler interface is all about -- when
hitting the update url you specify a ut (update type) that determines
that logic ... but it should be independed of #1

maybe the full list of stream sources for #1 is finite and the code for
all of them can live in the UpdateServlet ... but it still needs to be an
option configured as a param, and it seems like it might as well be a
plugin so it's easy for people to write new ones in the future.

-Hoss

70 matches

Mail list logo