Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Thorsten Scherler
On Fri, 2007-01-12 at 15:41 -0500, Yonik Seeley wrote:
 On 1/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:
  The one hitch i think to the the notion that updates and queries map
  cleanlly with something like this...
 
SolrRequestHandler = SolrUpdateHandler
SolrQueryRequest = SolrUpdateRequest
SolrQueryResponse = SolrUpdateResponse (possibly the same class)
QueryResponseWriter = UpdateResponseWriter (possible the same class)
 
  ...is that with queries, the input tends to be fairly simple.  very
  generic code can be run by the query Servlet to get all of the input
  params and build the SolrQueryRequest ... but with updates this isn't
  quite as simple.  there's the two issues i spoke of in my earlier mail
  which should be independenly confiugable:
1) where does the stream of update data come from?  is it in the raw
   POST body? is it in a POSTed multi-part MIME part? is it a remote
   resource refrenced by URL?
2) how should the raw binary stream of update data be parsed?  is it
   XML? (in the current update format)  is it a CSV file?  is it a PDF?
 
  ...#2 can be what the SolrUpdateHandler interface is all about -- when
  hitting the update url you specify a ut (update type) that determines
  that logic ... but it should be independed of #1
 
 Right, you're getting at issues of why I haven't committed my CSV handler yet.
 It currently handles reading a local file (this is more like an SQL
 update handler... only a reference to the data is passed).  But I also
 wanted to be able to handle a POST of the data  , or even a file
 upload from a browser.  Then I realized that this should be generic...
 the same should also apply to XML updates, and potential future update
 formats like JSON.

I do not see the problem here. One just need to add a couple of lines in
the upload servlet and change the csv plugin to input stream (not local
file).

See
https://issues.apache.org/jira/secure/attachment/12347425/solar-85.with.file.upload.diff
...
+boolean isMultipart = ServletFileUpload
+.isMultipartContent(new ServletRequestContext(request));
...
+if (isMultipart) {
+// Create a new file upload handler
...
+commandReader = new BufferedReader(new 
InputStreamReader(stream));

Now instead of 
+core.update(commandReader, responseWriter);
one would use the updateHandler for the in the request defined format 
(format=json)

UpdateHandler handler = core.lookupUpdateHandler(format);
handler.update(commandReader, responseWriter);

Or do I miss something?

salu2



[jira] Commented: (SOLR-86) [PATCH] standalone updater cli based on httpClient

2007-01-15 Thread Bertrand Delacretaz (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464709
 ] 

Bertrand Delacretaz commented on SOLR-86:
-

I like the idea of a very simple update only client.

It's probably simple enough to create two versions, one using HttpClient and 
one with no dependencies apart from the JDK? I agree with Hoss that the post.sh 
replacement should use the latter.

IMHO it's good to show the use of HttpClient for people who're going to base 
more complex clients on it, and a no depedencies client is useful for simple 
cases.

Maybe (thinking outloud here) both clients could implement a common 
SolrUpdateClientInterface, and update+search clients would implement a 
SolrSearchInterface as well.

 [PATCH]  standalone updater cli based on httpClient
 ---

 Key: SOLR-86
 URL: https://issues.apache.org/jira/browse/SOLR-86
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Thorsten Scherler
 Attachments: simple-post-using-urlconnection-approach.patch, 
 solr-86.diff, solr-86.diff


 We need a cross platform replacement for the post.sh. 
 The attached code is a direct replacement of the post.sh since it is actually 
 doing the same exact thing.
 In the future one can extend the CLI with other feature like auto commit, 
 etc.. 
 Right now the code assumes that SOLR-85 is applied since we using the servlet 
 of this issue to actually do the update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-86) [PATCH] standalone updater cli based on httpClient

2007-01-15 Thread Thorsten Scherler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464737
 ] 

Thorsten Scherler commented on SOLR-86:
---

Hi Hoiss, I had a look at your version and it is good as gold.
I personally prefer the httpClient since the method is smaller but Bertrand and 
ourself are right, the dependency jar price for a simple replacement is ATM too 
high.

The only thing that I would add is directory support:
...
+  if (srcFile.exists()) {
+if (srcFile.isDirectory()) {
+File[] fileSet = srcFile.listFiles();
+for (int i = 0; i  fileSet.length; i++) {
+File file = fileSet[i];
+tool.postFile(file, out);
+} else {
+tool.postFile(srcFile, out);
+}
+  System.out.println();
+} else {
+  System.err.println(srcFile +  does not exist);
+}

I agree to your patch as official replacement of the post.sh. I further agree 
with Bertrand that we may include patch as base demonstration for more complex 
client apps.

 [PATCH]  standalone updater cli based on httpClient
 ---

 Key: SOLR-86
 URL: https://issues.apache.org/jira/browse/SOLR-86
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Thorsten Scherler
 Attachments: simple-post-using-urlconnection-approach.patch, 
 solr-86.diff, solr-86.diff


 We need a cross platform replacement for the post.sh. 
 The attached code is a direct replacement of the post.sh since it is actually 
 doing the same exact thing.
 In the future one can extend the CLI with other feature like auto commit, 
 etc.. 
 Right now the code assumes that SOLR-85 is applied since we using the servlet 
 of this issue to actually do the update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (SOLR-109) variable substitution in lucene query params

2007-01-15 Thread Thorsten Scherler (JIRA)
variable substitution in lucene query params


 Key: SOLR-109
 URL: https://issues.apache.org/jira/browse/SOLR-109
 Project: Solr
  Issue Type: New Feature
Reporter: Thorsten Scherler


Allowing variable substitution in the lucene query params seems pretty slick 
... a more general solution might be to modify the SolrQueryParser
directly to have a new void setParamVariables(SolrParams p) method.
http://marc.theaimsgroup.com/?t=11671237641r=1w=2

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: switch to native locks by default?

2007-01-15 Thread Michael McCandless

Chris Hostetter wrote:

: Ah, I hadn't realized that they might not be supported everywhere... I

I'm just trusting the javadoc for NativeFSLockFactory ... i have no idea
if it's accurate or not.


Hi!  I had added the caveat about native locks based on my dicey
experience getting them working over NFS (NFS locking was not turned
on by default in my setup; and, frustratingly, it would take ~35
seconds for a timeout to tell me this).

I don't have any specific evidence that other OS/filesystems are
problematic but then again I haven't done much research to understand
overall portability of Java's native lock interface.  It would not
surprise me if other OS/filesystems had issues.

I was hoping by getting the NativeFSLockFactory out there that it
would get some healthy testing first and then we could use that
feedback to decide whether benefits outweigh the risks of making it
the default.  It's not clear how many people have actually tested it
at this point, though!


: The current locking can also guard against mistakes though (multiple
: instances of Solr trying to write to the same dir, someone opening a
: Luke index on it, etc).

right ... but it's only useful if all of the potential clients are using
the same locking mechanism ... right now it's only safe to do any of those
things if all the apps use SimpleFSLockFactory.  all the more reason to
make the factory and the lockDir configurable in Solr i guess.


Mike


Re: svn commit: r496274 - /incubator/solr/trunk/src/java/org/apache/solr/core/Config.java

2007-01-15 Thread Chris Hostetter

: Log: SolrConfig says 'system property solr.solr.home not set' in the
: log, when using default Solr home

this seems like an odd thing to call out in the log ... it implies the
system proerty should be set, but using JNDI to set the solr.home is just
as valid of a way to specify where things liv

: -log.info(Solr home defaulted to ' + instanceDir + ');
: +log.info(Solr home defaulted to ' + instanceDir + ' (system 
property  + prop +  not set));




-Hoss



Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Chris Hostetter

: The most important issue is to nail down the external HTTP interface.

I'm not sure if i agree with that statement .. i would think that figuring
out the model or how updates should be handled in a generic way, what
all of the Plugin types are, and what their APIs should be is the most
important issue -- once we have those issues settled we could allways
write a new SolrServlet2 that made the URL structure work anyway we
want.



-Hoss



Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Chris Hostetter

:SolrRequestHandler = SolrUpdateHandler
:SolrQueryRequest = SolrUpdateRequest
:SolrQueryResponse = SolrUpdateResponse (possibly the same class)
:QueryResponseWriter = UpdateResponseWriter (possible the same class)
: 
:
: Is there any reason the plugin system needs a different RequestObject
: for Query vs Update?

as i said: only to the extend that Updates tend to have streams of data
that queries don't need (as far as i can imagine)

: SolrRequest would be the current SolrQueryRequest augmented with the
: HTTP method type and a way to get the raw post stream.

the raw POST stream may not be where the data is though -- consider the
file upload case, or the reading from a local file case, or the reading
form a list of remote URLs specified in params.

: I'm not sure the nitty gritty, but it should be as close to
: HttpServletRequest as possible.  If possible, I think handlers should
: choose how to handle the stream.
:
: It it is a remote resource, I think its the handlers job to open the stream.

i disagree ... it should be possible to create micro-plugins (I think i
called them UpdateSource instances in my orriginal suggestion) that know
about getting streams in various ways, but don't care what format of data
is found on those streams -- that would be left for the
(Update)RequestHandler (which wouldn't need to know where the data came
from)

a JDBC/SQL updater would probably be a very special case -- where the
format and the stream are inheriently related -- in which case a No-Op
UpdateSource could be used that didn't provide any stream, and the
JdbcUpdateRequestHandler would manage it's JDBC streams directly.

: Likewise I don't see anything in QueryResponseWriter that should tie
: it to 'Query.'  Could it just be ResponseWriter?

probably -- as i said, both it and SolrQueryResponse could probably be
reused, the only hitch is that their names might be confusing (we could
allways refactor all of their guts into super classes, and deprecate the
existing classes)

: While we are at it... is there any reason (for or against) exposing
: other parts of the HttpServletRequest to SolrRequestHandlers?

the biggest one is Unit testing -- giving plugins very simple APIs that
don't require a lot of knowledge about external APIs make it much easier
to test them.  it also helps make it possible for use to future proof
plugins.  other messages in this thread have discussed the possibility of
changing the URL structure, supporting more restful URLs and things like
that ... if we currently exposed lots of info from the HttpServletRequest
in the SolrQueryRequest, then making changes like that in a backwards
compatible way would be nearly impossible.  As it stands, we can write a
new Servlet that deals with input *completely* differently from the
current URL structure, and be 99% certain that existing plugins will
continue to work.

: While it is not the focus of solr, someone (including me!) may want to
: implement some more complex authentication scheme - Perhaps setting a
: field on each document saying who added it and from what IP.
:
: stuff to consider: cookies, headers, remoteUser, remoteHost...

all of that could concievably be done by changing the servlet to add
that info into the SolrParams.



-Hoss



Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Ryan McKinley

On 1/15/07, Chris Hostetter [EMAIL PROTECTED] wrote:


: The most important issue is to nail down the external HTTP interface.

I'm not sure if i agree with that statement .. i would think that figuring
out the model or how updates should be handled in a generic way, what
all of the Plugin types are, and what their APIs should be is the most
important issue -- once we have those issues settled we could allways
write a new SolrServlet2 that made the URL structure work anyway we
want.



-Hoss



I hate to inundate you with more code, but it seems like the best way
to describe a possible interface.

//---

interface ContentStream
{
 String getName();
 String getContentType();
 InputStream getStream();
}

interface SolrParams
{
 String getParam( String name );
 String[] getParams( String name );
}

//-

interface SolrRequest
{
 SolrParams getParams();
 ContentStream[] getContentStreams(); // Iterator?
 long getStartTime();
}

interface SolrResponse
{
 int getStatus(); // ???
 NamedList getProps(); // ???
}

//-

interface SolrRequestProcessor
{
 SolrResponse process( SolrRequest req );
 SolrResponseWriter getWriter( SolrRequest req ); // default
}

interface SolrResponseWriter
{
 void write(Writer writer, SolrRequest request, SolrResponse response);
 String getContentType(SolrRequest request, SolrResponse response);
}

//-

Then a servlet (or filter) could be in charge of parsing URL/params
into a request.  It would pick a Processor and send the output to a
writer.  If someone wanted a custom URL scheme, they would overide the
servlet/filter.

Perhaps SolrRequest should have an object for solrCore.  It would be
better if it does not need to go to the static
SolrCore.getUpdateHandler().

I am proposing ContentStream[] getContentStreams() because it would be
simpler then an iterator.  In the case of multipart upload, if you
offered an API closer to:
http://jakarta.apache.org/commons/fileupload/streaming.html
You would not have any parameters until after you read each Item and
convert the form fields to parameters.

Thoughts?


[jira] Commented: (SOLR-105) Duck typing for Document/Field plus to/from solr conversions for field names.

2007-01-15 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465016
 ] 

Erik Hatcher commented on SOLR-105:
---

Lets discuss this further.  I'm not quite on board with document round tripping 
just yet, as I think we need more of a Lucene Hits-like concept on the Ruby 
side to navigate the results in an iterator/Enumerable type fashion.  Keep in 
mind that what comes back from Solr may or may not be the full document that 
was added originally, due to the fields not being requested or the schema not 
configured to store them.  Having  partial documents on the Ruby side seems 
awkward from a user perspective and not adhere to the principle of least 
surprise.

 Duck typing for Document/Field plus to/from solr conversions for field names. 
 --

 Key: SOLR-105
 URL: https://issues.apache.org/jira/browse/SOLR-105
 Project: Solr
  Issue Type: Improvement
  Components: clients - ruby - flare
 Environment: Darwin rocket 8.8.1 Darwin Kernel Version 8.8.1: Mon Sep 
 25 19:42:00 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_I386 i386 i386
Reporter: William Groppe
 Attachments: doc_and_field_roundtrip.patch


 Hey Erik,
 Take a close look at this patch, I've extended Ed's code quite a bit.  You 
 may want to hold off applying this until we all discuss it.  But on the plus 
 side, it has 100% test coverage, and allows round trips of full documents to 
 Solr.
 Will

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-107) Iterable NamedList with java5 generics

2007-01-15 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465047
 ] 

Hoss Man commented on SOLR-107:
---

i only briefly skimmed the patch, but a couple quick questions came to mind...

1) instead of creating a new NameValuePairT interface, couldn't named list 
just impliment IterableMap.EntryString,T ?

2) for this bit of code...

@@ -183,7 +185,7 @@
 Iterator iter = eset.iterator();
 while (iter.hasNext()) {
   Map.Entry entry = (Map.Entry)iter.next();
-  add(entry.getKey().toString(), entry.getValue());
+  add(entry.getKey().toString(), (T)entry.getValue());
 }
 return args.size()0;
   }

...that's in addAll(Map) right? ... if we're genericizing NamedList with 
respect to T, then shouldn't the method sig change to addAll(Map?,T) ... 
which would eliminate the need for the cast right?

3) there's an addAll(NamedList) too isn't there? .. shouldn't that method 
change to addAll(NamedListT) as well?


(I think all of those would still work in the current code base using the 
generics default of Object for unspecified templates)


 Iterable NamedList with java5 generics
 --

 Key: SOLR-107
 URL: https://issues.apache.org/jira/browse/SOLR-107
 Project: Solr
  Issue Type: Improvement
Reporter: Ryan McKinley
Priority: Trivial
 Attachments: IterableNamedList.patch


 Iterators and generics are nice!
 this patch adds both to NamedList.java

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-15 Thread Bertrand Delacretaz

On 1/16/07, Chris Hostetter [EMAIL PROTECTED] wrote:


  interface SolrRequestParser {
 SolrRequest process( HttpServletRequest req );
  }

(the trick being that the servlet would need to parse the st info out
of the URL (either from the path or from the QueryString) directly without
using any of the HttpServletRequest.getParameter*() methods...


I haven't followed all of the discussion, but wouldn't it be easier to
use the request path, instead of parameters, to select these
RequestParsers?

i.e. solr/update/pdf-parser, solr/update/hssf-parser,
solr/update/my-custom-parser, etc.

-Bertrand