Re: Solr SRW Service

2006-11-21 Thread Ian Ibbotson
Thanks for the responses, couple of follow-ups

 Why do you need Axis for this?

Well you certainly don't for the SRU implementation, but for SRW I'd
just say that (in all the SRW implementations i've done so far) it's a
case of the right tool for the right job. Of course we can hand craft
the codecs and parse/produce the XML by hand. However, the SRU/SRW
community comes from a background of interoperability as a sacrosanct
requirement. Given that background, having something parse wsdl and
produce your codecs for you gives people (me) a warm fuzzy feeling when
it comes to WSI compliance. It also makes the release process much
easier when it comes to upgrading the protocol version: Just pop a new
wsdl in the build tree and compile. Of course there are other reasons
too, but thats a starter for 10 :)

 Solr has some pluggable capability, detailed here:

Ah ok thanks for that. I've taken a quick look and I'm trying to figure
out how we might be able too expose extra features, like the ability to 
request results be returned in different schemas. I'll keep at it tho 
and check back if I have any questions.

Cheers,
Ian.


On Mon, 2006-11-20 at 16:35 -0500, Erik Hatcher wrote:
 On Nov 20, 2006, at 2:15 PM, Ian Ibbotson wrote:
  Hiya all...
 
  I'm interested in the possibility of contributing SRW/SRU web services
  interface/module to solr (see http://www.loc.gov/standards/sru/).
  SRW/SRU is the web service definition which is often used along- 
  side or
  instead-of the more traditional Z39.50 protocol for cross/meta
  searching. a solr SRW/SRU interface would enable meta-search  
  engines to
  transparently include solr repository search results by only  
  configuring
  the base URL of the service. I've already got the much code to do much
  of whats needed (IE, CQL to Lucene query rewriters and code to  
  generate
  the right stubs using axis etc). Actually, I might be up for  
  creating a
  z3950 module too if anyone is interested?
 
 
 Why do you need Axis for this?
 
  So my first question really would be... Is anyone out there already
  working on such a beast? If so, do you need any help? Seems  
  pointless to
  create a second add-on. I've searched the lists (Not in any great  
  depth
  tho) and can't see any references to SRW/Z3959. Assuming nobody is,  
  I've
  got some follow-up questions about the best way to package up what  
  might
  be add-on modules.. is this list the right place to ask?
 
 Solr has some pluggable capability, detailed here:
 
   http://wiki.apache.org/solr/SolrPlugins
 
 You can simply create your code, which I presume would entail a  
 SolrRequestHandler and a QueryResponseWriter, and distribute it as a  
 JAR that others could just drop in and run with it.
 
   Erik
 



Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Walter Underwood
On 11/20/06 7:22 PM, Fuad Efendi [EMAIL PROTECTED] wrote:
 This is just a sample...
 
 1. What is an Error?
 2. What is a Mistake?
 3. What is an application bug?
 4. What is a 'system crash'?

These are not HTTP concepts. The request on a URI can succeed or fail
or result in other codes. Mistakes and crashes are outside of the HTTP
protocol.

 Of cource, XML-over-HTTP engine is not the same as HTML-over-HTTP...
 However... Walter noticed 'crawling'... I can't imagine a company which will
 put SOLR as a front-end accessible to crawlers... (To crawl an indexing
 service instead of source documents!?)

XML-over-HTTP is exactly the same as HTML-over-HTTP. In HTML, we
could return detailed error information in a meta tag. No difference.

If something is on HTTP, a good crawler can find it. All it takes is
one link, probably to the admin URL. Once found, that crawler will
happily pound on errors returned by 200.

XSLT support means you could build the search UI natively on Solr,
so that might happen.

Even without a crawler, we must work with caches and load balancers.
I will be using Solr with a load balancer in production. If Solr is
a broken HTTP server, we will have to build something else.

 I am sure that mixing XML-based interface with HTTP status codes is not an
 attractive 'architecture', we shold separate conserns and leave HTTP code
 handling to a servlet container as much as possible...

We don't need to use HTTP response codes deep in Solr, but we do need
to separate bad parameters, retryable errors, non-retryable errors, and
so on. We can call them what ever we want internally, but we need to
report them properly over HTTP.

wunder
-- 
Walter Underwood
Search Guru, Netflix

 



Re: Phonetic Token Filter

2006-11-21 Thread Chris Hostetter

:  2. Should we have a Jira issue first?

this wiki should have all the info you need...

http://wiki.apache.org/solr/HowToContribute



-Hoss



Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Yonik Seeley

On 11/20/06, Walter Underwood [EMAIL PROTECTED] wrote:

Even without a crawler, we must work with caches and load balancers.
I will be using Solr with a load balancer in production. If Solr is
a broken HTTP server, we will have to build something else.


Agree.  Every instance of Solr in CNET that serves websites is behind
a load balancer.
I don't know the config details of the loadbalancers though, except
that part of it is the LB checking for the existence of a
server-enabled file.  That allows administrators to remove the file
and still bring up a Solr instance w/o live traffic hitting it.

Solr does nothing with this file except display enabled or disabled.

From solrconfig.xml:

   !-- configure a healthcheck file for servers behind a loadbalancer
   healthcheck type=fileserver-enabled/healthcheck

-Yonik


Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Yonik Seeley

On 11/20/06, Chris Hostetter [EMAIL PROTECTED] wrote:

: Wow, i had completley forgotten that SolrException contained an HTTP
: status code.

Hmmm... acctually, the javadocs for SolrException are a little vague on
the meaning of code and there are at least a few places where it's set
to a value that is not a legal HTTP status code...


None of these cases actually bubble back to an HTTP response code.
Schema parsing is done at startup, and the update servlet always
returns 200 (with error in the XML response).

Perhaps the update servlet should use HTTP error codes as well.

-Yonik


./src/java/org/apache/solr/schema/IndexSchema.java:  throw new 
SolrException(1,Schema Parsing Failed,e,false);
./src/java/org/apache/solr/schema/IndexSchema.java:  throw new 
SolrException(1,analyzer without class or tokenizer  filter list);
./src/java/org/apache/solr/schema/IndexSchema.java:   throw new 
SolrException(1,TokenizerFactory must be specified first in analyzer);
./src/java/org/apache/solr/schema/IndexSchema.java:throw new 
SolrException(1,undefined field +fieldName);
./src/java/org/apache/solr/update/DirectUpdateHandler.java:if (idField == null) throw 
new SolrException(2,Operation requires schema to have a unique key field);
./src/java/org/apache/solr/update/DirectUpdateHandler.java:if (idField == null) throw 
new SolrException(2,Operation requires schema to have a unique key field);
./src/java/org/apache/solr/update/UpdateHandler.java:  throw new 
SolrException(1,error parsing event listevers, e, false);
./src/java/org/apache/solr/update/UpdateHandler.java:  throw new 
SolrException(1,error parsing event listeners, e, false);


Re: Phonetic Token Filter

2006-11-21 Thread Bertrand Delacretaz

On 11/21/06, Walter Underwood [EMAIL PROTECTED] wrote:

...It is worth a try. Most implementations of Double Metaphone are
well-commented, so you could change it for other languages...


Ok, I'll see if I find some time to test that, thanks for the clarifications!
-Bertrand


Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Walter Underwood
On 11/20/06 5:51 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

 Now that I think about it though, one nice change would be to get rid
 of the long stack trace for 400 exceptions... it's not needed, right?

That is correct. A client error (400) should not be reported with a
server stack trace. --wunder



Phonetic Token Filter

2006-11-21 Thread Walter Underwood
I've written a simple phonetic token filter (and factory) based
on the Double Metaphone implementation in Jakarta Codecs to
contribute. Three questions:

1. Does this sound like a generally useful addition?

2. Should we have a Jira issue first?

3. This adds a depencency on the codecs jar. How do we add that
to the distro?

The code is very simple, but I need to learn the contribution
process and build some tests, so this won't happen in one day.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Chris Hostetter

: /solr/select?q= is a tricky case. Three options:

...there's kind of a chicken/egg problem with this discussion ... the egg
being what should the HTTP response look like in an 'error' situation
the chicken being what is the internal API to allow a RequestHandler to
denote an 'error' situation ... talking about specific cases only gets us
so far since those cases may not be errors in all RequestHandlers.

the problem gets even more complicated when you try to answer the
question: what should Solr do if an OutputWriter encounters an error? ...
we can't generate a valid JSON response dnoting an error if the
JSONOutputWriter is failing :)

It might be wise to discuss the API/psuedo code for dealing with errors in
RequestHandlers and OutputWriters and then think about what kinds of
responses those would generate rather then worrying too much about the
exact HTTP status codes first ... a big question to start off with would
be: should the RequestHandler know about HTTP satus codes and be allowed
to set them explicitly, or should that level of detail be abstracted away?


-Hoss



Re: Phonetic Token Filter

2006-11-21 Thread Yonik Seeley

On 11/21/06, Walter Underwood [EMAIL PROTECTED] wrote:

I've written a simple phonetic token filter (and factory) based
on the Double Metaphone implementation in Jakarta Codecs to
contribute. Three questions:

1. Does this sound like a generally useful addition?


Definitely useful.
If it's generally applicable enough and light weight enough then it
should go in the core.  Otherwise it could go in contrib (which we
don't really have yet, but we will when the need arises).

This sounds like it should probably go in the core.


2. Should we have a Jira issue first?


Yes, please.


3. This adds a depencency on the codecs jar. How do we add that
to the distro?


It would go in the lib directory if it ends up in Solr proper.

-Yonik


Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Walter Underwood
One way to think about this is to assume caches, proxies, and load balancing
in the HTTP path, then think about their behavior. A 500 response may make
the load balancer drop this server from the pool, for example. A 200 OK
can be cached, so temporary errors shouldn't be sent with that code.

On 11/20/06 10:51 AM, Chris Hostetter [EMAIL PROTECTED] wrote:
 
 ...there's kind of a chicken/egg problem with this discussion ... the egg
 being what should the HTTP response look like in an 'error' situation
 the chicken being what is the internal API to allow a RequestHandler to
 denote an 'error' situation ... talking about specific cases only gets us
 so far since those cases may not be errors in all RequestHandlers.

We can get most of the benefit with a few kinds of errors: 400, 403, 404,
500, and 503. Roughly:

400 - error in the request, fix it and try again
403 - forbidden, don't try again
404 - not found, don't try again unless you think it is there now
500 - server error, don't try again
503 - server error, try again

These can be mapped from internal error types.

 the problem gets even more complicated when you try to answer the
 question: what should Solr do if an OutputWriter encounters an error? ...
 we can't generate a valid JSON response dnoting an error if the
 JSONOutputWriter is failing :)

Write the response to a string before sending the headers. This can be
slower than writing the response out as it is computed, but the response
codes can be accurate. Also, it allows optimal buffering, so it might
scale better.

If you really want to handle failure in an error response, write that
to a string and if that fails, send a hard-coded string.

wunder
-- 
Walter Underwood
Search Guru, Netflix




RE: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Fuad Efendi

On the update side of things, I think it would be nice if one could
check the HTTP status code and if it's OK (200), don't bother XML
parsing the body.

Do you mean 304 'Not Modified'? Agree, we should handle it in SOLR (it is
not SOAP indeed!); we should handle 'last modified', 'expiration' etc. 

HTTP specs, as pointed by Hoss, allow to use 4xx codes with user-defined
entities.

There is some HTTP staff which we need to use anyway, but we should not use
HTTP codes in a core-Java parts of an application. Some code is currently
tightly coupled with such staff as 
SC_BAD_REQUEST
SC_OK 
SC_NOT_FOUND 

This is part of JEE, and existing design looks slightly outdated: we need to
decouple such 'nice' staff:
} catch (SolrException e) {
  sendErr(e.code(), SolrException.toStr(e), request, response);
} 

We even _catch_ an Exception, and _rethrow_ it as 400/404 (this is also
'Exception', but in a different language)


 1. What is an Error?
 2. What is a Mistake?
 3. What is an application bug?
 4. What is a 'system crash'?

These are not HTTP concepts. The request on a URI can succeed or fail
or result in other codes. Mistakes and crashes are outside of the HTTP
protocol.

Yes, I tried to mention very generic concepts and to think about
'Exceptions' in Java SE, EE, SOLR, JSON, XML, HTTP. We are always extending
java.lang.Exception without any thinking, just following patterns from
thousands of guides. 

Please, have a look at 
http://www.mindview.net/Etc/Discussions/CheckedExceptions
And following discussion:
http://www.bruceeckel.com/Etc/Discussions/UnCheckedExceptionComments


Some authors suggest to use unchecked exceptions. Code written in so many
books regarding try-catch-finally is suitable only for a very small
applications (usually small samples from a books)...

Thanks



Re: Solr SRW Service

2006-11-21 Thread Erik Hatcher
Right, I was questioning the use of Axis for SRU, not for SRW - sorry  
I didn't make that clear.


Erik


On Nov 21, 2006, at 2:27 AM, Ian Ibbotson wrote:


Thanks for the responses, couple of follow-ups


Why do you need Axis for this?


Well you certainly don't for the SRU implementation, but for SRW I'd
just say that (in all the SRW implementations i've done so far) it's a
case of the right tool for the right job. Of course we can hand craft
the codecs and parse/produce the XML by hand. However, the SRU/SRW
community comes from a background of interoperability as a sacrosanct
requirement. Given that background, having something parse wsdl and
produce your codecs for you gives people (me) a warm fuzzy feeling  
when

it comes to WSI compliance. It also makes the release process much
easier when it comes to upgrading the protocol version: Just pop a new
wsdl in the build tree and compile. Of course there are other reasons
too, but thats a starter for 10 :)


Solr has some pluggable capability, detailed here:


Ah ok thanks for that. I've taken a quick look and I'm trying to  
figure
out how we might be able too expose extra features, like the  
ability to

request results be returned in different schemas. I'll keep at it tho
and check back if I have any questions.

Cheers,
Ian.


On Mon, 2006-11-20 at 16:35 -0500, Erik Hatcher wrote:

On Nov 20, 2006, at 2:15 PM, Ian Ibbotson wrote:

Hiya all...

I'm interested in the possibility of contributing SRW/SRU web  
services

interface/module to solr (see http://www.loc.gov/standards/sru/).
SRW/SRU is the web service definition which is often used along-
side or
instead-of the more traditional Z39.50 protocol for cross/meta
searching. a solr SRW/SRU interface would enable meta-search
engines to
transparently include solr repository search results by only
configuring
the base URL of the service. I've already got the much code to do  
much

of whats needed (IE, CQL to Lucene query rewriters and code to
generate
the right stubs using axis etc). Actually, I might be up for
creating a
z3950 module too if anyone is interested?



Why do you need Axis for this?


So my first question really would be... Is anyone out there already
working on such a beast? If so, do you need any help? Seems
pointless to
create a second add-on. I've searched the lists (Not in any great
depth
tho) and can't see any references to SRW/Z3959. Assuming nobody is,
I've
got some follow-up questions about the best way to package up what
might
be add-on modules.. is this list the right place to ask?


Solr has some pluggable capability, detailed here:

http://wiki.apache.org/solr/SolrPlugins

You can simply create your code, which I presume would entail a
SolrRequestHandler and a QueryResponseWriter, and distribute it as a
JAR that others could just drop in and run with it.

Erik





Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Yonik Seeley

On 11/20/06, Fuad Efendi [EMAIL PROTECTED] wrote:

Here, we are passing 'Empty Query' error message with a full stack trace as
an entity body of HTTP 404 response.


It's actually returning 400:

$ curl -i http://localhost:8983/solr/select/
HTTP/1.1 400 Bad Request
Date: Tue, 21 Nov 2006 03:56:34 GMT
Server: Jetty/5.1.11RC0 (Windows XP/5.1 x86 java/1.5.0_09
Content-Type: text/plain; charset=UTF-8
Content-Length: 1377

org.apache.solr.core.SolrException: Missing queryString
   at org.apache.solr.request.StandardRequestHandler.handleRequest(Standard
RequestHandler.java:105)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:587)
   at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)


Imagine that instead of 'Incorrect ZIP Code' we will see Java stack trace in
some web-sites...


As an aside, as I pointed out in an earlier message, it's debatable if
we should include a stack trace for user errors (as opposed to server
errors).  I guess it depends if it ever helps with debugging or not.

Anyway, the Solr interface isn't meant as a user GUI.  It's a back-end
system like a database.


I am sure that mixing XML-based interface with HTTP status codes is not an
attractive 'architecture', we shold separate conserns and leave HTTP code
handling to a servlet container as much as possible...


That gets further away from REST. Not that Solr is purely REST, but
it's not web-services either... it's about being practical.

On the update side of things, I think it would be nice if one could
check the HTTP status code and if it's OK (200), don't bother XML
parsing the body.

-Yonik


Re: Solr SRW Service

2006-11-21 Thread Edward Summers

On Nov 20, 2006, at 2:15 PM, Ian Ibbotson wrote:

So my first question really would be... Is anyone out there already
working on such a beast? If so, do you need any help? Seems  
pointless to
create a second add-on. I've searched the lists (Not in any great  
depth
tho) and can't see any references to SRW/Z3959. Assuming nobody is,  
I've
got some follow-up questions about the best way to package up what  
might

be add-on modules.. is this list the right place to ask?


I'm not working on it, but I know that a lot of people in the library  
technology community would find this to be very useful indeed.


The Extensible Text Framework [1] from the California Digital Library  
is similar to solr in that it provides a wrapper around lucene, and  
it has some experimental srw/sru support apparently [2]. It might be  
worthwhile chatting with them.


//Ed

[1] http://www.cdlib.org/inside/projects/xtf/
[2] http://xtf.sourceforge.net/WebDocs/HTML/XTF_Experimental_Features/ 
XTFExperimental.html


Re: Solr SRW Service

2006-11-21 Thread Erik Hatcher


On Nov 20, 2006, at 2:15 PM, Ian Ibbotson wrote:

Hiya all...

I'm interested in the possibility of contributing SRW/SRU web services
interface/module to solr (see http://www.loc.gov/standards/sru/).
SRW/SRU is the web service definition which is often used along- 
side or

instead-of the more traditional Z39.50 protocol for cross/meta
searching. a solr SRW/SRU interface would enable meta-search  
engines to
transparently include solr repository search results by only  
configuring

the base URL of the service. I've already got the much code to do much
of whats needed (IE, CQL to Lucene query rewriters and code to  
generate
the right stubs using axis etc). Actually, I might be up for  
creating a

z3950 module too if anyone is interested?



Why do you need Axis for this?


So my first question really would be... Is anyone out there already
working on such a beast? If so, do you need any help? Seems  
pointless to
create a second add-on. I've searched the lists (Not in any great  
depth
tho) and can't see any references to SRW/Z3959. Assuming nobody is,  
I've
got some follow-up questions about the best way to package up what  
might

be add-on modules.. is this list the right place to ask?


Solr has some pluggable capability, detailed here:

http://wiki.apache.org/solr/SolrPlugins

You can simply create your code, which I presume would entail a  
SolrRequestHandler and a QueryResponseWriter, and distribute it as a  
JAR that others could just drop in and run with it.


Erik



XML vs. JSON, Python, Ruby

2006-11-21 Thread Fuad Efendi
SOLR is a Web-Application with well-defined XML-based API:
- indexing service
- asynchronous; no need for 'real time' (content has well-defined TTL); can
use HTTP Caching for increased performance
- provides native support for XSL

The question: do we really need to maintain JSON/Puby as a ServletOutput? We
can focus on 'Public XML API' only, and provide samples of XSL-to-JSON,
XML-to-WML, and etc...