[jira] Commented: (SOLR-344) New Java API

Hoss Man (JIRA) Wed, 05 Sep 2007 14:54:07 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525231
 ]


Hoss Man commented on SOLR-344:
-------------------------------

I've only had a chance to skim the attached PDF ... I've printed it out in the 
hopes that I'll find some time to read in depth your specific ideas about what 
the ideal Solr API should be; but there are a few things that jumped out at me 
that I wanted to address while they were on my mind...

-- Motivation --

- Direct Java is "better" -

A key assumption in this proposal seems to be that "if you are writing a Java 
app, and you want to use Solr, you should not use the HTTP interface"  I would 
argue strongly against this assumption.  there are *lots* of reasons why it 
makes sense to treat Solr as a webservice and interact with it over HTTP 
instead of having a tight coupling with your Java application: redundancy, load 
balancing, ...  Even if someone had a situation where they only had one machine 
in their entire operation, and all of their applications ran on that machine i 
would still suggest installing a servlet container and using Solr that way 
because it's likely they will have more then one application that will want to 
deal with their index.  Solr can make a lot of good optimizations and 
assumptions that go right out the window if you try to embed Solr in 2 
different apps reading and writing to the same physical index directory.

Even if compelling stats can be presented that the HTTP+XML/JSON overhead is in 
fact a bottleneck, i would still think that pursuing something like an RMI 
based client/server API in addition to the HTTP API would make more sense then 
encouraging people to use directly in the JVM of their other applications.  
Even the Plugin model (for embedding your custom Java code into Solr) is 
something i only recommend in situations where it makes a lot of sense for that 
logic to tied closely with the Solr or Lucene internals (ie: as part of the 
TokenStream, or dealing with the DocSets before they are cached, etc...)

The #1 "Value Add" that Solr has over Lucene is the Client/Server abstraction 
... there are certainly other value adds -- some small (like added 
TokenFilters) and some big (like the IndexSchema concept) -- and many of these 
could probably be refactored into the Lucene core (or a Lucene contrib) so they 
could be reused by other Lucene applications in addition to Solr ... but Solr 
*is* an application.

Arguing that you shouldn't bother using a client/server relationship to deal 
with Solr if your application is written in Java is like arguing that you 
shouldn't bother using a client/server relationship to deal with MySQL if your 
application is written in C.

- Demand for direct access -

the statement "a significant proportion of questions on the mailing lists are 
clearly from people who are attempting such integrations right now." does not 
serve as a clear call to action ... even if a significant number of recent 
questions have related to embedded Solr (and I'm not convinced the number is 
that significant) that one data point alone does not clearly indicate that it 
is important/urgent to make this easier to do.  It just indicates that the 
people who are attempting to do this have questions about how to do it ... 
which isn't that suprising considering it's a relatively new concept that 
hasn't really been documented.   Some of these people may just be assuming that 
they *need* to embed Solr in their existing Java applications because they 
don't realize it's intended to be used as a server.

The [EMAIL PROTECTED] list gets lots of questions from people who misunderstand 
the the demo code in the Lucene distribution and think Lucene is an application 
that they can run on the command line to index files and search them -- that 
doesn't mean that the Lucene-Java project should revamp itself to focus on 
producing an application instead of a Library, it means the Lucene-Java 
community has to help educate users about: A) how they can use the Lucene 
library to build their own apps; and B) what apps are built on top of the 
Lucene library that might be useful to them.

I think it would probably be more beneficial for the community as a whole if 
people spent more time/energy documenting the benefits/mechanisms of using Solr 
as a server, or improving the client APIs to make communicating with a Solr 
server faster/easier then it would to dedicate a lot of resources solely 
towards making Solr more of a library and less of an application.


-- Strategy for making changes --

All that said -- i agree with you that a lot of improvements can and should be 
made to the internal APIs.  Not because i think we need to make it easier to 
embed Solr, but to make it easier for new developers to work on the Solr 
internals (or to write plugins).  if embedding Solr gets easier as a result -- 
great, but I don't see that as a compelling reason for change.

Somewhere in your doc, you advocated the importance of a top down complete API 
overhaul instead of approaching things piecemeal (forgive me for not 
remembering exactly how you put it, I'm not trying to put words in your mouth i 
just remember there being a sentiment like this) ... while i think it would 
definitely make sense to have  some discussions on solr-dev about what the big 
problems are with the internal APIs and come up with a high level picture of 
what the ideal API might be so we can aim for it, the best way to get there is 
with small patches that focuses on a single area.

I say this from experience as someone who has submitted patches to projects, 
and as a committer who has to review patches:  Big patches that change a lot of 
things take a lot more work/discussion/thought to review and generally spend a 
lot longer sitting in Jira then shorter most focused patches (some day I'll sit 
down and do the math and write out "Hoss'ss Patch Size Theorem" but for now 
take my word for it that there's an exponential factor in there somewhere).  
The best way to proceed is probable to start by tackling individual pieces of 
functionality, adding the API you think there should be, and refactoring the 
current code to implement/use that API (leaving the old one around as 
deprecated).


-- Loose APIs vs tight APIs --

While i agree there are a lot of places where thing like NamedList are 
overused, don't discount the value add that this kind of "pass through" API 
allows ... the decision to use things like the SolrParams class in some utility 
classes was made consciously in a lot of cases, in order to make it easier for 
these utilities to grow and evolve without their callers needing to be aware of 
these new changes ... SimpleFacets for example takes in a generic SolrParams 
and returns a NamedList so that as new functionality is added and new params 
are added to control that functionality existing request handlers don't have to 
be specificly aware of all those param names in order to get that 
functionality.  They can be if they want: they can construct a SolrParams 
instance just for driving SimpleFacets behavior instead of passing through the 
main request params, it's their choice ... but a very specific API, where every 
query param was mapped to a constructor arg or a setter method or a command 
pattern object or something else that had a tighter coupling would require 
changes in RequestHandlers anytime something like Date faceting was added (or 
even facet.mincount)

if i remember correctly, you pointed out in the mailing list that things like 
SimpleFacets or the Highlighting utils shouldn't return NamedLists -- it should 
return a more specific FacetResults/HighlightResults objects ... i would 
definitely be on board patches like that.   Refactoring the code to use a well 
typed response object certainly would make the code easier to understand, and 
new getters can always be added for pulling out new types of information as 
added -- the important thing is that Result objects like this would need to be 
able to translate themselves back into simple objects that can be understood by 
ResponseWriters so that the various RequestHandlers/ResponseWriters don't 
*need* to be aware of their details.


> New Java API
> ------------
>
>                 Key: SOLR-344
>                 URL: https://issues.apache.org/jira/browse/SOLR-344
>             Project: Solr
>          Issue Type: Improvement
>          Components: clients - java, search, update
>    Affects Versions: 1.3
>            Reporter: Jonathan Woods
>         Attachments: New Java API for Solr.pdf
>
>
> The core Solr codebase urgently needs to expose a new Java API designed for 
> use by Java running in Solr's JVM and ultimately by core Solr code itself.  
> This API must be (i) object-oriented ('typesafe'), (ii) self-documenting, 
> (iii) at the right level of granularity, (iv) designed specifically to expose 
> the value which Solr adds over and above Lucene.
> This is an urgent issue for two reasons:
> - Java-Solr integrations represent a use-case which is nearly as important as 
> the core Solr use-case in which non-Java clients interact with Solr over HTTP
> - a significant proportion of questions on the mailing lists are clearly from 
> people who are attempting such integrations right now.
> This point in Solr development - some way out from the 1.3 release - might be 
> the right time to do the development and refactoring necessary to produce 
> this API.  We can do this without breaking any backward compatibility from 
> the point of view of XML/HTTP and JSON-like clients, and without altering the 
> core Solr algorithms which make it so efficient.  If we do this work now, we 
> can significantly speed up the spread of Solr.
> Eventually, this API should be part of core Solr code, not hived off into 
> some separate project nor in a non-first-class package space.  It should be 
> capable of forming the foundation of any new Solr development which doesn't 
> need to delve into low level constructs like DocSet and so on - and any new 
> development which does need to do just that should be a candidate for 
> incorporation into the API at the some level.  Whether or not it will ever be 
> worth re-writing existing code is a matter of opinion; but the Java API 
> should be such that if it had existed before core plug-ins were written, it 
> would have been natural to use it when writing them.
> I've attached a PDF which makes the case for this API.  Apologies for 
> delivering it as an attachment, but I wanted to embed pics and a bit of 
> formatting.
> I'll update this issue in the next few days to give a prototype of this API 
> to suggest what it might look like at present.  This will build on the work 
> already done in Solrj and SearchComponents 
> (https://issues.apache.org/jira/browse/SOLR-281), and will be a patch on an 
> up-to-date revision of Solr trunk.
> [PS:
> 1.  Having written most of this, I then properly looked at 
> SearchComponents/SOLR-281 and read 
> http://www.nabble.com/forum/ViewPost.jtp?post=11050274&framed=y, which says 
> much the same thing albeit more quickly!  And weeks ago, too.  But this 
> proposal is angled slightly differently:
> - it focusses on the value of creating an API not only for internal Solr 
> consumption, but for local Java clients
> - it focusses on designing a Java API without constantly being hobbled by 
> HTTP-Java
> - it's suggesting that the SearchComponents work should result in a Java API 
> which can be used as much by third party Java as by ResponseBuilder.
> 2.  I've made some attempt to address Hoss's point 
> (http://www.nabble.com/search-components-%28plugins%29-tf3898040.html#6551097579454875774)
>  - that an API like this would need to maintain enough state e.g. to allow an 
> initial search to later be faceted, highlighted etc without going back to the 
> start each time - but clearly the proof of the pudding will be in the 
> prototype.
> 3.  Again, I've just discovered SOLR-212 (DirectSolrConnection).  I think all 
> my comments about Solrj apply to this, useful though it clearly is.]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-344) New Java API

Reply via email to