Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Bertrand Delacretaz

On 1/16/07, Ryan McKinley [EMAIL PROTECTED] wrote:


...I think a DocumentParser registry is a good way to isolate this top level 
task...


With all this talk about plugins, registries etc., /me can't help
thinking that this would be a good time to introduce the Spring IoC
container to manage this stuff.

More info at http://www.springframework.org/docs/reference/beans.html
for people who are not familiar with it. It's very easy to use for
simple cases like the ones we're talking about.

-Bertrand


[jira] Commented: (SOLR-69) PATCH:MoreLikeThis support

2007-01-16 Thread Bertrand Delacretaz (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465092
 ] 

Bertrand Delacretaz commented on SOLR-69:
-

SOLR-69.patch updated

 PATCH:MoreLikeThis support
 --

 Key: SOLR-69
 URL: https://issues.apache.org/jira/browse/SOLR-69
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Bertrand Delacretaz
Priority: Minor
 Attachments: lucene-queries-2.0.0.jar, SOLR-69.patch, SOLR-69.patch


 Here's a patch that implements simple support of Lucene's MoreLikeThis class.
 The MoreLikeThisHelper code is heavily based on (hmm...lifted from might be 
 more appropriate ;-) Erik Hatcher's example mentioned in 
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg00878.html
 To use it, add at least the following parameters to a standard or dismax 
 query:
   mlt=true
   mlt.fl=list,of,fields,which,define,similarity
 See the MoreLikeThisHelper source code for more parameters.
 Here are two URLs that work with the example config, after loading all 
 documents found in exampledocs in the index (just to show that it seems to 
 work - of course you need a larger corpus to make it interesting):
 http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mindf=1fl=id,score
 http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mindf=1fl=id,score
 Results are added to the output like this:
 response
   ...
   lst name=moreLikeThis
 result name=UTF8TEST numFound=1 start=0 maxScore=1.5293242
   doc
 float name=score1.5293242/float
 str name=idSOLR1000/str
   /doc
 /result
 result name=SOLR1000 numFound=1 start=0 maxScore=1.5293242
   doc
 float name=score1.5293242/float
 str name=idUTF8TEST/str
   /doc
 /result
   /lst
 I haven't tested this extensively yet, will do in the next few days. But 
 comments are welcome of course.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (SOLR-110) Factor out common code in our SolrRequestHandler classes

2007-01-16 Thread Bertrand Delacretaz (JIRA)
Factor out common code in our SolrRequestHandler classes


 Key: SOLR-110
 URL: https://issues.apache.org/jira/browse/SOLR-110
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Bertrand Delacretaz


DisMaxRequestHandler and StandardRequestHandler are similar enough to warrant a 
common base class, or helper classes to factor out common code.

I don't have the time (or courage ;-) to do that right now, but it should be 
done to save time when implementing features that impact both classes.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-69) PATCH:MoreLikeThis support

2007-01-16 Thread Bertrand Delacretaz (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465114
 ] 

Bertrand Delacretaz commented on SOLR-69:
-

The method used to compute includeScore in MoreLikeThisHelper was inconsistent 
with what the XmlWriter does. 

I have changed it to take this info from SolrQueryResponse.getReturnFields().

The md5 sum of the current SOLR-69 patch is b6178d11d33f19b296b741a67df00d45

With this change, all the following requests should work (standard and dismax 
handlers, with no fl param, id only and id + score as return fields):

http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1

http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id

http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id,score

http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1

http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id

http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mintf=1fl=id,score

 PATCH:MoreLikeThis support
 --

 Key: SOLR-69
 URL: https://issues.apache.org/jira/browse/SOLR-69
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Bertrand Delacretaz
Priority: Minor
 Attachments: lucene-queries-2.0.0.jar, SOLR-69.patch, SOLR-69.patch, 
 SOLR-69.patch


 Here's a patch that implements simple support of Lucene's MoreLikeThis class.
 The MoreLikeThisHelper code is heavily based on (hmm...lifted from might be 
 more appropriate ;-) Erik Hatcher's example mentioned in 
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg00878.html
 To use it, add at least the following parameters to a standard or dismax 
 query:
   mlt=true
   mlt.fl=list,of,fields,which,define,similarity
 See the MoreLikeThisHelper source code for more parameters.
 Here are two URLs that work with the example config, after loading all 
 documents found in exampledocs in the index (just to show that it seems to 
 work - of course you need a larger corpus to make it interesting):
 http://localhost:8983/solr/select/?stylesheet=q=apacheqt=standardmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mindf=1fl=id,score
 http://localhost:8983/solr/select/?stylesheet=q=apacheqt=dismaxmlt=truemlt.fl=manu,catmlt.mindf=1mlt.mindf=1fl=id,score
 Results are added to the output like this:
 response
   ...
   lst name=moreLikeThis
 result name=UTF8TEST numFound=1 start=0 maxScore=1.5293242
   doc
 float name=score1.5293242/float
 str name=idSOLR1000/str
   /doc
 /result
 result name=SOLR1000 numFound=1 start=0 maxScore=1.5293242
   doc
 float name=score1.5293242/float
 str name=idUTF8TEST/str
   /doc
 /result
   /lst
 I haven't tested this extensively yet, will do in the next few days. But 
 comments are welcome of course.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




query to pull all document ids in an index

2007-01-16 Thread Edward Summers
I'm new to solr (working on solrb with Erik). We have some functional  
tests that run against a live solr instance, and I'd like the tests  
to periodically remove all the documents from the index. This way  
tests will have a predictable outcome that is independent on the  
state of the index before the test.


I was thinking I could do a query that pulls back all the document  
ids in the index, and then delete each one...but I'm not quite sure  
how I could perform such a select. Does anyone have any ideas?


//Ed


Re: query to pull all document ids in an index

2007-01-16 Thread Bertrand Delacretaz

On 1/16/07, Edward Summers [EMAIL PROTECTED] wrote:


I was thinking I could do a query that pulls back all the document
ids in the index, and then delete each one...


The delete by query feature will do this without requiring an
iteration on the client side, see
http://incubator.apache.org/solr/tutorial.html#Deleting+Data

-Bertrand


SOLR-67 query interface

2007-01-16 Thread rlawson

I'm new to SOLR and would like to contribute. I think my skills would best
lend themselves to helping with a nice query interface. I'm a java web dev
by profession (couple of the sites/companies I have worked with are below)

www.ptplace.com
www.colinx.com
www.getlocalbiz.com
www.kemperinvestors.com (don't blame me, client wanted it that way)

Is someone else working on this already ? How can I help ?

Thanks,
Rick
-- 
View this message in context: 
http://www.nabble.com/SOLR-67-query-interface-tf3020838.html#a8389856
Sent from the Solr - Dev mailing list archive at Nabble.com.



Re: SOLR-67 query interface

2007-01-16 Thread Bertrand Delacretaz

On 1/16/07, rlawson [EMAIL PROTECTED] wrote:


...I'm new to SOLR and would like to contribute. I think my skills would best
lend themselves to helping with a nice query interface. I'm a java web dev
by profession...


If you mean graphic design of the admin webpages, there are two issues
about this currently:

http://issues.apache.org/jira/browse/SOLR-84
http://issues.apache.org/jira/browse/SOLR-76

Your opinions and contributions are of course welcome!


...www.kemperinvestors.com (don't blame me, client wanted it that way)...


ouch ;-)

-Bertrand


Can this be achieved? (Was: document support for file system crawling)

2007-01-16 Thread Eivind Hasle Amundsen
First: Please pardon the cross-post to solr-user for reference. I hope 
to continue this thread in solr-dev. Please answer to solr-dev.



1) more documentation (and posisbly some locking configuration options) on
how you can use Solr to access an index generated by the nutch crawler (i
think Thorsten has allready done this) or by Compass, or any other system
that builds a Lucene index.


Thorsten Scherler? Is this code available anywhere? Sounds very 
interesting to me. Maybe someone could ellaborate on the differences 
between the indexes created by Nutch/Solr/Compass/etc., or point me in 
the direction of an answer?



2) contrib code that runs as it's own process to crawl documents and
send them to a Solr server. (mybe it parses them, or maybe it relies on
the next item...)


Do you know FAST? It uses a step-by-step approach (pipeline) in which 
all of these tasks are done. Much of it is tuned in a easy web tool.


The point I'm trying to make is that contrib code is nice, but a 
complete package with these possibilities could broaden Solr's appeal 
somewhat.



3) Stock update plugins that can each read a raw inputstreams of a some
widely used file format (PDF, RDF, HTML, XML of any schema) and have
configuration options telling them them what fields in the schema each
part of their document type should go in.


Exactly, this sounds more like it. But if similar inputstreams can be 
handled by Nutch, what's the point in using Solr at all? The http API's? 
 In other words, both Nutch and Solr seem to have functionality that 
enterprises would want. But neither gives you the total solution.


Don't get it wrong, I don't want to bloat the products, even though it 
would be nice to have a crossover solution which is easy to set up.


The architecture could look something like this:

Connector - Parser - DocProc - (via schema) - Index

Possible connectors: JDBC, filesystem, crawler, manual feed
Possible parsers: PDF, whatever

Both connectors, parsers AND the document processors would be plugins. 
The DocProcs would typically be adjusted for each enterprise' needs, so 
that it fits with their schema.xml.


Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to 
really know all possibilities and limitations. But I do believe that the 
outlined architecture would be flexible and answer many needs. So the 
question is:


What is Solr missing? Could parts of Nutch be used in Solr to achieve 
this? How? Have I misunderstood completely? :)


Eivind


solr.solr.home- what I have to do?

2007-01-16 Thread Yury A. Buharkin
Hello! I'm a novice in Lucene technologies, and now only trying to install  
Solr.
The main problem is appeared because I have to use Sun Java(bla-bla) web  
server as

servlet contaner. So: Who can explain me what means phrase in solr's docs-

Solr now looks in ./solr/conf for config, ./solr/data for data
configurable via solr.solr.home system property...
??

Is system property is really system-property ... / tag in web.xml file?
Or I have to define sime environment var with name solr.solr.home?
Or something else?
Sincerelly yours
Buharkin Y.A.
Moscow


Merging Results from Multiple Solr Instances

2007-01-16 Thread Sangraal Aiken
I have three instances of Solr on a single machine that I would like  
to query as if they were a single instance.


I was wondering if there's a facility, or if anyone has any  
recommendations, for searching across multiple instances with a  
single query, or merging the results of multiple instances into one  
result set.


-STA




Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Alan Burlison

Bertrand Delacretaz wrote:


With all this talk about plugins, registries etc., /me can't help
thinking that this would be a good time to introduce the Spring IoC
container to manage this stuff.

More info at http://www.springframework.org/docs/reference/beans.html
for people who are not familiar with it. It's very easy to use for
simple cases like the ones we're talking about.


Please, no.  I work on a big webapp that uses spring - it's a complete 
nightmare to figure out what's going on.


--
Alan Burlison
--


To Spring or not to Spring? (was: Update Plugins)

2007-01-16 Thread Bertrand Delacretaz

On 1/16/07, Alan Burlison [EMAIL PROTECTED] wrote:

Bertrand Delacretaz wrote:
.../me can't help
 thinking that this would be a good time to introduce the Spring IoC
 container to manage this stuff...



Please, no.  I work on a big webapp that uses spring - it's a complete
nightmare to figure out what's going on.


Using just the IoC container? I'm not talking about full-blown Spring
magic, *just* IoC to assemble plugins.

Spring's IoC is not complicated, and logging statements and debuggers
are here to find out exactly what's happening if needed.

I don't think it'd be more complicated than using our homegrown plugin
system. Only more tested, documented and well-known.

-Bertrand


Re: To Spring or not to Spring? (was: Update Plugins)

2007-01-16 Thread Alan Burlison

Bertrand Delacretaz wrote:


Using just the IoC container? I'm not talking about full-blown Spring
magic, *just* IoC to assemble plugins.

Spring's IoC is not complicated, and logging statements and debuggers
are here to find out exactly what's happening if needed.

I don't think it'd be more complicated than using our homegrown plugin
system. Only more tested, documented and well-known.


It just seems like a big hammer to crack a small nut.  I've had *bad* 
experiences with apps where people pulled in just about every framework, 
component and widget you can think of - to understand what the hell is 
going on you end up having to be an expert in all of them.


Yes, I'm probably just paranoid ;-)

--
Alan Burlison
--


Re: To Spring or not to Spring? (was: Update Plugins)

2007-01-16 Thread Bertrand Delacretaz

On 1/16/07, Alan Burlison [EMAIL PROTECTED] wrote:


...I've had *bad*
experiences with apps where people pulled in just about every framework,
component and widget you can think of...


That's what you previous message seemed to imply ;-)

I agree that, if we start using Spring (or another) IoC container, we
must be careful to use what actually helps us, and not let it become
our Code Dictator...

-Bertrand


Re: Can this be achieved? (Was: document support for file system crawling)

2007-01-16 Thread Thorsten Scherler
On Tue, 2007-01-16 at 16:28 +0100, Eivind Hasle Amundsen wrote:
 First: Please pardon the cross-post to solr-user for reference. I hope 
 to continue this thread in solr-dev. Please answer to solr-dev.
 
  1) more documentation (and posisbly some locking configuration options) on
  how you can use Solr to access an index generated by the nutch crawler (i
  think Thorsten has allready done this) or by Compass, or any other system
  that builds a Lucene index.
 
 Thorsten Scherler? 

Hmm, I did the exact opposite. Let me explain you my use case. I am
working on a part of a portal http://andaluciajunta.es. The new version
of http://andaluciajunta.es/BOJA is this part. The current version is
based on a proprietary CMS in a dynamic environment. 

The new development is using Apache Forrest to generate static html. Now
coming to solr/nutch, you can find
http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html the
current search engine especially for the BOJA. This will be changed to a
solr powered solution. 

Like I said I only doing one part of the portal and the main portal has
a search engine as well. http://andaluciajunta.es/aj-sea-.html This
search engine will be based on nutch in the next version. The special
character is that this main portal search engine has to search against
the solr BOJA based indexed. Meaning Nutch will have to search the solr
index and not vice versa.

What I did before we decided to go with solr is a simple test. I copied
my solr index into a nutch instance and dispatched a couple of queries.
The only thing that you need is to keep your solr schema as close as
possible to the one nutch uses. For example nutch is using content,
url and title as default fields when returning the search result. If
you do not have this fields in your solr schema then nutch will return
null. 

 Is this code available anywhere? 

Like stated above it is a couple of lines in the solr schema:
field name=title type=string stored=true /field
field name=content type=text indexed=true stored=true /
field name=url type=string stored=true /field

Then you just need to point your nutch instance to this index for
searching. 

The same is true (I guess) for solr searching a nutch index. You could
use nutch to update the index, point solr to the index and it should
work (if you have defined all field in the schema).

 Sounds very 
 interesting to me. Maybe someone could ellaborate on the differences 
 between the indexes created by Nutch/Solr/Compass/etc., or point me in 
 the direction of an answer?
 

I am far from being an expert, but actually the only real difference I
see is the usage of field names. All indexes could be searched with a
raw lucene component (if they are based on the same lucene version)

  2) contrib code that runs as it's own process to crawl documents and
  send them to a Solr server. (mybe it parses them, or maybe it relies on
  the next item...)
 
 Do you know FAST? It uses a step-by-step approach (pipeline) in which 
 all of these tasks are done. Much of it is tuned in a easy web tool.
 
 The point I'm trying to make is that contrib code is nice, but a 
 complete package with these possibilities could broaden Solr's appeal 
 somewhat.

Hmm, I think like Hoss on this, why do we want do the same work of
nutch. If you need a crawler why not use the one from nutch and change
some lines? I actually use Forrest as crawler when I generate the new
sites, which will then push the content to the solr server via a plugin
I developed: 
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/

 
  3) Stock update plugins that can each read a raw inputstreams of a some
  widely used file format (PDF, RDF, HTML, XML of any schema) and have
  configuration options telling them them what fields in the schema each
  part of their document type should go in.
 
 Exactly, this sounds more like it. But if similar inputstreams can be 
 handled by Nutch, what's the point in using Solr at all? The http API's? 
   In other words, both Nutch and Solr seem to have functionality that 
 enterprises would want. But neither gives you the total solution.
 

Not sure. I am using solr because I did not had to develop three
different nutch plugin to make it work. Further I have punctual updates
where I push a certain set of documents to the server, so no need for a
crawler.

 Don't get it wrong, I don't want to bloat the products, even though it 
 would be nice to have a crossover solution which is easy to set up.
 
 The architecture could look something like this:
 
 Connector - Parser - DocProc - (via schema) - Index
 
 Possible connectors: JDBC, filesystem, crawler, manual feed
 Possible parsers: PDF, whatever

 Both connectors, parsers AND the document processors would be plugins. 
 The DocProcs would typically be adjusted for each enterprise' needs, so 
 that it fits with their schema.xml.
 
 Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to 
 really know all 

Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread J.J. Larrea
I'm in frantic deadline mode so I'm just going to throw in some (hopefully) 
short comments...

At 11:02 PM -0800 1/15/07, Ryan McKinley wrote:
the one thing that still seems missing is those micro-plugins i was
 [SNIP]

  interface SolrRequestParser {
 SolrRequest process( HttpServletRequest req );
  }



I left out micro-plugins because i don't quite have a good answer
yet :)  This may a place where a custom dispatcher servlet/filter
defined in web.xml is the most appropriate solution.

If the issue is munging HTTPServletRequest information, then a proper 
separation of concerns suggests responsibility should lie with a Servlet 
Filter, as Ryan suggests.

For example, while the Servlet 2.4 spec doesn't have specifications for how the 
servlet container can/should burst a multipart-MIME payload into separate 
files or streams, there are a number of 3rd party Filters which do this.

The IteratorContentStream is a great idea because if each stream is read to 
completion before the next is opened it doesn't impose any limitation on 
individual stream length and doesn't require disk buffering.

(Of course some handlers may require access to more than one stream at a time; 
each time next() is called on the iterator before the current stream is closed, 
the remainder of that stream will have to be buffered in memory or on disk, 
depending on the part length.  Nonetheless that detail can be entirely hidden 
from the handler, as it should be.  I am not sure if any available 
ServletFilter implementations work this way, but it's certainly doable.)

But that detail is irrelevant for now; as I suggest below, using this API lets 
one immediately implement it with only next() value of the entire POST stream; 
that would answer the needs of the existing update request handling code, but 
establish an API to handle multi-part.  Whenever someone wants to write a 
multi-stream handler, they can write or find a better IteratorContentStream 
implementation, which would best be cast as a ServletFilter.

I like the SolrRequestParser suggestion.

Me too.  It answers a hole in my vision for how this can all fit together.

Consider:
qt='RequestHandler'
wt='ResponseWriter'
rp='RequestParser ' (rb='SolrBuilder'?)

To avoid possible POST read-ahead stream mungling: qt,wt, and rp
should be defined by the URL, not parameters.  (We can add special
logic to allow /query?qt=xxx)

For qt, I like J.J. Larrea's suggestion on SOLR-104 to let people
define arbitrary path mapping for qt.

We could append 'wt', 'rb', and arbitrary arbitrary text to the
registered path, something like
 /registered/path/wt:json/rb:standard/more/stuff/in/the/path?params...

(any other syntax ideas?)

No need for new syntax, I think.  The pathInfo or qt or other source resolves 
to a requestHandler CONFIG name.  The handler config is read to determine the 
handler class name.  It also can be consulted (with URL or form-POST params 
overriding if allowed by the  config) to decide which RequestParser to invoke 
BEFORE IT IS CALLED and which ResponseWriter to invoke AFTER.  Once those 
objects are set up, the request body gets executed.

Handler config inheritance (as I proposed in SOLR-104 point #2) would greatly 
simplify, for example, creating a dozen query handlers which used a particular 
invariant combination of qt, wt, and rp

The 'standard' RequestParser would:
GET:
 fill up SolrParams directly with req.getParameterMap()
if there is a 'post' parameter (post=XXX)
  return a stream with XXX as its content
else
  empty iterator.
Perhaps add a standard way to reference a remote URI stream.

POST:
 if( multipart ) {
  read all form fields into parameter map.

This should use the same req.getParameterMap as for GET, which Servlet 2.4 says 
is suppose to be automatically by the servlet container if the payload is 
application/x-www-form-urlencoded; in that case the input stream should be null.

  return an iterator over the collection of files

Collection of streams, per Hoss.

}
else {
  no parameters? parse parameters from the URL? /name:value/
  return the body stream

As above, this introduces unneeded complexity and should be avoided.

}
DEL:
 throw unsupported exception?


Maybe each RequestHandler could have a default RequestParser.  If we
limited the 'arbitrary path' to one level, this could be used to
generate more RESTful URLs. Consider:

/myadder////

/myadder maps to MyCustomHandler and that gives you
MyCustomRequestBuilder that maps /// to SolrParams


I think these are best left for an extra-SOLR layer, especially since SOLR URLs 
are meant for interprogram communication and not direct use by non-developer 
end users.  For example, for my org's website I have hundreds of Apache 
mod_rewrite rules which do URL munging such as
/journals/abc/7/3/192a.pdf
into
/journalroot/index.cfm?journal=abcvolume=7issue=3
page=192seq=aformat=pdf

Or someone could custom-code a subclass of SolrServlet which 

Re: solr.solr.home- what I have to do?

2007-01-16 Thread Chris Hostetter

: Solr now looks in ./solr/conf for config, ./solr/data for data
:  configurable via solr.solr.home system property...
: ??
:
: Is system property is really system-property ... / tag in web.xml file?
: Or I have to define sime environment var with name solr.solr.home?

it's a system property that can be defined using whatever means your
servlet container lets you define system properties before loading web
applications ... i don't know much about the Sub Servlet Container, but
assuming it's pure java, and you have a shell script somewhere that start
it like this...

java ... com.sun.SomeMainClass

you can pass system proeprties on the comamnd line like this...

java -Dsolr.solr.home=/your/path ... com.sun.SomeMainClass.


-Hoss



[jira] Commented: (SOLR-106) new facet params: facet.sort, facet.mincount, facet.offset

2007-01-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465254
 ] 

Yonik Seeley commented on SOLR-106:
---

Thanks for the info JJ... didn't see your update untill after I committed this 
(I'm running a bit behind all the solr traffic :-)

 Case for Facet Count Caching: Paging through the hitlist 

Hmmm, yes that would be good for a more stateless client.  Even more efficient 
would be to recognize in the client that since you are only changing a page in 
the hitlist, the facets won't change (and hence don't re-query).

 It occurs to me that facet.limit should NOT do double-duty for paging:

Or, it should *only* be used for paging, specifying the number to be returned.  
The BoundedTreeSet size and caching are an implementation detail and shouldn't 
be in the API unless really necessary.  If it matters in the future, we could 
add a hint specifying how much extra should be computed.

 Case for pulling response generation out of getFieldCacheCounts and 
 getFacetTermEnumCounts

Sure, makes sense.  Don't view the current facet code as done... I have a 
*lot* of little ideas on how to make it better, esp for cases like faceting on 
author.

 TermFreqVectors
Regarding this, do you have any performance data on it... my assumption was 
that it would be too slow for a large number of hits.  Perhaps still a good 
option to have if the number of hits are small and the fieldcache isn't an 
option though.

 Just had an idea: It would be even nicer if the counting logic could be 
 passed some object,
Yup, separating those things was on my todo list.




 new facet params: facet.sort, facet.mincount, facet.offset
 --

 Key: SOLR-106
 URL: https://issues.apache.org/jira/browse/SOLR-106
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Yonik Seeley
 Attachments: facet_params.patch


 a couple of new facet params:
 facet lists become pageable with facet.offset, facet.limit  (idea from Erik)
 facet.sort explicitly specifies sort order (true for count descending, false 
 for natural index order)
 facet.mincount: minimum count for facets included in response (idea from JJ, 
 deprecate zeros)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Yonik Seeley

On 1/15/07, Chris Hostetter [EMAIL PROTECTED] wrote:

: The most important issue is to nail down the external HTTP interface.

I'm not sure if i agree with that statement .. i would think that figuring
out the model or how updates should be handled in a generic way, what
all of the Plugin types are, and what their APIs should be is the most
important issue -- once we have those issues settled we could allways
write a new SolrServlet2 that made the URL structure work anyway we
want.


The number of people writing update plugins will be small compared to
the number of users using the external HTTP API (the URL + query
parameters, and the relationship URL-wise between different update
formats).  My main concern is making *that* as nice and utilitarian as
possible, and any plugin stuff is implementation and a secondary
concern IMO.

-Yonik


[jira] Commented: (SOLR-86) [PATCH] standalone updater cli based on httpClient

2007-01-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465276
 ] 

Hoss Man commented on SOLR-86:
--

regarding Bertrand's comment, i'm not sure if there is any benefit in having 
this code and SOLR-20 share a common SolrUpdateClientInterface since this 
code will be dealing with pure streaming of UTF8 data, while SOLR-20 is focused 
on a better object abstracting for SolrDocuments ... i'm not sure what kinds of 
methods such an interface might have.

regarding THorstens comment: yeah, i removed the directory support from your 
patch while i was refactoring just because it was confusing me and i was trying 
to keep things simple (i kept trying to run java -jar post.jar exampledos/ ant 
it would fail because of the .svn directory)

that's no reason not to inlcude it though since it's so simple.

 [PATCH]  standalone updater cli based on httpClient
 ---

 Key: SOLR-86
 URL: https://issues.apache.org/jira/browse/SOLR-86
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Thorsten Scherler
 Attachments: simple-post-using-urlconnection-approach.patch, 
 solr-86.diff, solr-86.diff


 We need a cross platform replacement for the post.sh. 
 The attached code is a direct replacement of the post.sh since it is actually 
 doing the same exact thing.
 In the future one can extend the CLI with other feature like auto commit, 
 etc.. 
 Right now the code assumes that SOLR-85 is applied since we using the servlet 
 of this issue to actually do the update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (SOLR-111) new repsonse classes and connection enhancements

2007-01-16 Thread Ed Summers (JIRA)
new repsonse classes and connection enhancements


 Key: SOLR-111
 URL: https://issues.apache.org/jira/browse/SOLR-111
 Project: Solr
  Issue Type: Improvement
  Components: clients - ruby - flare
 Environment: Darwin frizz 8.8.1 Darwin Kernel Version 8.8.1: Mon Sep 
25 19:42:00 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_I386 i386 i386
Reporter: Ed Summers
 Attachments: response_connection_changes.diff

Similar to Solr::Request::* a Solr::Response::* hierarchy was created as well 
as a Solr::Response::Base which has a factory method for creating the 
appropriate response based on the request type and the raw response.

Also added delete(), delete_by_query(), add(), update() and query() methods to 
Solr::Connection. This gets a bit closer to a DSL type of syntax which doesn't 
require the user to know the inner workings of solrb. I adjusted README 
accordingly.

Solr::Connection also operates with autocommit turned *on* so commit() messages 
are not required when doing add(), update(), delete() calls. It can be turned 
off if the user doesn't want they extra http traffic.

Added the ability to iterate over search results. Although need to add the 
ability to iterate over complete results, fetching data behind the scenes as 
necessary.

Unit tests have been added and functional tests improved.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (SOLR-111) new repsonse classes and connection enhancements

2007-01-16 Thread Ed Summers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Summers updated SOLR-111:


Attachment: response_connection_changes.diff

 new repsonse classes and connection enhancements
 

 Key: SOLR-111
 URL: https://issues.apache.org/jira/browse/SOLR-111
 Project: Solr
  Issue Type: Improvement
  Components: clients - ruby - flare
 Environment: Darwin frizz 8.8.1 Darwin Kernel Version 8.8.1: Mon Sep 
 25 19:42:00 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_I386 i386 i386
Reporter: Ed Summers
 Attachments: response_connection_changes.diff


 Similar to Solr::Request::* a Solr::Response::* hierarchy was created as well 
 as a Solr::Response::Base which has a factory method for creating the 
 appropriate response based on the request type and the raw response.
 Also added delete(), delete_by_query(), add(), update() and query() methods 
 to Solr::Connection. This gets a bit closer to a DSL type of syntax which 
 doesn't require the user to know the inner workings of solrb. I adjusted 
 README accordingly.
 Solr::Connection also operates with autocommit turned *on* so commit() 
 messages are not required when doing add(), update(), delete() calls. It can 
 be turned off if the user doesn't want they extra http traffic.
 Added the ability to iterate over search results. Although need to add the 
 ability to iterate over complete results, fetching data behind the scenes as 
 necessary.
 Unit tests have been added and functional tests improved.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (SOLR-111) new repsonse classes and connection enhancements

2007-01-16 Thread Ed Summers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Summers updated SOLR-111:


Attachment: response_connection_changes.diff

 new repsonse classes and connection enhancements
 

 Key: SOLR-111
 URL: https://issues.apache.org/jira/browse/SOLR-111
 Project: Solr
  Issue Type: Improvement
  Components: clients - ruby - flare
 Environment: Darwin frizz 8.8.1 Darwin Kernel Version 8.8.1: Mon Sep 
 25 19:42:00 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_I386 i386 i386
Reporter: Ed Summers
 Attachments: response_connection_changes.diff, 
 response_connection_changes.diff


 Similar to Solr::Request::* a Solr::Response::* hierarchy was created as well 
 as a Solr::Response::Base which has a factory method for creating the 
 appropriate response based on the request type and the raw response.
 Also added delete(), delete_by_query(), add(), update() and query() methods 
 to Solr::Connection. This gets a bit closer to a DSL type of syntax which 
 doesn't require the user to know the inner workings of solrb. I adjusted 
 README accordingly.
 Solr::Connection also operates with autocommit turned *on* so commit() 
 messages are not required when doing add(), update(), delete() calls. It can 
 be turned off if the user doesn't want they extra http traffic.
 Added the ability to iterate over search results. Although need to add the 
 ability to iterate over complete results, fetching data behind the scenes as 
 necessary.
 Unit tests have been added and functional tests improved.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: svn patches and directories...

2007-01-16 Thread Ryan McKinley

On 1/16/07, Chris Hostetter [EMAIL PROTECTED] wrote:


: Date: Sat, 13 Jan 2007 19:12:27 -0800 (PST)
: Subject: [jira] Commented: (SOLR-104) SQL Upload Plugin

: 2) download HandlerRefactoring.DRAFT.zip and extract the contents to:
: \solr\src\java\org\apache\solr\handler
:
: (svn patches don' t let you add new directories!)

that's shouldn't be true true .. using the Linux SVN client you can
definitely svn add a directory and then generate a diff from it (even
using anonymous svn) ... you may want to double check the docs for your
SVN client on how to do the same thing on your platform



aaah.  I'm running TortoiseSVN on XP.

I ran 'svn add' on everything

When i create a patch using TortoiseSVN I get a message that says:
You've selected added folders.  The patch won't contain added files
within such added folders.  Do you want to poceed anyway?

But when you run from command line:  svn diff  XXX.patch it seems to work ok.



(pure patches are allways easier to deal with then patches+zips)



got it


Re: Can this be achieved? (Was: document support for file system crawling)

2007-01-16 Thread Eivind Hasle Amundsen

(...) http://andaluciajunta.es/aj-sea-.html This
search engine will be based on nutch in the next version. The special
character is that this main portal search engine has to search against
the solr BOJA based indexed. Meaning Nutch will have to search the solr
index and not vice versa.


Looks interesting, too bad I don't understand the language :) But I do 
get the idea.



field name=title type=string stored=true /field
field name=content type=text indexed=true stored=true /
field name=url type=string stored=true /field


This is valuable info to a newbie like me. Thanks a lot! It also makes 
me think why didn't they make Nutch more general but I guess they 
wanted consistence (and it's probably configurable in Nutch, hidden 
somewhere, anyway).



Hmm, I think like Hoss on this, why do we want do the same work of
nutch. If you need a crawler why not use the one from nutch and change
some lines? I actually use Forrest as crawler when I generate the new
sites, which will then push the content to the solr server via a plugin
I developed: 
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/


Nice one. I didn't know about Forrest, so thanks for the advice. My 
needs are actually not related to a certain site or application at 
all. I am here for pure interest in Lucene/Solr/Nutch/etc, and the 
search field in general (enterprise in particular). Think of my needs as 
more of RD, if you'd like.


Ultimately I hope to be able to contribute, but don't know where to 
start (and how much time/resources I have).



Not sure. I am using solr because I did not had to develop three
different nutch plugin to make it work. Further I have punctual updates
where I push a certain set of documents to the server, so no need for a
crawler.


My suggestion is independent of how often docs are indexed. Everything 
should be possible - manual feed, crawler, filesystem surveillance, 
database transaction reports - as long as this is kept separate limit 
lies in one's imagination.


Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. to 
really know all possibilities and limitations. But I do believe that the 
outlined architecture would be flexible and answer many needs. 


Not sure.


Well I am thinking about a way to meet the same market as some 
commercial vendors. They should not and may not be copied, so don't get 
me wrong. But I do know something about this market, or at least I like 
to think so.



(...) I must say that interproject
collaboration is very hard to archive. 


I take your word for it :) I guess one way is to just code/create the 
damn thing, not talk about it like I do now. *dreaming*



Anyway if you need a crawler but want use solr then see the crawling
code of nutch and write a standalone crawler that will update the solr
index.


Will do! Thanks for a full and wise reply.

Eivind


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Yonik Seeley

On 1/16/07, J.J. Larrea [EMAIL PROTECTED] wrote:

- Revise the XML-based update code (broken out of SolrCore into a 
RequestHandler) to use all the above.


+++1, that's been needed forever.
If one has the time, I'd also advocate moving to StAX (via woodstox
for Java5, but it's built into Java6).

-Yonik


[jira] Resolved: (SOLR-111) new repsonse classes and connection enhancements

2007-01-16 Thread Erik Hatcher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Hatcher resolved SOLR-111.
---

Resolution: Fixed
  Assignee: Erik Hatcher

Applied, except tweaked autocommit to off by default.  Good stuff, Ed!

 new repsonse classes and connection enhancements
 

 Key: SOLR-111
 URL: https://issues.apache.org/jira/browse/SOLR-111
 Project: Solr
  Issue Type: Improvement
  Components: clients - ruby - flare
 Environment: Darwin frizz 8.8.1 Darwin Kernel Version 8.8.1: Mon Sep 
 25 19:42:00 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_I386 i386 i386
Reporter: Ed Summers
 Assigned To: Erik Hatcher
 Attachments: response_connection_changes.diff, 
 response_connection_changes.diff


 Similar to Solr::Request::* a Solr::Response::* hierarchy was created as well 
 as a Solr::Response::Base which has a factory method for creating the 
 appropriate response based on the request type and the raw response.
 Also added delete(), delete_by_query(), add(), update() and query() methods 
 to Solr::Connection. This gets a bit closer to a DSL type of syntax which 
 doesn't require the user to know the inner workings of solrb. I adjusted 
 README accordingly.
 Solr::Connection also operates with autocommit turned *on* so commit() 
 messages are not required when doing add(), update(), delete() calls. It can 
 be turned off if the user doesn't want they extra http traffic.
 Added the ability to iterate over search results. Although need to add the 
 ability to iterate over complete results, fetching data behind the scenes as 
 necessary.
 Unit tests have been added and functional tests improved.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Java version for solr development (was Re: Update Plugins)

2007-01-16 Thread Thorsten Scherler
On Tue, 2007-01-16 at 15:49 -0500, Yonik Seeley wrote:
 On 1/16/07, J.J. Larrea [EMAIL PROTECTED] wrote:
  - Revise the XML-based update code (broken out of SolrCore into a 
  RequestHandler) to use all the above.
 
 +++1, that's been needed forever.
 If one has the time, I'd also advocate moving to StAX (via woodstox
 for Java5, but it's built into Java6).

I was up to have a look on this. Seeing this comment makes me think.

I am on 1.5 ATM and using 
|-- stax-1.2.0-dev.jar
`-- stax-utils.jar

Two more dependencies. Setting min version 
 !-- Java Version we are compatible with --
  property name=java.compat.version value=1.6 /
would get rid of this.

Should I use 1.6 for a patch or above mentioned libs?

wdyt?

salu2
-- 
thorsten

Together we stand, divided we fall! 
Hey you (Pink Floyd)





[jira] Commented: (SOLR-86) [PATCH] standalone updater cli based on httpClient

2007-01-16 Thread Thorsten Scherler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465327
 ] 

Thorsten Scherler commented on SOLR-86:
---

Yeah, I know what you mean (had a similar problem today). 

if (!file.isDirectory()){
tool.postFile(file, out); 
}

should fix that. 

TIA



 [PATCH]  standalone updater cli based on httpClient
 ---

 Key: SOLR-86
 URL: https://issues.apache.org/jira/browse/SOLR-86
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Thorsten Scherler
 Attachments: simple-post-using-urlconnection-approach.patch, 
 solr-86.diff, solr-86.diff


 We need a cross platform replacement for the post.sh. 
 The attached code is a direct replacement of the post.sh since it is actually 
 doing the same exact thing.
 In the future one can extend the CLI with other feature like auto commit, 
 etc.. 
 Right now the code assumes that SOLR-85 is applied since we using the servlet 
 of this issue to actually do the update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Java version for solr development (was Re: Update Plugins)

2007-01-16 Thread Yonik Seeley

On 1/16/07, Thorsten Scherler [EMAIL PROTECTED] wrote:

I am on 1.5 ATM and using
|-- stax-1.2.0-dev.jar
`-- stax-utils.jar


I don't know where those jars are from, but I guess one would need the
stax API jar, and the implementation (woodstox I would think) jar.
That's two jars instead of one, but they could go away with a move to Java6.
The API is likely to have a much longer lifetime too.


Two more dependencies. Setting min version
 !-- Java Version we are compatible with --
  property name=java.compat.version value=1.6 /
would get rid of this.

Should I use 1.6 for a patch or above mentioned libs?


I think it's a bit soon to move to 1.6 - I don't know how many
platforms it's available for yet.

-Yonik


[jira] Updated: (SOLR-107) Iterable NamedList with java5 generics

2007-01-16 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated SOLR-107:
---

Attachment: IterableNamedList.patch

 Iterable NamedList with java5 generics
 --

 Key: SOLR-107
 URL: https://issues.apache.org/jira/browse/SOLR-107
 Project: Solr
  Issue Type: Improvement
Reporter: Ryan McKinley
Priority: Trivial
 Attachments: IterableNamedList.patch, IterableNamedList.patch


 Iterators and generics are nice!
 this patch adds both to NamedList.java

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-107) Iterable NamedList with java5 generics

2007-01-16 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465357
 ] 

Ryan McKinley commented on SOLR-107:


updated patch for 1,2, and 3


 Iterable NamedList with java5 generics
 --

 Key: SOLR-107
 URL: https://issues.apache.org/jira/browse/SOLR-107
 Project: Solr
  Issue Type: Improvement
Reporter: Ryan McKinley
Priority: Trivial
 Attachments: IterableNamedList.patch, IterableNamedList.patch


 Iterators and generics are nice!
 this patch adds both to NamedList.java

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Java version for solr development (was Re: Update Plugins)

2007-01-16 Thread Walter Underwood
On 1/16/07 8:03 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

 I think it's a bit soon to move to 1.6 - I don't know how many
 platforms it's available for yet.

It is still in early release from IBM for their PowerPC
servers, so requiring 1.6 would be a serious problem for us.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Java version for solr development (was Re: Update Plugins)

2007-01-16 Thread Bertrand Delacretaz

On 1/17/07, Thorsten Scherler [EMAIL PROTECTED] wrote:


...Should I use 1.6 for a patch or above mentioned libs?...


IMHO moving to 1.6 is way too soon, and if it's only to save two jars
it's not worth it.

-Bertrand


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Chris Hostetter

: The number of people writing update plugins will be small compared to
: the number of users using the external HTTP API (the URL + query
: parameters, and the relationship URL-wise between different update
: formats).  My main concern is making *that* as nice and utilitarian as
: possible, and any plugin stuff is implementation and a secondary
: concern IMO.

Agreed, but my point was that we should try to design the internal APIs
indepently from the URL structure ... if we have a set of APIs,
it's easy to come up with a URL structure that will map well (we could
theoretically have several URL structures using different servlets) but if
we worry too much about what hte URL should look like, we may hamstring
the model design.


-Hoss



Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Ryan McKinley

kind of like a binary stream equivilent to the way analyzers
can be customized -- is thta kind of what you had in mind?



exactly.



  interface SolrDocumentParser {
public init(NamedList args);
Document parse(SolrParams p, ContentStream content);
  }




yes