[jira] Created: (SOLR-113) Some example + post.sh in docs in client/solrb

2007-01-19 Thread Antonio Eggberg (JIRA)
Some example + post.sh in docs in client/solrb
--

 Key: SOLR-113
 URL: https://issues.apache.org/jira/browse/SOLR-113
 Project: Solr
  Issue Type: Wish
  Components: clients - ruby - flare
 Environment: OSX 10.4
Reporter: Antonio Eggberg
Priority: Trivial


I tried flare today really nice :=) 

It would be nice to add some example docs like current Solr distro for the 
Ruby/Flare client.. If I understand correctly the exampledocs in Solr i.e 
/example/exampledocs is not compatible with solrb.. Maybe I am doing something 
wrong? if so please clarify and delete the issue. The issue is not so important 
but good for the folks that are impetiant. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Can this be achieved? (Was: document support for file system crawling)

2007-01-19 Thread Chris Hostetter
:  What I am really talking about, is this: There is a growing market for
:  simple search solutions that can work out of the box, and that can still
:  be customized. Something that:
:  - organizations can use on their network, out of the box

:  I am not looking to change Solr in that direction. But take a look at
:  Solr. Or Nutch. They are already built on Lucene and many other
:  projects. Why/not build something on top of this? Something more/else?
:
: I don't think that anyone is arguing that this product shouldn't exist
: in the open-source world, just that it shouldn't be part of Solr's
: mandate.  It sounds like a cool project (though the closer you get to

Exactly.

Eivind: earlier in this thread, you were talking about having more
crawling features, and document parsing features and built in to Solr, and
i got hte impression that you didn't like the idea that they could be
loosely coupled external applications ... but if your interest is in
having an enterprise search solution that people can deploy on a box
and haveit start working for them, then there is no reason for all of that
code to run in a single JVM using a single code base -- i'm going to go
out on a limb and guess that that the Google Appliances run more then a
single process :)

given a collection of loosely coupled pieces, including Solr,
including Nutch, including whatever future document parsing contribs might
be written for either SOlr or Nutch ... you could bundle them all together
into an enterprise search system that when installed deployed them all and
coupled them together and had a GUI for configuring them ... but that
would be a seperate project from Solr -- just as Solr and Nutch are
seperate projects from Java-Lucene ... it's all about laysers built on top
of layers that allow for reuse.


-Hoss



Re: Can this be achieved? (Was: document support for file system crawling)

2007-01-19 Thread Walter Underwood
On 1/19/07 10:33 AM, Chris Hostetter [EMAIL PROTECTED] wrote:

 [...] but if your interest is in
 having an enterprise search solution that people can deploy on a box
 and haveit start working for them, then there is no reason for all of that
 code to run in a single JVM using a single code base -- i'm going to go
 out on a limb and guess that that the Google Appliances run more then a
 single process :)

Ultraseek does exactly that and is a single multi-threaded process.
A single process is much easier for the admin. A multi-process solution
is more complicated to start up, monitor, shut down, and upgrade.

There is decent demand for a spidering enterprise search engine.
Look at the Google Appliance, Ultraseek, and IBM OmniFind. The
free IBM OmniFind Yahoo! Edition uses Lucene.

I'd love to see the Ultraseek spider connected to Solr, but that
depends on Autonomy.

wunder
-- 
Walter Underwood
Search Guru, Netflix




RE: separate log files

2007-01-19 Thread Chris Hostetter

: I'm running multiple instances of Solr, which all using the same war
: file to load from.  To log to separate files I implemented the following
: kludge.

Ben: I'm glad you managed to get your situation working, but did you try
the instructions on the TomCat documentation page about configuring
seperate loggers per context?  if it didnt' work, did you try mailing the
tomcat user list?

what you have here is definitely a kludge as you say ... and not
something i would recommend in general ... for starters, it assumes there
will allways be a logging.properties file, besides the possibility that it
won't be there, this also doesn't play nicely with the
possibility of someone using the
java.util.logging.config.file or java.util.logging.config.class properties
... not to mention the fact that Servlet containers are totally within
thir right to control logging programaticly using the public LogManager
APIs based on configuration options from their own config files well
before any applications are initialized ... and this approach would undo
any of that configuration -- which could break the servlet contains own
logs not just the logging info from the individual webapps.


: !--start SolrServlet.java.diff--
: 23d22
:  import org.apache.solr.request.SolrQueryResponse;
: 24a24
:  import org.apache.solr.request.SolrQueryResponse;
: 33a34,36
: 
:  import java.io.ByteArrayInputStream;
:  import java.io.ByteArrayOutputStream;
: 34a38,39
:  import java.io.InputStream;
:  import java.io.OutputStream;
: 35a41,42
:  import java.util.Properties;
:  import java.util.logging.LogManager;
: 47a55,80
:/*
: * switch java.util.logging.Logger appenders
: *
: * Add the following to the web context file
: * Environment name=solr/log-prefix type=java.lang.String
: value=log-prefix. override=false /
: */
:  private void switchAppenders(String prefix) {
:  String logParam = org.apache.juli.FileHandler.prefix;
:  log.info(switching appender to  + logParam + = +
: prefix);
:  Properties props = new Properties();
:  try {
:  InputStream configStream =
: getClass().getResourceAsStream(/logging.properties);
:  props.load(configStream);
:  configStream.close();
:  props.setProperty(logParam, prefix);
:  ByteArrayOutputStream os = new
: ByteArrayOutputStream();
:  props.store((OutputStream)os, LOGGING
: PROPERTIES);
:  LogManager.getLogManager().readConfiguration(new
: ByteArrayInputStream(os.toByteArray()));
:  log.info(props:  + props.toString());
:  }
:  catch(Exception e) {
:  String errMsg = Error: Cannot load
: configuration file; Cause:  + e.getMessage();
:  log.info(errMsg);
:  }
:  }
: 
: 48a82
: 
: 52c86,91
: 
: ---
: 
:// change the logging properties
:String prefix =
: (String)c.lookup(java:comp/env/solr/log-prefix);
:if (prefix!=null)
:switchAppenders(prefix);
: 
: 64a104
: 
: !--end SolrServlet.java.diff--
:
:
:  -Original Message-
:  From: Chris Hostetter [mailto:[EMAIL PROTECTED]
:  Sent: Wednesday, 17 January 2007 6:04 AM
:  To: solr-user@lucene.apache.org
:  Subject: Re: separate log files
: 
: 
:  : I wonder of jetty or tomcat can be configured to put logging output
:  : for different webapps in different log files...
: 
:  i've never tried it, but the tomcat docs do talk about tomcat
:  providing a custom implimentation of java.util.logging
:  specificly for this purpose.
: 
:  Ben: please take a look at this doc...
: 
:  http://tomcat.apache.org/tomcat-5.5-doc/logging.html
: 
:  ..specifically the section on java.util.logging (since that's
:  what Solr
:  uses) ... I believe you'll want something like the Example
:  logging.properties file to be placed in common/classes so
:  that you can control the logging.
: 
:  Please let us all know if this works for you ... it would
:  make a great addition to the SolrTomcat wiki page.
: 
: 
:  : On 1/15/07, Ben Incani [EMAIL PROTECTED] wrote:
:  :  Hi Solr users,
:  : 
:  :  I'm running multiple instances of Solr, which all using
:  the same war
:  :  file to load from.
:  : 
:  :  Below is an example of the servlet context file used for each
:  :  application.
:  : 
:  :  Context path=/app1-solr docBase=/var/usr/solr/solr-1.0.war
:  :  debug=0 crossContext=true 
:  :  Environment name=solr/home type=java.lang.String
:  :  value=/var/local/app1 override=true /
:  :  /Context
:  : 
:  :  Hence each application is using the same
:  :  WEB-INF/classes/logging.properties file to configure logging.
:  : 
:  :  I would like to each instance to log to separate log
:  files such as;
:  :  app1-solr.-mm-dd.log
:  :  app2-solr.-mm-dd.log
:  :  ...
:  : 
:  :  Is there an easy way to append 

Re: Can this be achieved? (Was: document support for file system crawling)

2007-01-19 Thread Mike Klaas

On 1/19/07, Walter Underwood [EMAIL PROTECTED] wrote:


Ultraseek does exactly that and is a single multi-threaded process.
A single process is much easier for the admin. A multi-process solution
is more complicated to start up, monitor, shut down, and upgrade.

There is decent demand for a spidering enterprise search engine.
Look at the Google Appliance, Ultraseek, and IBM OmniFind. The
free IBM OmniFind Yahoo! Edition uses Lucene.

I'd love to see the Ultraseek spider connected to Solr, but that
depends on Autonomy.


You could accomplish this by throwing them together as various webapps
in a single container instance.

-MIke


[jira] Created: (SOLR-114) HashDocSet new hash(), andNot(), union()

2007-01-19 Thread Yonik Seeley (JIRA)
HashDocSet new hash(), andNot(), union()


 Key: SOLR-114
 URL: https://issues.apache.org/jira/browse/SOLR-114
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Yonik Seeley


Looking at the negative filters stuff, I realized that andNot() had no 
optimized implementation for HashDocSet, so I implemented that and union().

While I was in there, I did a re-analysis of hash collision rates and came up 
with a cool new hash method that goes directly into a linear scan and is hence 
simpler, faster, and has fewer collisions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (SOLR-114) HashDocSet new hash(), andNot(), union()

2007-01-19 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-114:
--

Attachment: hashdocset.patch

 HashDocSet new hash(), andNot(), union()
 

 Key: SOLR-114
 URL: https://issues.apache.org/jira/browse/SOLR-114
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Yonik Seeley
 Attachments: hashdocset.patch


 Looking at the negative filters stuff, I realized that andNot() had no 
 optimized implementation for HashDocSet, so I implemented that and union().
 While I was in there, I did a re-analysis of hash collision rates and came up 
 with a cool new hash method that goes directly into a linear scan and is 
 hence simpler, faster, and has fewer collisions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-114) HashDocSet new hash(), andNot(), union()

2007-01-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466154
 ] 

Yonik Seeley commented on SOLR-114:
---

Performance results:
  - HashDocSet.exists() is 13% faster
  - HashDocSet.intersectionSize() is thus 9% faster
  - HashDocSet.union() is 20 times faster
  - HashDocSet.andNot() is 27 times faster

Tested with Sun JDK6 -server on a P4

 HashDocSet new hash(), andNot(), union()
 

 Key: SOLR-114
 URL: https://issues.apache.org/jira/browse/SOLR-114
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Yonik Seeley
 Attachments: hashdocset.patch


 Looking at the negative filters stuff, I realized that andNot() had no 
 optimized implementation for HashDocSet, so I implemented that and union().
 While I was in there, I did a re-analysis of hash collision rates and came up 
 with a cool new hash method that goes directly into a linear scan and is 
 hence simpler, faster, and has fewer collisions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-114) HashDocSet new hash(), andNot(), union()

2007-01-19 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466160
 ] 

Hoss Man commented on SOLR-114:
---

quick questions...

1) what test did you run to get those numbers? ... even if we don't commit it, 
we should attach it to this Jira issue
2) we should probably test at least the Sun 1.5 JVM as well right?




 HashDocSet new hash(), andNot(), union()
 

 Key: SOLR-114
 URL: https://issues.apache.org/jira/browse/SOLR-114
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Yonik Seeley
 Attachments: hashdocset.patch


 Looking at the negative filters stuff, I realized that andNot() had no 
 optimized implementation for HashDocSet, so I implemented that and union().
 While I was in there, I did a re-analysis of hash collision rates and came up 
 with a cool new hash method that goes directly into a linear scan and is 
 hence simpler, faster, and has fewer collisions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-114) HashDocSet new hash(), andNot(), union()

2007-01-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466166
 ] 

Yonik Seeley commented on SOLR-114:
---

The performance tests are commented out in the TestDocSet test... I  had other 
changes in my tree related to negative queries and only selected the two source 
files for diffs.

I had quickly tested Java5 to make sure it was still faster in all instances, 
and it was.  Numbers were about the same, some speedups larger and some smaller 
than Java6.

 HashDocSet new hash(), andNot(), union()
 

 Key: SOLR-114
 URL: https://issues.apache.org/jira/browse/SOLR-114
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Yonik Seeley
 Attachments: hashdocset.patch


 Looking at the negative filters stuff, I realized that andNot() had no 
 optimized implementation for HashDocSet, so I implemented that and union().
 While I was in there, I did a re-analysis of hash collision rates and came up 
 with a cool new hash method that goes directly into a linear scan and is 
 hence simpler, faster, and has fewer collisions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (SOLR-114) HashDocSet new hash(), andNot(), union()

2007-01-19 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-114:
--

Attachment: test.patch

 HashDocSet new hash(), andNot(), union()
 

 Key: SOLR-114
 URL: https://issues.apache.org/jira/browse/SOLR-114
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Yonik Seeley
 Attachments: hashdocset.patch, test.patch


 Looking at the negative filters stuff, I realized that andNot() had no 
 optimized implementation for HashDocSet, so I implemented that and union().
 While I was in there, I did a re-analysis of hash collision rates and came up 
 with a cool new hash method that goes directly into a linear scan and is 
 hence simpler, faster, and has fewer collisions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-114) HashDocSet new hash(), andNot(), union()

2007-01-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466176
 ] 

Yonik Seeley commented on SOLR-114:
---

tested on an AMD opteron, 64 bit mode, Java5 -server -Xbatch and exists() was 
8.5% faster, intersectionSize() was 7% faster.
I didn't bother testing union(), andNot(), as they are obviously going to be 
much faster.


 HashDocSet new hash(), andNot(), union()
 

 Key: SOLR-114
 URL: https://issues.apache.org/jira/browse/SOLR-114
 Project: Solr
  Issue Type: Improvement
  Components: search
Reporter: Yonik Seeley
 Attachments: hashdocset.patch, test.patch


 Looking at the negative filters stuff, I realized that andNot() had no 
 optimized implementation for HashDocSet, so I implemented that and union().
 While I was in there, I did a re-analysis of hash collision rates and came up 
 with a cool new hash method that goes directly into a linear scan and is 
 hence simpler, faster, and has fewer collisions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley

On 1/19/07, Yonik Seeley [EMAIL PROTECTED] wrote:

On 1/19/07, Chris Hostetter [EMAIL PROTECTED] wrote:
 whoa ... hold on a minute, even if we use a ServletFilter do do all of the
 dispatching instead of a Servlet we still need a base path right?

I thought that's what the filter gave you... the ability to filter any
URL to the /solr webapp, and Ryan was doing a lookup on the next
element for a request handler.



yes, this is the beauty of a Filter.  It *can* process the request
and/or it can pass it along.  There is no problem at all with mapping
a filter to all requests and a servlet to some paths.  The filter will
only handle paths declared in solrconfig.xml everything else will be
handled however it is defined in web.xml

(As a sidenote, wicket 2.0 replaces their dispatch servlet with a
filter - it makes it MUCH easier to have their app co-exist with other
things in a shared URL structure.)


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley

then all is fine and dandy ... but what happens if someone tries to
configure a plugin with the name admin ... now all of the existing admin
pages break.



that is exactly what you would expect to happen if you map a handler
to /admin.  The person configuring solrconfig.xml is saying Hey, use
this instead of the default /admin.  I want mine to make sure you are
logged in using my custom authentication method.  In addition, It may
be reasonable (sometime in the future) to implement /admin as a
RequestHandler.  This could be a clean way to address SOLR-58  (xml
with stylesheets, or JSON, etc...)



also: what happens a year from now when we add some completely new
Servlet/ServletFilter to Solr, and want to give it a unique URL...

  http://host:/solr/bar/



obviously, I think the default solr settings should be prudent about
selecting URLs.  The standard configuration should probably map most
things to /select/xxx or /update/xxx.


...we could put it earlier in the processing chain before the existing
ServletFilter, but then we break any users that have registered a plugin
with the name bar.


Even if we move this to have a prefix path, we run into the exact same
issue when sometime down the line solr has a default handler mapped to
'bar'

/solr/dispatcher/bar

But, if it ever becomes a problem, we can add an excludes pattern to
the filter-config that would  skip processing even if it maps to a
known handler.



more short term: if there is no prefix that the ervletFilter requires,
then supporting the legacy http://host:/solr/update; and
http://host:/solr/select; URLs becomes harder,


I don't think /update or /select need to be legacy URLs.  They can
(and should) continue work as they currently do using a new framework.

The reason I was suggesting that the Handler interface adds support to
ask for the default RequestParser and/or ResponseWriter is to support
this exact issue.  (However in the case of path=/select the filter
would need to get the handler from ?qt=xxx)

- - - - -

All that said, this could just as cleanly map everything to:
 /solr/dispatch/update/xml
 /solr/cmd/update/xml
 /solr/handle/update/xml
 /solr/do/update/xml

thoughts?


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Yonik Seeley

On 1/19/07, Ryan McKinley [EMAIL PROTECTED] wrote:

All that said, this could just as cleanly map everything to:
  /solr/dispatch/update/xml
  /solr/cmd/update/xml
  /solr/handle/update/xml
  /solr/do/update/xml

thoughts?


That was my original assumption (because I was thinking of using
servlets, not a filter),
but I see little advantage to scoping under additional path elements.
I also agree with the other points you make.

-Yonik


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley

(Note: this is different then what i have suggested before.  Treat it
as brainstorming on how to take what i have suggested and mesh it with
your concerns)

What if:

The RequestParser is not be part of the core API - It would be a
helper function for Servlets and Filters that call the core API.  It
could be configured in web.xml rather then solrconfig.xml.  A
RequestDispatcher (Servlet or Filter) would be configured with a
single RequestParser.

The RequestParser would be in charge of taking HttpRequest and determining:
 1) The RequestHandler
 2) The SolrRequest (Params  Streams)

It would not be the most 'pluggable' of plugins, but I am still having
trouble imagining anything beyond a single default RequestParser.
Assuming anything doing *really* complex ways of extracting
ContentStreams will do it in the Handler not the request parser.  For
reference see my argument for a seperate DocumentParser interface in:
http://www.nabble.com/Re%3A-Update-Plugins-%28was-Re%3A-Handling-disparate-data-sources-in-Solr%29-p8386161.html

In my view, the default one could be mapped to /* and a custom one
could be mapped to /mycustomparser/*

This would drop the ':' from my proposed URL and change the scheme to look like:
/parser/path/the/parser/knows/how/to/extract/?params

This would give people a relativly easy way to implement 'restful'
URLs if they need to.  (but they would have to edit web.xml)



: Would that be configured in solrconfig.xml as handler name=xml?
: name=update/xml?  If it is update/xml would it only really work if
: the 'update' servlet were configured properly?

it would only make sense to map that as xml ... the SolrCore (and hte
solrconfig.xml) shouldn't have any knowledge of the Servlet/ServletFilter
base paths because it should be possible to use the SolrCore independent
of any ServletContainer (if for no other reason in unit tests)



Correct, SolrCore shoudl not care what the request path is.  That is
why I want to deprecate the execute( ) function that assumes the
handler is defined by 'qt'

Unit tests should be handled by execute( handler, req, res )

If I had my druthers, It would be:
 res = handler.execute( req )
but that is too big of leap for now :)



...

A third use case of doing queries with POST might be that you want to use
standard CGI form encoding/multi-part file upload semantics of HTTP to
send an XML file (or files) to the above mentioned XmlQPRequestHandler ...
so then we have MultiPartMimeRequestParser ...


I agree with all your use cases.  It just seems like a LOT of complex
overhead to extract the general aspects of translating a
URL+Params+Streams = Handler+Request(Params+Streams)

Again, since the number of 'RequestParsers' is small, it seems overly
complex to have a separate plugin to extract URL, another to extract
the Handler, and another to extract the streams.  Particulary since
the decsiions on how you parse the URL can totally affect the other
aspects.




...i really, really, REALLY don't like the idea that the RequestParser
Impls -- classes users should be free to write on their own and plugin to
Solr using the solrconfig.xml -- are responsible for the URL parsing and
parameter extraction.  Maybe calling them RequestParser in my suggested
design is missleading, maybe a better name like StreamExtractor would be
better ... but they shouldn't be in charge of doing anything with the URL.



What if it were configured in web.xml, would you feel more comfortable
letting it determine how the URL is parsed and streams are extracted?


Imagine if 3 years ago, when Yonik and I were first hammering out the API
for SolrRequestHandlers, we had picked this...

   public interface SolrRequestHandlers extends SolrInfoMBean {
 public void init(NamedList args);
 public void handleRequest(HttpServletRequest req, SolrQueryResponse rsp);
   }


Thank goodness you didn't!  I'm confident you won't let me (or anyone)
talk you into something like that!  You guys made a lot of good
choices and solr is an amazing platform for it.

That said, the task at issue is: How do we convert an arbitrary
HttpServletRequest into a SolrRequest.

I am proposing we have a single interface to do this:
 SolrRequest r = RequestParser.parse( HttpServletRequest  )

You are proposing this is broken down further.  Something like:
 Handler h = (the filter) getHandler( req.getPath() )
 SolrParams = (the filter) do stuff to extract the params (using
parser.preProcess())
 ContentStreams = parser.parse( request )

While it is not great to have plugins manipulate the HttpRequest -
someone needs to do it.  In my opinion, the RequestParser's job is to
isolate *everything* *else* from the HttpServletRequest.

Again, since the number of RequestParser is small, it seems ok (to me)



keeping HttpServletRequest out of the API for RequestParsers helps us
future-proof against breaking plugins down the road.



I agree.  This is why i suggest the RequestParsers is not a core part
of the API, just a helper class 

Re: [jira] Commented: (SOLR-114) HashDocSet new hash(), andNot(), union()

2007-01-19 Thread Mike Klaas

On 1/19/07, Yonik Seeley (JIRA) [EMAIL PROTECTED] wrote:


[ 
https://issues.apache.org/jira/browse/SOLR-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466176
 ]

Yonik Seeley commented on SOLR-114:
---

tested on an AMD opteron, 64 bit mode, Java5 -server -Xbatch and exists() was 
8.5% faster, intersectionSize() was 7% faster.
I didn't bother testing union(), andNot(), as they are obviously going to be 
much faster.


Nice job!

-Mike


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Yonik Seeley

First Ryan, thank you for your patience on this *very* long hash
session.  Most wouldn't last that long unless it were a flame war ;-)
And thanks to Hoss, who seems to have the highest read+response
bandwidth of anyone I've ever seen (I'll admit I've only been
selectively reading this thread, with good intentions of coming back
to it).

On 1/19/07, Ryan McKinley [EMAIL PROTECTED] wrote:

It would not be the most 'pluggable' of plugins, but I am still having
trouble imagining anything beyond a single default RequestParser.
Assuming anything doing *really* complex ways of extracting
ContentStreams will do it in the Handler not the request parser.


Agreed... a custom handler opening various streams not covered by the
default will most easily be handled by the handler opening the streams
themselves.


This would give people a relativly easy way to implement 'restful'
URLs if they need to.  (but they would have to edit web.xml)


A handler could alternately get the rest of the path (absent params), right?


Correct, SolrCore shoudl not care what the request path is.  That is
why I want to deprecate the execute( ) function that assumes the
handler is defined by 'qt'

Unit tests should be handled by execute( handler, req, res )


How does the unit test get the handler?


If I had my druthers, It would be:
  res = handler.execute( req )
but that is too big of leap for now :)


Yep... esp since the response writers now need the request for
parameters, for the searcher (streaming docs, etc).


You guys made a lot of good
choices and solr is an amazing platform for it.


I just wish I had known Lucene when I *started* Sol(a)r ;-)


I am proposing we have a single interface to do this:
  SolrRequest r = RequestParser.parse( HttpServletRequest  )


That's currently what new SolrServletRequest(HttpServletRequest) does.
We just need to figure out how to get InputStreams, Readers, etc.


I agree.  This is why i suggest the RequestParsers is not a core part
of the API, just a helper class for Servlets and Filters.


Sounds good to as a practical starting point to me.  If we need more
in the future, we can add it then.

USECASE: The XML update plugin using the woodstox XML parser:
Woodstox docs say to give the parser an InputStream (with char
encoding, if available) for best performance.  This is also preferable
since if the char isn't specified, the parser can try to snoop it from
the stream.

So, the hander needs to be able to get an InputStream, and HTTP headers.
Other plugins (CSV) will ask for a Reader and expect the details to be
ironed out for it.

Method1: come up with ways to expose all this info through an
interface... a headers object could be added to the SolrRequest
context (see getContext())
Method2: consider it a more special case, have an XML update servlet
that puts that info into the SolrRequest (perhaps via the context
again)

-Yonik


[jira] Created: (SOLR-115) replace BooleanQuery.getClauses() with clauses()

2007-01-19 Thread Yonik Seeley (JIRA)
replace BooleanQuery.getClauses() with clauses()


 Key: SOLR-115
 URL: https://issues.apache.org/jira/browse/SOLR-115
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
Priority: Minor


Basically, take advantage of
http://issues.apache.org/jira/browse/LUCENE-745
after we update lucene versions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Chris Hostetter

: First Ryan, thank you for your patience on this *very* long hash

I could not agree more ... as i was leaving work this afternoon, it
occured to me I really hope Ryan realizes i like all of his ideas, i'm
just wondering if they can be better -- most people I work with don't
have the stamina to deal with my design reviews :)

What occured to me as i was *getting* home was that since I seem to be the
only one that's (overly) worried about the RequestParser/HTTP abstraction
-- and since i haven't managed to convince Ryan after all of my badgering
-- it's probably just me being paranoid.

I think in general, the approach you've outlined should work great -- i'll
reply to some of your more recent comments directly.



-Hoss



Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley

On 1/19/07, Chris Hostetter [EMAIL PROTECTED] wrote:


: First Ryan, thank you for your patience on this *very* long hash

I could not agree more ... as i was leaving work this afternoon, it
occured to me I really hope Ryan realizes i like all of his ideas, i'm
just wondering if they can be better -- most people I work with don't
have the stamina to deal with my design reviews :)



Thank you both!  This is the first time I've taken the time and effort
to contribute to an open source project.  I'm learning the
pace/etiquette etc as I go along :)   Honestly your critique is
refreshing - I'm used to working alone or directing others.

I *think* we are close to something we will all be happy with.


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley


what!? .. really? ... you don't think the ones i mentioned before are
things we should support out of the box?

  - no stream parser (needed for simple GETs)
  - single stream from raw post body (needed for current updates
  - multiple streams from multipart mime in post body (needed for SOLR-85)
  - multiple streams from files specified in params (needed for SOLR-66)
  - multiple streams from remote URL specified in params



I have imagined the single default parser handles *all* the cases you
just mentioned.

GET: read params from paramMap().  Check thoes params for special
params that send you to one or many remote streams.

POST: depending on headers/content type etc you parse the body as a
single stream, multi-part files or read the params.

It will take some careful design, but I think all the standard cases
can be handled by a single parser.


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Yonik Seeley

On 1/20/07, Ryan McKinley [EMAIL PROTECTED] wrote:


 what!? .. really? ... you don't think the ones i mentioned before are
 things we should support out of the box?

   - no stream parser (needed for simple GETs)
   - single stream from raw post body (needed for current updates
   - multiple streams from multipart mime in post body (needed for SOLR-85)
   - multiple streams from files specified in params (needed for SOLR-66)
   - multiple streams from remote URL specified in params


I have imagined the single default parser handles *all* the cases you
just mentioned.


Yes, this is what I had envisioned.
And if we come up with another cool standard one, we can add it and
all the current/older handlers get that additional behavior for free.

-Yonik


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Ryan McKinley

:
: This would drop the ':' from my proposed URL and change the scheme to look 
like:
: /parser/path/the/parser/knows/how/to/extract/?params

i was totally okay with the : syntax (although we should double check if
: is actaully a legal unescaped URL character) .. but i'm confused by
this new suggestions ... is parser the name of the parser in that
example and path/the/parser/knows/how/to/extract data that the parser
may use to build to SolrRequest with? (ie: perhaps the RequestHandler)

would parser names be required to not have slashes in them in that case?



(working with the assumption that most cases can be defined by a
single request parser)

I am/was suggesting that a dispatch servlet/fliter has a single
request parser.  The default request parser will choose the handler
based on names defined in solrconfig.xml.  If someone needs a custom
RequestParser, it would be linked to a new servlet/filter (possibly)
mapped to a distinct prefix.

If it is not possible to handle most standard stream cases with a
single request parser, i will go back to the /path:parser format.

I suggest it is configured in web.xml because that is a configurable
place that is not solrconfg.xml.  I don't think it is or should be a
highly configurable component.



:
: Thank goodness you didn't!  I'm confident you won't let me (or anyone)
: talk you into something like that!  You guys made a lot of good

the point i was trying to make is that if we make a RequestParser
interface with a parseRequest(HttpServletRequest req) method, it amouts
to just as much badness -- the key is we can make that interface as long
as all the implimentations are in the SOlr code base where we can keep an
eye on them, and people have to go way, WAY, *WAY* into solr to start
shanging them.




Yes, implementing a RequestParser is more like writing a custom
Servlet then adding a Tokenizer.


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-19 Thread Yonik Seeley

On 1/20/07, Chris Hostetter [EMAIL PROTECTED] wrote:

: I have imagined the single default parser handles *all* the cases you
: just mentioned.

A ... a lot of confusing things make more sense now. .. but
some things are more confusing: If there is only one parser, and it
decides what to do based entirely on param names and HTTP headers, then
what's the point of having the parser name be part of the path in your
URL design?


I didn't think it would be part of the URL anymore.


: POST: depending on headers/content type etc you parse the body as a
: single stream, multi-part files or read the params.
:
: It will take some careful design, but I think all the standard cases
: can be handled by a single parser.

that scares me ... not only does it rely on the client code sending the
correct content-type


Not really... that would perhaps be the default, but the parser (or a
handler) can make intelligent decisions about that.

If you put the parser in the URL, then there's *that* to be messed up
by the client.


(i don't trust HTTP Client code -- but for the sake
of argument let's assume all clients are perfect) what happens when a
person wants to send a mim multi-part message *AS* the raw post body -- so
the RequestHandler gets it as a single ContentStream (ie: single input
stream, mime type of multipart/mixed) ?


Multi-part posts will have the content-type set correctly, or it won't work.
The big use-case I see is browser file upload, and they will set it correctly.


This may sound like a completely ridiculous idea, but consider the
situation where someone is indexing email ... they've written a
RequestHandler that knows how to parser multipart mime emails and
convert them to documents, they want to POST them directly to Solr and let
their RequestHandler deal with them as a single entity.


We should not preclude wacky handlers from doing things for
themselves, calling our stuff as utility methods.


..i think life would be a lot simpler if we kept the RequestParser name as
part of hte URL, completely determined by the client (since the client
knows what it's trying to send) ... even if there are only 2 or 3 types of
RequestParsing being done.


Having to do different types of posts to different URLs doesn't seem
optimal, esp if we can do it in one.

-Yonik