Fwd: Reviving Nutch 0.7

2007-01-22 Thread Zaheed Haque

-- Forwarded message --
From: Zaheed Haque [EMAIL PROTECTED]
Date: Jan 22, 2007 10:13 AM
Subject: Re: Reviving Nutch 0.7
To: nutch-dev@lucene.apache.org


On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Hi,

I've been meaning to write this message for a while, and Andrzej's 
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, 
it will be even more valuable than it is today.  However, I think there is 
still a need for something much simpler, something like what Nutch 0.7 used to 
be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few 
developers to maintain and further develop both of these concepts, and the main 
Nutch developers need the more powerful version - 0.8 and beyond.  So, what is 
going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth at 
least considering and discussing the possibility of somehow branching that 
version into a parallel project that's not just in a maintenance mode, but has 
its own group of developers (not me, no time :( ) that pushes it forward.

Thoughts?


I agree with you that there is a need for 0.7-style Nutch. I wouldn't
say reviving but more Disecting and re-directing :-). here you go
--- my focus here is 0.7 style i.e. mid-size, enterprise need.

Solr could use a good crawler cos it has everything else .. (AFAIK)
probably this is not technically plug an pray :-) also I am not sure
Solr community wants a crawler but it could benefit from such Solr add
on/snap on crawler. Furthermore I am sure some of the 0.7 plugins
could be re-factored to fit into Solr.

I will forward the mail to Solr community to see if there any interest.

Cheers


[jira] Created: (SOLR-118) Some admin pages stop working with error 404 as the only symptom

2007-01-22 Thread Bertrand Delacretaz (JIRA)
Some admin pages stop working with error 404 as the only symptom
--

 Key: SOLR-118
 URL: https://issues.apache.org/jira/browse/SOLR-118
 Project: Solr
  Issue Type: Bug
  Components: web gui
 Environment: Fedora Core 4 (Linux version 2.6.11-1.1369_FC4smp)  Sun's 
JVM 1.5.0_07-b03
Reporter: Bertrand Delacretaz
Priority: Minor


This was reported to the mailing list a while ago, see 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200610.mbox/[EMAIL 
PROTECTED]

Today I'm seeing the same thing on a Solr instance that has been running since 
January 9th (about 13 days) with the plain start.jar setup. Index contains 
150'000 docs, 88322 search requests to date.

$ curl http://localhost:8983/solr/admin/analysis.jsp
html
head
titleError 404 /admin/analysis.jsp/title
/head
body
h2HTTP ERROR: 404/h2pre/admin/analysis.jsp/pre
pRequestURI=/solr/admin/analysis.jsp/p
...

curl http://localhost:8983/solr/admin/index.jsp
html
head
titleError 404 /admin/index.jsp/title
/head
body
h2HTTP ERROR: 404/h2pre/admin/index.jsp/pre
pRequestURI=/solr/admin/index.jsp/p
...

Other admin pages work correctly, for example 
http://localhost:8983/solr/admin/stats.jsp

I don't see any messages in the logs, which are capturing stdout and stderr 
from the JVM.

I guess I'll have to restart this instance, I'm out of possibilities to find 
out what's happening exactly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: facet response

2007-01-22 Thread Erik Hatcher


On Jan 21, 2007, at 11:54 PM, Yonik Seeley wrote:

On 1/21/07, Erik Hatcher [EMAIL PROTECTED] wrote:

In the built-in simple faceting, I get a Ruby response like this:

  'facet_counts'={
   'facet_queries'={},
   'facet_fields'={
'subject_genre_facet'={
 'Biography.'=2605,
 'Congresses.'=1837,
 'Bibliography.'=672,
 'Exhibitions.'=642,
 'Periodicals.'=615},
...

This is using facet.limit=5 and no sort specified, so the items are
being written in the proper order, however they are written as a Ruby
Hash syntax, which does not iterate in a predictable order (like
Java's Map).  This really should be an Array in order for the client
to assume the response is in a specified order.  I think the response
is best formatted as:

  'facet_counts'={
   'facet_queries'={},
   'facet_fields'={
'subject_genre_facet'=[
 {'Biography.'=2605},
 {'Congresses.'=1837},
 {'Bibliography.'=672},
 {'Exhibitions.'=642},
 {'Periodicals.'=615}],
...

This makes the navigation of the results a bit clunkier because each
item in a fields array is a single element Hash(Map), but the facets
of a field really need to be in an array to maintain order.

I presume this same dilemma is in the Python/JSON format too?  In
XML, the lst has the right semantics, and a parser would easily be
able to deal with it in order:

lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
   lst name=subject_genre_facet
int name=Biography.2605/int
int name=Congresses.1837/int
int name=Bibliography.672/int
int name=Exhibitions.642/int
int name=Periodicals.615/int
   /lst
...

Thoughts?



But isn't this considered a bug in the data structure used to write  
out the facets?



Options:
 - Resort yourself (for this specific case), it's probably going to
be faster than having eval() create a new hash object for each anyway.


For sure it'd be no problem to re-sort on the client.  I was more  
concerned about the purity of the response and it losing its  
semantics as an ordered list.



 - Is there any place to hook into ruby's parser and create a map
that preserves it's order?  I would assume not.


Oh, I'm sure there is a way to accomplish that sort of trickery.   
Ruby is hyper dynamic, so I'm quite confident that with a little  
voodoo this could be done.  But its the wrong way to approach this



 - Does ruby have a JSON parser that preserves the order?


Sure, even in one line of code :)

http://rubyforge.org/snippet/detail.php?type=snippetid=29

But even more robustly this looks like the one to use: http:// 
json.rubyforge.org/



 - A way to specify a different structure for only certain elements...


Again, shouldn't an array be used in this context anyway, rather than  
a hash, regardless of which response writer is being used?



If that's your *only* sorting problem, find a ruby implementation of a
hash or a map that preserves order, then re-sort the hash, and replace
it with the order-preserving map, and then worry about a more general
solution later?


Maybe what we need is a YAML response writer, and used an ordered  
map: http://yaml.org/type/omap.html


Erik



[jira] Commented: (SOLR-118) Some admin pages stop working with error 404 as the only symptom

2007-01-22 Thread Bertrand Delacretaz (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466468
 ] 

Bertrand Delacretaz commented on SOLR-118:
--

Yes, this is the Jetty that is bundled with Solr, Jetty/5.1.11RC0 according to 
the Server HTTP header.

I haven't investigated on the Jetty side yet, it might be a known bug there

 Some admin pages stop working with error 404 as the only symptom
 --

 Key: SOLR-118
 URL: https://issues.apache.org/jira/browse/SOLR-118
 Project: Solr
  Issue Type: Bug
  Components: web gui
 Environment: Fedora Core 4 (Linux version 2.6.11-1.1369_FC4smp)  
 Sun's JVM 1.5.0_07-b03
Reporter: Bertrand Delacretaz
Priority: Minor

 This was reported to the mailing list a while ago, see 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200610.mbox/[EMAIL 
 PROTECTED]
 Today I'm seeing the same thing on a Solr instance that has been running 
 since January 9th (about 13 days) with the plain start.jar setup. Index 
 contains 150'000 docs, 88322 search requests to date.
 $ curl http://localhost:8983/solr/admin/analysis.jsp
 html
 head
 titleError 404 /admin/analysis.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/analysis.jsp/pre
 pRequestURI=/solr/admin/analysis.jsp/p
 ...
 curl http://localhost:8983/solr/admin/index.jsp
 html
 head
 titleError 404 /admin/index.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/index.jsp/pre
 pRequestURI=/solr/admin/index.jsp/p
 ...
 Other admin pages work correctly, for example 
 http://localhost:8983/solr/admin/stats.jsp
 I don't see any messages in the logs, which are capturing stdout and stderr 
 from the JVM.
 I guess I'll have to restart this instance, I'm out of possibilities to find 
 out what's happening exactly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-118) Some admin pages stop working with error 404 as the only symptom

2007-01-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466464
 ] 

Yonik Seeley commented on SOLR-118:
---

What version of jetty was it?  The one included with Solr?

I don't personally have experience with Solr + Jetty and long uptimes.  We use 
Resin in-house, and don't have any uptime issues.

 Some admin pages stop working with error 404 as the only symptom
 --

 Key: SOLR-118
 URL: https://issues.apache.org/jira/browse/SOLR-118
 Project: Solr
  Issue Type: Bug
  Components: web gui
 Environment: Fedora Core 4 (Linux version 2.6.11-1.1369_FC4smp)  
 Sun's JVM 1.5.0_07-b03
Reporter: Bertrand Delacretaz
Priority: Minor

 This was reported to the mailing list a while ago, see 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200610.mbox/[EMAIL 
 PROTECTED]
 Today I'm seeing the same thing on a Solr instance that has been running 
 since January 9th (about 13 days) with the plain start.jar setup. Index 
 contains 150'000 docs, 88322 search requests to date.
 $ curl http://localhost:8983/solr/admin/analysis.jsp
 html
 head
 titleError 404 /admin/analysis.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/analysis.jsp/pre
 pRequestURI=/solr/admin/analysis.jsp/p
 ...
 curl http://localhost:8983/solr/admin/index.jsp
 html
 head
 titleError 404 /admin/index.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/index.jsp/pre
 pRequestURI=/solr/admin/index.jsp/p
 ...
 Other admin pages work correctly, for example 
 http://localhost:8983/solr/admin/stats.jsp
 I don't see any messages in the logs, which are capturing stdout and stderr 
 from the JVM.
 I guess I'll have to restart this instance, I'm out of possibilities to find 
 out what's happening exactly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: facet response

2007-01-22 Thread Yonik Seeley

On 1/22/07, Erik Hatcher [EMAIL PROTECTED] wrote:

But isn't this considered a bug in the data structure used to write
out the facets?


Not for JSON I think... that's a wire format, and the order is what
you see on the wire.
It can be a problem preserving order, depending on the client.

The problem is, for a large number of cases, the order doesn't matter,
even where we use a named list.  If you translate all named lists into
arrays of arrays or arrays of maps, it unnecessarily bloats the XML,
and makes the JSON much harder to read.

For example, the top level NamedList is ordered (responseHeader comes
first), but in Ruby/Python we normally don't care since we can access
by element name.


 Options:
  - Resort yourself (for this specific case), it's probably going to
 be faster than having eval() create a new hash object for each anyway.

For sure it'd be no problem to re-sort on the client.  I was more
concerned about the purity of the response and it losing its
semantics as an ordered list.

  - Is there any place to hook into ruby's parser and create a map
 that preserves it's order?  I would assume not.

Oh, I'm sure there is a way to accomplish that sort of trickery.
Ruby is hyper dynamic, so I'm quite confident that with a little
voodoo this could be done.  But its the wrong way to approach this

  - Does ruby have a JSON parser that preserves the order?

Sure, even in one line of code :)

http://rubyforge.org/snippet/detail.php?type=snippetid=29

But even more robustly this looks like the one to use: http://
json.rubyforge.org/

  - A way to specify a different structure for only certain elements...

Again, shouldn't an array be used in this context anyway, rather than
a hash, regardless of which response writer is being used?


A NamedList is used... that is ordered for XML.
It seems like what would be ideal from a user perspective would be to
have an ordered map so you get random access lookup.

An efficient representation using arrays would be two separate arrays:
one for terms, one for counts.
terms=[foo,bar,baz]
counts=[30,20,10]

But you loose easy random access lookup, and for a sufficiently large
list, you loose the ability of a human to look at the raw response and
correlate a count with the term.

So what about something that could output something like
omap(term1,100,term2, 45)

The other alternative (besides changing *every* named list), is to
have a facet.format and override the default structure.

-Yonik


 If that's your *only* sorting problem, find a ruby implementation of a
 hash or a map that preserves order, then re-sort the hash, and replace
 it with the order-preserving map, and then worry about a more general
 solution later?

Maybe what we need is a YAML response writer, and used an ordered
map: http://yaml.org/type/omap.html

Erik


[jira] Commented: (SOLR-118) Some admin pages stop working with error 404 as the only symptom

2007-01-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466489
 ] 

Yonik Seeley commented on SOLR-118:
---

Maybe it's time to upgrade to the latest Jetty, or at least start evaluating it?
That would also remove the requirement for a JDK over a JRE, and speed up JSP 
page compilation too.


 Some admin pages stop working with error 404 as the only symptom
 --

 Key: SOLR-118
 URL: https://issues.apache.org/jira/browse/SOLR-118
 Project: Solr
  Issue Type: Bug
  Components: web gui
 Environment: Fedora Core 4 (Linux version 2.6.11-1.1369_FC4smp)  
 Sun's JVM 1.5.0_07-b03
Reporter: Bertrand Delacretaz
Priority: Minor

 This was reported to the mailing list a while ago, see 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200610.mbox/[EMAIL 
 PROTECTED]
 Today I'm seeing the same thing on a Solr instance that has been running 
 since January 9th (about 13 days) with the plain start.jar setup. Index 
 contains 150'000 docs, 88322 search requests to date.
 $ curl http://localhost:8983/solr/admin/analysis.jsp
 html
 head
 titleError 404 /admin/analysis.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/analysis.jsp/pre
 pRequestURI=/solr/admin/analysis.jsp/p
 ...
 curl http://localhost:8983/solr/admin/index.jsp
 html
 head
 titleError 404 /admin/index.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/index.jsp/pre
 pRequestURI=/solr/admin/index.jsp/p
 ...
 Other admin pages work correctly, for example 
 http://localhost:8983/solr/admin/stats.jsp
 I don't see any messages in the logs, which are capturing stdout and stderr 
 from the JVM.
 I guess I'll have to restart this instance, I'm out of possibilities to find 
 out what's happening exactly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-118) Some admin pages stop working with error 404 as the only symptom

2007-01-22 Thread Bertrand Delacretaz (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466496
 ] 

Bertrand Delacretaz commented on SOLR-118:
--

Upgrading is probably a good idea, at least to a released 5.x version, as 
apparently we're using a release candidate.

 Some admin pages stop working with error 404 as the only symptom
 --

 Key: SOLR-118
 URL: https://issues.apache.org/jira/browse/SOLR-118
 Project: Solr
  Issue Type: Bug
  Components: web gui
 Environment: Fedora Core 4 (Linux version 2.6.11-1.1369_FC4smp)  
 Sun's JVM 1.5.0_07-b03
Reporter: Bertrand Delacretaz
Priority: Minor

 This was reported to the mailing list a while ago, see 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200610.mbox/[EMAIL 
 PROTECTED]
 Today I'm seeing the same thing on a Solr instance that has been running 
 since January 9th (about 13 days) with the plain start.jar setup. Index 
 contains 150'000 docs, 88322 search requests to date.
 $ curl http://localhost:8983/solr/admin/analysis.jsp
 html
 head
 titleError 404 /admin/analysis.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/analysis.jsp/pre
 pRequestURI=/solr/admin/analysis.jsp/p
 ...
 curl http://localhost:8983/solr/admin/index.jsp
 html
 head
 titleError 404 /admin/index.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/index.jsp/pre
 pRequestURI=/solr/admin/index.jsp/p
 ...
 Other admin pages work correctly, for example 
 http://localhost:8983/solr/admin/stats.jsp
 I don't see any messages in the logs, which are capturing stdout and stderr 
 from the JVM.
 I guess I'll have to restart this instance, I'm out of possibilities to find 
 out what's happening exactly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Fwd: Reviving Nutch 0.7

2007-01-22 Thread Thorsten Scherler
On Mon, 2007-01-22 at 10:13 +0100, Zaheed Haque wrote:
 -- Forwarded message --
 From: Zaheed Haque [EMAIL PROTECTED]
 Date: Jan 22, 2007 10:13 AM
 Subject: Re: Reviving Nutch 0.7
 To: nutch-dev@lucene.apache.org
 
 
 On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:
  Hi,
 
  I've been meaning to write this message for a while, and Andrzej's 
  StrategicGoals made me compose it, finally.
 
  Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop 
  stabilizes, it will be even more valuable than it is today.  However, I 
  think there is still a need for something much simpler, something like what 
  Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm this.  
  Nutch has too few developers to maintain and further develop both of these 
  concepts, and the main Nutch developers need the more powerful version - 
  0.8 and beyond.  So, what is going to happen to 0.7?  Maintenance mode?
 
  I feel that there is enough need for 0.7-style Nutch that it might be worth 
  at least considering and discussing the possibility of somehow branching 
  that version into a parallel project that's not just in a maintenance mode, 
  but has its own group of developers (not me, no time :( ) that pushes it 
  forward.
 
  Thoughts?
 

I do not really want to comment on the 0.7 part of this discussion.

 I agree with you that there is a need for 0.7-style Nutch. I wouldn't
 say reviving but more Disecting and re-directing :-). here you go
 --- my focus here is 0.7 style i.e. mid-size, enterprise need.
 
 Solr could use a good crawler cos it has everything else .. (AFAIK)
 probably this is not technically plug an pray :-) also I am not sure
 Solr community wants a crawler but it could benefit from such Solr add
 on/snap on crawler. 

I used forrest/cocoon cli as crawler in a forrest plugin I wrote. I will
need to look into the nutch crawler code to see whether we could reuse
this code. Not sure how close this is married with the db but I guess
pretty close. 

 Furthermore I am sure some of the 0.7 plugins
 could be re-factored to fit into Solr.

The thing about introducing all this plugin into solr we may come pretty
soon into the situation the original thread is describing. We may blow
the simple one thing that we want to solve to a well defined problem
with too much plugins and components. 

I like to have solr tools that are doing some well defined processes
like updating the solr server with crawled content but like said they
are IMO tools not really part of solr core.

In the end if you want an enhanced search experience via solr with all
the filter goodies then you need to add more fields then the once from
the e.g. nutch standard xhtml parser. 

Certain documents allow fine filtering based on additional information
this documents may provide (year, type, organization, author, etc.). It
is easy to write a single component to update a certain doc type or set
of information against solr, but IMO that should not be the focus of
main solr development.

I think that should go into a tools/ dir. 

 
 I will forward the mail to Solr community to see if there any interest.

Thanks Zaheed. Fits good into the Update Plugins thread.

salu2

 
 Cheers
-- 
thorsten

Together we stand, divided we fall! 
Hey you (Pink Floyd)




Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-22 Thread Chris Hostetter

:3) there's a comment in RequestHandlerBase.init about indexOf that
:   comes form the existing impl in DismaxRequestHandler -- but doesn't match
:   the new code ... i also wasn't certain that the change you made matches

:  I just copied the code from DismaxRequestHandler and made sure it
:  passes the tests.  I don't totally understand what that case is doing.
:
: The first iteration of dismax (before we did generic defaults,
: invariants, etc for request handlers) took defaults directly from the
: init params, and that is what that case is checking for and

bingo .. the reason it jumped out at me in your patch, is that the comment
still refered to indexOf, but the code didn't ... it might be functionally
equivilent, i just wasn't sure when i did my quick read.

there's mention in the comment that indexOf is used so that null
name=defaults / can indicate that you don't want all the init params as
defaults, but you don't acctually want defaults either -- but there
doesn't seem to be a test for that case.

you can see support for the legacy defaults syntax in
src/test/test-files/solr/conf/solrconfig.xml if you grep for
dismaxOldStyleDefaults



-Hoss



Re: continuous integration for solrb

2007-01-22 Thread Chris Hostetter

: While we are on the subject of continuous integration, does anyone think that
: we should also do so for Lucene and Solr?  Doing so we give us a heads-up if
: changes in Lucene breaks Solr.

the current nightly solr builds may not be continuous but they are
rgular, and they will fail and complain if there are any build/test
failures in Solr.  rigging up a seperate recuring build that allways uses
the latest lucene nightly build is an interesting idea ... but if Lucene
starts doing more frequently releases and treats the trunk as more
unstable (which is the direction it seems to be heading) this may not be
that useful.

Of course: if/when that happens, Solr will probably want to stop using
nightly builds of lucene anyway, and only rev when their official point
releases.

:
: Bill
:
: On 1/22/07, Bertrand Delacretaz [EMAIL PROTECTED] wrote:
:  On 1/22/07, Erik Hatcher [EMAIL PROTECTED] wrote:
: 
:   ...I don't know much about our Solaris zone, so could someone fill me in
:   on it a bit?...
: 
:  I haven't seen Solr's zone yet, but basically zones are Solaris
:  (virtual) machines where some of us can get root access, so we can
:  install anything there as long as it plays nice with other zones in
:  terms of CPU and memory usage. Currently all of the ASF's zones are
:  sharing a - fairly powerful - physical machine.
: 
:  For example, the Cocoon zone at http://cocoon.zones.apache.org/ runs
:  live demos of Cocoon pulled automatically out of SVN every few hours
:  by crontab scripts, the Continuum continuous integration server, and
:  the Daisy CMS for editing docs.
: 
:  There's more info at http://www.apache.org/dev/solaris-zones.html
: 
:  HTH,
:  -Bertrand
: 
:



-Hoss



Re: facet response

2007-01-22 Thread Yonik Seeley

On 1/22/07, Yonik Seeley [EMAIL PROTECTED] wrote:

An efficient representation using arrays would be two separate arrays:
one for terms, one for counts.
terms=[foo,bar,baz]
counts=[30,20,10]

But you loose easy random access lookup, and for a sufficiently large
list, you loose the ability of a human to look at the raw response and
correlate a count with the term.


Or if you want to retain the human readability, a single array:
[foo,30,bar,20,baz,10]

We could introduce some new types to tell the output handlers just how
important it is to maintain order, so all named lists don't get
treated the same.

Example:
class OrderedNamedList extends NamedList {...}

Using OrderedNamedList, means that it's really important that order be
maintained, and we could a different strategy such as interleaving
keys and values in a single array (or another strategy set by
json.orderednl?)

Thoughts?

-Yonik


So what about something that could output something like
omap(term1,100,term2, 45)

The other alternative (besides changing *every* named list), is to
have a facet.format and override the default structure.

-Yonik

  If that's your *only* sorting problem, find a ruby implementation of a
  hash or a map that preserves order, then re-sort the hash, and replace
  it with the order-preserving map, and then worry about a more general
  solution later?

 Maybe what we need is a YAML response writer, and used an ordered
 map: http://yaml.org/type/omap.html

 Erik


Re: continuous integration for solrb

2007-01-22 Thread Yonik Seeley

On 1/22/07, Chris Hostetter [EMAIL PROTECTED] wrote:

Of course: if/when that happens, Solr will probably want to stop using
nightly builds of lucene anyway, and only rev when their official point
releases.


I'd rather play that one by ear... I haven't rev'd the lucene version
recently because of all the file format changes going on.  At this
point, I would feel a little more comfortable with waiting until the
next release, or near to it.

-Yonik


[jira] Commented: (SOLR-118) Some admin pages stop working with error 404 as the only symptom

2007-01-22 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466535
 ] 

Hoss Man commented on SOLR-118:
---

FYI: there was more tothat orriginal thread then the apache archives show 
(because they are split up by month) here's the full discussion...

http://www.nabble.com/Admin-page-went-down-tf2548760.html#a7103716

...at the time i wasn't able to reproduce the problem, but i wasn't hammering 
the port very hard.  I suspect heavily that since hte problem was only with the 
admin pages, and all of the update/query functionality still worked fine that 
it was a JSP issue with Jetty.


 Some admin pages stop working with error 404 as the only symptom
 --

 Key: SOLR-118
 URL: https://issues.apache.org/jira/browse/SOLR-118
 Project: Solr
  Issue Type: Bug
  Components: web gui
 Environment: Fedora Core 4 (Linux version 2.6.11-1.1369_FC4smp)  
 Sun's JVM 1.5.0_07-b03
Reporter: Bertrand Delacretaz
Priority: Minor

 This was reported to the mailing list a while ago, see 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200610.mbox/[EMAIL 
 PROTECTED]
 Today I'm seeing the same thing on a Solr instance that has been running 
 since January 9th (about 13 days) with the plain start.jar setup. Index 
 contains 150'000 docs, 88322 search requests to date.
 $ curl http://localhost:8983/solr/admin/analysis.jsp
 html
 head
 titleError 404 /admin/analysis.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/analysis.jsp/pre
 pRequestURI=/solr/admin/analysis.jsp/p
 ...
 curl http://localhost:8983/solr/admin/index.jsp
 html
 head
 titleError 404 /admin/index.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/index.jsp/pre
 pRequestURI=/solr/admin/index.jsp/p
 ...
 Other admin pages work correctly, for example 
 http://localhost:8983/solr/admin/stats.jsp
 I don't see any messages in the logs, which are capturing stdout and stderr 
 from the JVM.
 I guess I'll have to restart this instance, I'm out of possibilities to find 
 out what's happening exactly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-118) Some admin pages stop working with error 404 as the only symptom

2007-01-22 Thread Bertrand Delacretaz (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466539
 ] 

Bertrand Delacretaz commented on SOLR-118:
--

 I suspect that it was a JSP issue with Jetty. 

Yes, certainly. Nothing seems to indicate a problem in Solr's code.

 Some admin pages stop working with error 404 as the only symptom
 --

 Key: SOLR-118
 URL: https://issues.apache.org/jira/browse/SOLR-118
 Project: Solr
  Issue Type: Bug
  Components: web gui
 Environment: Fedora Core 4 (Linux version 2.6.11-1.1369_FC4smp)  
 Sun's JVM 1.5.0_07-b03
Reporter: Bertrand Delacretaz
Priority: Minor

 This was reported to the mailing list a while ago, see 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200610.mbox/[EMAIL 
 PROTECTED]
 Today I'm seeing the same thing on a Solr instance that has been running 
 since January 9th (about 13 days) with the plain start.jar setup. Index 
 contains 150'000 docs, 88322 search requests to date.
 $ curl http://localhost:8983/solr/admin/analysis.jsp
 html
 head
 titleError 404 /admin/analysis.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/analysis.jsp/pre
 pRequestURI=/solr/admin/analysis.jsp/p
 ...
 curl http://localhost:8983/solr/admin/index.jsp
 html
 head
 titleError 404 /admin/index.jsp/title
 /head
 body
 h2HTTP ERROR: 404/h2pre/admin/index.jsp/pre
 pRequestURI=/solr/admin/index.jsp/p
 ...
 Other admin pages work correctly, for example 
 http://localhost:8983/solr/admin/stats.jsp
 I don't see any messages in the logs, which are capturing stdout and stderr 
 from the JVM.
 I guess I'll have to restart this instance, I'm out of possibilities to find 
 out what's happening exactly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: continuous integration for solrb

2007-01-22 Thread Erik Hatcher


On Jan 22, 2007, at 1:44 AM, Bertrand Delacretaz wrote:

There's more info at http://www.apache.org/dev/solaris-zones.html


Anyone here capable of giving me an account on the Lucene zone to let  
me tinker a bit?


Or is Doug the right person to set me up?

Thanks,
Erik



Re: facet response

2007-01-22 Thread Chris Hostetter

For the record, I predicted this problem would come up...

http://www.nabble.com/JSON-output-support-tf1915406.html#a5247622

: We could introduce some new types to tell the output handlers just how
: important it is to maintain order, so all named lists don't get
: treated the same.
:
: Example:
: class OrderedNamedList extends NamedList {...}

Ugh ... please no ... the List in NamedList is what indicates that
order is a factor (it's what distinguishes a NamedList from a hypothetical
MultiValueMap)

: Using OrderedNamedList, means that it's really important that order be
: maintained, and we could a different strategy such as interleaving
: keys and values in a single array (or another strategy set by
: json.orderednl?)

i would much rather see us change any places in the Solr request handlers
where order does *not* matter to just use a Map ...then the ResponseWriter
could know that order in Maps doesn't matter, but order in NamedLists do.




-Hoss



Re: facet response

2007-01-22 Thread Yonik Seeley

On 1/22/07, Chris Hostetter [EMAIL PROTECTED] wrote:


For the record, I predicted this problem would come up...

http://www.nabble.com/JSON-output-support-tf1915406.html#a5247622

: We could introduce some new types to tell the output handlers just how
: important it is to maintain order, so all named lists don't get
: treated the same.
:
: Example:
: class OrderedNamedList extends NamedList {...}

Ugh ... please no ... the List in NamedList is what indicates that
order is a factor (it's what distinguishes a NamedList from a hypothetical
MultiValueMap)


But we also use it when order doesn't totally matter, but it's still nice.


: Using OrderedNamedList, means that it's really important that order be
: maintained, and we could a different strategy such as interleaving
: keys and values in a single array (or another strategy set by
: json.orderednl?)

i would much rather see us change any places in the Solr request handlers
where order does *not* matter to just use a Map ...then the ResponseWriter
could know that order in Maps doesn't matter, but order in NamedLists do.


The problem is that it's not clear cut.
Should the top level (containing responseHeader, result, facets, etc)
be ordered or unordered?  We currently order pretty much everything,
even when it's slightly redundant (we highlight docs in order, but we
also include a unique id).

Even when order doesn't strictly matter, it's still nice to see things
ordered in the response (the responseHeader first for instance).

So what's your proposal for what a facet list should look like in JSON?

-Yonik


[jira] Resolved: (SOLR-80) negative filter queries

2007-01-22 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-80?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-80.
--

Resolution: Fixed

committed.
Thanks for the review Mike!

 negative filter queries
 ---

 Key: SOLR-80
 URL: https://issues.apache.org/jira/browse/SOLR-80
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Attachments: negative_filters.patch, negative_filters.patch


 There is a need for negative filter queries to avoid long filter generation 
 times and large caching requirements.
 Currently, if someone wants to filter out a small number of documents, they 
 must specify the complete set of documents to express those negative 
 conditions against.  
 q=foofq=id:[* TO *] -id:101
 In this example, to filter out a single document, the complete set of 
 documents (minus one) is generated, and a large bitset is cached.  You could 
 also add the restriction to the main query, but that doesn't work with the 
 dismax handler which doesn't have a facility for this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: facet response

2007-01-22 Thread Yonik Seeley

Chris Hostetter [EMAIL PROTECTED] wrote:

as i said, i'd rather invert the use case set to find where ordering
isn't important and change those to Maps


That might be a *lot* of changes...
What's currently broken, just faceting or anything else?

-Yonik


Re: graduation todo list

2007-01-22 Thread Ryan McKinley

Here is a *trivial* one:

the 'Documentation' link on src/webapp/resources/admin/index.jsp still
points to:
http://incubator.apache.org/solr/


Re: [jira] Resolved: (SOLR-80) negative filter queries

2007-01-22 Thread Erik Hatcher


On Jan 22, 2007, at 4:43 PM, Yonik Seeley (JIRA) wrote:

Yonik Seeley resolved SOLR-80.
--

Resolution: Fixed

committed.
Thanks for the review Mike!


You guys are quick!   I had on my TODO list to review this patch  
tonight.  :)


Re: graduation todo list

2007-01-22 Thread Erik Hatcher

Committed, thanks!

Erik

On Jan 22, 2007, at 7:11 PM, Ryan McKinley wrote:


Here is a *trivial* one:

the 'Documentation' link on src/webapp/resources/admin/index.jsp still
points to:
http://incubator.apache.org/solr/




[jira] Commented: (SOLR-80) negative filter queries

2007-01-22 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-80?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466611
 ] 

Yonik Seeley commented on SOLR-80:
--

You are seeing a MatchAllDocsQuery filter.

If getDocSet(ListQuery) is called with a single negative query, or
 or getDocSet(Query, Filter) is called with a null filter and a negative query, 
we call getDocSet(MatchAllDocsQuery)
to use as a base to andNot the passed query.

If you continue your example with fq=+memory and fq=-memory, you will see what 
you expect (only one new filter).


 negative filter queries
 ---

 Key: SOLR-80
 URL: https://issues.apache.org/jira/browse/SOLR-80
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Attachments: negative_filters.patch, negative_filters.patch


 There is a need for negative filter queries to avoid long filter generation 
 times and large caching requirements.
 Currently, if someone wants to filter out a small number of documents, they 
 must specify the complete set of documents to express those negative 
 conditions against.  
 q=foofq=id:[* TO *] -id:101
 In this example, to filter out a single document, the complete set of 
 documents (minus one) is generated, and a large bitset is cached.  You could 
 also add the restriction to the main query, but that doesn't work with the 
 dismax handler which doesn't have a facility for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-80) negative filter queries

2007-01-22 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-80?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466612
 ] 

Mike Klaas commented on SOLR-80:


I think this is due to the last line of this fragment of the patch:

   protected DocSet getDocSet(ListQuery queries) throws IOException {
+if (queries==null) return null;
+if (queries.size()==1) return getDocSet(queries.get(0));
 DocSet answer=null;
-if (queries==null) return null;
-for (Query q : queries) {
-  if (answer==null) {
-answer = getDocSet(q);
+
+boolean[] neg = new boolean[queries.size()];
+DocSet[] sets = new DocSet[queries.size()];
+
+int smallestIndex = -1;
+int smallestCount = Integer.MAX_VALUE;
+for (int i=0; isets.length; i++) {
+  Query q = queries.get(i);
+  Query posQuery = QueryUtils.getAbs(q);
+  sets[i] = getPositiveDocSet(posQuery);

getPositiveDocSet() caches all docsets returned, so both the query part and the 
filter part would be cached in the filterCache.

 negative filter queries
 ---

 Key: SOLR-80
 URL: https://issues.apache.org/jira/browse/SOLR-80
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Attachments: negative_filters.patch, negative_filters.patch


 There is a need for negative filter queries to avoid long filter generation 
 times and large caching requirements.
 Currently, if someone wants to filter out a small number of documents, they 
 must specify the complete set of documents to express those negative 
 conditions against.  
 q=foofq=id:[* TO *] -id:101
 In this example, to filter out a single document, the complete set of 
 documents (minus one) is generated, and a large bitset is cached.  You could 
 also add the restriction to the main query, but that doesn't work with the 
 dismax handler which doesn't have a facility for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-80) negative filter queries

2007-01-22 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-80?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466616
 ] 

Hoss Man commented on SOLR-80:
--

I was strating to think the same thing as Mike: but doing more testing i seee 
what yonik's refering to (note to self: test more then one query when doing 
cache testing) ... only the first use of a negative query results in the double 
insert .. afterthat every thing is golden.

Mike: i think the key is that unless faceting is turned on, the 
StandardRequestHandler only calls getDocList, not getDocListAndSet ... so by 
the time the call makes it to getDocListC the falgs never contain GET_DOCSET, 
so the main query isn't included in the list passed to getDocSet.

 negative filter queries
 ---

 Key: SOLR-80
 URL: https://issues.apache.org/jira/browse/SOLR-80
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Attachments: negative_filters.patch, negative_filters.patch


 There is a need for negative filter queries to avoid long filter generation 
 times and large caching requirements.
 Currently, if someone wants to filter out a small number of documents, they 
 must specify the complete set of documents to express those negative 
 conditions against.  
 q=foofq=id:[* TO *] -id:101
 In this example, to filter out a single document, the complete set of 
 documents (minus one) is generated, and a large bitset is cached.  You could 
 also add the restriction to the main query, but that doesn't work with the 
 dismax handler which doesn't have a facility for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-80) negative filter queries

2007-01-22 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-80?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466617
 ] 

Mike Klaas commented on SOLR-80:


Surely Hoss' example doesn't use matchAllDocs--he has a positive query in both 
cases.

 negative filter queries
 ---

 Key: SOLR-80
 URL: https://issues.apache.org/jira/browse/SOLR-80
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Attachments: negative_filters.patch, negative_filters.patch


 There is a need for negative filter queries to avoid long filter generation 
 times and large caching requirements.
 Currently, if someone wants to filter out a small number of documents, they 
 must specify the complete set of documents to express those negative 
 conditions against.  
 q=foofq=id:[* TO *] -id:101
 In this example, to filter out a single document, the complete set of 
 documents (minus one) is generated, and a large bitset is cached.  You could 
 also add the restriction to the main query, but that doesn't work with the 
 dismax handler which doesn't have a facility for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-80) negative filter queries

2007-01-22 Thread Chris Hostetter

: Surely Hoss' example doesn't use matchAllDocs--he has a positive query in 
both cases.

no acctually i was testing out a positive filter and then the negative of
that filter and thought i was seeing cache inserts for both.

what i was realy seeing was a cache insert of the positive and a cache
insert of the matchalldocs.



-Hoss



[jira] Commented: (SOLR-80) negative filter queries

2007-01-22 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-80?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466621
 ] 

Mike Klaas commented on SOLR-80:


Hoss: thanks for the explanation.  

I might throw this in our production code this week and see how it fares.


 negative filter queries
 ---

 Key: SOLR-80
 URL: https://issues.apache.org/jira/browse/SOLR-80
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Attachments: negative_filters.patch, negative_filters.patch


 There is a need for negative filter queries to avoid long filter generation 
 times and large caching requirements.
 Currently, if someone wants to filter out a small number of documents, they 
 must specify the complete set of documents to express those negative 
 conditions against.  
 q=foofq=id:[* TO *] -id:101
 In this example, to filter out a single document, the complete set of 
 documents (minus one) is generated, and a large bitset is cached.  You could 
 also add the restriction to the main query, but that doesn't work with the 
 dismax handler which doesn't have a facility for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: facet.missing?

2007-01-22 Thread Chris Hostetter
: So now that we have negative queries, we don't really need any
: additional/extra code for facet.missing.  It could simply be
: facet.query=-myfield:*, and that way it could be obtained without
: getting facet.field results if desired.

facet.missing can be used on a per field basis .. but i suspect a more
natural usage of it is to just use facet.missing=true when i always want
to show the user a count for resutls that don't match any value for each
of my facets

this...
q=ipodfacet=truefacet.missing=truefacet.field=inStockfacet.field=catfacet.field=foo
is nicer then...
q=ipodfacet=truefacet.field=inStockfacet.field=catfacet.field=foofacet.query=-inStock:*facet.query=-cat:*facet.query=-foo:*

...particularly when you want to put str name=face.missingtrue/str
as a default in your solrconfig.

: Of course we would need to enable zero-length prefix queries in the
: SolrQueryParser for that, but I think we should do that anyway.

hmmm... is that really better then saying foo:[* TO *] ? ... i guess
syntacticly it's nicer, but on the other hand making people spell out the
range query forces them to conciously choose to do it ... much the same
way the *:* syntax for MatchAllDocs works.

acctually, that's makes me realize: if you support zero width prefix
queries, then * is going to be parsed as a zero width prefix on whatever
the defaultSearchField is and return all results which have a value in
that field ... but that may confuse a lot of people who might assume it is
giving them all docs in the index (and since they are going to get results
instead of errors, they won't have any indication that they are wrong)

: So should we deprecate facet.missing, or is it only really used with
: facet.field queries, and often enough we would want it *in* that list?

well yeah, there's that too ... if you are parsing the facet counts
dealing with the missing count in the list for each facet field is easier
then correlating back to a facet query -- which involves some anoying
string manipulation.  (where by easier and anoying i mean if i had to do
this in XSLT how painful would it be?)



-Hoss



Re: facet.missing?

2007-01-22 Thread Yonik Seeley

On 1/22/07, Chris Hostetter [EMAIL PROTECTED] wrote:

acctually, that's makes me realize: if you support zero width prefix
queries, then * is going to be parsed as a zero width prefix on whatever
the defaultSearchField is and return all results which have a value in
that field


Hmmm, right. if the QueryParser actually supports parsing that syntax.
I haven't tried it out.

It's just that it normally surprises people that they can do foo:a*
and not foo:*

-Yonik


[jira] Updated: (SOLR-117) constrain field faceting to a prefix

2007-01-22 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-117:
--

Attachment: facet_prefix.patch

Full patch w/ tests attached.

This version also implements facet.prefix for the FieldCache method.  It also 
lowers the memory used per-request for that method (because int[] count is 
smaller since we know the max number of terms beforehand that match the 
prefix).  A binary search is used to find the start and end terms for the 
prefix.


 constrain field faceting to a prefix
 

 Key: SOLR-117
 URL: https://issues.apache.org/jira/browse/SOLR-117
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Attachments: facet_prefix.patch, facet_prefix.patch


 Useful for faceting as someone is typing, autocompletion, etc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.