Newspad using solr

2007-10-06 Thread Jed Reynolds
PRWeb's Newspad.com search has been using a replicated Solr setup since 
June 11, 2007. In that time, and I'm just checking the admin page on the 
query server...3,000,000 requests since June across 350,000 documents. 
This hardly taxes the server, it's load is about 0.20 with 20 rather 
sleepy apache workers. Newspad is up to about 51,000 searches a day. 
It's ready for more :-)


Thank you so much for this software! This is good stuff, truly!

Jed


Re: Multiple Values -Structured?

2007-09-03 Thread Jed Reynolds

Bharani wrote:

Hi,

I have got two sets of document

1) Primary Document
2) Occurrences of primary document

Since there is no such thing as "join" i can either 


a) Post the primary document with occurrences as multi valued field
 or
b) Post the primary document for every occurrences i.e. classic
de-normalized route

My problem with 


Option a) This works great as long as the occurrence is a single field but
if i had a group of fields that describes the occurrence then the search
returns wrong results becuase of the nature of text search

i.e 1 Jan 2007
 review

 2 Jan 2007 
 revision

if i search for 2 Jan 2007 and  1 Jan 2007  i will get a hit
(which is wrong)  becuase there is no grouping of fields to associate date
and type as one unit. If i merge them as one entity then i cant use the
range quieries for date

Option B) This would result in large number of documents and even if i try
with index only and not store i am still have to deal with duplicate hit -
becuase all i want is the primary document


Is there a better approach to the problem?
  


Are you concerned about the size of your index?

One of the difficulties that you're going to find with multi-valued 
fields is that they are an unordered collection without relation. If you 
have a document with a list of editors and revisions, the two fields 
have no inherent correlation unless your application can extract it from 
the data itself.


[doc]
   [id]123[/id]
   [str name=name]hello world[/str]
   [array name=editor]
   [str name=editor]Fred[/str]
   [str name=editor]Bob[/str]
   [/array]
   [array name=revisiondate]
  [date name=revisiondate]2006-01-01T00:00:00Z[/date]
  [date name=revisiondate]2006-01-02T00:00:00Z[/date]
   [/array]
[/doc]

If your application can decipher that and do a slice on it showing a 
revision...then brilliant! But if the multi-value fields are out of 
order, that might make a significant different.


I would create a document per revision and take advantage of range 
queries and sorting available at the query level.





Jed


Re: minimum occurances of term in document

2007-08-30 Thread Jed Reynolds

Mike Klaas wrote:


On 30-Aug-07, at 4:01 PM, Chris Hostetter wrote:



You could accomplish the goal without any coding by using phrase 
queries: "calico calico calico"~1 will match only documents 
that have at least three occurrences of calico.  If this is 
performant enough, you are done. Otherwise, you'll have to do some 
custom coding.


I'll be searching article content so literals like "cat cat cat" are 
improbable.


i think you missunderstood Mike's point ... the query string...
 foo:"cat cat cat"~1

...will only match documents containing three instances of the term 
"cat" in the field "foo" where those instances are all withing 1 
term positions of eachother ... hte idea being that as long as the 
"slop" (number) used is bigger then the largest document you expect 
to deal with, this will esentially give you want you want.


Note too that by default solr only indexes the first 10k tokens, so 
this should work for all documents in the index.


-Mike




Whoa! When I first read the original suggestion, I was thinking ^1 
because I happened to be googling "solr filter by score" (another topic 
I learned is hardly worth persuing).


Yeah, I'm going to try that right now

Jed


Re: minimum occurances of term in document

2007-08-30 Thread Jed Reynolds

Mike Klaas wrote:

On 30-Aug-07, at 1:22 PM, Jed Reynolds wrote:


Jed Reynolds wrote:


Apologies if this is in the Lucene FAQ, but I was looking thru the 
Lucene syntax and I just didn't see it.


Is there a way to search for documents that have a certain number of 
occurrences of a term in the document? Like, I want to find all 
documents that have the term Calico mentioned three or more times in 
the document?


Apologies for the ignorant question. I believe what I'm looking to do 
is filter results on term frequency.  I of course can get term 
frequency data from the debug output, but I'd rather not engage in 
application-level filtering by parsing the debug output.


It looks like there could be a few ways to purse incorporating a term 
frequency modifier into a search. I'd think that results could be fq 
filtered thru the fq step, if I could change the fq step to filter on 
term freq. I presume a QueryHandler could be made to do that, too. I 
presume that a QueryParser and a Searcher could do the job.


Any suggestions about a reasonable way to go about this would be 
appreciated.


You could accomplish the goal without any coding by using phrase 
queries: "calico calico calico"~1 will match only documents that 
have at least three occurrences of calico.  If this is performant 
enough, you are done. Otherwise, you'll have to do some custom coding.


I'll be searching article content so literals like "cat cat cat" are 
improbable.



One way would be to create your own Query subclass (similar to 
TermQuery) that returned a score of zero for docs below a certain tf 
threshold.  This is probably the most efficient.  Rather than creating 
a custom queryparser, it probably would be easier to add an extra 
parameter to a custom request handler than parsed 
(::) into your custom query class add added it in 
the appropriate place (eg. as a filter).


A Query subclass sounds the most efficient, and probably allows the most 
accurate way to control results.


Thanks for the suggestions!


Jed


Re: minimum occurances of term in document

2007-08-30 Thread Jed Reynolds

Jed Reynolds wrote:


Apologies if this is in the Lucene FAQ, but I was looking thru the 
Lucene syntax and I just didn't see it.


Is there a way to search for documents that have a certain number of 
occurrences of a term in the document? Like, I want to find all 
documents that have the term Calico mentioned three or more times in 
the document?


Apologies for the ignorant question. I believe what I'm looking to do is 
filter results on term frequency.  I of course can get term frequency 
data from the debug output, but I'd rather not engage in 
application-level filtering by parsing the debug output.


It looks like there could be a few ways to purse incorporating a term 
frequency modifier into a search. I'd think that results could be fq 
filtered thru the fq step, if I could change the fq step to filter on 
term freq. I presume a QueryHandler could be made to do that, too. I 
presume that a QueryParser and a Searcher could do the job.


Any suggestions about a reasonable way to go about this would be 
appreciated.


Thanks!

Jed


minimum occurances of term in document

2007-08-30 Thread Jed Reynolds


Apologies if this is in the Lucene FAQ, but I was looking thru the 
Lucene syntax and I just didn't see it.


Is there a way to search for documents that have a certain number of 
occurrences of a term in the document? Like, I want to find all 
documents that have the term Calico mentioned three or more times in the 
document?


Thanks

Jed


Re: Replication script file issues..

2007-07-19 Thread Jed Reynolds

Matthew Runo wrote:

It seems that as soon as I get a commit, snapshooter goes wild.

I have 1107 running instances of snapshooter right now..



I suspect you've got pathing and/or permissions issues.

First try running snapshooter -v, and it will be louder. I've often had 
to dig in deeper, tho.


I'd kill them all off. Edit the snapshooter script and add "set -x" to 
line two of the script and run it by hand. Make sure to run it by hand 
as the user (which might be tomcat, I don't know your setup) that would 
be running it from cron.


It might be that you have disk performance issue, or two much data to 
transfer in 5 minutes or whatever your cron period is set to. If you've 
got multiple snapshooters hogging the master rsync at once, you'll very 
likely run into some blockage.





success! Newspad lives anew!

2007-07-18 Thread Jed Reynolds
I'd like to thank everyone that created and helped bring us Solr. 
Newspad is working awesomely.


http://www.newspad.com/

And sorting in 1.2.0 is going to be such a bonus!

Thanks!

Jed


Re: Restrict Servlet Access

2007-03-14 Thread Jed Reynolds

Gunther, Andrew wrote:


What are people doing to restrict UpdateServlet access on production
installs of Solr.  Are people removing that option and rotating in a new
index or restricting access from the jetty side.
 



I'm putting Solr on my DMZ without direct WAN access. If I had to put it 
on a WAN facing server, I'd hide it behind Apache and access it using 
mod_rewrite and use the [P] proxy directive. Using mod_rewrite, by 
ignoring the /foo/update URI then you have no external access to that.


Jed


Re: Federated Search

2007-03-10 Thread Jed Reynolds

   Venkatesh Seetharam wrote:



The hash idea sounds really interesting and if I had a fixed number of

indexes it would be perfect.
I'm infact looking around for a reverse-hash algorithm where in given a
docId, I should be able to find which partition contains the document 
so I

can save cycles on broadcasting slaves.


Many large databases partition their data either by load or by another 
logical manner, like by alphabet. I hear that Hotmail, for instance, 
partitions its users alphabetically. Having a broker will certainly 
abstract this mechninism, and of course your application(s) want to be 
able to bypass a broker when necessary.



I mean, even if you use a DB, how have you solved the problem of
distribution when a new server is added into the mix.


http://www8.org/w8-papers/2a-webserver/caching/paper2.html

I saw this link on the memcached list and the thread surrounding it 
certainly covered some similar ground. Some ideas have been discussed like:

- high availability of memcached, redundant entries
- scaling out clusters and facing the need to rebuild the entire cache 
on all nodes depending on your bucketing.
I see some similarties with maintaining multiple indicies/lucene 
partitions and having a memcache deployment: mostly if you are hashing 
your keys to partitions (or buckets or machines) then you might be faced 
with a) availability issues if there's a machine/partition outtage b) 
rebuilding partitions if adding a partition/bucket changes the hash mapping.


The ways I can think of to scale-out new indexes would be to have your 
application maintain two sets of bucket mappings for ids to indexes, and 
the second would be to key your documents and partition them by date. 
The former method would allow you to rebuild a second set of 
repartitioned indexes and buckets and allow you to update your 
application to use the new bucket mapping (when all the indexes has been 
rebuilt). The latter method would only apply if you could organize your 
document ids by date and only added new documents to the 'now' end or 
evenly across most dates. You'd have to add a new partition onto the end 
as time progressed, and rarely rebuild old indexes unless your documents 
grow unevenly.


Interesting topic! I don't yet need to run multiple Lucene partitions, 
but I have a few memcached servers and increasing the number of them I 
expect will force my site to take a performance accordingly as I am 
forced to rebuild the caches. I can see similarly if I had multiple 
lucene partitions, that if I had to fission some of them, rebuilding the 
resulting partitions would be time intensive and I'd want to have 
procedures in place for availibility, scaling out and changing 
application code as necessary. Just having one fail-over Solr index is 
just so easy in comparison.


Jed


Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-03 Thread Jed Reynolds

Chris Hostetter wrote:

: I almost didn't notice the exception fly by because there's s much
: log output, and I can see why I might not have noticed. Yay for
  



you should be able to configure it to put WARNING and SEVERE messages in a
seperate log file even.
  


Certainly! I learned to reconfigure tomcat's logging when I was doing my 
Nutch deployment. I'm very likely going to reconfigure my logging.



i've been thinking a Servlet that didn't depend on any special Solr code
(so it will work even if SolrCore isn't initialized) but registeres a log
handler and records the last N messages from Solr above a certain level
would be handy to refer people to when they are having issues and aren't
overly comfortable with log files.
  


Yeah, like a ring buffer for last x number warning|severe messages.

I'm pretty used to looking at apache log files.  Some errors pointing 
out configuration or operational failure (like running out of file 
descriptors) on the admin and status pages would be helpful because I 
think that some people are probably going to check those pages first, 
possibly because they're deving and not necessarily watching logs. I'd 
still use Solr even if it didn't have a logging servlet, tho ;-)


Jed


Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-03 Thread Jed Reynolds

Yonik Seeley wrote:

On 3/2/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

How do you all feel about returning an error when you add a document
with unknown fields?


+1

dynamicField definitions can be used if desired (including "*" to
match every undefined field).


If dynamicField definitions are removed from the schema.xml file (and 
your fields are not referencing them), does this have the same effect of 
disabling unknown-field generation?


Jed


Re: JVM random crashes

2007-03-03 Thread Jed Reynolds

Yonik Seeley wrote:

On 3/3/07, Dimitar Ouzounov <[EMAIL PROTECTED]> wrote:

But what hardware problem could it be? Tomorrow I'll make sure that the
memory is fine, but nothing
else comes to my mind.


Memory, motherboard, etc.
Try http://www.memtest86.com/ to test this.


It may be OS-related - probably a buggy version of
some library. But which library?


Yep, we've seen that in the past.
I'd recommend going with OS versions that vendors test with.
The commercial RHEL or the free clone of it http://www.centos.org/,
would be my recommendation.



I'm running a lot of CentOS 4.4 myself, on i686 and x86_64 processors. 
I'm testing out Solr on an i686 with JDK 1.5 and I'm running a 
production copy of Nutch on x86_64 JDK 1.5, Tomcat 1.5. It's been rock 
solid.


From trying to install Java in the past on FC5, I read a lot about how 
you had to be rather careful to make absolutely certain that you had no 
conflicting gjc libs in your path. If this is a production box, I'd got 
with a longer-supported OS than FC6. If the server is only for searching 
and apache, I don't think FC6 will give you any noticeable performance 
boost over CentOS 4.4. FC6's performance enhancements with 
glibc-hash-binding won't affect a JVM.



Jed


Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-03 Thread Jed Reynolds

Bertrand Delacretaz wrote:

On 3/3/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

...The rationale with the solrconfig stuff is that a broken config 
should

behave as best it can.  This is great if you are running a real site
with people actively using it - it is a pain in the ass if you are
getting started and don't notice errors


I think it's a PITA in any case, I like my systems to fail loudly when
something's wrong in the configs (with details about what's happening,
of course).

-Bertrand

I think it's interesting seeing the difference. The system at CNET 
obviously needed to fail gracefully before it needed to fail fast. I 
have the luxury of a dev environment and fail-fast is exactly the kinda 
thing I want so I know about as many limitations and problems as soon as 
possible.


Having this behavior toggled would be idea. Version the solrconfig.xml 
between a fail-graceful for your production branch and a fail-fast for 
your dev branch.


Jed


Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-02 Thread Jed Reynolds

Ryan McKinley wrote:

On 3/2/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 3/2/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> The rationale with the solrconfig stuff is that a broken config should
> behave as best it can.

I don't think that's what I was actually going for in this instance
(the schema).
I was focused on getting correct stuff to work correctly, and worry
about incorrect stuff later :-)



sorry, I was referring to solrconfig.xml... if something goes wrong
loading handlers it continues but prints out some log messages.  I
(think) there are code comments somewhere about how it should be ok to
have an error and still keep a working system...  I'd like to be able
to configure a "strict" mode so it does not continue.



> The other one that can confuse you is if you add documents with fields
> that are undefined - rather then getting an error, solr adds the
> fields that are defined (it may print out an exception somewhere, but
> i've never noticed it)

Also unintended.



How do you all feel about returning an error when you add a document
with unknown fields?


That sounds like a good option to specify in solrconfig.xml.



I spent a long time tracking down an error with a document set with an
uppercase field name to something configured with a lowercase field.



Isn't this the kind of error that XML validation is supposed to address? 
I completely understand the appeal of loosely validating XML documents, 
of course. However, since adding a document to an index is not a 
lightweight operation, adding validation doesn't seem unreasonable. If 
writing a schema is required for validation, I'm willing to endure that 
step. I can certainly see many instances when components in my system 
written by other staff won't fit into my Solr schema. A way to enforce a 
schema, strictly, in a dev environment, is entirely appropriate for me.



Jed


Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-02 Thread Jed Reynolds

Ryan McKinley wrote:


I almost didn't notice the exception fly by because there's s much
log output, and I can see why I might not have noticed. Yay for
scrollback! (Hrm, I might not have wanted to watch logging for 4
instances of solr all at once. Might explain why so much logging.)


This has bitten me more then once too!

The rationale with the solrconfig stuff is that a broken config should
behave as best it can.  This is great if you are running a real site
with people actively using it - it is a pain in the ass if you are
getting started and don't notice errors.

I'd like to see a "strict" configuration parameter.  If something
fails on startup, nothing would work until it was fixed.  If there is
any interest, I can put this together.


That would be helpful.


The other one that can confuse you is if you add documents with fields
that are undefined - rather then getting an error, solr adds the
fields that are defined (it may print out an exception somewhere, but
i've never noticed it)



I've read about this capability but I haven't experienced it's effects yet.



Another helpful modification would be returning 500 errors codes in the
header. ...


The 'new' RequestHandler framework (apache-solr-1.2-dev) returns a
proper response code (400,500,etc).  It is not (yet) the default
handler for /select, but I hope it gets to be soon.


Bitchen! Looking forward to that.


However, I've got a lot more learning and testing to do. Don't rush 
anything on account of me.


Jed


Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-02 Thread Jed Reynolds

Yonik Seeley wrote:


If the actual schema was null, then that was probably some problem
parsing the schema.
If that's the case, hopefully you saw an exception in the logs on 
startup?



Using apache-solr-1.1.0-incubating.


Actually not at first, but now I do. But I've gone back and re-created 
the (or a similar) error, and what the problem was happened to be the 
way I was watching my logs. When I first started, I was just doing a 
tail -F on catalina.out, but the exception was throwing to  the logfile 
localhost.2007-03-01.log. Ah, tomcat my best old buddy old pal. I've 
learned to just do a "tail -F *". I've obviously grown desinsitized by 
other java projects throwing exceptions to logs, and by so much logging 
duplication between catalina.out and the tomcat  contextual logs.


I almost didn't notice the exception fly by because there's s much 
log output, and I can see why I might not have noticed. Yay for 
scrollback! (Hrm, I might not have wanted to watch logging for 4 
instances of solr all at once. Might explain why so much logging.)


Another helpful modification would be returning 500 errors codes in the 
header. This would help a script detect error codes without needing to 
grep or dom process the result element. The output of my php script to 
load documents was showing me the snippet below. Possibly making the 
error code configurable might help (I can see cases where forcing a 200 
response is useful) .




Array
(
   [errno] => 0
   [errstr] =>
   [response] => HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: text/xml;charset=UTF-8
Content-Length: 1329
Date: Sat, 03 Mar 2007 02:04:12 GMT
Connection: close

java.lang.NullPointerException
   at org.apache.solr.core.SolrCore.update(SolrCore.java:763)
   at 
org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:53)

--snip--

)






Anyway, I agree that some config errors could be handled in a more
user-friendly manner, and it would be nice if config failures could
make it to the front-page admin screen or something.



That would groovy!

I was able to see instances where a field was not defined. Now that I'm 
looking at all the log files, I'm seeing the error I should have seen 
earlier.


Thanks guys!

Jed

PS Last night I was able to index about 180,000 documents in about 2.5 
hours. The resulting index is a bit over 800M. Compared to my 
self-crawling with Nutch, this is 1/4 the time to index and 1/30th the 
disk space used by indicies. I am really impressed. I threw four 
concurrent scripts making 50,000 distinct (select distinct tag from 
taglist;) requests at this solr instance and my solr server was serving 
50 requests per second per script and the solr server load average was 
about 3.2. That's 200 requests per second against a 4 core box. The 
tomcat instance was taking 606M ram, resident.





merely a suggestion: schema.xml validator or better schema validation logging

2007-03-01 Thread Jed Reynolds

First time user. Not interested in flamewar, just making a suggestion.

I just got Solr working with my own schema and it was only a little more 
mysterious than I expected, having previously dealth with Nutch. Solr is 
exactly what I wanted in terms of (theoretical) ease of configurability.


However, my first try at defining a schema.xml file was tough because my 
only feedback for a long time was "NullPointerException" from SolrCore 
when I was trying to add content. I deduce what was happening was when 
SolrCore tried invoking methods on the schema instance, the schema 
instance was null.


From a design point of view, this could easily be modeled with the 
NullObject pattern, and an InvalidSchema object could be substituted as 
a default schema object. Method invocations to that schema would 
appropriately log why the proper schema failed to validate and substantiate.


I'd think that since the capacity to define a schema via XML is so 
attractively powerful, that providing feedback on bad schemata would 
really speed deployment and adoption.  It turned out that I had 
misspelled the unique key field reference. Silly, but can't be uncommon 
for a first time user.


If there is already a method of pre-validating the schema, noting it on 
the wiki would be really helpful.


So far, that has been my only hangup. This has been so much easier and 
appropriate than Nutch I've been gung-ho all week setting this up. Thank 
you!



Jed