Newspad using solr
PRWeb's Newspad.com search has been using a replicated Solr setup since June 11, 2007. In that time, and I'm just checking the admin page on the query server...3,000,000 requests since June across 350,000 documents. This hardly taxes the server, it's load is about 0.20 with 20 rather sleepy apache workers. Newspad is up to about 51,000 searches a day. It's ready for more :-) Thank you so much for this software! This is good stuff, truly! Jed
Re: Multiple Values -Structured?
Bharani wrote: Hi, I have got two sets of document 1) Primary Document 2) Occurrences of primary document Since there is no such thing as "join" i can either a) Post the primary document with occurrences as multi valued field or b) Post the primary document for every occurrences i.e. classic de-normalized route My problem with Option a) This works great as long as the occurrence is a single field but if i had a group of fields that describes the occurrence then the search returns wrong results becuase of the nature of text search i.e 1 Jan 2007 review 2 Jan 2007 revision if i search for 2 Jan 2007 and 1 Jan 2007 i will get a hit (which is wrong) becuase there is no grouping of fields to associate date and type as one unit. If i merge them as one entity then i cant use the range quieries for date Option B) This would result in large number of documents and even if i try with index only and not store i am still have to deal with duplicate hit - becuase all i want is the primary document Is there a better approach to the problem? Are you concerned about the size of your index? One of the difficulties that you're going to find with multi-valued fields is that they are an unordered collection without relation. If you have a document with a list of editors and revisions, the two fields have no inherent correlation unless your application can extract it from the data itself. [doc] [id]123[/id] [str name=name]hello world[/str] [array name=editor] [str name=editor]Fred[/str] [str name=editor]Bob[/str] [/array] [array name=revisiondate] [date name=revisiondate]2006-01-01T00:00:00Z[/date] [date name=revisiondate]2006-01-02T00:00:00Z[/date] [/array] [/doc] If your application can decipher that and do a slice on it showing a revision...then brilliant! But if the multi-value fields are out of order, that might make a significant different. I would create a document per revision and take advantage of range queries and sorting available at the query level. Jed
Re: minimum occurances of term in document
Mike Klaas wrote: On 30-Aug-07, at 4:01 PM, Chris Hostetter wrote: You could accomplish the goal without any coding by using phrase queries: "calico calico calico"~1 will match only documents that have at least three occurrences of calico. If this is performant enough, you are done. Otherwise, you'll have to do some custom coding. I'll be searching article content so literals like "cat cat cat" are improbable. i think you missunderstood Mike's point ... the query string... foo:"cat cat cat"~1 ...will only match documents containing three instances of the term "cat" in the field "foo" where those instances are all withing 1 term positions of eachother ... hte idea being that as long as the "slop" (number) used is bigger then the largest document you expect to deal with, this will esentially give you want you want. Note too that by default solr only indexes the first 10k tokens, so this should work for all documents in the index. -Mike Whoa! When I first read the original suggestion, I was thinking ^1 because I happened to be googling "solr filter by score" (another topic I learned is hardly worth persuing). Yeah, I'm going to try that right now Jed
Re: minimum occurances of term in document
Mike Klaas wrote: On 30-Aug-07, at 1:22 PM, Jed Reynolds wrote: Jed Reynolds wrote: Apologies if this is in the Lucene FAQ, but I was looking thru the Lucene syntax and I just didn't see it. Is there a way to search for documents that have a certain number of occurrences of a term in the document? Like, I want to find all documents that have the term Calico mentioned three or more times in the document? Apologies for the ignorant question. I believe what I'm looking to do is filter results on term frequency. I of course can get term frequency data from the debug output, but I'd rather not engage in application-level filtering by parsing the debug output. It looks like there could be a few ways to purse incorporating a term frequency modifier into a search. I'd think that results could be fq filtered thru the fq step, if I could change the fq step to filter on term freq. I presume a QueryHandler could be made to do that, too. I presume that a QueryParser and a Searcher could do the job. Any suggestions about a reasonable way to go about this would be appreciated. You could accomplish the goal without any coding by using phrase queries: "calico calico calico"~1 will match only documents that have at least three occurrences of calico. If this is performant enough, you are done. Otherwise, you'll have to do some custom coding. I'll be searching article content so literals like "cat cat cat" are improbable. One way would be to create your own Query subclass (similar to TermQuery) that returned a score of zero for docs below a certain tf threshold. This is probably the most efficient. Rather than creating a custom queryparser, it probably would be easier to add an extra parameter to a custom request handler than parsed (::) into your custom query class add added it in the appropriate place (eg. as a filter). A Query subclass sounds the most efficient, and probably allows the most accurate way to control results. Thanks for the suggestions! Jed
Re: minimum occurances of term in document
Jed Reynolds wrote: Apologies if this is in the Lucene FAQ, but I was looking thru the Lucene syntax and I just didn't see it. Is there a way to search for documents that have a certain number of occurrences of a term in the document? Like, I want to find all documents that have the term Calico mentioned three or more times in the document? Apologies for the ignorant question. I believe what I'm looking to do is filter results on term frequency. I of course can get term frequency data from the debug output, but I'd rather not engage in application-level filtering by parsing the debug output. It looks like there could be a few ways to purse incorporating a term frequency modifier into a search. I'd think that results could be fq filtered thru the fq step, if I could change the fq step to filter on term freq. I presume a QueryHandler could be made to do that, too. I presume that a QueryParser and a Searcher could do the job. Any suggestions about a reasonable way to go about this would be appreciated. Thanks! Jed
minimum occurances of term in document
Apologies if this is in the Lucene FAQ, but I was looking thru the Lucene syntax and I just didn't see it. Is there a way to search for documents that have a certain number of occurrences of a term in the document? Like, I want to find all documents that have the term Calico mentioned three or more times in the document? Thanks Jed
Re: Replication script file issues..
Matthew Runo wrote: It seems that as soon as I get a commit, snapshooter goes wild. I have 1107 running instances of snapshooter right now.. I suspect you've got pathing and/or permissions issues. First try running snapshooter -v, and it will be louder. I've often had to dig in deeper, tho. I'd kill them all off. Edit the snapshooter script and add "set -x" to line two of the script and run it by hand. Make sure to run it by hand as the user (which might be tomcat, I don't know your setup) that would be running it from cron. It might be that you have disk performance issue, or two much data to transfer in 5 minutes or whatever your cron period is set to. If you've got multiple snapshooters hogging the master rsync at once, you'll very likely run into some blockage.
success! Newspad lives anew!
I'd like to thank everyone that created and helped bring us Solr. Newspad is working awesomely. http://www.newspad.com/ And sorting in 1.2.0 is going to be such a bonus! Thanks! Jed
Re: Restrict Servlet Access
Gunther, Andrew wrote: What are people doing to restrict UpdateServlet access on production installs of Solr. Are people removing that option and rotating in a new index or restricting access from the jetty side. I'm putting Solr on my DMZ without direct WAN access. If I had to put it on a WAN facing server, I'd hide it behind Apache and access it using mod_rewrite and use the [P] proxy directive. Using mod_rewrite, by ignoring the /foo/update URI then you have no external access to that. Jed
Re: Federated Search
Venkatesh Seetharam wrote: The hash idea sounds really interesting and if I had a fixed number of indexes it would be perfect. I'm infact looking around for a reverse-hash algorithm where in given a docId, I should be able to find which partition contains the document so I can save cycles on broadcasting slaves. Many large databases partition their data either by load or by another logical manner, like by alphabet. I hear that Hotmail, for instance, partitions its users alphabetically. Having a broker will certainly abstract this mechninism, and of course your application(s) want to be able to bypass a broker when necessary. I mean, even if you use a DB, how have you solved the problem of distribution when a new server is added into the mix. http://www8.org/w8-papers/2a-webserver/caching/paper2.html I saw this link on the memcached list and the thread surrounding it certainly covered some similar ground. Some ideas have been discussed like: - high availability of memcached, redundant entries - scaling out clusters and facing the need to rebuild the entire cache on all nodes depending on your bucketing. I see some similarties with maintaining multiple indicies/lucene partitions and having a memcache deployment: mostly if you are hashing your keys to partitions (or buckets or machines) then you might be faced with a) availability issues if there's a machine/partition outtage b) rebuilding partitions if adding a partition/bucket changes the hash mapping. The ways I can think of to scale-out new indexes would be to have your application maintain two sets of bucket mappings for ids to indexes, and the second would be to key your documents and partition them by date. The former method would allow you to rebuild a second set of repartitioned indexes and buckets and allow you to update your application to use the new bucket mapping (when all the indexes has been rebuilt). The latter method would only apply if you could organize your document ids by date and only added new documents to the 'now' end or evenly across most dates. You'd have to add a new partition onto the end as time progressed, and rarely rebuild old indexes unless your documents grow unevenly. Interesting topic! I don't yet need to run multiple Lucene partitions, but I have a few memcached servers and increasing the number of them I expect will force my site to take a performance accordingly as I am forced to rebuild the caches. I can see similarly if I had multiple lucene partitions, that if I had to fission some of them, rebuilding the resulting partitions would be time intensive and I'd want to have procedures in place for availibility, scaling out and changing application code as necessary. Just having one fail-over Solr index is just so easy in comparison. Jed
Re: merely a suggestion: schema.xml validator or better schema validation logging
Chris Hostetter wrote: : I almost didn't notice the exception fly by because there's s much : log output, and I can see why I might not have noticed. Yay for you should be able to configure it to put WARNING and SEVERE messages in a seperate log file even. Certainly! I learned to reconfigure tomcat's logging when I was doing my Nutch deployment. I'm very likely going to reconfigure my logging. i've been thinking a Servlet that didn't depend on any special Solr code (so it will work even if SolrCore isn't initialized) but registeres a log handler and records the last N messages from Solr above a certain level would be handy to refer people to when they are having issues and aren't overly comfortable with log files. Yeah, like a ring buffer for last x number warning|severe messages. I'm pretty used to looking at apache log files. Some errors pointing out configuration or operational failure (like running out of file descriptors) on the admin and status pages would be helpful because I think that some people are probably going to check those pages first, possibly because they're deving and not necessarily watching logs. I'd still use Solr even if it didn't have a logging servlet, tho ;-) Jed
Re: merely a suggestion: schema.xml validator or better schema validation logging
Yonik Seeley wrote: On 3/2/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: How do you all feel about returning an error when you add a document with unknown fields? +1 dynamicField definitions can be used if desired (including "*" to match every undefined field). If dynamicField definitions are removed from the schema.xml file (and your fields are not referencing them), does this have the same effect of disabling unknown-field generation? Jed
Re: JVM random crashes
Yonik Seeley wrote: On 3/3/07, Dimitar Ouzounov <[EMAIL PROTECTED]> wrote: But what hardware problem could it be? Tomorrow I'll make sure that the memory is fine, but nothing else comes to my mind. Memory, motherboard, etc. Try http://www.memtest86.com/ to test this. It may be OS-related - probably a buggy version of some library. But which library? Yep, we've seen that in the past. I'd recommend going with OS versions that vendors test with. The commercial RHEL or the free clone of it http://www.centos.org/, would be my recommendation. I'm running a lot of CentOS 4.4 myself, on i686 and x86_64 processors. I'm testing out Solr on an i686 with JDK 1.5 and I'm running a production copy of Nutch on x86_64 JDK 1.5, Tomcat 1.5. It's been rock solid. From trying to install Java in the past on FC5, I read a lot about how you had to be rather careful to make absolutely certain that you had no conflicting gjc libs in your path. If this is a production box, I'd got with a longer-supported OS than FC6. If the server is only for searching and apache, I don't think FC6 will give you any noticeable performance boost over CentOS 4.4. FC6's performance enhancements with glibc-hash-binding won't affect a JVM. Jed
Re: merely a suggestion: schema.xml validator or better schema validation logging
Bertrand Delacretaz wrote: On 3/3/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: ...The rationale with the solrconfig stuff is that a broken config should behave as best it can. This is great if you are running a real site with people actively using it - it is a pain in the ass if you are getting started and don't notice errors I think it's a PITA in any case, I like my systems to fail loudly when something's wrong in the configs (with details about what's happening, of course). -Bertrand I think it's interesting seeing the difference. The system at CNET obviously needed to fail gracefully before it needed to fail fast. I have the luxury of a dev environment and fail-fast is exactly the kinda thing I want so I know about as many limitations and problems as soon as possible. Having this behavior toggled would be idea. Version the solrconfig.xml between a fail-graceful for your production branch and a fail-fast for your dev branch. Jed
Re: merely a suggestion: schema.xml validator or better schema validation logging
Ryan McKinley wrote: On 3/2/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 3/2/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > The rationale with the solrconfig stuff is that a broken config should > behave as best it can. I don't think that's what I was actually going for in this instance (the schema). I was focused on getting correct stuff to work correctly, and worry about incorrect stuff later :-) sorry, I was referring to solrconfig.xml... if something goes wrong loading handlers it continues but prints out some log messages. I (think) there are code comments somewhere about how it should be ok to have an error and still keep a working system... I'd like to be able to configure a "strict" mode so it does not continue. > The other one that can confuse you is if you add documents with fields > that are undefined - rather then getting an error, solr adds the > fields that are defined (it may print out an exception somewhere, but > i've never noticed it) Also unintended. How do you all feel about returning an error when you add a document with unknown fields? That sounds like a good option to specify in solrconfig.xml. I spent a long time tracking down an error with a document set with an uppercase field name to something configured with a lowercase field. Isn't this the kind of error that XML validation is supposed to address? I completely understand the appeal of loosely validating XML documents, of course. However, since adding a document to an index is not a lightweight operation, adding validation doesn't seem unreasonable. If writing a schema is required for validation, I'm willing to endure that step. I can certainly see many instances when components in my system written by other staff won't fit into my Solr schema. A way to enforce a schema, strictly, in a dev environment, is entirely appropriate for me. Jed
Re: merely a suggestion: schema.xml validator or better schema validation logging
Ryan McKinley wrote: I almost didn't notice the exception fly by because there's s much log output, and I can see why I might not have noticed. Yay for scrollback! (Hrm, I might not have wanted to watch logging for 4 instances of solr all at once. Might explain why so much logging.) This has bitten me more then once too! The rationale with the solrconfig stuff is that a broken config should behave as best it can. This is great if you are running a real site with people actively using it - it is a pain in the ass if you are getting started and don't notice errors. I'd like to see a "strict" configuration parameter. If something fails on startup, nothing would work until it was fixed. If there is any interest, I can put this together. That would be helpful. The other one that can confuse you is if you add documents with fields that are undefined - rather then getting an error, solr adds the fields that are defined (it may print out an exception somewhere, but i've never noticed it) I've read about this capability but I haven't experienced it's effects yet. Another helpful modification would be returning 500 errors codes in the header. ... The 'new' RequestHandler framework (apache-solr-1.2-dev) returns a proper response code (400,500,etc). It is not (yet) the default handler for /select, but I hope it gets to be soon. Bitchen! Looking forward to that. However, I've got a lot more learning and testing to do. Don't rush anything on account of me. Jed
Re: merely a suggestion: schema.xml validator or better schema validation logging
Yonik Seeley wrote: If the actual schema was null, then that was probably some problem parsing the schema. If that's the case, hopefully you saw an exception in the logs on startup? Using apache-solr-1.1.0-incubating. Actually not at first, but now I do. But I've gone back and re-created the (or a similar) error, and what the problem was happened to be the way I was watching my logs. When I first started, I was just doing a tail -F on catalina.out, but the exception was throwing to the logfile localhost.2007-03-01.log. Ah, tomcat my best old buddy old pal. I've learned to just do a "tail -F *". I've obviously grown desinsitized by other java projects throwing exceptions to logs, and by so much logging duplication between catalina.out and the tomcat contextual logs. I almost didn't notice the exception fly by because there's s much log output, and I can see why I might not have noticed. Yay for scrollback! (Hrm, I might not have wanted to watch logging for 4 instances of solr all at once. Might explain why so much logging.) Another helpful modification would be returning 500 errors codes in the header. This would help a script detect error codes without needing to grep or dom process the result element. The output of my php script to load documents was showing me the snippet below. Possibly making the error code configurable might help (I can see cases where forcing a 200 response is useful) . Array ( [errno] => 0 [errstr] => [response] => HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Content-Type: text/xml;charset=UTF-8 Content-Length: 1329 Date: Sat, 03 Mar 2007 02:04:12 GMT Connection: close java.lang.NullPointerException at org.apache.solr.core.SolrCore.update(SolrCore.java:763) at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:53) --snip-- ) Anyway, I agree that some config errors could be handled in a more user-friendly manner, and it would be nice if config failures could make it to the front-page admin screen or something. That would groovy! I was able to see instances where a field was not defined. Now that I'm looking at all the log files, I'm seeing the error I should have seen earlier. Thanks guys! Jed PS Last night I was able to index about 180,000 documents in about 2.5 hours. The resulting index is a bit over 800M. Compared to my self-crawling with Nutch, this is 1/4 the time to index and 1/30th the disk space used by indicies. I am really impressed. I threw four concurrent scripts making 50,000 distinct (select distinct tag from taglist;) requests at this solr instance and my solr server was serving 50 requests per second per script and the solr server load average was about 3.2. That's 200 requests per second against a 4 core box. The tomcat instance was taking 606M ram, resident.
merely a suggestion: schema.xml validator or better schema validation logging
First time user. Not interested in flamewar, just making a suggestion. I just got Solr working with my own schema and it was only a little more mysterious than I expected, having previously dealth with Nutch. Solr is exactly what I wanted in terms of (theoretical) ease of configurability. However, my first try at defining a schema.xml file was tough because my only feedback for a long time was "NullPointerException" from SolrCore when I was trying to add content. I deduce what was happening was when SolrCore tried invoking methods on the schema instance, the schema instance was null. From a design point of view, this could easily be modeled with the NullObject pattern, and an InvalidSchema object could be substituted as a default schema object. Method invocations to that schema would appropriately log why the proper schema failed to validate and substantiate. I'd think that since the capacity to define a schema via XML is so attractively powerful, that providing feedback on bad schemata would really speed deployment and adoption. It turned out that I had misspelled the unique key field reference. Silly, but can't be uncommon for a first time user. If there is already a method of pre-validating the schema, noting it on the wiki would be really helpful. So far, that has been my only hangup. This has been so much easier and appropriate than Nutch I've been gung-ho all week setting this up. Thank you! Jed