Re: [CODE4LIB] Getting started with SOLR
On Nov 22, 2007, at 3:41 PM, Kent Fitch wrote: On Nov 23, 2007 4:11 AM, Binkley, Peter [EMAIL PROTECTED] wrote: ... If you use boost on the date field the way you suggest, remember you'll have to reindex from scratch every year to adjust the boost as items age. Or maybe just use a method such that 2007 dates boost the document by 3.0, 2008 dates by 3.1, 2009 by 3.2 ... Whether this is feasible depends on how else you are expecting other scoring boosts to interact with document intrinsic boosts. Don't do date boosting at index time, but rather tune things on the query end, using FunctionQuery and such. That'll give you maximum flexibility. Erik
Re: [CODE4LIB] Getting started with SOLR
On 23.11.2007 10:39 Erik Hatcher wrote: As far as I know there is no way to avoid POSTing in the XML - no direct import of an XML file without HTTP (without getting down and dirty and writing to the embedded Solr API, which is a bit discouraged for many reasons). Ok. You can also bring data into Solr using the CSV importer. I highly recommend folks take a good look at this route. It's clean, easy, fast: http://wiki.apache.org/solr/UpdateCSV That sounds like what I need. Only problem I see: what about escapes? I don't know my data good enough to be sure that any possible delimiter will never occur within the data. Most exotic characters will probably be errors but I still don't want SOLR to choke on it. Can I use escapes for separator and/or encapsulator? If so is it \ or (backslash or doubling)? I found nothing in the docs about it. For 700,000 records, one first nice step to try is to convert that data into CSV and feed it into Solr. Create a CSV file on the file system with all those records and use the CSV importer. I think you'll find that the absolute fastest way to bring data in. But It even looks like the direct way (almost) without HTTP since the file is read directly from the file system and doesn't have to be squeezed through a socket connection. To Peter: Thanks for the books. I think I will have something to do for some time now ;-) To Ewout: Thanks for the script. I will have a look at it but I think I will try the csv route first. Thanks for the help -Michael
[CODE4LIB] Getting started with SOLR
Hello, I am just getting my feet wet with SOLR and have a couple of question how others have done certain things. I created a schema.xml where basically every field is of type text for the beginning. Do you use specialized types for authors or ISBNs or other fields? How do you handle multi-value fields? Do you feed everything in a single field (like Smith, James ; Miller, Steve as I have seen in a pure Lucene implementation of a collegue or do you use the multiValued feature of SOLR? What about boosting? I thought of giving the current year a boost=3.0 and then 0.1 less for every year the title is older, down to 1.0 for a 21-year-old book. The idea is to have a sort that tends to promote recent titles but still respects other aspects. Does this sound reasonable or are there other ideas? I would be very interested in an actual boosting-scheme from where I could start. We have a couple of databases that should eventually indexed. Do you build one huge database with an additional database field or is it better to have every database in its own SOLR instance? How do you fill the index? Our main database has about 700,000 records and I don't know if I should build one huge XML-file and feed that into SOLR or use a script that sends one record at a time with a commit after every 1000 records or so. Or do something in between and split it into chunks of a few thousand records each? What are your experiences? What if a record gives an error? Will the whole file be recjected or just that one record? Are there alternatives to the HTTP gateway? Are there any Perl-scripts around that could help? I built a little script that uses LWP to feed my test records into the database. It works but I don't have any error handling yet, very Quick and dirty XML creation so if there is something more mature I would like to use that. Any other ideas, further reading, experiences...? I know these are a lot of questions but after the conference last year I think there is lots of expertise in this group and perhaps I can avoid a few beginner mistakes with your help thanks in advance - Michael
Re: [CODE4LIB] Getting started with SOLR
Hi, Michael, I created a schema.xml where basically every field is of type text for the beginning. Do you use specialized types for authors or ISBNs or other fields? I use different fields for every MARC fields i want to search, moreover there is a field: UDC notation which is split up to atomic notation, so 1 complex udc will be 3+ Solr fields. How do you handle multi-value fields? Do you feed everything in a single field (like Smith, James ; Miller, Steve as I have seen in a pure Lucene implementation of a collegue or do you use the multiValued feature of SOLR? I usually create different fields with the same name. I do it in Lucene as well. There is no problem with repeating fields (same name, different values of course). What about boosting? I thought of giving the current year a boost=3.0 and then 0.1 less for every year the title is older, down to 1.0 for a 21-year-old book. The idea is to have a sort that tends to promote recent titles but still respects other aspects. Does this sound reasonable or are there other ideas? I would be very interested in an actual boosting-scheme from where I could start. That sound reasonable. We have a couple of databases that should eventually indexed. Do you build one huge database with an additional database field or is it better to have every database in its own SOLR instance? Our projects usually builds one index from different sources - but it depends on the nature of your project. We built up an application to which we convert 110+ CD-ROMs (originally in Folio database) - this covers 2 200 000+ xhtml page, and there are some search forms for the different DBs. It is a Lucene project, not Solr. How do you fill the index? Our main database has about 700,000 records and I don't know if I should build one huge XML-file and feed that into SOLR or use a script that sends one record at a time with a commit after every 1000 records or so. Or do something in between and split it into chunks of a few thousand records each? What are your experiences? What if a record gives an error? Will the whole file be recjected or just that one record? There is a Java command line tool or you can see the VuFind's solution. If you can, I suggest you to prefer a pure java solution, writing directly to the Solr index (with the Solr API), because its much-much more quicker than the PHP's (Rail's, Perl's) solution which based on a web-service (which need the PHP parsing and HTTP request curve). The PHP solution does nothing with Solr directly, it use the web service, and all the code can be rewriten in Perl. Any other ideas, further reading, experiences...? See the source files of the solutions based on Solr, there are some, even in the library scene (PHP, Rail, Python). More info: http://del.icio.us/popular/solr Peter Kiraly http://www.tesuji.eu