Re: [CODE4LIB] Getting started with SOLR

2007-11-23 Thread Erik Hatcher

On Nov 22, 2007, at 3:41 PM, Kent Fitch wrote:

On Nov 23, 2007 4:11 AM, Binkley, Peter [EMAIL PROTECTED]
wrote:


...

If you use boost on the date field the way you suggest, remember
you'll
have to reindex from scratch every year to adjust the boost as items
age.



Or maybe just use a method such that 2007 dates boost the document
by 3.0,
2008 dates by 3.1, 2009 by 3.2 ...  Whether this is feasible
depends on how
else you are expecting other scoring boosts to interact with document
intrinsic boosts.


Don't do date boosting at index time, but rather tune things on the
query end, using FunctionQuery and such.   That'll give you maximum
flexibility.

   Erik


Re: [CODE4LIB] Getting started with SOLR

2007-11-23 Thread Michael Lackhoff
On 23.11.2007 10:39 Erik Hatcher wrote:

 As far as I know there is no way to avoid POSTing in the XML - no
 direct import of an XML file without HTTP (without getting down and
 dirty and writing to the embedded Solr API, which is a bit
 discouraged for many reasons).

Ok.

 You can also bring data into Solr using the CSV importer.  I highly
 recommend folks take a good look at this route.  It's clean, easy, fast:
 http://wiki.apache.org/solr/UpdateCSV

That sounds like what I need. Only problem I see: what about escapes? I
don't know my data good enough to be sure that any possible delimiter
will never occur within the data. Most exotic characters will probably
be errors but I still don't want SOLR to choke on it.
Can I use escapes for separator and/or encapsulator? If so is it \ or
 (backslash or doubling)? I found nothing in the docs about it.

 For 700,000 records, one first nice step to try is to convert that
 data into CSV and feed it into Solr.  Create a CSV file on the file
 system with all those records and use the CSV importer.  I think
 you'll find that the absolute fastest way to bring data in.   But

It even looks like the direct way (almost) without HTTP since the file
is read directly from the file system and doesn't have to be squeezed
through a socket connection.

To Peter: Thanks for the books. I think I will have something to do for
some time now ;-)
To Ewout: Thanks for the script. I will have a look at it but I think I
will try the csv route first.

Thanks for the help
-Michael


[CODE4LIB] Getting started with SOLR

2007-11-22 Thread Michael Lackhoff
Hello,

I am just getting my feet wet with SOLR and have a couple of question
how others have done certain things.

I created a schema.xml where basically every field is of type text for
the beginning. Do you use specialized types for authors or ISBNs or
other fields?
How do you handle multi-value fields? Do you feed everything in a single
field (like Smith, James ; Miller, Steve as I have seen in a pure
Lucene implementation of a collegue or do you use the multiValued
feature of SOLR?

What about boosting? I thought of giving the current year a boost=3.0
and then 0.1 less for every year the title is older, down to 1.0 for a
21-year-old book. The idea is to have a sort that tends to promote
recent titles but still respects other aspects. Does this sound
reasonable or are there other ideas? I would be very interested in an
actual boosting-scheme from where I could start.

We have a couple of databases that should eventually indexed. Do you
build one huge database with an additional database field or is it
better to have every database in its own SOLR instance?

How do you fill the index? Our main database has about 700,000 records
and I don't know if I should build one huge XML-file and feed that into
SOLR or use a script that sends one record at a time with a commit after
every 1000 records or so. Or do something in between and split it into
chunks of a few thousand records each? What are your experiences? What
if a record gives an error? Will the whole file be recjected or just
that one record?
Are there alternatives to the HTTP gateway?
Are there any Perl-scripts around that could help? I built a little
script that uses LWP to feed my test records into the database. It works
but I don't have any error handling yet, very Quick and dirty XML
creation so if there is something more mature I would like to use that.

Any other ideas, further reading, experiences...?

I know these are a lot of questions but after the conference last year I
think there is lots of expertise in this group and perhaps I can avoid a
few beginner mistakes with your help

thanks in advance
- Michael


Re: [CODE4LIB] Getting started with SOLR

2007-11-22 Thread Peter Kiraly

Hi, Michael,


I created a schema.xml where basically every field is of type text for
the beginning. Do you use specialized types for authors or ISBNs or
other fields?

I use different fields for every MARC fields i want to search,
moreover there is a field: UDC notation which is split up
to atomic notation, so 1 complex udc will be 3+ Solr fields.


How do you handle multi-value fields? Do you feed everything in a single
field (like Smith, James ; Miller, Steve as I have seen in a pure
Lucene implementation of a collegue or do you use the multiValued
feature of SOLR?

I usually create different fields with the same name.
I do it in Lucene as well. There is no problem with
repeating fields (same name, different values of course).


What about boosting? I thought of giving the current year a boost=3.0
and then 0.1 less for every year the title is older, down to 1.0 for a
21-year-old book. The idea is to have a sort that tends to promote
recent titles but still respects other aspects. Does this sound
reasonable or are there other ideas? I would be very interested in an
actual boosting-scheme from where I could start.

That sound reasonable.


We have a couple of databases that should eventually indexed. Do you
build one huge database with an additional database field or is it
better to have every database in its own SOLR instance?

Our projects usually builds one index from different
sources - but it depends on the nature of your project.
We built up an application to which we convert 110+
CD-ROMs (originally in Folio database) - this covers
2 200 000+ xhtml page, and there are some search forms
for the different DBs. It is a Lucene project, not Solr.


How do you fill the index? Our main database has about 700,000 records
and I don't know if I should build one huge XML-file and feed that into
SOLR or use a script that sends one record at a time with a commit after
every 1000 records or so. Or do something in between and split it into
chunks of a few thousand records each? What are your experiences? What
if a record gives an error? Will the whole file be recjected or just
that one record?

There is a Java command line tool or you can see the VuFind's
solution. If you can, I suggest you to prefer a pure java
solution, writing directly to the Solr index (with the Solr
API), because its much-much more quicker than the PHP's
(Rail's, Perl's) solution which based on a web-service
(which need the PHP parsing and HTTP request curve).
The PHP solution does nothing with Solr directly, it
use the web service, and all the code can be rewriten
in Perl.


Any other ideas, further reading, experiences...?

See the source files of the solutions based on Solr, there
are some, even in the library scene (PHP, Rail, Python).
More info:
http://del.icio.us/popular/solr


Peter Kiraly
http://www.tesuji.eu