Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Kevin S. Clarke

On 11/27/06, Ross Singer [EMAIL PROTECTED] wrote:

On 11/27/06, Kevin S. Clarke [EMAIL PROTECTED] wrote:

Seriously, please don't get hung up on the 'proprietary'-ness of
Lucene's query syntax.  It's open, it's widely used, and has been
ported to a handful of languages.  I mean, why would you trade off
something that works well /now/ and will most likely only get better
for something that you admit sort of sucks?


It's not that fulltext for XQuery sucks... it just doesn't exist
(right now people do it through extensions to the language).  I would
expect that the spec that gets written will not be that far from
Lucene's syntax.  You are talking about the syntax that goes into the
search box right?  I don't expect an XQuery fulltext spec will change
that -- it is just how you pass that along to Lucene that will be
different (e.g., do you do it in Java, in Ruby, in XML via Solr, do
you do it in XQuery, etc.)


And I agree with Erik's assessment that it's better to keep your
repository and index separated for exactly the sort of scenario you
worry about.  If a super-duper new indexer comes along, you can always
just switch to it, then.


How do you switch to it?  How do the pieces talk?  This is the point
of standards.  If there is a standard way of addressing an index then
you don't have to care what the newest greatest indexer is.  This
paragraph seems in contrast to your one above.

Kevin


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Andrew Nagy

Erik Hatcher wrote:


What if games are mostly just guessing games in the high tech
world.  Agility is the trait our projects need.  Software is just
that... soft.  And malleable.  Sure, we can code ourselves into a
corner, but generally we can code ourselves right back out of it
too.  If software is built with decent separation of concerns, we can
adapt to changes readily.


I completely agree, but you can't deny it's a valid concern.  I am
always thinking about the future and making sure my software is modular
and flexible so any part can easily be replaced.  So I would hope it's
as easy as just writing a new driver for a new system that you want to
replace with.

Anyway, you have all convinced me to give solr a whirl ... im
downloading it right now.

Andrew


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Andrew Nagy

Art Rhyno wrote:


I made a big mistake along the way in trying to work with Voyager's call
number setup in Oracle, and dragged Ross along in an attempt to get past
Oracle's constant quibbles with rogue characters in call number ranges.
The idea was to expose the library catalogue as a series of folders using
said call number ranges. This part works well enough when the characters
are dealt with, but breaks down a bit for certain formats. For example,
the University of Windsor lumps most of its microfiche holdings in one
call number with an accession number, and Georgia Tech does something
similar with maps. This can mean individual webdav folders with many
thousands of entries, and some less than elegant workarounds.



So you are replacing SQL calls with WebDAV?  Can you explain this a bit
further?

Andrew


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Kevin S. Clarke

On 11/28/06, Kevin S. Clarke [EMAIL PROTECTED] wrote:

it is just how you pass that along to Lucene that will be
different (e.g., do you do it in Java, in Ruby, in XML via Solr, do
you do it in XQuery, etc.)


By the way, I see a very interesting intersection between Solr and
XQuery because both are speaking XML.  You may have XQueries that
generate the XML that makes Solr do it's magic for instance.  This is
an alternative to fulltext in XQuery, sure... it is something that is
here today (doesn't mean I'll stop thinking about tomorrow though).

Kevin


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Clay Redding

I'm sure most of you have seen this, but there is a lot of good work
going on regarding XQuery full text searching by the W3C.  LC is pushing
a lot of the activity in this group, and using hefty document-centric
EAD examples in the testing.

http://www.w3.org/TR/xquery-full-text/

FWIW, traditionally I've been a fan of utilizing an indexing tool that
is independent from my storage.  But the indexing (a subset of Lucene)
that is embedded in the NXDB (X-Hive) and expressed in XQuery in use at
Princeton is good.  It changed my opinions a bit about having the layers
separated, and I now think that XQuery Full Text has a chance.  We only
had to switch to the full, independent Lucene to implement some features
such as weighting, etc., that the NXDB didn't include off the shelf.

Regardless, though, having a standards-based syntax for querying is a
good thing.  Or, to put it another way, at least it doesn't hurt.  Those
that don't wish to interact with an index due to standards overhead
don't necessarily have to do so.  But for some, it will fit the bill by
allowing to put in new backends and simply plugging into the standard
syntax.

Clay

Kevin S. Clarke wrote:

On 11/27/06, Ross Singer [EMAIL PROTECTED] wrote:

On 11/27/06, Kevin S. Clarke [EMAIL PROTECTED] wrote:

Seriously, please don't get hung up on the 'proprietary'-ness of
Lucene's query syntax.  It's open, it's widely used, and has been
ported to a handful of languages.  I mean, why would you trade off
something that works well /now/ and will most likely only get better
for something that you admit sort of sucks?


It's not that fulltext for XQuery sucks... it just doesn't exist
(right now people do it through extensions to the language).  I would
expect that the spec that gets written will not be that far from
Lucene's syntax.  You are talking about the syntax that goes into the
search box right?  I don't expect an XQuery fulltext spec will change
that -- it is just how you pass that along to Lucene that will be
different (e.g., do you do it in Java, in Ruby, in XML via Solr, do
you do it in XQuery, etc.)


And I agree with Erik's assessment that it's better to keep your
repository and index separated for exactly the sort of scenario you
worry about.  If a super-duper new indexer comes along, you can always
just switch to it, then.


How do you switch to it?  How do the pieces talk?  This is the point
of standards.  If there is a standard way of addressing an index then
you don't have to care what the newest greatest indexer is.  This
paragraph seems in contrast to your one above.

Kevin


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Andrew Nagy

Kevin S. Clarke wrote:


By the way, I see a very interesting intersection between Solr and
XQuery because both are speaking XML.  You may have XQueries that
generate the XML that makes Solr do it's magic for instance.  This is
an alternative to fulltext in XQuery, sure... it is something that is
here today (doesn't mean I'll stop thinking about tomorrow though).


There is a good intersection, but if you look at the roadmap for eXist
(native xml database) they have many of the features that solr offers
(im still in the process of setting up solr so I am not too indepth with
the features yet).  eXist is basically an attempt at this intersection.
Too bad it's just too damn slow and still in it's infancy stages.

Andrew


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Art Rhyno
So you are replacing SQL calls with WebDAV?  Can you explain this a bit
further?

Hi,

No, WebDAV is, among other things, an XML representation of a folder
structure, and we were using SQL to help build the XML needed for WebDAV
support, not replacing one with the other. Voyager stores normalized call
numbers in a table, and  SQL was used to pull out records before
transforming the results to the XML layout required. In Windows, WebDAV is
accessed as a web folder, and the result was to expose the library
catalogue as a series of nested folders in call number order. My big
interest was to make the catalogue an extension of the desktop, and open
up the possibility of using desktop indexers for catalogue content. There
is more information in a submission Ross and I did for the Talis mashup
competition:

http://librarycog.uwindsor.ca/indexcat

We were able to use caching to minimize the overhead of the SQL queries,
but a better method would be to work directly with MARC files since there
wouldn't be a Voyager dependency, and the approach would be open to any
system that can export the catalogue.

art


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Ross Singer

On 11/28/06, Kevin S. Clarke [EMAIL PROTECTED] wrote:

How do you switch to it?  How do the pieces talk?  This is the point
of standards.  If there is a standard way of addressing an index then
you don't have to care what the newest greatest indexer is.  This
paragraph seems in contrast to your one above.


Well, what's the guarantee that the next great indexer isn't going to
be using /some other standard/ than the one you're using?

My only point is, it's a whole lot easier to refactor your application
to benefit from a different indexing engine than it is to export all
of your data out of something, potentially remodel it to work in
another.

I suppose it all breaks down to how much work you're willing to invest
to keep up with the Joneses (after all, you could just stay with
Lucene), but I don't really see the argument of XQuery is a
standard.  Just because it's a standard (vs. semi-ubiquitous API)
doesn't mean it will have the best tools for a particular problem
area.

-Ross.


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Gabriel Farrell
On Tue, Nov 28, 2006 at 10:27:22AM -0500, Ross Singer wrote:
 On 11/28/06, Kevin S. Clarke [EMAIL PROTECTED] wrote:
 How do you switch to it?  How do the pieces talk?  This is the point
 of standards.  If there is a standard way of addressing an index then
 you don't have to care what the newest greatest indexer is.  This
 paragraph seems in contrast to your one above.
 
 Well, what's the guarantee that the next great indexer isn't going to
 be using /some other standard/ than the one you're using?

 My only point is, it's a whole lot easier to refactor your application
 to benefit from a different indexing engine than it is to export all
 of your data out of something, potentially remodel it to work in
 another.

 I suppose it all breaks down to how much work you're willing to invest
 to keep up with the Joneses (after all, you could just stay with
 Lucene), but I don't really see the argument of XQuery is a
 standard.  Just because it's a standard (vs. semi-ubiquitous API)
 doesn't mean it will have the best tools for a particular problem
 area.

 -Ross.


Can't we stay with Lucene *and* keep up with the Joneses?  What's been
referred to in this conversation as Lucene's Standard Query Language
is just the syntax used by Lucene's default Query Parser, and, as
noted in the overview[1], Although Lucene provides the ability to
create your own queries through its API, it also provides a rich query
language through the Query Parser, a lexer which interprets a string
into a Lucene Query using JavaCC.

It's nice that Lucene ships with a Query Parser, but it is by no means
the only way to parse queries for Lucene.  A Google search on lucene
xquery parser (no quotes) brings up Nux and Jackrabbit.  I don't know
much about either project, but they seem to be working already on the
future we're talking about.

Gabe

[1] http://lucene.apache.org/java/docs/queryparsersyntax.html#Overview


[CODE4LIB] Opening for Technology Project Manager

2006-11-28 Thread Stephen Hedges
The Ohio Public Library Information Network has an opening for a
Library Technology Project Manager.  Please see the posting at
http://statejobs.ohio.gov/applicant/results2.asp?postingID=167148

--
Stephen Hedges, Executive Director
Ohio Public Library Information Network (OPLIN)
2323 W. 5th Avenue, Suite 130
Columbus, Ohio  43204
614-728-5250


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Andrew Nagy

Casey Durfee wrote:



I thought that was the point of using interfaces?  I guess I don't get why you 
need a standard to be compelled to do something you should be doing anyway -- 
coding to interfaces, not implementations.



Interfaces work well with like products (a database abstraction library
is a great example), however interfaces don't lend well to products that
achieve a similar goal but work differently altogether.  Relational
databases all work the same: there are databases, each database has
tables, views, procedures, etc. and each table has columns, etc.
However more infantile systems such as xml storage systems are hard to
map in a similar fashion.  I ran into this exact problem, I developed a
system around eXist and developed an interface for the data layer and a
driver for interacting with exist.  I then wanted to compare other
databases such as berkeley db xml.  I quickly found that they achieve a
common goal, but do not implement the same concepts making them very
hard to compare.  eXist has collections to group your xml into
distinct groupings and db xml does not.  In my interface I had a method
called getCollections, but since db xml does not have anything like
this, I could not use that method.  So now how would you develop an
interface that would include various xml databases as well as full-text
index systems such as lucene, etc.  I would image this would be very
challenging.


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Kevin S. Clarke

In this respect standard just means a programming interface.  I'm
suggesting using XQuery is like using interfaces in Java (a defined
way of accessing something independent of implementation).  You could
do this in Java (there is an XQJ... I think you can use this
independent of a textual XQuery statement) or you could do this in
XQuery.

XQuery is just an interface to XML data, regardless of backend storage
mechanism; with XQuery, you see the world through XML colored glasses
(which some think is a good idea and others don't like, granted).

Kevin

On 11/28/06, Casey Durfee [EMAIL PROTECTED] wrote:


I thought that was the point of using interfaces?  I guess I don't get why you 
need a standard to be compelled to do something you should be doing anyway -- 
coding to interfaces, not implementations.

--Casey

 [EMAIL PROTECTED] 11/28/2006 11:14 AM 
The point with a standard is you
shouldn't have to refactor your application just because you want to
change a component on the backend... you shouldn't have to care
whether you are storing in Oracle or MarkLogic.



[CODE4LIB] eXist 1.1

2006-11-28 Thread Binkley, Peter
Re the eXist 1.1 development line: I'm tinkering with that now - tried
populating two different collections at the same time over webdav
connections from two different machines, and ended up with a corrupt db
(content from one source ended up in documents supposedly written by the
other). Darn.

Peter


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Kevin S. Clarke

On 11/28/06, Erik Hatcher [EMAIL PROTECTED] wrote:


Is there a standard for specifying how textual analysis works as
well, so that tokenization can be standardized across these XQuery
engines as well?


Not that I know.  What I've seen so far is that tokenization is
implementation specific.  Perhaps this is something that is
configurable so that implementations can be set up and then queried
consistently.  Any indexing engine worth its salt should be
configurable I'd think.  There is nothing I'm aware of in the fulltext
work though that defines how things are indexed.


That's an easy bet... of course Lucene will be part of it.  It's
already implemented as extensions to XQuery engines (Nux, I know of,
and surely others).


As you can tell, I'm not really a gambler :-)

Our native XML database vendor has committed to the fulltext spec
(once it becomes a spec) and since they are using Lucene already I'd
say I don't have anything to worry about.

Interestingly, as a side note, a quick search turned up an eXist
presentation from Prague06 saying that eXist's text analysis classes
would be replaced by a modular analyzer provided by Apache's Lucene.
Neat.

All this talk is just me looking forward (with optimism).  It is
possible to use fulltext with XQuery now either through an
intermediary layer like we currently have (Lucene search is done and
the results passed to XQuery and our native XML database for retrieval
and munging) or by creating fulltext extensions (like eXist db and our
native XML database vendor have done).

Personally, I wish we had taken the extension route, but it was just
quicker for me to do something in Java and have the search and XQ
servlets chain rather than adding the extra extension layer through
our XQuery processor.  Quicker isn't always better/cleaner/nicer
though...

Kevin


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Andrew Nagy

Kevin S. Clarke wrote:


Have you had a chance yet to evaluate the 1.1 development line?  It is
supposed to have solved the scaling issues.  I haven't tried it myself
(and remain skeptical that it can scale up to the level that we talk
about with Lucene (but, as you point out, it is trying to do more than
Lucene too)).


I gave the 1.1 line a shot, but still saw abysmal results ... I sent
Wolfgang (the lead guy) my marcxml records and he implemented it in my
development environment and found the same issues.  The major problem
with it all is the ugly mess that is marcxml and it's incompatability
with native xml dbs.  Although, I still have some ideas that I have not
had a chance to test yet under the 1.1 branch.

I just finished coding our beta OPAC, so I am now heading back into my
load  scalability testing.  I am using Berkely DB XML which beats the
pants off of eXist in performance but has no where the feature set of
eXist.  I plan to re-test eXist 1.1 on my production server so I can get
a better handle on the speeds on a machine with a bit more beef.

I am also going to give this Nux a shot too.  Anyone out there using it?
http://dsd.lbl.gov/nux/index.html


Re: [CODE4LIB] XQuery

2006-11-28 Thread Kevin S. Clarke

On 11/28/06, Ross Singer [EMAIL PROTECTED] wrote:

but I don't really see the argument of XQuery is a
standard.  Just because it's a standard (vs. semi-ubiquitous API)
doesn't mean it will have the best tools for a particular problem
area.


As I think back over these posts I think I've probably failed to
communicate that it is not because XQuery is a *S*tandard that I find
it interesting but because it is a *s*tandard (way of working with XML
(designed specifically for XML)).  After all, it really isn't a
Standard yet anyway (it is in the final stages and should be by Jan
though).

Those who know me know I've been advocating non-Standards for awhile
now precisely because I think they *are* sometimes better alternatives
than the Standards (XOBIS over MARCXML/MODS, RELAX NG over W3C Schema,
etc. -- though RELAX NG is a standard now:
http://cafe.elharo.com/xml/relax-wins/)

I think what interests me about XQuery isn't that it is a W3C endorsed
Standard, but that it is a standard way of working with XML regardless
of backend particulars (or, at least, that is the promise... it is not
always the case (but that doesn't mean it should be thrown out
either... it is still evolving)).

Perhaps, stealing a page from Roy's phrasebook, I should have named my
proposed presentation: XQuery: A Better Digital Library Hammer.  After
all, XML does not *do* anything (like a hammer would imply) but XQ
does (XML is really the nail).  Anyway, I'll stop my evangelizing for
now.  I can only attribute this annoying trait to the fact that I come
from a long line of missionaries... perhaps I've missed my calling
:-)

Kevin


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Erik Hatcher

On Nov 28, 2006, at 5:44 PM, Kevin S. Clarke wrote:

Is there a standard for specifying how textual analysis works as
well, so that tokenization can be standardized across these XQuery
engines as well?


Not that I know.  What I've seen so far is that tokenization is
implementation specific.  Perhaps this is something that is
configurable so that implementations can be set up and then queried
consistently.  Any indexing engine worth its salt should be
configurable I'd think.  There is nothing I'm aware of in the fulltext
work though that defines how things are indexed.


If you leave out all the configurability in tokenization for indexing
and querying from the XQuery standard, then there will surely be
extensions needed for concrete implementations to allow this stuff to
be specified.  Interesting issue.

For all you Java savvy folks out there, how about standards like
J2EE that make it easy to move an application from one vendors app.
server to another.  Works for the simplest of applications, but all
vendors have their own specific custom deployment descriptors too.

   Erik


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Erik Hatcher

On Nov 28, 2006, at 3:28 PM, Andrew Nagy wrote:

The major problem
with it all is the ugly mess that is marcxml


This brings up an interesting point about just dropping our source
XML data into an XML-savvy database and using XQuery on it.

Maybe y'all have much cleaner data that I've seen, but my experience
with Rossetti Archive has had many XML data hurdles.  When I came on
board, Tamino was being used for the search engine, with XPath
queries all over the place.  The raw data is not consistent, and a
single word query expanded into an enormous XPath query to look at
many elements and attributes, not to mention it was SLOW.  Analyzing
the user interface and the real-world searching needs, I wrote Java
code that normalized the data for searching purposes into a much
courser grained set of fields, indexing it into Lucene, and voila:
http://www.rossettiarchive.org/rose

The point is that even with super fast full-text searching with
XQuery, most of our archives are probably going to require hideous
expressions to query them using their raw structure, especially if
have to account for data cleanup too (such as date formatting issues,
which we also have in RA raw data).

I realize I'm sounding anti-XQuery, which is sorta true, but only
because in the real-world in which I work it works better to have
some custom digesting of the raw data than to just toss it in and
work with standards.  Indexing is lossy - it's about keying things
the way they need to be looked up.  If your data is clean, you're in
better shape than me.  And if XQuery on your raw data does what you
need, by all means I recommend it.

   Erik


Re: [CODE4LIB] code4lib lucene pre-conference

2006-11-28 Thread Kevin S. Clarke

On 11/28/06, Erik Hatcher [EMAIL PROTECTED] wrote:

And if XQuery on your raw data does what you
need, by all means I recommend it.


Well structured data and a good language for working with XML are two
completely different things in my opinion.  Even XQuery doesn't make
MARCXML a pleasure to work with.  The structure of our bibliographic
and authority data is a different issue (*cough* XOBIS *cough*) from
what we should use to interact with our XML in my opinion.
.
Kevin