Re: [CODE4LIB] code4lib lucene pre-conference
On 11/27/06, Ross Singer [EMAIL PROTECTED] wrote: On 11/27/06, Kevin S. Clarke [EMAIL PROTECTED] wrote: Seriously, please don't get hung up on the 'proprietary'-ness of Lucene's query syntax. It's open, it's widely used, and has been ported to a handful of languages. I mean, why would you trade off something that works well /now/ and will most likely only get better for something that you admit sort of sucks? It's not that fulltext for XQuery sucks... it just doesn't exist (right now people do it through extensions to the language). I would expect that the spec that gets written will not be that far from Lucene's syntax. You are talking about the syntax that goes into the search box right? I don't expect an XQuery fulltext spec will change that -- it is just how you pass that along to Lucene that will be different (e.g., do you do it in Java, in Ruby, in XML via Solr, do you do it in XQuery, etc.) And I agree with Erik's assessment that it's better to keep your repository and index separated for exactly the sort of scenario you worry about. If a super-duper new indexer comes along, you can always just switch to it, then. How do you switch to it? How do the pieces talk? This is the point of standards. If there is a standard way of addressing an index then you don't have to care what the newest greatest indexer is. This paragraph seems in contrast to your one above. Kevin
Re: [CODE4LIB] code4lib lucene pre-conference
Erik Hatcher wrote: What if games are mostly just guessing games in the high tech world. Agility is the trait our projects need. Software is just that... soft. And malleable. Sure, we can code ourselves into a corner, but generally we can code ourselves right back out of it too. If software is built with decent separation of concerns, we can adapt to changes readily. I completely agree, but you can't deny it's a valid concern. I am always thinking about the future and making sure my software is modular and flexible so any part can easily be replaced. So I would hope it's as easy as just writing a new driver for a new system that you want to replace with. Anyway, you have all convinced me to give solr a whirl ... im downloading it right now. Andrew
Re: [CODE4LIB] code4lib lucene pre-conference
Art Rhyno wrote: I made a big mistake along the way in trying to work with Voyager's call number setup in Oracle, and dragged Ross along in an attempt to get past Oracle's constant quibbles with rogue characters in call number ranges. The idea was to expose the library catalogue as a series of folders using said call number ranges. This part works well enough when the characters are dealt with, but breaks down a bit for certain formats. For example, the University of Windsor lumps most of its microfiche holdings in one call number with an accession number, and Georgia Tech does something similar with maps. This can mean individual webdav folders with many thousands of entries, and some less than elegant workarounds. So you are replacing SQL calls with WebDAV? Can you explain this a bit further? Andrew
Re: [CODE4LIB] code4lib lucene pre-conference
On 11/28/06, Kevin S. Clarke [EMAIL PROTECTED] wrote: it is just how you pass that along to Lucene that will be different (e.g., do you do it in Java, in Ruby, in XML via Solr, do you do it in XQuery, etc.) By the way, I see a very interesting intersection between Solr and XQuery because both are speaking XML. You may have XQueries that generate the XML that makes Solr do it's magic for instance. This is an alternative to fulltext in XQuery, sure... it is something that is here today (doesn't mean I'll stop thinking about tomorrow though). Kevin
Re: [CODE4LIB] code4lib lucene pre-conference
I'm sure most of you have seen this, but there is a lot of good work going on regarding XQuery full text searching by the W3C. LC is pushing a lot of the activity in this group, and using hefty document-centric EAD examples in the testing. http://www.w3.org/TR/xquery-full-text/ FWIW, traditionally I've been a fan of utilizing an indexing tool that is independent from my storage. But the indexing (a subset of Lucene) that is embedded in the NXDB (X-Hive) and expressed in XQuery in use at Princeton is good. It changed my opinions a bit about having the layers separated, and I now think that XQuery Full Text has a chance. We only had to switch to the full, independent Lucene to implement some features such as weighting, etc., that the NXDB didn't include off the shelf. Regardless, though, having a standards-based syntax for querying is a good thing. Or, to put it another way, at least it doesn't hurt. Those that don't wish to interact with an index due to standards overhead don't necessarily have to do so. But for some, it will fit the bill by allowing to put in new backends and simply plugging into the standard syntax. Clay Kevin S. Clarke wrote: On 11/27/06, Ross Singer [EMAIL PROTECTED] wrote: On 11/27/06, Kevin S. Clarke [EMAIL PROTECTED] wrote: Seriously, please don't get hung up on the 'proprietary'-ness of Lucene's query syntax. It's open, it's widely used, and has been ported to a handful of languages. I mean, why would you trade off something that works well /now/ and will most likely only get better for something that you admit sort of sucks? It's not that fulltext for XQuery sucks... it just doesn't exist (right now people do it through extensions to the language). I would expect that the spec that gets written will not be that far from Lucene's syntax. You are talking about the syntax that goes into the search box right? I don't expect an XQuery fulltext spec will change that -- it is just how you pass that along to Lucene that will be different (e.g., do you do it in Java, in Ruby, in XML via Solr, do you do it in XQuery, etc.) And I agree with Erik's assessment that it's better to keep your repository and index separated for exactly the sort of scenario you worry about. If a super-duper new indexer comes along, you can always just switch to it, then. How do you switch to it? How do the pieces talk? This is the point of standards. If there is a standard way of addressing an index then you don't have to care what the newest greatest indexer is. This paragraph seems in contrast to your one above. Kevin
Re: [CODE4LIB] code4lib lucene pre-conference
Kevin S. Clarke wrote: By the way, I see a very interesting intersection between Solr and XQuery because both are speaking XML. You may have XQueries that generate the XML that makes Solr do it's magic for instance. This is an alternative to fulltext in XQuery, sure... it is something that is here today (doesn't mean I'll stop thinking about tomorrow though). There is a good intersection, but if you look at the roadmap for eXist (native xml database) they have many of the features that solr offers (im still in the process of setting up solr so I am not too indepth with the features yet). eXist is basically an attempt at this intersection. Too bad it's just too damn slow and still in it's infancy stages. Andrew
Re: [CODE4LIB] code4lib lucene pre-conference
So you are replacing SQL calls with WebDAV? Can you explain this a bit further? Hi, No, WebDAV is, among other things, an XML representation of a folder structure, and we were using SQL to help build the XML needed for WebDAV support, not replacing one with the other. Voyager stores normalized call numbers in a table, and SQL was used to pull out records before transforming the results to the XML layout required. In Windows, WebDAV is accessed as a web folder, and the result was to expose the library catalogue as a series of nested folders in call number order. My big interest was to make the catalogue an extension of the desktop, and open up the possibility of using desktop indexers for catalogue content. There is more information in a submission Ross and I did for the Talis mashup competition: http://librarycog.uwindsor.ca/indexcat We were able to use caching to minimize the overhead of the SQL queries, but a better method would be to work directly with MARC files since there wouldn't be a Voyager dependency, and the approach would be open to any system that can export the catalogue. art
Re: [CODE4LIB] code4lib lucene pre-conference
On 11/28/06, Kevin S. Clarke [EMAIL PROTECTED] wrote: How do you switch to it? How do the pieces talk? This is the point of standards. If there is a standard way of addressing an index then you don't have to care what the newest greatest indexer is. This paragraph seems in contrast to your one above. Well, what's the guarantee that the next great indexer isn't going to be using /some other standard/ than the one you're using? My only point is, it's a whole lot easier to refactor your application to benefit from a different indexing engine than it is to export all of your data out of something, potentially remodel it to work in another. I suppose it all breaks down to how much work you're willing to invest to keep up with the Joneses (after all, you could just stay with Lucene), but I don't really see the argument of XQuery is a standard. Just because it's a standard (vs. semi-ubiquitous API) doesn't mean it will have the best tools for a particular problem area. -Ross.
Re: [CODE4LIB] code4lib lucene pre-conference
On Tue, Nov 28, 2006 at 10:27:22AM -0500, Ross Singer wrote: On 11/28/06, Kevin S. Clarke [EMAIL PROTECTED] wrote: How do you switch to it? How do the pieces talk? This is the point of standards. If there is a standard way of addressing an index then you don't have to care what the newest greatest indexer is. This paragraph seems in contrast to your one above. Well, what's the guarantee that the next great indexer isn't going to be using /some other standard/ than the one you're using? My only point is, it's a whole lot easier to refactor your application to benefit from a different indexing engine than it is to export all of your data out of something, potentially remodel it to work in another. I suppose it all breaks down to how much work you're willing to invest to keep up with the Joneses (after all, you could just stay with Lucene), but I don't really see the argument of XQuery is a standard. Just because it's a standard (vs. semi-ubiquitous API) doesn't mean it will have the best tools for a particular problem area. -Ross. Can't we stay with Lucene *and* keep up with the Joneses? What's been referred to in this conversation as Lucene's Standard Query Language is just the syntax used by Lucene's default Query Parser, and, as noted in the overview[1], Although Lucene provides the ability to create your own queries through its API, it also provides a rich query language through the Query Parser, a lexer which interprets a string into a Lucene Query using JavaCC. It's nice that Lucene ships with a Query Parser, but it is by no means the only way to parse queries for Lucene. A Google search on lucene xquery parser (no quotes) brings up Nux and Jackrabbit. I don't know much about either project, but they seem to be working already on the future we're talking about. Gabe [1] http://lucene.apache.org/java/docs/queryparsersyntax.html#Overview
[CODE4LIB] Opening for Technology Project Manager
The Ohio Public Library Information Network has an opening for a Library Technology Project Manager. Please see the posting at http://statejobs.ohio.gov/applicant/results2.asp?postingID=167148 -- Stephen Hedges, Executive Director Ohio Public Library Information Network (OPLIN) 2323 W. 5th Avenue, Suite 130 Columbus, Ohio 43204 614-728-5250
Re: [CODE4LIB] code4lib lucene pre-conference
Casey Durfee wrote: I thought that was the point of using interfaces? I guess I don't get why you need a standard to be compelled to do something you should be doing anyway -- coding to interfaces, not implementations. Interfaces work well with like products (a database abstraction library is a great example), however interfaces don't lend well to products that achieve a similar goal but work differently altogether. Relational databases all work the same: there are databases, each database has tables, views, procedures, etc. and each table has columns, etc. However more infantile systems such as xml storage systems are hard to map in a similar fashion. I ran into this exact problem, I developed a system around eXist and developed an interface for the data layer and a driver for interacting with exist. I then wanted to compare other databases such as berkeley db xml. I quickly found that they achieve a common goal, but do not implement the same concepts making them very hard to compare. eXist has collections to group your xml into distinct groupings and db xml does not. In my interface I had a method called getCollections, but since db xml does not have anything like this, I could not use that method. So now how would you develop an interface that would include various xml databases as well as full-text index systems such as lucene, etc. I would image this would be very challenging.
Re: [CODE4LIB] code4lib lucene pre-conference
In this respect standard just means a programming interface. I'm suggesting using XQuery is like using interfaces in Java (a defined way of accessing something independent of implementation). You could do this in Java (there is an XQJ... I think you can use this independent of a textual XQuery statement) or you could do this in XQuery. XQuery is just an interface to XML data, regardless of backend storage mechanism; with XQuery, you see the world through XML colored glasses (which some think is a good idea and others don't like, granted). Kevin On 11/28/06, Casey Durfee [EMAIL PROTECTED] wrote: I thought that was the point of using interfaces? I guess I don't get why you need a standard to be compelled to do something you should be doing anyway -- coding to interfaces, not implementations. --Casey [EMAIL PROTECTED] 11/28/2006 11:14 AM The point with a standard is you shouldn't have to refactor your application just because you want to change a component on the backend... you shouldn't have to care whether you are storing in Oracle or MarkLogic.
[CODE4LIB] eXist 1.1
Re the eXist 1.1 development line: I'm tinkering with that now - tried populating two different collections at the same time over webdav connections from two different machines, and ended up with a corrupt db (content from one source ended up in documents supposedly written by the other). Darn. Peter
Re: [CODE4LIB] code4lib lucene pre-conference
On 11/28/06, Erik Hatcher [EMAIL PROTECTED] wrote: Is there a standard for specifying how textual analysis works as well, so that tokenization can be standardized across these XQuery engines as well? Not that I know. What I've seen so far is that tokenization is implementation specific. Perhaps this is something that is configurable so that implementations can be set up and then queried consistently. Any indexing engine worth its salt should be configurable I'd think. There is nothing I'm aware of in the fulltext work though that defines how things are indexed. That's an easy bet... of course Lucene will be part of it. It's already implemented as extensions to XQuery engines (Nux, I know of, and surely others). As you can tell, I'm not really a gambler :-) Our native XML database vendor has committed to the fulltext spec (once it becomes a spec) and since they are using Lucene already I'd say I don't have anything to worry about. Interestingly, as a side note, a quick search turned up an eXist presentation from Prague06 saying that eXist's text analysis classes would be replaced by a modular analyzer provided by Apache's Lucene. Neat. All this talk is just me looking forward (with optimism). It is possible to use fulltext with XQuery now either through an intermediary layer like we currently have (Lucene search is done and the results passed to XQuery and our native XML database for retrieval and munging) or by creating fulltext extensions (like eXist db and our native XML database vendor have done). Personally, I wish we had taken the extension route, but it was just quicker for me to do something in Java and have the search and XQ servlets chain rather than adding the extra extension layer through our XQuery processor. Quicker isn't always better/cleaner/nicer though... Kevin
Re: [CODE4LIB] code4lib lucene pre-conference
Kevin S. Clarke wrote: Have you had a chance yet to evaluate the 1.1 development line? It is supposed to have solved the scaling issues. I haven't tried it myself (and remain skeptical that it can scale up to the level that we talk about with Lucene (but, as you point out, it is trying to do more than Lucene too)). I gave the 1.1 line a shot, but still saw abysmal results ... I sent Wolfgang (the lead guy) my marcxml records and he implemented it in my development environment and found the same issues. The major problem with it all is the ugly mess that is marcxml and it's incompatability with native xml dbs. Although, I still have some ideas that I have not had a chance to test yet under the 1.1 branch. I just finished coding our beta OPAC, so I am now heading back into my load scalability testing. I am using Berkely DB XML which beats the pants off of eXist in performance but has no where the feature set of eXist. I plan to re-test eXist 1.1 on my production server so I can get a better handle on the speeds on a machine with a bit more beef. I am also going to give this Nux a shot too. Anyone out there using it? http://dsd.lbl.gov/nux/index.html
Re: [CODE4LIB] XQuery
On 11/28/06, Ross Singer [EMAIL PROTECTED] wrote: but I don't really see the argument of XQuery is a standard. Just because it's a standard (vs. semi-ubiquitous API) doesn't mean it will have the best tools for a particular problem area. As I think back over these posts I think I've probably failed to communicate that it is not because XQuery is a *S*tandard that I find it interesting but because it is a *s*tandard (way of working with XML (designed specifically for XML)). After all, it really isn't a Standard yet anyway (it is in the final stages and should be by Jan though). Those who know me know I've been advocating non-Standards for awhile now precisely because I think they *are* sometimes better alternatives than the Standards (XOBIS over MARCXML/MODS, RELAX NG over W3C Schema, etc. -- though RELAX NG is a standard now: http://cafe.elharo.com/xml/relax-wins/) I think what interests me about XQuery isn't that it is a W3C endorsed Standard, but that it is a standard way of working with XML regardless of backend particulars (or, at least, that is the promise... it is not always the case (but that doesn't mean it should be thrown out either... it is still evolving)). Perhaps, stealing a page from Roy's phrasebook, I should have named my proposed presentation: XQuery: A Better Digital Library Hammer. After all, XML does not *do* anything (like a hammer would imply) but XQ does (XML is really the nail). Anyway, I'll stop my evangelizing for now. I can only attribute this annoying trait to the fact that I come from a long line of missionaries... perhaps I've missed my calling :-) Kevin
Re: [CODE4LIB] code4lib lucene pre-conference
On Nov 28, 2006, at 5:44 PM, Kevin S. Clarke wrote: Is there a standard for specifying how textual analysis works as well, so that tokenization can be standardized across these XQuery engines as well? Not that I know. What I've seen so far is that tokenization is implementation specific. Perhaps this is something that is configurable so that implementations can be set up and then queried consistently. Any indexing engine worth its salt should be configurable I'd think. There is nothing I'm aware of in the fulltext work though that defines how things are indexed. If you leave out all the configurability in tokenization for indexing and querying from the XQuery standard, then there will surely be extensions needed for concrete implementations to allow this stuff to be specified. Interesting issue. For all you Java savvy folks out there, how about standards like J2EE that make it easy to move an application from one vendors app. server to another. Works for the simplest of applications, but all vendors have their own specific custom deployment descriptors too. Erik
Re: [CODE4LIB] code4lib lucene pre-conference
On Nov 28, 2006, at 3:28 PM, Andrew Nagy wrote: The major problem with it all is the ugly mess that is marcxml This brings up an interesting point about just dropping our source XML data into an XML-savvy database and using XQuery on it. Maybe y'all have much cleaner data that I've seen, but my experience with Rossetti Archive has had many XML data hurdles. When I came on board, Tamino was being used for the search engine, with XPath queries all over the place. The raw data is not consistent, and a single word query expanded into an enormous XPath query to look at many elements and attributes, not to mention it was SLOW. Analyzing the user interface and the real-world searching needs, I wrote Java code that normalized the data for searching purposes into a much courser grained set of fields, indexing it into Lucene, and voila: http://www.rossettiarchive.org/rose The point is that even with super fast full-text searching with XQuery, most of our archives are probably going to require hideous expressions to query them using their raw structure, especially if have to account for data cleanup too (such as date formatting issues, which we also have in RA raw data). I realize I'm sounding anti-XQuery, which is sorta true, but only because in the real-world in which I work it works better to have some custom digesting of the raw data than to just toss it in and work with standards. Indexing is lossy - it's about keying things the way they need to be looked up. If your data is clean, you're in better shape than me. And if XQuery on your raw data does what you need, by all means I recommend it. Erik
Re: [CODE4LIB] code4lib lucene pre-conference
On 11/28/06, Erik Hatcher [EMAIL PROTECTED] wrote: And if XQuery on your raw data does what you need, by all means I recommend it. Well structured data and a good language for working with XML are two completely different things in my opinion. Even XQuery doesn't make MARCXML a pleasure to work with. The structure of our bibliographic and authority data is a different issue (*cough* XOBIS *cough*) from what we should use to interact with our XML in my opinion. . Kevin