Re: [CODE4LIB] Enterprise Search and library collection
In short, I think a Google Appliance is an expensive but viable option. Relative to other commercial products in the space, the GA or G-mini is actually very inexpensive. Another option to add to Eric's list is the All Access Connector which adds MuseGlobal's fed search technology to the Google appliance. Of course, it also add $40K or more to the total price. http://wire.jstirnaman.com/2008/05/23/federated-search-for-google-search-appliance/ Jason -- Jason Stirnaman Digital Projects Librarian/School of Medicine Support A.R. Dykes Library, University of Kansas Medical Center [EMAIL PROTECTED] 913-588-7319 On 7/10/2008 at 10:25 PM, in message [EMAIL PROTECTED], Eric Lease Morgan [EMAIL PROTECTED] wrote: At the risk of interpreting the original question incorrectly, we have had decent success using the Google Search Appliance to facilitate search across the enterprise (university): * Buy the Appliance. * Feed it one or more URLs. * Wait for it to crawl. * Customize the user interface. * Allow people to use it. While we haven't done so, it would not be too difficult to implement a sort of federated search within the Appliance's interface. This could be done in a number of ways: 1. Acquire bibliographic data and feed to directly to the Appliance via the (poorly) documented SQL interface. 2. Acquire bibliographic data, save it as HTML files, and allow the Appliance to crawl the HTML. 3. License access to bibliographic making sure it is accessible through some sort of API, and write a Google OneBox module that queries the data, and returns results as a part of a normal Google Appliance search. The larger Google Appliance costs about $30,000 but you purchase it, not license it. No annual fees. That will buy you the ability to index 500,000 documents. When it comes to a bibliographic database (such as a subject index or a library catalog) that is not really very much. We here at Notre Dame did implement Option #3, but it queries the local LDAP sever to return names and addresses of people, not bibliographic citations. [1, 2] I did write a OneBox module to query our catalog, but we haven't implemented it, yet. It will probably appear as a part of the library's Search This Site functionality. In short, I think a Google Appliance is an expensive but viable option. [1] Search for a name (ex: Hesburgh) at http://search.nd.edu/ [2] OneBox source code - http://tinyurl.com/6ktxot
[CODE4LIB] anyone know about Inera?
I recently became aware of a company that provides what it terms reference correction software: Inera. This is the company that powers the crossRef Simple Text Query box (http://www.crossref.org/freeTextQuery). See http://www.inera.com/refcorrection.shtml for more details Does anyone on this list have any knowledge of this company? I'm just wondering if it would be better to use what they have rather than continue to possibly reinvent the wheel for citation parsing. Steve
Re: [CODE4LIB] anyone know about Inera?
Steve, If you need citation parsing, rather than reference correction, maybe this will work for you: http://aye.comp.nus.edu.sg/parsCit/ I haven't had a chance to try it yet, though. Jason On Fri, Jul 11, 2008 at 11:51 AM, Steve Oberg [EMAIL PROTECTED] wrote: I recently became aware of a company that provides what it terms reference correction software: Inera. This is the company that powers the crossRef Simple Text Query box (http://www.crossref.org/freeTextQuery). See http://www.inera.com/refcorrection.shtml for more details Does anyone on this list have any knowledge of this company? I'm just wondering if it would be better to use what they have rather than continue to possibly reinvent the wheel for citation parsing. Steve
Re: [CODE4LIB] anyone know about Inera?
Jason, Thanks, yes, I knew of this effort and have actually spent a lot of time working with this same software (or rather the same underlying software). But I'm not sure it does enough or does it well enough for me at this point. I'd like to take a list of one or two, up to hundreds of citations and dump it into a web form and output SFX URLs as a result. Steve On Fri, Jul 11, 2008 at 1:51 PM, Jason Ronallo [EMAIL PROTECTED] wrote: Steve, If you need citation parsing, rather than reference correction, maybe this will work for you: http://aye.comp.nus.edu.sg/parsCit/ I haven't had a chance to try it yet, though. Jason On Fri, Jul 11, 2008 at 11:51 AM, Steve Oberg [EMAIL PROTECTED] wrote: I recently became aware of a company that provides what it terms reference correction software: Inera. This is the company that powers the crossRef Simple Text Query box (http://www.crossref.org/freeTextQuery). See http://www.inera.com/refcorrection.shtml for more details Does anyone on this list have any knowledge of this company? I'm just wondering if it would be better to use what they have rather than continue to possibly reinvent the wheel for citation parsing. Steve
Re: [CODE4LIB] anyone know about Inera?
Just out of curiosity, what makes parscit not optimal for this purpose? Is it too slow? Not accurate enough? I ask, as I've thought of doing similar things but haven't explored the software deeply enough to know if it'd work. Cheers, -Nate On Fri, Jul 11, 2008 at 2:44 PM, Steve Oberg [EMAIL PROTECTED] wrote: Jason, Thanks, yes, I knew of this effort and have actually spent a lot of time working with this same software (or rather the same underlying software). But I'm not sure it does enough or does it well enough for me at this point. I'd like to take a list of one or two, up to hundreds of citations and dump it into a web form and output SFX URLs as a result.
Re: [CODE4LIB] Enterprise Search and library collection
I did not have stellar results experimenting with a similar approach to Eric's. The crawler we use is from Thunderstone, and it does a fine job of indexing web content with very nice relevancy ranking and did you mean spell-check. What I found when trying to let it loose against multiple servers is, when it hits our OPAC, it sees several different formats per record and ends up more than triple-indexing each title. It does have a lot of flexibility in the indexing options, though, so I could try it again and set it to ignore URL patterns that refer to the MARC display, etc. Still, the total cost to index a couple million pages (which would be needed in order to include all the records in the OPAC plus the website pages plus the Syndetics added content) is a bit of a steep one-time outlay. I'm sure there's some other way to go about this with Thunderstone's TEXIS rather than using their Webinator product, but then you have a substantially higher development effort, I think. Thunderstone now makes a faceted search (they call it parametric search). They also make search appliances at different capacity levels. Pricing is really pretty reasonable for what you get. http://www.thunderstone.com/texis/site/pages/Products.html Genny Engel Internet Librarian Sonoma County Library [EMAIL PROTECTED] 707 545-0831 x581 www.sonomalibrary.org [EMAIL PROTECTED] 07/11/08 07:36AM In short, I think a Google Appliance is an expensive but viable option. Relative to other commercial products in the space, the GA or G-mini is actually very inexpensive. Another option to add to Eric's list is the All Access Connector which adds MuseGlobal's fed search technology to the Google appliance. Of course, it also add $40K or more to the total price. http://wire.jstirnaman.com/2008/05/23/federated-search-for-google-search-appliance/ Jason -- Jason Stirnaman Digital Projects Librarian/School of Medicine Support A.R. Dykes Library, University of Kansas Medical Center [EMAIL PROTECTED] 913-588-7319 On 7/10/2008 at 10:25 PM, in message [EMAIL PROTECTED], Eric Lease Morgan [EMAIL PROTECTED] wrote: At the risk of interpreting the original question incorrectly, we have had decent success using the Google Search Appliance to facilitate search across the enterprise (university): * Buy the Appliance. * Feed it one or more URLs. * Wait for it to crawl. * Customize the user interface. * Allow people to use it. While we haven't done so, it would not be too difficult to implement a sort of federated search within the Appliance's interface. This could be done in a number of ways: 1. Acquire bibliographic data and feed to directly to the Appliance via the (poorly) documented SQL interface. 2. Acquire bibliographic data, save it as HTML files, and allow the Appliance to crawl the HTML. 3. License access to bibliographic making sure it is accessible through some sort of API, and write a Google OneBox module that queries the data, and returns results as a part of a normal Google Appliance search. The larger Google Appliance costs about $30,000 but you purchase it, not license it. No annual fees. That will buy you the ability to index 500,000 documents. When it comes to a bibliographic database (such as a subject index or a library catalog) that is not really very much. We here at Notre Dame did implement Option #3, but it queries the local LDAP sever to return names and addresses of people, not bibliographic citations. [1, 2] I did write a OneBox module to query our catalog, but we haven't implemented it, yet. It will probably appear as a part of the library's Search This Site functionality. In short, I think a Google Appliance is an expensive but viable option. [1] Search for a name (ex: Hesburgh) at http://search.nd.edu/ [2] OneBox source code - http://tinyurl.com/6ktxot
Re: [CODE4LIB] Enterprise Search and library collection [SEC=UNCLASSIFIED]
Hi Steve, Thanks for a full reply. We actually do combine date within enterprises, including from their ILS and subscription Sources (article databases), and internal repositories. Of course we claim we do it well - and I think we do. A library background will enable you to face almost any shape of data with aplomb, if not equanimity. Data from varied sources is varied in structure, type of content and level of detail, as you say. It *is* possible to combine it, but it works best when there is some sort of commonality across the sources. Fortunately most people when searching provide that focus, so the theoretical problem is very rarely a practical one - and this business is all about practical solutions. We do actually have a fair number of the enterprise search engine vendors as partners where we act as a selective harvesting capability for them and convert the syntax and semantics of the harvested records into a uniformity they can easily ingest and work their indexing magic on. Fence sitting has a long and honourable tradition (both in the UK and the US), and we 'back both horses' ourselves by being in both the federated search and content integration space. Thus involved in both the just-in-case harvesting, and the just-in-time fed searching. Final thought is that almost everybody we have dealt with is a special case - most of them in the nicest possibly way - so, even for systems like ours, customization is the order of the day. But that's what computers allow us to do - adapt to users. Peter -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Steve Oberg Sent: Friday, July 11, 2008 12:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Enterprise Search and library collection [SEC=UNCLASSIFIED] Peter, Use an search engine and create an aggregated database/index of all the material from the organization, or use a federated search system to search the repositories/catalogs/databases/etc. in real time? Did you consider both? And why the choice you made? I was not involved in the initial planning. I came in sort of halfway through and had to make a lot of the initial planning decisions work. (Even while I disagreed with some of those decisions.) Again, my perspective relates mostly to use of catalog data. However, I would add that we did in fact have a federated search tool when I came but quickly discarded it because it couldn't do the more limited functionality we were hoping for it to accomplish (present best options to users for where to search among our databases and collections according to subject), let alone aggregate or search across disparate data repositories. Personally I find it very difficult to believe that a federated search such as what you provide at MuseGlobal can do this sort of enterprise combination of data well. The data is not well structured (except for catalog data) and includes an extreme range of completeness and little commonality. What is interesting to note, however, is that on the one hand, a vendor such as yourself may claim that you can do this sort of stuff well (I'm not saying you said that, just that you might say that). On the other hand I find it interesting to note that the enterprise search tool vendor we have, coming from a completely different market and perspective, would readily claim they can do all that library stuff -- that they do in fact offer true federated search. Which in my personal opinion isn't true at all. But ideally I would answer your question in this way. I think there should be a combination of the two approaches, that this would be more practical and workable than just one or the other. How's that for sitting on the fence :-) Build vs. Buy? It obviously has taken Steve and his colleagues a lot of hard work to produce a nice looking system (except for all those big black bits on the screen!) and it obviously takes maintenance (it is 'fragile') Do you think it was/is worth it and if so why? My answer is, it is too soon to tell. There are many reasons why our implementation is probably unique (and I don't mean to imply that it is better than someone else's, just that I doubt it could readily be replicated elsewhere). We have a number of very different requirements and use cases than what some other library settings might have. We have a large number of constraints on the IT side. We have had to do a lot of custom stuff as a result. This is probably why it is fragile, more than because of deficiencies in any one piece such as the search tool itself. But we are still, in my view, only at the very early stages of assessing the whole package's value for our users. And we have very particular, demanding users. In sum, we have had to buy AND build and so it isn't, again, a question of one versus the other. Steve
Re: [CODE4LIB] anyone know about Inera?
Actually, SFX is probably not going to care what the title is. It's much more likely to care about the ISSN, volume and issue. Now, if the matching targets are EBSCO or Proquest, you might have a problem (since they accept inbound OpenURLs from SFX), but I'm not sure, exactly. How many of these things do you have? -Ross. On Fri, Jul 11, 2008 at 3:55 PM, Steve Oberg [EMAIL PROTECTED] wrote: One example: Here's the citation I have in hand: Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al. The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone Metabolism and Disease in CKD: association with mortality in dialysis patients. American Journal of Kidney Diseases 2005; 46(5):925-932. Here's the output from ParsCit. Note the problem with the article title: algorithm name=ParsCit version=1.0 citationList citation authors authorM Noordzij/author authorKorevaar JC/author /authors volume2005/volume titleBoeschoten EW, Dekker FW, Bos WJ, Krediet RT et al. The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone Metabolism and Disease in CKD: association with mortality in dialysis patients/title journalAmerican Journal of Kidney Diseases/journal pages46--5/pages /citation /citationList /algorithm There's more but basically it isn't accurate enough. It's very good but not good enough for what I need at this juncture. OpenURL resolvers like SFX are generally only as good as the metadata they are given to parse. I need a high level of accuracy. Maybe that's a pipe dream. Steve On Fri, Jul 11, 2008 at 2:48 PM, Nate Vack [EMAIL PROTECTED] wrote: Just out of curiosity, what makes parscit not optimal for this purpose? Is it too slow? Not accurate enough? I ask, as I've thought of doing similar things but haven't explored the software deeply enough to know if it'd work. Cheers, -Nate On Fri, Jul 11, 2008 at 2:44 PM, Steve Oberg [EMAIL PROTECTED] wrote: Jason, Thanks, yes, I knew of this effort and have actually spent a lot of time working with this same software (or rather the same underlying software). But I'm not sure it does enough or does it well enough for me at this point. I'd like to take a list of one or two, up to hundreds of citations and dump it into a web form and output SFX URLs as a result.
Re: [CODE4LIB] anyone know about Inera?
Ross, Actually, SFX is probably not going to care what the title is. It's much more likely to care about the ISSN, volume and issue. Yes, true. But linking to full text is only partly the issue when it comes to using SFX in this way. I also want to ensure that those articles that we don't already have available in full text are directly routed to our internal doc. delivery form (in SFX speak, using an svc.ill=yes in the OpenURL). This would of course mean that however the citation got parsed is how that form is filled out. Incorrect title information is a problem in this case. Now, if the matching targets are EBSCO or Proquest, you might have a problem (since they accept inbound OpenURLs from SFX), but I'm not sure, exactly. How many of these things do you have? Literally, possibly thousands. I can't divulge a great amount of detail (there we go again with that restriction on info.) of the exact use. But let's just say there are many very large documents (in PDF or Word), each of which contains between 100-400 article citations, that I am working with. Why on earth try to provide article-level OpenURLs? Well, for many reasons. I fully realize how much of a risk that is in terms of reliability and maintenance. But right now I just want a way to do this in bulk with a high level of accuracy. Steve
Re: [CODE4LIB] anyone know about Inera?
At Fri, 11 Jul 2008 14:55:18 -0500, Steve Oberg [EMAIL PROTECTED] wrote: One example: Here's the citation I have in hand: Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al. The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone Metabolism and Disease in CKD: association with mortality in dialysis patients. American Journal of Kidney Diseases 2005; 46(5):925-932. Here's the output from ParsCit. Note the problem with the article title: […] The output is a little different from what I get from the parsCit web service. The parsCit authors recently published a new paper on a new version of their systems with a new engine, which you might want to look at [1]. There's more but basically it isn't accurate enough. It's very good but not good enough for what I need at this juncture. OpenURL resolvers like SFX are generally only as good as the metadata they are given to parse. I need a high level of accuracy. Maybe that's a pipe dream. I doubt that the software provided by Inera performs better than parsCit. Inera does find a DOI for that citation but that is not nearly so hard as determining which parts of a citation are which. parsCit is pretty cutting edge provides some of the best numbers I have seen. The Flux-CiM system [2] also has pretty good numbers, but the code for it is not available. I’ve also done a little bit of work on this, which you might want to have a look at. [3] One of the problems may be that the parsCit you are dealing with has been trained on the Cora dataset of computer science citations. It is a reasonably heterogeneous dataset of citations but it doesn’t have a lot that looks like that health sciences format. If your citations are largely drawn from the health sciences you might see about training it on a health sciences dataset; you will probably get much better results. best, Erik Hetzner 1. Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An open-source CRF reference string parsing package. In Proceedings of the Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morrocco, May. Available from http://wing.comp.nus.edu.sg/parsCit/#p 2. Eli Cortez C. Vilarinho, Altigran Soares da Silva, Marcos André Gonçalves, Filipe de Sá Mesquita, Edleno Silva de Moura. FLUX-CIM: flexible unsupervised extraction of citation metadata. In Proceedings of the 8th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2007), pp. 215-224. 3. A simple method for citation metadata extraction using hidden Markov models. In Proc. of the Joint Conf. on Digital Libraries (JCDL 2008), Pittsburgh, Pa., 2008. http://gales.cdlib.org/~egh/hmm-citation-extractor/ ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 pgp64luKWEnmY.pgp Description: PGP signature
Re: [CODE4LIB] anyone know about Inera?
On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg [EMAIL PROTECTED] wrote: I fully realize how much of a risk that is in terms of reliability and maintenance. But right now I just want a way to do this in bulk with a high level of accuracy. How bad is it, really, if you get some (5%?) bad requests into your document delivery system? Customers submit poor quality requests by hand with some frequency, last I checked... Especially if you can hack your system to deliver the original citation all the way into your doc delivery system, you may be able to make the case that 'this is a good service to offer; let's just deal with the bad parses manually.' Trying to solve this via pure technology is gonna get into a world of diminishing returns. A surprising number of citations in references sections are wrong. Some correct citations are really hard to parse, even by humans who look at a lot of citations. ParsCit has, in my limited testing, worked as well as anything I've seen (commercial or OSS), and much better than most. My $0.02, -Nate
Re: [CODE4LIB] Webfeet, Encompass WAS: Ser Sol 360 Search
Hi Dave, National Library of New Zealand still uses Encompass for their Discover service: http://discover.natlib.govt.nz/ They were Enc development partners with us way back whenand have a ton invested in this.not sure of a contact person anymore, but can probably rustle someone up if you need specifics. Cheers.Joe Shubitowski -- Joseph M. Shubitowski Head, Information Systems Getty Research Institute 1200 Getty Center Drive, Suite 1100 Los Angeles CA 90049-1688 Voice: 310-440-6394 Fax: 310-440-7780 [EMAIL PROTECTED] On 7/11/2008 at 8:55 AM, in message [EMAIL PROTECTED], Walker, David [EMAIL PROTECTED] wrote: Thanks to everyone who responded to my earlier request. On to the next system: If your library licenses the Webfeat metasearch system, would you mind contacting me off-list? I have similar questions to ask you all. Also -- and I realize I'm reaching here -- if you happen to have the now-defunct Endeavor Encompass system still up and running somewhere (even if its out of public view) would you mind contacting me. Thanks! --Dave == David Walker Library Web Services Manager California State University http://xerxes.calstate.edu From: Walker, David Sent: Monday, July 07, 2008 8:57 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Ser Sol 360 Search Hi All, I'm giving a conference presentation later this month on metasearch. If your library licenses Serial Solutions' metasearch system, would you mind contacting me off-list? I'd like to ask a couple of questions. Thanks! --Dave --- David Walker Library Web Services Manager California State University http://xerxes.calstate.edu