Re: [CODE4LIB] Enterprise Search and library collection

2008-07-11 Thread Jason Stirnaman
 In short, I think a Google Appliance is an expensive but viable
option.
Relative to other commercial products in the space, the GA or G-mini is
actually very inexpensive.  Another option to add to Eric's list is the
All Access Connector which adds MuseGlobal's fed search technology to
the Google appliance.  Of course, it also add $40K or more to the total
price.
http://wire.jstirnaman.com/2008/05/23/federated-search-for-google-search-appliance/

Jason
-- 

Jason Stirnaman
Digital Projects Librarian/School of Medicine Support
A.R. Dykes Library, University of Kansas Medical Center
[EMAIL PROTECTED]
913-588-7319


 On 7/10/2008 at 10:25 PM, in message
[EMAIL PROTECTED], Eric Lease Morgan
[EMAIL PROTECTED] wrote:
 At the risk of interpreting the original question incorrectly, we  
 have had decent success using the Google Search Appliance to  
 facilitate search across the enterprise (university):
 
* Buy the Appliance.
* Feed it one or more URLs.
* Wait for it to crawl.
* Customize the user interface.
* Allow people to use it.
 
 While we haven't done so, it would not be too difficult to implement 

 a sort of federated search within the Appliance's interface. This  
 could be done in a number of ways:
 
1. Acquire bibliographic data and
   feed to directly to the Appliance
   via the (poorly) documented SQL
   interface.
 
2. Acquire bibliographic data, save
   it as HTML files, and allow the
   Appliance to crawl the HTML.
 
3. License access to bibliographic
   making sure it is accessible through
   some sort of API, and write a Google
   OneBox module that queries the data,
   and returns results as a part of a
   normal Google Appliance search.
 
 The larger Google Appliance costs about $30,000 but you purchase it, 

 not license it. No annual fees. That will buy you the ability to  
 index 500,000 documents. When it comes to a bibliographic database  
 (such as a subject index or a library catalog) that is not really  
 very much.
 
 We here at Notre Dame did implement Option #3, but it queries the  
 local LDAP sever to return names and addresses of people, not  
 bibliographic citations. [1, 2] I did write a OneBox module to query 

 our catalog, but we haven't implemented it, yet. It will probably  
 appear as a part of the library's Search This Site functionality.
 
 In short, I think a Google Appliance is an expensive but viable
option.
 
 [1] Search for a name (ex: Hesburgh) at http://search.nd.edu/ 
 [2] OneBox source code - http://tinyurl.com/6ktxot


[CODE4LIB] anyone know about Inera?

2008-07-11 Thread Steve Oberg
I recently became aware of a company that provides what it terms reference
correction software:  Inera.  This is the company that powers the crossRef
Simple Text Query box (http://www.crossref.org/freeTextQuery).

See http://www.inera.com/refcorrection.shtml for more details

Does anyone on this list have any knowledge of this company? I'm just
wondering if it would be better to use what they have rather than continue
to possibly reinvent the wheel for citation parsing.

Steve


Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Jason Ronallo
Steve,
If you need citation parsing, rather than reference correction, maybe
this will work for you:
http://aye.comp.nus.edu.sg/parsCit/

I haven't had a chance to try it yet, though.

Jason

On Fri, Jul 11, 2008 at 11:51 AM, Steve Oberg [EMAIL PROTECTED] wrote:
 I recently became aware of a company that provides what it terms reference
 correction software:  Inera.  This is the company that powers the crossRef
 Simple Text Query box (http://www.crossref.org/freeTextQuery).

 See http://www.inera.com/refcorrection.shtml for more details

 Does anyone on this list have any knowledge of this company? I'm just
 wondering if it would be better to use what they have rather than continue
 to possibly reinvent the wheel for citation parsing.

 Steve



Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Steve Oberg
Jason,

Thanks, yes, I knew of this effort and have actually spent a lot of time
working with this same software (or rather the same underlying software).
But I'm not sure it does enough or does it well enough for me at this point.
I'd like to take a list of one or two, up to hundreds of citations and dump
it into a web form and output SFX URLs as a result.

Steve

On Fri, Jul 11, 2008 at 1:51 PM, Jason Ronallo [EMAIL PROTECTED] wrote:

 Steve,
 If you need citation parsing, rather than reference correction, maybe
 this will work for you:
 http://aye.comp.nus.edu.sg/parsCit/

 I haven't had a chance to try it yet, though.

 Jason

 On Fri, Jul 11, 2008 at 11:51 AM, Steve Oberg [EMAIL PROTECTED] wrote:
  I recently became aware of a company that provides what it terms
 reference
  correction software:  Inera.  This is the company that powers the
 crossRef
  Simple Text Query box (http://www.crossref.org/freeTextQuery).
 
  See http://www.inera.com/refcorrection.shtml for more details
 
  Does anyone on this list have any knowledge of this company? I'm just
  wondering if it would be better to use what they have rather than
 continue
  to possibly reinvent the wheel for citation parsing.
 
  Steve
 



Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Nate Vack
Just out of curiosity, what makes parscit not optimal for this
purpose? Is it too slow? Not accurate enough?

I ask, as I've thought of doing similar things but haven't explored
the software deeply enough to know if it'd work.

Cheers,
-Nate

On Fri, Jul 11, 2008 at 2:44 PM, Steve Oberg [EMAIL PROTECTED] wrote:
 Jason,

 Thanks, yes, I knew of this effort and have actually spent a lot of time
 working with this same software (or rather the same underlying software).
 But I'm not sure it does enough or does it well enough for me at this point.
 I'd like to take a list of one or two, up to hundreds of citations and dump
 it into a web form and output SFX URLs as a result.


Re: [CODE4LIB] Enterprise Search and library collection

2008-07-11 Thread Genny Engel
I did not have stellar results experimenting with a similar approach to
Eric's.  The crawler we use is from Thunderstone, and it does a fine job
of indexing web content with very nice relevancy ranking and did you
mean spell-check.  What I found when trying to let it loose against
multiple servers is, when it hits our OPAC, it sees several different
formats per record and ends up more than triple-indexing each title.  It
does have a lot of flexibility in the indexing options, though, so I
could try it again and set it to ignore URL patterns that refer to the
MARC display, etc.  Still, the total cost to index a couple million
pages (which would be needed in order to include all the records in the
OPAC plus the website pages plus the Syndetics added content) is a bit
of a steep one-time outlay.  I'm sure there's some other way to go about
this with Thunderstone's TEXIS rather than using their Webinator
product, but then you have a substantially higher development effort, I
think.  

Thunderstone now makes a faceted search (they call it parametric
search).  They also make search appliances at different capacity levels.
 Pricing is really pretty reasonable for what you get.
http://www.thunderstone.com/texis/site/pages/Products.html
 
 
 
Genny Engel
Internet Librarian
Sonoma County Library
[EMAIL PROTECTED]
707 545-0831 x581
www.sonomalibrary.org
 


 [EMAIL PROTECTED] 07/11/08 07:36AM 
 In short, I think a Google Appliance is an expensive but viable
option.
Relative to other commercial products in the space, the GA or G-mini
is
actually very inexpensive.  Another option to add to Eric's list is
the
All Access Connector which adds MuseGlobal's fed search technology to
the Google appliance.  Of course, it also add $40K or more to the
total
price.
http://wire.jstirnaman.com/2008/05/23/federated-search-for-google-search-appliance/


Jason
-- 

Jason Stirnaman
Digital Projects Librarian/School of Medicine Support
A.R. Dykes Library, University of Kansas Medical Center
[EMAIL PROTECTED] 
913-588-7319


 On 7/10/2008 at 10:25 PM, in message
[EMAIL PROTECTED], Eric Lease Morgan
[EMAIL PROTECTED] wrote:
 At the risk of interpreting the original question incorrectly, we  
 have had decent success using the Google Search Appliance to  
 facilitate search across the enterprise (university):
 
* Buy the Appliance.
* Feed it one or more URLs.
* Wait for it to crawl.
* Customize the user interface.
* Allow people to use it.
 
 While we haven't done so, it would not be too difficult to implement


 a sort of federated search within the Appliance's interface. This  
 could be done in a number of ways:
 
1. Acquire bibliographic data and
   feed to directly to the Appliance
   via the (poorly) documented SQL
   interface.
 
2. Acquire bibliographic data, save
   it as HTML files, and allow the
   Appliance to crawl the HTML.
 
3. License access to bibliographic
   making sure it is accessible through
   some sort of API, and write a Google
   OneBox module that queries the data,
   and returns results as a part of a
   normal Google Appliance search.
 
 The larger Google Appliance costs about $30,000 but you purchase it,


 not license it. No annual fees. That will buy you the ability to  
 index 500,000 documents. When it comes to a bibliographic database  
 (such as a subject index or a library catalog) that is not really  
 very much.
 
 We here at Notre Dame did implement Option #3, but it queries the  
 local LDAP sever to return names and addresses of people, not  
 bibliographic citations. [1, 2] I did write a OneBox module to query


 our catalog, but we haven't implemented it, yet. It will probably  
 appear as a part of the library's Search This Site functionality.
 
 In short, I think a Google Appliance is an expensive but viable
option.
 
 [1] Search for a name (ex: Hesburgh) at http://search.nd.edu/ 
 [2] OneBox source code - http://tinyurl.com/6ktxot 


Re: [CODE4LIB] Enterprise Search and library collection [SEC=UNCLASSIFIED]

2008-07-11 Thread Peter Noerr
Hi Steve,

Thanks for a full reply.

We actually do combine date within enterprises, including from their ILS
and subscription Sources (article databases), and internal repositories.
Of course we claim we do it well - and I think we do. A library
background will enable you to face almost any shape of data with aplomb,
if not equanimity.

Data from varied sources is varied in structure, type of content and
level of detail, as you say. It *is* possible to combine it, but it
works best when there is some sort of commonality across the sources.
Fortunately most people when searching provide that focus, so the
theoretical problem is very rarely a practical one - and this business
is all about practical solutions. We do actually have a fair number of
the enterprise search engine vendors as partners where we act as a
selective harvesting capability for them and convert the syntax and
semantics of the harvested records into a uniformity they can easily
ingest and work their indexing magic on.

Fence sitting has a long and honourable tradition (both in the UK and
the US), and we 'back both horses' ourselves by being in both the
federated search and content integration space. Thus involved in both
the just-in-case harvesting, and the just-in-time fed searching.

Final thought is that almost everybody we have dealt with is a special
case - most of them in the nicest possibly way - so, even for systems
like ours, customization is the order of the day. But that's what
computers allow us to do - adapt to users.

Peter  

 -Original Message-
 From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf
Of
 Steve Oberg
 Sent: Friday, July 11, 2008 12:15 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Enterprise Search and library collection
 [SEC=UNCLASSIFIED]
 
 Peter,
 
 Use an search engine and create an aggregated database/index of all
the
  material from the organization, or use a federated search system to
  search the repositories/catalogs/databases/etc. in real time? Did
you
  consider both? And why the choice you made?
 
 
 I was not involved in the initial planning. I came in sort of halfway
 through and had to make a lot of the initial planning decisions work.
 (Even
 while I disagreed with some of those decisions.)  Again, my
perspective
 relates mostly to use of catalog data.   However, I would add that we
 did in
 fact have a federated search tool when I came but quickly discarded it
 because it couldn't do the more limited functionality we were hoping
 for it
 to accomplish (present best options to users for where to search among
 our
 databases and collections according to subject), let alone aggregate
or
 search across disparate data repositories.
 
 Personally I find it very difficult to believe that a federated search
 such
 as what you provide at MuseGlobal can do this sort of enterprise
 combination
 of data well.  The data is not well structured (except for catalog
 data) and
 includes an extreme range of completeness and little commonality.
What
 is
 interesting to note, however, is that on the one hand, a vendor such
as
 yourself may claim that you can do this sort of stuff well (I'm not
 saying
 you said that, just that you might say that). On the other hand I find
 it
 interesting to note that the enterprise search tool vendor we have,
 coming
 from a completely different market and perspective, would readily
claim
 they
 can do all that library stuff -- that they do in fact offer true
 federated
 search. Which in my personal opinion isn't true at all.
 
 But ideally I would answer your question in this way. I think there
 should
 be a combination of the two approaches, that this would be more
 practical
 and workable than just one or the other.  How's that for sitting on
the
 fence :-)
 
 
  Build vs. Buy? It obviously has taken Steve and his colleagues a lot
 of
  hard work to produce a nice looking system (except for all those big
  black bits on the screen!) and it obviously takes maintenance (it is
  'fragile') Do you think it was/is worth it and if so why?
 
 
 My answer is, it is too soon to tell.  There are many reasons why our
 implementation is probably unique (and I don't mean to imply that it
is
 better than someone else's, just that I doubt it could readily be
 replicated
 elsewhere).  We have a number of very different requirements and use
 cases
 than what some other library settings might have.  We have a large
 number of
 constraints on the IT side.  We have had to do a lot of custom stuff
as
 a
 result. This is probably why it is fragile, more than because of
 deficiencies in any one piece such as the search tool itself.
 
 But we are still, in my view, only at the very early stages of
 assessing the
 whole package's value for our users.  And we have very particular,
 demanding
 users.
 
 In sum, we have had to buy AND build and so it isn't, again, a
question
 of
 one versus the other.
 
 Steve


Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Ross Singer
Actually, SFX is probably not going to care what the title is.

It's much more likely to care about the ISSN, volume and issue.

Now, if the matching targets are EBSCO or Proquest, you might have a
problem (since they accept inbound OpenURLs from SFX), but I'm not
sure, exactly.

How many of these things do you have?
-Ross.

On Fri, Jul 11, 2008 at 3:55 PM, Steve Oberg [EMAIL PROTECTED] wrote:
 One example:

 Here's the citation I have in hand:

 Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al.
 The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone
 Metabolism and Disease in CKD: association with mortality in dialysis
 patients. American Journal of Kidney Diseases 2005; 46(5):925-932.

 Here's the output from ParsCit. Note the problem with the article title:

 algorithm name=ParsCit version=1.0
 citationList
 citation
 authors
 authorM Noordzij/author
 authorKorevaar JC/author
 /authors
 volume2005/volume
 titleBoeschoten EW, Dekker FW, Bos WJ, Krediet RT et al. The Kidney
 Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone
 Metabolism and Disease in CKD: association with mortality in dialysis
 patients/title
 journalAmerican Journal of Kidney Diseases/journal
 pages46--5/pages
 /citation
 /citationList
 /algorithm

 There's more but basically it isn't accurate enough. It's very good but not
 good enough for what I need at this juncture.  OpenURL resolvers like SFX
 are generally only as good as the metadata they are given to parse.  I need
 a high level of accuracy.

 Maybe that's a pipe dream.

 Steve

 On Fri, Jul 11, 2008 at 2:48 PM, Nate Vack [EMAIL PROTECTED] wrote:

 Just out of curiosity, what makes parscit not optimal for this
 purpose? Is it too slow? Not accurate enough?

 I ask, as I've thought of doing similar things but haven't explored
 the software deeply enough to know if it'd work.

 Cheers,
 -Nate

 On Fri, Jul 11, 2008 at 2:44 PM, Steve Oberg [EMAIL PROTECTED] wrote:
  Jason,
 
  Thanks, yes, I knew of this effort and have actually spent a lot of time
  working with this same software (or rather the same underlying software).
  But I'm not sure it does enough or does it well enough for me at this
 point.
  I'd like to take a list of one or two, up to hundreds of citations and
 dump
  it into a web form and output SFX URLs as a result.




Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Steve Oberg
Ross,


Actually, SFX is probably not going to care what the title is.

 It's much more likely to care about the ISSN, volume and issue.


Yes, true. But linking to full text is only partly the issue when it comes
to using SFX in this way.  I also want to ensure that those articles that we
don't already have available in full text are directly routed to our
internal doc. delivery form (in SFX speak, using an svc.ill=yes in the
OpenURL).  This would of course mean that however the citation got parsed is
how that form is filled out.  Incorrect title information is a problem in
this case.

Now, if the matching targets are EBSCO or Proquest, you might have a
 problem (since they accept inbound OpenURLs from SFX), but I'm not
 sure, exactly.

 How many of these things do you have?


Literally, possibly thousands. I can't divulge a great amount of detail
(there we go again with that restriction on info.) of the exact use. But
let's just say there are many very large documents (in PDF or Word), each of
which contains between 100-400 article citations, that I am working with.
Why on earth try to provide article-level OpenURLs? Well, for many reasons.
I fully realize how much of a risk that is in terms of reliability and
maintenance.  But right now I just want a way to do this in bulk with a high
level of accuracy.

Steve


Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Erik Hetzner
At Fri, 11 Jul 2008 14:55:18 -0500,
Steve Oberg [EMAIL PROTECTED] wrote:
 
 One example:
 
 Here's the citation I have in hand:
 
 Noordzij M, Korevaar JC, Boeschoten EW, Dekker FW, Bos WJ, Krediet RT et al.
 The Kidney Disease Outcomes Quality Initiative (K/DOQI) Guideline for Bone
 Metabolism and Disease in CKD: association with mortality in dialysis
 patients. American Journal of Kidney Diseases 2005; 46(5):925-932.
 
 Here's the output from ParsCit. Note the problem with the article title:

 […]

The output is a little different from what I get from the parsCit web
service. The parsCit authors recently published a new paper on a new
version of their systems with a new engine, which you might want to
look at [1].

 There's more but basically it isn't accurate enough. It's very good but not
 good enough for what I need at this juncture.  OpenURL resolvers like SFX
 are generally only as good as the metadata they are given to parse.  I need
 a high level of accuracy.
 
 Maybe that's a pipe dream.

I doubt that the software provided by Inera performs better than
parsCit. Inera does find a DOI for that citation but that is not
nearly so hard as determining which parts of a citation are which.
parsCit is pretty cutting edge  provides some of the best numbers I
have seen. The Flux-CiM system [2] also has pretty good numbers, but
the code for it is not available. I’ve also done a little bit of work
on this, which you might want to have a look at. [3]

One of the problems may be that the parsCit you are dealing with has
been trained on the Cora dataset of computer science citations. It is
a reasonably heterogeneous dataset of citations but it doesn’t have a
lot that looks like that health sciences format. If your citations are
largely drawn from the health sciences you might see about training it
on a health sciences dataset; you will probably get much better
results.

best,
Erik Hetzner

1. Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An
open-source CRF reference string parsing package. In Proceedings of
the Language Resources and Evaluation Conference (LREC 08), Marrakesh,
Morrocco, May. Available from http://wing.comp.nus.edu.sg/parsCit/#p

2. Eli Cortez C. Vilarinho, Altigran Soares da Silva, Marcos André
Gonçalves, Filipe de Sá Mesquita, Edleno Silva de Moura. FLUX-CIM:
flexible unsupervised extraction of citation metadata. In Proceedings
of the 8th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2007),
pp. 215-224.

3. A simple method for citation metadata extraction using hidden
Markov models. In Proc. of the Joint Conf. on Digital Libraries (JCDL
2008), Pittsburgh, Pa., 2008.
http://gales.cdlib.org/~egh/hmm-citation-extractor/
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3


pgp64luKWEnmY.pgp
Description: PGP signature


Re: [CODE4LIB] anyone know about Inera?

2008-07-11 Thread Nate Vack
On Fri, Jul 11, 2008 at 3:57 PM, Steve Oberg [EMAIL PROTECTED] wrote:

 I fully realize how much of a risk that is in terms of reliability and
 maintenance.  But right now I just want a way to do this in bulk with a high
 level of accuracy.

How bad is it, really, if you get some (5%?) bad requests into your
document delivery system? Customers submit poor quality requests by
hand with some frequency, last I checked...

Especially if you can hack your system to deliver the original
citation all the way into your doc delivery system, you may be able to
make the case that 'this is a good service to offer; let's just deal
with the bad parses manually.'

Trying to solve this via pure technology is gonna get into a world of
diminishing returns. A surprising number of citations in references
sections are wrong. Some correct citations are really hard to parse,
even by humans who look at a lot of citations.

ParsCit has, in my limited testing, worked as well as anything I've
seen (commercial or OSS), and much better than most.

My $0.02,
-Nate


Re: [CODE4LIB] Webfeet, Encompass WAS: Ser Sol 360 Search

2008-07-11 Thread Joe Shubitowski
Hi Dave,

National Library of New Zealand still uses Encompass for their Discover service:
http://discover.natlib.govt.nz/

They were Enc development partners with us way back whenand have a ton 
invested in this.not sure of a contact person anymore, but can probably 
rustle someone up if you need specifics.

Cheers.Joe Shubitowski

-- 

Joseph M. Shubitowski
Head, Information Systems
Getty Research Institute
1200 Getty Center Drive, Suite 1100
Los Angeles CA 90049-1688
Voice: 310-440-6394
Fax: 310-440-7780
[EMAIL PROTECTED] 

 On 7/11/2008 at 8:55 AM, in message
[EMAIL PROTECTED], Walker, David
[EMAIL PROTECTED] wrote:
 Thanks to everyone who responded to my earlier request.
 
 On to the next system: If your library licenses the Webfeat metasearch 
 system, would you mind contacting me off-list?  I have similar questions to 
 ask you all.
 
 Also -- and I realize I'm reaching here -- if you happen to have the 
 now-defunct 
 Endeavor Encompass system still up and running somewhere (even if its out of 
 public view) would you mind contacting me.
 
 Thanks!
 
 --Dave
 
 
 ==
 David Walker
 Library Web Services Manager
 California State University
 http://xerxes.calstate.edu 
 
 From: Walker, David
 Sent: Monday, July 07, 2008 8:57 AM
 To: CODE4LIB@LISTSERV.ND.EDU 
 Subject: Ser Sol 360 Search
 
 Hi All,
 
 I'm giving a conference presentation later this month on metasearch.  If 
 your library licenses Serial Solutions' metasearch system, would you mind 
 contacting me off-list?  I'd like to ask a couple of questions.  Thanks!
 
 --Dave
 
 
 ---
 David Walker
 Library Web Services Manager
 California State University
 http://xerxes.calstate.edu