Re: [CODE4LIB] Strategy for assigning DOIs?
Hi, Jason, I strongly suggest to separate your DOI namespace/naming schema to be totally independent of your choice of repository/system. DOI is an infrastructure thing, and the main reason behind of assigning DOI is for persistency and permanency. At some point any repository system will go away and will be replaced by another one. Secondly I do not think data is a good namespace. I suggest to have something persistent which can standalone even you do not have DOI prefix of 10.17348. Yan On 2/9/16, 11:56 AM, "Code for Libraries on behalf of Jason Best"wrote: >We recently started assigning DOIs to articles published in one of our >journals using Open Journal System which generates the DOI and metadata within >a namespace dedicated to that journal. We don’t yet have an institutional >repository, but are moving in that direction and I hope we have one in a >couple of years. But in the meantime, how could we go about issuing DOIs for >items that aren’t related to the journal, but that we’d hope to eventually >have handled by our IR? For example, we have a handful of datasets for which >we’d like to issue DOIs so I planned on created a “data” namespace then just >adding a serial number for each dataset (e.g. 10.17348/data.01 ) which >would resolve to a page (with metadata and download links) in our Drupal CMS >until we get an IR. Will such an approach allow us to eventually use an IR to >1) become the repository for the items with DOIs previously issued in the >“data” namespace and 2) continue issuing DOIs for new items withi! n! > the “data” namespace? I know the answer is going to depend on the IR > platform we use, so I’m asking this in the broad sense to get your input > about your experiences. > >But since DSpace is one of the likely candidates for our IR, I’ll use it as a >more concrete example. From my limited understanding (just reading the >documentation), items deposited in a DSpace instance will all share the same >DOI namespace. The namespace and an internal identifier are then concatenated >with the DOI prefix to create the DOI. If we’ve already issued a DOI outside >of DSpace, would we have any control over the identifier that was assigned to >a newly-deposited item allowing us to control the DOI that is generated? > >Any thoughts or suggestions? > >Thanks, >Jason > >Jason Best >Biodiversity Informatician >Botanical Research Institute of Texas >1700 University Drive >Fort Worth, Texas 76107 > >817-332-4441 ext. 230 >http://www.brit.org
[CODE4LIB]
Yes. Use iText or PDFBox These are common PDF libraries. On 2/6/16, 2:24 PM, "Code for Libraries on behalf of Andrew Cunningham"wrote: >Hi all, > >I am working with PDF files in some South Asian and South East Asian >languages. Each PDF has ActualText added for each tag in the PDF. Each PDF >has ActualText as an alternative forvthe visible text layer in the PDF. > >Is anyone aware of tools the will allow me to index and search PDFs based >on the ActualText content rather than the visible text layers in the PDF? > >Andrew > >-- >Andrew Cunningham >lang.supp...@gmail.com
[CODE4LIB] EMPLOYMENT OPPORTUNITY: Department Head, Office of Digital Innovation and Stewardship (ODIS)
Please share the posting with interested parties. Tucson has mild winter and dry / warm summer. The person will be working with engaged and nice colleagues. EMPLOYMENT OPPORTUNITY Department Head, Office of Digital Innovation and Stewardship The University of Arizona Libraries, Digital Innovation/Stewardship (Dept. 1705) Classification: Administrator/Appointed Professional; Full-Time; Exempt Location: Main Campus, Tucson Position Summary: The University Libraries seek a dynamic, innovative Head of the Office of Digital Innovation and Stewardship (ODIS), a position with the primary responsibility of providing leadership and strategic direction for digital innovation and stewardship within the broader context of the strategic plans of the University Libraries and the University of Arizona. ODIS provides a broad range of services including digital collections, data management, campus repository, metadata, journal hosting and publishing, copyright and scholarly communication, open access, and geospatial data. In overseeing several areas of strategic importance, the Department Head must be forward thinking and willing to take strategic risks in the development of services. The Department Head will be a member of the Libraries Cabinet (leadership, policy and management team) and reports to the Vice Dean of Libraries. The Department Head of ODIS will be responsible for leadership, management, and planning for the services and functions of the Office of Digital Innovation and Stewardship, which includes 8 FTE permanent professionals and a large team of students and temporary employees. ODIS members work collaboratively, engaging the strengths and knowledge of all members of the department. The Department Head will coordinate and facilitate leadership currently in place among ODIS faculty and staff. As UA librarians have faculty status, the Department Head is responsible for coaching and guiding librarians through the promotion and continuing status process. The Department Head will also be responsible for ensuring that department planning furthers the strategic goals for the Libraries and campus. This is a continuing-eligible, academic professional position. Incumbents are members of the general faculty and are entitled to all accompanying rights and privileges granted by the Arizona Board of Regents and the University of Arizona. Retention and promotion are earned through achievement of a record of excellence in position effectiveness, scholarship, and service. The Office of Digital Innovation and Stewardship (ODIS) at the University of Arizona Libraries engages and innovates across a range of services and content in support of the University’s mission and strategic plan. ODIS provides services to the University community that encompass data management, campus repository, metadata, journal hosting and publishing, copyright and scholarly communication, open access, and geospatial data. ODIS is responsible for programmatic planning and oversight of the Libraries digital collections and digitization activities, including digital preservation and digital asset management efforts. ODIS coordinates strategies for exposing unique and local digital collections. ODIS also leads and contributes to a variety of national and international collaborative efforts, including TRAIL (Technical Report Archive and Image Library) and the Afghanistan Digital Collections. ODIS is active in campus-wide efforts related to scholarly activity and research data, participates in the University’s Research Computing Governance Committee, leads the institution’s faculty activity reporting efforts, and collaborates with the University’s Office of Research and Discovery, and University Information Technology Services. In this process, ODIS collaborates with faculty and staff throughout the University Libraries and across campus. The University of Arizona has been recognized on Forbes 2015 list of America’s Best Employers in the United States and has been awarded the 2015 Work-Life Seal of Distinction by the Alliance for Work-Life Progress! For more information about working at the University Libraries, see http://www.library.arizona.edu/about/employment/why. Diversity Commitment: At the University of Arizona, we value our inclusive climate because we know that diversity in experiences and perspectives is vital to advancing innovation, critical thinking, solving complex problems, and creating an inclusive academic community. Diversity in our environment embraces the acceptance of a multiplicity of cultural heritages, lifestyles and worldviews. We translate these values into action by seeking individuals who have experience and expertise working with diverse students, colleagues and constituencies, as we believe that such experiences are both institutional and service imperatives. Because we seek a workforce with diverse perspectives and experiences, we encourage applications
Re: [CODE4LIB] Amazon Glacier - tracking deposits
Be aware of data transfer cost if you are using Glacier. Glacier is excellent choice for archive use, but you want to be sure these files shall not be accessed often. You shall consider the total cost of ownership including data transfer cost, which could be very expensive if you retrieve more than 5%? Of your data. It adds up quickly if you do not check carefully. I have one article to-be-published discussing Amazon S3 , Glacier. Also including history of data transfer and storage cost over the past 7 years in Library Hi-tech. For id, I designed and implemented a unique persistent id system for all the digital files (which is also used as DOI if needed). Yan Han The University of Arizona Libraries On 4/9/15, 4:13 AM, Scancella, John j...@loc.gov wrote: Have you looked at google's cloud storage nearline? it is about $0.01 per gigabyte per month with about 3 second access time http://googlecloudplatform.blogspot.com/2015/03/introducing-Google-Cloud-S torage-Nearline-near-online-data-at-an-offline-price.html -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Cary Gordon Sent: Wednesday, April 08, 2015 7:49 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Amazon Glacier - tracking deposits We have been playing with Glacier, but so far neither us nor our clients have been convinced of its cost-effectiveness. A while back, we were discussing a project with 15 PB of archival assets, and that would certainly have made Glacier cost-effective, saving about $30k/mo. over S3, although requests could cut into that. The Glacier location is in the format /Account ID/vaults/Vault Name/archives/Archive ID, so you might want to consider using the whole string. Thanks, Cary On Apr 8, 2015, at 3:32 PM, Sara Amato sam...@willamette.edu wrote: Has anyone leapt on board with Glacier? We are considering using it for long term storage of high res archival scans. We have derivative copies for dissemination, so don't intend touching these often, if ever. The question I have is how to best track the Archive ID that glacier attaches to deposits, as it looks like that is the only way to retrieve information if needed (though you can attach a brief description also that appears on the inventory along with the id.) We're considering putting the ID in Archivist Toolkit, where the location of the dissemination copies is noted, but am wondering if there are other tools out there specific for this scenario that people are using.
Re: [CODE4LIB] : Persian Romanization table
Hello, Charles, The plan is to write a program which can use a pre-defined language mapping XML file. One language needs one pre-defined mapping XML file, so that any language can have its own mapping (extensible for future language transliteration). In this case, a Persian language mapping XML file, and a Pashuto language mapping XML file. Thanks for the language tool. I will take a look. Yan -Original Message- From: Riley, Charles [mailto:charles.ri...@yale.edu] Sent: Wednesday, April 17, 2013 5:31 PM To: lit...@ala.org; Jacobs, Jane W; Code for Libraries (CODE4LIB@LISTSERV.ND.EDU) Cc: Seyede Pouye Khoshkhoosani Subject: [lita-l] RE: : Persian Romanization table Hi Yan, Sounds like a really interesting project. Is the intent to support going from Persian to Pashto directly, as well as from each language to Roman script? Among the natural language processing tools found here-- http://www.ling.ohio-state.edu/~jonsafari/persian_nlp.html --the one that *might* be the most helpful is the link to the Persian Lexical Project, where the romanized orthography used is one that accounts for vowels inserted between the consonants. It's not a large dataset, but carries a GPLv2 license--maybe useful in some testing, and see if it's worth expanding on the effort. Best, Charles Riley From: Han, Yan [h...@u.library.arizona.edu] Sent: Wednesday, April 17, 2013 8:14 PM To: Jacobs, Jane W; Code for Libraries (CODE4LIB@LISTSERV.ND.EDU); lit...@ala.org Cc: Seyede Pouye Khoshkhoosani Subject: [lita-l] RE: : Persian Romanization table Hello, All and Jane First I would like to appreciate Jane Jacobs at Queens Library providing me Urdu Romanization table. As we are working on creating Persian/Pushutu transliterate software, my Persian language expert has the following question : In according to our conversation for transliterating Persian to Roman letters, I faced a big problem: As the short vowels do not show up on or under the letters in Persian, how a machine can read a word in Persian. For example we have the word “پدر ; to the machine this word is PDR, because it cannot read the vowels. There is no rule for the short vowels in the Persian language; so the machine does not understand if the first letter is “pi”, “pa” or “po”. Is there any way to overcome this obstacle? This seems to me that we missed a critical piece of information here. (Something like a dictionary). Without it, there is no way to have good translation from computer. We will have to have a Persian speaker to check/correct the computer's transliteration. Any suggestions ? Thanks, Yan -Original Message- From: Jacobs, Jane W [mailto:jane.w.jac...@queenslibrary.org] Sent: Wednesday, January 23, 2013 6:28 AM To: Han, Yan Subject: RE: : Persian Romanization table Hi Yan, As per my message to the listserve, here are the config files for Urdu. If you do a Persian config file, I d love to get it and if possible add it to the MARC::Detrans site. Let me know if you want to follow this road. JJ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Han, Yan Sent: Tuesday, January 22, 2013 5:31 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] : Persian Romanization table Hello, All, I have a project to deal with Persian materials. I have already uses Google Translate API to translate. Now I am looking for an API to transliterate /Romanize (NOT Translate) Persian to English (not English to Persian). In other words, I have Persian in, and English out. There is a Romanization table (Persian romanization table - Library of Congresshttp://www.loc.gov/catdir/cpso/romanization/persian.pdf www.loc.gov/catdir/cpso/romanization/persian.pdfhttp://www.loc.gov/catdir/cpso/romanization/persian.pdf). For example, If should output as Kit?b My finding is that existing tools only do the opposite 1. Google Transliterate: you enter English, output Persian (Input Bookmark , output ??? , Input ??? , output ??? ) 2. OCLC language: the same as Google Transliterate. 3. http://mylanguages.org/persian_romanization.php : works, but no API. Anyone know such API exists? Thanks much, Yan To maximize your use of LITA-L or to unsubscribe, see http://www.ala.org/lita/involve/email To maximize your use of LITA-L or to unsubscribe, see http://www.ala.org/lita/involve/email
Re: [CODE4LIB] : Persian Romanization table
Hello, All and Jane First I would like to appreciate Jane Jacobs at Queens Library providing me Urdu Romanization table. As we are working on creating Persian/Pushutu transliterate software, my Persian language expert has the following question : In according to our conversation for transliterating Persian to Roman letters, I faced a big problem: As the short vowels do not show up on or under the letters in Persian, how a machine can read a word in Persian. For example we have the word “پدر ; to the machine this word is PDR, because it cannot read the vowels. There is no rule for the short vowels in the Persian language; so the machine does not understand if the first letter is “pi”, “pa” or “po”. Is there any way to overcome this obstacle? This seems to me that we missed a critical piece of information here. (Something like a dictionary). Without it, there is no way to have good translation from computer. We will have to have a Persian speaker to check/correct the computer's transliteration. Any suggestions ? Thanks, Yan -Original Message- From: Jacobs, Jane W [mailto:jane.w.jac...@queenslibrary.org] Sent: Wednesday, January 23, 2013 6:28 AM To: Han, Yan Subject: RE: : Persian Romanization table Hi Yan, As per my message to the listserve, here are the config files for Urdu. If you do a Persian config file, I d love to get it and if possible add it to the MARC::Detrans site. Let me know if you want to follow this road. JJ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Han, Yan Sent: Tuesday, January 22, 2013 5:31 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] : Persian Romanization table Hello, All, I have a project to deal with Persian materials. I have already uses Google Translate API to translate. Now I am looking for an API to transliterate /Romanize (NOT Translate) Persian to English (not English to Persian). In other words, I have Persian in, and English out. There is a Romanization table (Persian romanization table - Library of Congresshttp://www.loc.gov/catdir/cpso/romanization/persian.pdf www.loc.gov/catdir/cpso/romanization/persian.pdfhttp://www.loc.gov/catdir/cpso/romanization/persian.pdf). For example, If should output as Kit?b My finding is that existing tools only do the opposite 1. Google Transliterate: you enter English, output Persian (Input Bookmark , output ??? , Input ??? , output ??? ) 2. OCLC language: the same as Google Transliterate. 3. http://mylanguages.org/persian_romanization.php : works, but no API. Anyone know such API exists? Thanks much, Yan
[CODE4LIB] III loading module cannot handle non-English characters
Hello, We have problems using III loading module to load MARC files (.mrc) to our catalog. This is to use Data Exchange Load Electronic Records (itm). Basically non- English characters (French, Arabic ) will be changed to unknown symbols. The MARC files (.mrk and .mrc) are verified before loading to III. There are only two issues: 1. the III configuration might be wrong. 2. The III loading module has a bug and it probably does not know how to deal with non-English characters. Anyone having similar experience or resolving it? Thanks, Yan
[CODE4LIB] : Persian Romanization table
Hello, All, I have a project to deal with Persian materials. I have already uses Google Translate API to translate. Now I am looking for an API to transliterate /Romanize (NOT Translate) Persian to English (not English to Persian). In other words, I have Persian in, and English out. There is a Romanization table (Persian romanization table - Library of Congresshttp://www.loc.gov/catdir/cpso/romanization/persian.pdf www.loc.gov/catdir/cpso/romanization/persian.pdfhttp://www.loc.gov/catdir/cpso/romanization/persian.pdf). For example, If كتاب should output as Kitāb My finding is that existing tools only do the opposite 1. Google Transliterate: you enter English, output Persian (Input “Bookmark”, output “بوکمارک “, Input “بوکمارک “, output “بوکمارک “) 2. OCLC language: the same as Google Transliterate. 3. http://mylanguages.org/persian_romanization.php : works, but no API. Anyone know such API exists? Thanks much, Yan
Re: [CODE4LIB] LAMP Hosting service that supports php_yaz?
You can just buy a node from a variety of cloud providers such as Amazon EC2, Linode etc. (It is very easy to build anything you want). Yan -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Cindy Harper Sent: Sunday, March 06, 2011 10:54 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] LAMP Hosting service that supports php_yaz? At the risk of exhausting my quota of messages for the month - Our LAMP hosting service does not support PECL extension php_yaz. Does anyone know of a service that does? Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edu 315-228-7363
Re: [CODE4LIB] LAMP Hosting service that supports php_yaz?
Updating L A M is easy with Ubuntu /Debian. Not sure about PHP. If you are afraid of hacking/security, you can have a monitor service. I know Google has java/python platform. I do not know who provides PHP/Perl one on this level. (yes. It is a little easier when someone takes care of updating for you). Yan Han, Associate Librarian The University of Arizona Libraries Phone: (520)307-2823 Email: h...@u.library.arizona.edu From: Cindy Harper [mailto:char...@colgate.edu] Sent: Monday, March 07, 2011 11:18 AM To: Code for Libraries Cc: Han, Yan Subject: Re: [CODE4LIB] LAMP Hosting service that supports php_yaz? I guess I was hoping to have service such as that provided by my current hosting service, where security,etc., updates for L A M P are all taken care of by the host. Any recommendations along those lines? One that provides that and still lets me install what I want? My service suggested that I go to a VPS account,where I'd have to do my own updates. Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edumailto:char...@colgate.edu 315-228-7363 On Mon, Mar 7, 2011 at 11:00 AM, Han, Yan h...@u.library.arizona.edumailto:h...@u.library.arizona.edu wrote: You can just buy a node from a variety of cloud providers such as Amazon EC2, Linode etc. (It is very easy to build anything you want). Yan -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDUmailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Cindy Harper Sent: Sunday, March 06, 2011 10:54 AM To: CODE4LIB@LISTSERV.ND.EDUmailto:CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] LAMP Hosting service that supports php_yaz? At the risk of exhausting my quota of messages for the month - Our LAMP hosting service does not support PECL extension php_yaz. Does anyone know of a service that does? Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edumailto:char...@colgate.edu 315-228-7363
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
I would think DSpace, Fedora, and Eprint. DSpace is fairly easy to implement, which has embargo support in 1.6 (https://wiki.duraspace.org/display/DSTEST/Embargo ). I have an article comparing DSpace and Fedora, but was written 6 years ago. DSpace has not been changed much, but Fedora is a different story. Yan -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Deng, Sai Sent: Wednesday, October 20, 2010 10:33 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Hello, list, Do you know the Digital Library systems which can search within the documents (e.g. PDFs) and handle access restrictions (e.g. DRM)? Has any of you compared these DL systems? Thanks for any information! Sophie
Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?
DSpace does Full-text search, you need to turn on the configuration file. See UAL http://arizona.openrepository.com/arizona/ Yan -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Deng, Sai Sent: Wednesday, October 20, 2010 2:14 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? For access restriction, I mean we would like to have certain documents open only to certain communities (UpLib cannot do that, right?). I don't know how DRM affects file indexing. On second thought, I searched for DSpace full text search and found this: https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing However, I haven't seen any instance which shows the full text search results as I would see from vendor databases. Any idea on what system might be good/best for search within documents and DRM? Thank you for the reply! Sophie From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill Janssen [jans...@parc.com] Sent: Wednesday, October 20, 2010 4:01 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)? Deng, Sai sai.d...@wichita.edu wrote: Do you know the Digital Library systems which can search within the documents (e.g. PDFs) and handle access restrictions (e.g. DRM)? Not sure what you mean by handle access restrictions. Do you mean it can index the documents put into it even if they have DRM encumbrances? UpLib has search within the documents -- if you search for a word or phrase, it shows you all the documents which match, but also all the pages in each document which match. Supports a wide variety of document formats, from JPEG2000 to PDF to Powerpoint. But as far as I know it doesn't deal with DRM restrictions. Bill
[CODE4LIB] Amazon EC2 ports: only 80 and 8080?
Hello, Currently we would like to have Amazon EC2 node hosting 2 applications: DSpace and Koha (so that we need 4 ports). However, it seems to me that only port 80 and 8080 are available. Any other ports are not accessible from outside. Anyone has similar experience and knows how to open other ports? Thanks, Yan
[CODE4LIB] OCR for handwritten pages
Hello, Colleagues, Does anyone know/use any OCR software working on handwritten pages? or at least think it is better than hiring a student key-in. I know these OCR software such as ABBYY, but they do not work on handwriting. Thanks, Yan
Re: [CODE4LIB] Assigning DOI for local content
Please explain in more details, that will be more helpful. It has been a while. Back to 2007, I checked PURL's architecture, and it was straightly handling web addresses only. Of course, current HTTP protocol is not going to last forever, and there are other protocols in the Internet. The coverage of PURL is not enough. From PURL's website, it still says PURLs (Persistent Uniform Resource Locators) are Web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure. I am not sure what web addresses means. http://www.purl.org/docs/help.html#overview says PURLs are Persistent Uniform Resource Locators (URLs). A URL is simply an address on the World Wide Web. We all know that World Wide Web is not the Internet. What if info resource can be accessed through other Internet Protocols (FTP, VOIP, )? This is the limitation of PURL. PURL is doing re-architecture, though I cannot find out more documentation. The Handle system is The Handle System is a general purpose distributed information system that provides efficient, extensible, and secure HDL identifier and resolution services for use on networks such as the Internet.. http://www.handle.net/index.html Notice the difference in definition. Yan -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ross Singer Sent: Wednesday, November 18, 2009 8:11 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Assigning DOI for local content On Wed, Nov 18, 2009 at 12:19 PM, Han, Yan h...@u.library.arizona.edu wrote: Currently DOI uses Handle (technology) with it social framework (i.e. administrative body to manage DOI). In technical sense, PURL is not going to last long. I'm not entirely sure what this is supposed to mean (re: purl), but I'm pretty sure it's not true. I'm also pretty sure there's little to no direct connection between purl and doi despite a superficial similarity in scope. -Ross.
Re: [CODE4LIB] Assigning DOI for local content
Currently DOI uses Handle (technology) with it social framework (i.e. administrative body to manage DOI). In technical sense, PURL is not going to last long. Crossref handles DOI registration in U.S. In Europe and Aisa, they have other organizations to handle it. DOI is also currently going through ISO standardization process. The other fact is that DOI has the biggest number of usage than other Persistent Identifiers. More info can be found at http://www.doi.org/faq.html Yan -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Jodi Schneider Sent: Tuesday, November 17, 2009 4:59 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Assigning DOI for local content The first question is: what are they trying to accomplish by having DOIs? Do they have a long-term plan for persistence of their content? Financial plan? If they're looking for persistent identifiers, I don't understand (a priori), why DOI is better, as an identifier scheme, than any other 'persistent identifier scheme' (ARK [1], PURL, Handle, etc[2]). (Though I really like CrossRef and the things they're doing.) [1] http://www.cdlib.org/inside/diglib/ark/ [2] http://www.persistent-identifier.de/english/204-examples.php -Jodi On Tue, Nov 17, 2009 at 11:44 PM, Bucknell, Terry t.d.buckn...@liverpool.ac.uk wrote: You should be able to find all the information you need about CrossRef fees and rules at: http://www.crossref.org/02publishers/20pub_fees.html and http://www.crossref.org/02publishers/59pub_rules.html Information about the system of registering and maintaining DOIs is at: http://www.crossref.org/help/ Note that as well as registering DOIs for the articles in LLT, LLT would be obliged to link to the articles cited by LLT articles (for cited articles that have DOIs too). Looking at the LLT site, it looks like they would have to change their 'abstract' pages to 'abstract plus cited refs', or change the way that their PDFs are created so that they include DOI links for cited references. (Without this the whole system would fail: publishers would expect traffic to come to them, but wouldn't have to send traffic elsewhere). I'd agree that DOIs are in general a Good Thing (and for e-books / e-book chapters, and reference work entries as well as e-journal articles). The CrossRef fees are deliberately set so as not to exclude single-title publishers. Here's an example of a single-title, university-based e-journal in the UK that provides DOIs, so it must be a CrossRef member: http://www.bioscience.heacademy.ac.uk/journal/. Terry Bucknell Electronic Resources Manager University of Liverpool -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Jonathan Rochkind Sent: 17 November 2009 23:20 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Assigning DOI for local content So I have no actual experience with this. But you have to pay for DOI's. I've never done it, but I don't think you neccesarily have to run your own purl server -- CrossRef takes care of it. Of course, if your documents are going to be moving all over the place, if you run your own purl server and register your purls with CrossRef, then when a document moves, you can update your local purl server; otherwise, you can update CrossRef, heh. It certainly is useful to have DOIs, I agree. I would suggest they should just contact cross-ref and get information on the cost, and what their responsibilities are, and then they'll be able to decide. If the 'structure of their content' is journal articles, then, sure DOI is pretty handy for people wanting to cite or link to those articles. Jonathan Ranti Junus wrote: Hi All, I was asked by somebody from a college @ my institution whether they should go with assigning DOI for their journal articles: http://llt.msu.edu/ I can see the advantage of this approach and my first thought is more about whether they have resources in running their purl server, or whether they would need to do it through crossref (or any other agency.) Has anybody had any experience about this? Moreover, are there other factors that one should consider (pros and cons) about this? Or, looking at the structure of their content, whether they ever need DOI? Any ideas and/or suggestions? Any insights about this is much appreciated. thanks, ranti.
Re: [CODE4LIB] Digital imaging questions
There are two things about archive images at least I can think of this moment: 1. the resolution: diff size/materials require different resolution. There is no one-size-fit-all. To make a judgment, I would like to know the image (color?), the size of the material?, 2. the file format: TIFF is the recommended format due to its openness, stability, and lossless data over the time. If you believe that your jpeg file has enough resolution, I do not see any problem to convert it to tiff. Yan From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Deng, Sai [sai.d...@wichita.edu] Sent: Thursday, June 18, 2009 7:33 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Digital imaging questions Hi, list, A while ago, I read some interesting discussion on how to use camera to produce archival-quality images from this list. Now, I have some imaging questions and I think this might be a good list to turn to. Thank you in advance! We are trying to add some herbarium images to our DSpace. The specimen pictures will be taken at the Biology department and the library is responsible for depositing the images and transferring/mapping/adding metadata. On the testing stage, they use Fujifilm FinePix S8000fd digital camera (http://www.fujifilmusa.com/support/ServiceSupportSoftwareContent.do?dbid=874716prodcat=871639sscucatid=664260). It produces 8 megapixel images, and it doesn't have raw/tiff support. It seems that it cannot produce archival quality images. Before we persuade the Biology department to switch their camera, I want to make sure it is absolutely necessary. The pictures they took look fine with human eyes, see an example at: http://library.wichita.edu/techserv/test/herbarium/Astenophylla1-02710.jpg In order to make master images from a camera, it should be capable of producing raw or tiff images with 12 or above megapixels? A related archiving question, the biology field standard is DarwinCore, however, DSpace doesn't support it. The Biology Dept. already has some data in spreadsheet. In this case, when it is impossible to map all the elements to Dublin Core, is it a good practice for us to set up several local elements mapped from DarwinCore? Thanks a million, Sai Sai Deng Metadata Catalog Librarian Wichita State University Libraries 1845 Fairmount Wichita, KS 67260-0068 Phone: (316) 978-5138 Fax: (316) 978-3048 Email: sai.d...@wichita.edu said...@gmail.com
Re: [CODE4LIB] Recommend book scanner?
The National Archives has the guideline which describes target that you can use for scanning comparison. There are other targets used in other books/articles. I suggest that you check the National Archives' guidelines. http://www.archives.gov/preservation/technical/guidelines.html -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Lars Aronsson Sent: Friday, May 01, 2009 8:27 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Recommend book scanner? Mike Taylor wrote: Or not. Cheap cameras may well produce JPEGs that contain eight million pixels, but that doesn't mean that they are using all or even much of that resolution. Does anybody have a printed test sheet that we can scan or photo, and then compare the resulting digital images? It should have lines at various densities and areas of different colours, just like an old TV test image. Can you buy such calibration sheets? We could make it a standard routine, to always shoot such a sheet at the beginning of any captured book, to give the reader an idea of the digitization quality of the used equipment. They are called technical target in figure 14, page 149, of Lisa L. Fox (ed.), Preservation Microfilming, 2nd ed. (1996), ISBN 0-8389-0653-2. The example there is manufactured by AP International, http://www.a-p-international.com/ However, their price list is $100-400 per package of 50 sheets. I wouldn't pay more for the calibration targets than for the camera, if I could avoid it. -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/
Re: [CODE4LIB] Recommend book scanner?
That is right. In addition, for certain printing (gold seal), digital camera delivers better result than scanners. -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Jonathan Rochkind Sent: Friday, May 01, 2009 2:38 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Recommend book scanner? Yeah, I don't think people use cameras instead of flatbed scanners because they produce superior results, or are cheaper: They use them because they're _faster_ for large-scale digitization, and also make it possible to capture pages from rare/fragile materials with less damage to the materials. (Flatbeds are not good on bindings, if you want to get a good image). If these things don't apply, is there any reason not to use a flatbed scanner? Not that I know of? Jonathan Randy Stern wrote: My understanding is that a flatbed or sheetfed document scanner that produces 300 dpi will produce much better OCR results than a cheap digital camera that produces 300 dpi. The reasons have to do with the resolution and distortion of the resulting image, where resolution is defined as the number of line pairs per mm can be resolved (for example when scanning a test chart) - in other words the details that will show up for character images, and distortion is image aberration that can appear at the edges of the page image areas, particularly when illumination is not even. A scanner has much more even illumination. At 11:21 AM 5/1/2009 -0700, Erik Hetzner wrote: At Fri, 1 May 2009 09:51:19 -0500, Amanda P wrote: On the other hand, there are projects like bkrpr [2] and [3], home-brew scanning stations build for marginally more than the cost of a pair of $100 cameras. Cameras around $100 dollars are very low quality. You could get no where near the dpi recommended for materials that need to be OCRed. The quality of images from cameras would be not only low, but the OCR (even with the best software) would probably have many errors. For someone scanning items at home this might be ok, but for archival quality, I would not recommend cameras. If you are grant funded and the grant provider requires a certain level of quality, you need to make sure the scanning mechanism you use can scan at that quality. I know very little about digital cameras, so I hope I get this right. According to Wikipedia, Google uses (or used) an 11MP camera (Elphel 323). You can get a 12MP camera for about $200. With a 12MP camera you should easily be able to get 300 DPI images of book pages and letter size archival documents. For a $100 camera you can get more or less 300 DPI images of book pages. * The problems I have always seen with OCR had much to do with alignment and artifacts than with DPI. 300 DPI is fine for OCR as far as my (limited) experience goes - as long as you have quality images. If your intention is to scan items for preservation, then, yes, you want higher quality - but I can’t imagine any setup for archival quality costing anywhere near $1000. If you just want to make scans full text OCR available, these setups seem worth looking at - especially if the software workflow can be improved. best, Erik * 12 MP seems to equal 4256 x 2848 pixels. To take a ‘scan’ (photo) of a page at 300 DPI, that page would need to be 14.18 x 9.49 (dividing pixels / 300). As long as you can get the camera close enough to the image to not waste much space you will be getting in the close to 300 DPI range for images of size 8.5 x 11 or less. ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3
Re: [CODE4LIB] You got it!!!!! Re: [CODE4LIB] Something completely different
Bill and Peter, Very nice posts. XML, RDF, MARC and DC are all different ways to present information in a way (of course, XML, RDF, and DC are easier to read/processed by machine). However, down the fundamentals, I think that it can go deeper, basically data structure and algorithms making things works. RDF (with triples) is a directed graph. Graph is a powerful (the most powerful?) data structure that you can model everything. However, some of the graph theory/problems are NP-hard problems. In fundamental we are talking about Math. So a balance needs to be made. (between how complex the model is and how easy(or possible) to get it implemented). As computing power grows, complex data modeling and data mining are on the horizon. Yan -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Peter Schlumpf Sent: Thursday, April 09, 2009 10:09 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] You got it! Re: [CODE4LIB] Something completely different Bill, You have hit the nail on the head! This is EXACTLY what I am trying to do! It's the underlying stuff that I am trying to get at. Looking at RDF may yield some good ideas. But I am not thinking in terms of RDF or XML, triples, or MARC, standards, or any of that stuff that gets thrown around here. Even the Internet is not terribly necessary. I am thinking in terms of data structures, pointers, sparse matrices, relationships between objects and yes, set theory too -- things like that. The former is pretty much cruft that lies upon the latter, and it mostly just gets in the way. Noise, as you put it, Bill! A big problem here is that Libraryland has a bad habit of getting itself lost in the details and going off on all kinds of tangents. As I said before, the biggest prison is between the ears Throw out all that junk in there and just start over! When I begin programming this thing my only tools will be a programming language (C or Java) a text editor (vi) and my head. But before I really start that, right now I am writing a paper that explains how this stuff works at a very low level. It's mostly an effort to get my thoughts down clearly, but I will share a draft of it with y'all on here soon. Peter Schlumpf -Original Message- From: Bill Dueber b...@dueber.com Sent: Apr 9, 2009 10:37 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Something completely different On Thu, Apr 9, 2009 at 10:26 AM, Mike Taylor m...@indexdata.com wrote: I'm not sure what to make of this except to say that Yet Another XML Bibliographic Format is NOT the answer! I recognize that you're being flippant, and yet think there's an important nugget in here. When you say it that way, it makes it sound as if folks are debating the finer points of OAI-MARC vs MARC-XML -- that it's simply syntactic sugar (although I'm certainly one to argue for the importance of syntactic sugar) over the top of what we already have. What's actually being discussed, of course, is the underlying data model. E-R pairs primarily analyzed by set theory, triples forming directed graphs, whether or not links between data elements can themselves have attributes -- these are all possible characteristics of the fundamental underpinning of a data model to describe the data we're concerned with. The fact that they all have common XML representations is noise, and referencing the currently-most-common xml schema for these things is just convenient shorthand in a community that understands the exemplars. The fact that many in the library community don't understand that syntax is not the same as a data model is how we ended up with RDA. (Mike: I don't know your stuff, but I seriously doubt you're among that group. I'm talkin' in general, here.) Bibliographic data is astoundingly complex, and I believe wholeheartedly that modeling it sufficiently is a very, very hard task. But no matter the underlying model, we should still insist on starting with the basics that computer science folks have been using for decades now: uids (and, these days, guids) for the important attributes, separation of data and display, definition of sufficient data types and reuse of those types whenever possible, separation of identity and value, full normalization of data, zero ambiguity in the relationship diagram as a fundamental tenet, and a rigorous mathematical model to describe how it all fits together. This is hard stuff. But it's worth doing right. -- Bill Dueber Library Systems Programmer University of Michigan Library
Re: [CODE4LIB] Something completely different
Well, the future of ILS is to use general computing standards without making library's own. Essentially, from a computing theory view, a graph is the way to present all the info (i.e. a graph can represent a tree, or a line. When you look at MARC, it is a linear computing model.) Graph is powerful, but graph theory can be difficult and extremely complex. Some of them are NP hard problem. I think that RDF based standards (DC? Or something else or maybe no need for just one metadata standard )can be used to maximize interoperability, allow further information discovery and at the same time provide suitable description for different type of materials. Yan -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Karen Coyle Sent: Monday, April 06, 2009 10:49 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Something completely different Cloutman, David wrote: I'm open to seeing new approaches to the ILS in general. A related question I had the other day, speaking of MARC, is what would an alternative bibliographic data format look like if it was designed with the intent for opening access to the data our ILS systems to developers in a more informal manner? I was thinking of an XML format that a developer could work with without formal training, Well, speaking of 'without formal training' -- I posted this to the Open Library technology list, but using the OL, which is triple-based and open access, I was able to create a simple demo Pipe of how you could determine the earliest date of publication of a book (with an interest in looking at potential copyright status). Caveat is that the API I'm is still pretty stubby, so it only retrieves on exact title (this will be fixed sometime in the future). The pipe is here: http://pipes.yahoo.com/pipes/pipe.info?_id=216efa8c3b04764ca77ad181b1cc6 6e4 kc the basics of which could be learned in an hour, and could reasonably represent the essential fields of the 90% of records that are most likely to be viewed by a public library patron. In my mind, such a format would allow creators of community-based web sites to pull data from their local library, and repurpose it without having to learn a lot of arcane formats (e.g. MARC) or esoteric protocols (e.g. Z39.50). The sacrifice, of course, would be loosing some of the richness MARC allows, but I think in many common situations the really complex records are not what patrons are interested in. You may want to consider prototyping this in your application. I see such an effort to be vital in making our systems relevant in future computing environments, and I am skeptical that a simple, workable solution would come out the initial efforts of a standardization committee. Just my 2 cents. - David --- David Cloutman dclout...@co.marin.ca.us Electronic Services Librarian Marin County Free Library -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Peter Schlumpf Sent: Sunday, April 05, 2009 8:40 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Something completely different Greetings! I have been lurking on (or ignoring) this forum for years. And libraries too. Some of you may know me. I am the Avanti guy. I am, perhaps, the first person to try to produce an open source ILS back in 1999, though there is a David Duncan out there who tried before I did. I was there when all this stuff was coming together. Since then I have seen a lot of good things happen. There's Koha. There's Evergreen. They are good things. I have also seen first hand how libraries get screwed over and over by commercial vendors with their crappy software. I believe free software is the answer to that. I have neglected Avanti for years, but now I am ready to return to it. I want to get back to simple things. Imagine if there were no Marc records. Minimal layers of abstraction. No politics. No vendors. No SQL straightjacket. What would an ILS look like without those things? Sometimes the biggest prison is between the ears. I am in a position to do this now, and that's what I have decided to do. I am getting busy. Peter Schlumpf Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm -- --- Karen Coyle / Digital Library Consultant kco...@kcoyle.net http://www.kcoyle.net ph.: 510-540-7596 skype: kcoylenet fx.: 510-848-3913 mo.: 510-435-8234
Re: [CODE4LIB] OCR engine for Persian/Dari
Mark, Many thanks for your input. This is one of the packages that I am thinking of. Good to know its accuracy. Yan -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Mark Jordan Sent: Tuesday, February 03, 2009 5:36 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] OCR engine for Persian/Dari Hi again Yan, There's this one: http://www.worldlanguage.com/Products/Readiris-Pro-11-Middle-East-Edition-ArabicReadiris-Farsi-Persian-Arabic-Farsi-110226.htm We have a copy of the Traditional Chinese version of Readiris and find its accuracy to be fairly poor (and its performance on latin characters was poor as well IIRC), but I can't comment on how this product works with other languages. Mark Mark Jordan Head of Library Systems W.A.C. Bennett Library, Simon Fraser University Burnaby, British Columbia, V5A 1S6, Canada Voice: 778.782.5753 / Fax: 778.782.3023 mjor...@sfu.ca - Yan Han h...@u.library.arizona.edu wrote: Hello, Do you know an OCR engine for Persian/Dari ? If so, what is the accurate rate? Thanks, Yan
[CODE4LIB] Linux tools for making PDFs
Hello, Do you know a tool running under Linux to make PDFs from images? I use Adobe Acrobat professional in Windows to create PDFs from image files. However, Acrobat does not handle image files with east Asian characters. Yan
[CODE4LIB] OCR engine for Persian/Dari
Hello, Do you know an OCR engine for Persian/Dari ? If so, what is the accurate rate? Thanks, Yan
Re: [CODE4LIB] MARC 21 and MODS
I clicked 2 URLs, and they are broken. What happened? 404 Not Found There is no SKOS Concept, ConceptScheme, or Collection instance in the registry available using this resource URI. -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Tim Cornwell Sent: Thursday, January 29, 2009 8:46 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC 21 and MODS snip ... As a starting point in exploring semantic web types of technologies we are establishing a registry for controlled values used in various standards-- MARC, MODS, PREMIS. See the text at: http://id.loc.gov In the meantime we have a prototype at: http://www.loc.gov:8081/standards/registry/lists.html Rebecca Rebecca S. Guenther FYI: The notion of a vocabulary registry has been investigated and implemented to some extent by the folks here: http://metadataregistry.org/ ...not sure where they stand currently. -Tim Timothy Cornwell, Programmer/Analyst National Science Digital Library (http://nsdl.org) 301 College Avenue Ithaca, NY 14850 (607)255-3297
Re: [CODE4LIB] Is there a utility to open a folder of many pdfs and determine if each one will open? (eom)
try PDFBox. It can index PDF documents. -Original Message- From: Code for Libraries on behalf of Thomas Dowling Sent: Wed 1/28/2009 2:37 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Is there a utility to open a folder of many pdfs and determine if each one will open? (eom) On 01/28/2009 04:31 PM, Stockwell, Chris wrote: Chris Stockwell Library Systems Programmer Analyst Montana State Library cstockw...@mt.gov 406-444-5352 Your shell of choice should let you run pdfinfo on each one. It will either give you sensible information about the PDF file (in which case, you can assume it's good), or give you an error message. -- Thomas Dowling tdowl...@ohiolink.edu
[CODE4LIB] ETD package for ProQuest/UMI old and new delivery platforms
Hello, All, As mentioned before, I have received quite a few inquiries about the packages. I have created a web page so that you can download them. I have also made some fixes on the package. The software package does: * Unzip ProQuest/UMI ETD delivery Zipped files, and create one directory per ETD. * Rename these ETDs into other preferred file names (in my case, Wang_arizona_0009D_10075.xml -- azu_etd_10075_sip1_m.xml) * Generate digital signature for digital preservation. * Create MARC records from ProQuest/UMI XML files. (i.e. a MRK file will be generated for direct loading to catalog. I load MRC file to innovative and Koha) * Create embargo notification and moving embargo ETDs to a different directory for future loading Note: My package is based on files received by U. Arizona. I do not have access to other Universities/colleges files. I am not sure if your University/College have a completely different file naming/structure. If you can email me your university/college file pattern, I might be able to generate something more flexible. The download page is available at http://www.afghanresource.org/joomla/index.php?option=com_contenttask=v iewid=37Itemid=52 . There are two packages existing. Please make sure you have the similar file pattern listed on the page. Thanks, Yan Han
[CODE4LIB] software package for Elec. Theses/dissertations
Hello, Colleagues, As ProQuest/UMI switched its delivery platform for Electronic Theses and dissertations(ETD), I have developed a small software package to process ETD. The software package does: 1. Unzip ProQuest/UMI ETD delivery Zipped files, and create one directory per ETD. 2. Rename these ETDs into other preferred file names (in my case, Wang_arizona_0009D_10075.xml à azu_etd_10075_sip1_m.xml) 3. Generate digital signature for digital preservation. 4. Create MARC records from ProQuest/UMI XML files. (i.e. a MRK file will be generated for direct loading to catalog. I use III innovative and Koha) 5. Create embargo notification and moving embargo ETDs to a different directory for future loading This package saves me a lot of time to process hundreds of ETDs. The package (size of 50kb) has a Java compiled code (class file) and Perl Scripts. Currently I run it on Linux, but it can be run in Windows. If anyone wants to have it or give it a try, please contact me. p.s. I also have a package handling ProQuest old platform (BePress) ETD files. Thanks, Yan Han The University of Arizona Libraries