Re: [CODE4LIB] Strategy for assigning DOIs?

2016-02-09 Thread Han, Yan - (yhan)
Hi, Jason,

I strongly suggest to separate your DOI namespace/naming schema to be totally 
independent of your choice of repository/system. DOI is an infrastructure 
thing, and the main reason behind of assigning DOI is for persistency and 
permanency. At some point any repository system will go away and will be 
replaced by another one. 

Secondly I do not think data is a good namespace. I suggest to have something 
persistent which can standalone even you do not have DOI prefix of 10.17348. 

Yan





On 2/9/16, 11:56 AM, "Code for Libraries on behalf of Jason Best" 
 wrote:

>We recently started assigning DOIs to articles published in one of our 
>journals using Open Journal System which generates the DOI and metadata within 
>a namespace dedicated to that journal. We don’t yet have an institutional 
>repository, but are moving in that direction and I hope we have one in a 
>couple of years. But in the meantime, how could we go about issuing DOIs for 
>items that aren’t related to the journal, but that we’d hope to eventually 
>have handled by our IR? For example, we have a handful of datasets for which 
>we’d like to issue DOIs so I planned on created a “data” namespace then just 
>adding a serial number for each dataset (e.g. 10.17348/data.01 ) which 
>would resolve to a page (with metadata and download links) in our Drupal CMS 
>until we get an IR. Will such an approach allow us to eventually use an IR to 
>1) become the repository for the items with DOIs previously issued in the 
>“data” namespace and 2) continue issuing DOIs for new items withi!
 n!
>  the “data” namespace? I know the answer is going to depend on the IR 
> platform we use, so I’m asking this in the broad sense to get your input 
> about your experiences. 
>
>But since DSpace is one of the likely candidates for our IR, I’ll use it as a 
>more concrete example. From my limited understanding (just reading the 
>documentation), items deposited in a DSpace instance will all share the same 
>DOI namespace. The namespace and an internal identifier are then concatenated 
>with the DOI prefix to create the DOI. If we’ve already issued a DOI outside 
>of DSpace, would we have any control over the identifier that was assigned to 
>a newly-deposited item allowing us to control the DOI that is generated?
>
>Any thoughts or suggestions?
>
>Thanks,
>Jason
>
>Jason Best
>Biodiversity Informatician
>Botanical Research Institute of Texas
>1700 University Drive
>Fort Worth, Texas 76107
>
>817-332-4441 ext. 230
>http://www.brit.org


[CODE4LIB]

2016-02-08 Thread Han, Yan - (yhan)
Yes. Use iText or PDFBox

These are common PDF libraries.





On 2/6/16, 2:24 PM, "Code for Libraries on behalf of Andrew Cunningham" 
 wrote:

>Hi all,
>
>I am working with PDF files in some South Asian and South East Asian
>languages. Each PDF has ActualText added for each tag in the PDF. Each PDF
>has ActualText as an alternative forvthe visible text layer in the PDF.
>
>Is anyone aware of tools the will allow me to index and search PDFs based
>on the ActualText content rather than the visible text layers in the PDF?
>
>Andrew
>
>-- 
>Andrew Cunningham
>lang.supp...@gmail.com


[CODE4LIB] EMPLOYMENT OPPORTUNITY: Department Head, Office of Digital Innovation and Stewardship (ODIS)

2016-01-26 Thread Han, Yan - (yhan)

Please share the posting with interested parties.  Tucson has mild winter and 
dry / warm summer. The person will be working with engaged and nice colleagues.



EMPLOYMENT OPPORTUNITY

Department Head, Office of Digital Innovation and Stewardship
The University of Arizona Libraries, Digital Innovation/Stewardship (Dept. 1705)
Classification: Administrator/Appointed Professional; Full-Time; Exempt
Location: Main Campus, Tucson

Position Summary:
The University Libraries seek a dynamic, innovative Head of the Office of 
Digital Innovation and Stewardship (ODIS), a position with the primary 
responsibility of providing leadership and strategic direction for digital 
innovation and stewardship within the broader context of the strategic plans of 
the University Libraries and the University of Arizona. ODIS provides a broad 
range of services including digital collections, data management, campus 
repository, metadata, journal hosting and publishing, copyright and scholarly 
communication, open access, and geospatial data. In overseeing several areas of 
strategic importance, the Department Head must be forward thinking and willing 
to take strategic risks in the development of services. The Department Head 
will be a member of the Libraries Cabinet (leadership, policy and management 
team) and reports to the Vice Dean of Libraries.

The Department Head of ODIS will be responsible for leadership, management, and 
planning for the services and functions of the Office of Digital Innovation and 
Stewardship, which includes 8 FTE permanent professionals and a large team of 
students and temporary employees. ODIS members work collaboratively, engaging 
the strengths and knowledge of all members of the department. The Department 
Head will coordinate and facilitate leadership currently in place among ODIS 
faculty and staff. As UA librarians have faculty status, the Department Head is 
responsible for coaching and guiding librarians through the promotion and 
continuing status process. The Department Head will also be responsible for 
ensuring that department planning furthers the strategic goals for the 
Libraries and campus.

This is a continuing-eligible, academic professional position. Incumbents are 
members of the general faculty and are entitled to all accompanying rights and 
privileges granted by the Arizona Board of Regents and the University of 
Arizona. Retention and promotion are earned through achievement of a record of 
excellence in position effectiveness, scholarship, and service.

The Office of Digital Innovation and Stewardship (ODIS) at the University of 
Arizona Libraries engages and innovates across a range of services and content 
in support of the University’s mission and strategic plan. ODIS provides 
services to the University community that encompass data management, campus 
repository, metadata, journal hosting and publishing, copyright and scholarly 
communication, open access, and geospatial data. ODIS is responsible for 
programmatic planning and oversight of the Libraries digital collections and 
digitization activities, including digital preservation and digital asset 
management efforts. ODIS coordinates strategies for exposing unique and local 
digital collections. ODIS also leads and contributes to a variety of national 
and international collaborative efforts, including TRAIL (Technical Report 
Archive and Image Library) and the Afghanistan Digital Collections. ODIS is 
active in campus-wide efforts related to scholarly activity and research data, 
participates in the University’s Research Computing Governance Committee, leads 
the institution’s faculty activity reporting efforts, and collaborates with the 
University’s Office of Research and Discovery, and University Information 
Technology Services. In this process, ODIS collaborates with faculty and staff 
throughout the University Libraries and across campus.


The University of Arizona has been recognized on Forbes 2015 list of America’s 
Best Employers in the United States and has been awarded the 2015 Work-Life 
Seal of Distinction by the Alliance for Work-Life Progress! For more 
information about working at the University Libraries, see 
http://www.library.arizona.edu/about/employment/why.


Diversity Commitment: At the University of Arizona, we value our inclusive 
climate because we know that diversity in experiences and perspectives is vital 
to advancing innovation, critical thinking, solving complex problems, and 
creating an inclusive academic community. Diversity in our environment embraces 
the acceptance of a multiplicity of cultural heritages, lifestyles and 
worldviews. We translate these values into action by seeking individuals who 
have experience and expertise working with diverse students, colleagues and 
constituencies, as we believe that such experiences are both institutional and 
service imperatives. Because we seek a workforce with diverse perspectives and 
experiences, we encourage applications 

Re: [CODE4LIB] Amazon Glacier - tracking deposits

2015-04-09 Thread Han, Yan - (yhan)
Be aware of data transfer cost if you are using Glacier.
Glacier is excellent choice for archive use, but you want to be sure these
files shall not be accessed often.

You shall consider the total cost of ownership including data transfer
cost, which could be very expensive if you retrieve more than 5%? Of your
data. It adds up quickly if you do not check carefully.

I have one article to-be-published discussing Amazon S3 , Glacier. Also
including history of data transfer and storage cost over the past 7 years
in Library Hi-tech.

For id, I designed and implemented  a unique persistent id system for all
the digital files (which is also used as DOI if needed).


Yan Han
The University of Arizona Libraries




On 4/9/15, 4:13 AM, Scancella, John j...@loc.gov wrote:

Have you looked at google's cloud storage nearline? it is about $0.01
per gigabyte per month with about 3 second access time
http://googlecloudplatform.blogspot.com/2015/03/introducing-Google-Cloud-S
torage-Nearline-near-online-data-at-an-offline-price.html


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Cary Gordon
Sent: Wednesday, April 08, 2015 7:49 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Amazon Glacier - tracking deposits

We have been playing with Glacier, but so far neither us nor our clients
have been convinced of its cost-effectiveness. A while back, we were
discussing a project with 15 PB of archival assets, and that would
certainly have made Glacier cost-effective, saving about $30k/mo. over
S3, although requests could cut into that.

The Glacier location is in the format /Account ID/vaults/Vault
Name/archives/Archive ID, so you might want to consider using the
whole string.

Thanks,

Cary


 On Apr 8, 2015, at 3:32 PM, Sara Amato sam...@willamette.edu wrote:
 
 Has anyone leapt on board with Glacier?   We are considering using it
for long term storage of high res archival scans.  We have derivative
copies for dissemination, so don't intend touching these often, if ever.
  The question I have is how to best track the  Archive ID that glacier
attaches to deposits, as it looks like that is the only way to retrieve
information if needed (though you can attach a brief description also
that appears on the inventory along with the id.)   We're considering
putting the ID in Archivist Toolkit, where the location of the
dissemination copies is noted, but am wondering if there are other tools
out there specific for this scenario that people are using. 


Re: [CODE4LIB] : Persian Romanization table

2013-04-19 Thread Han, Yan
Hello, Charles,
The plan is to write a program which can use a pre-defined language mapping XML 
file.   One language needs one pre-defined mapping XML file, so that any 
language can have its own mapping (extensible for future language 
transliteration).  In this case,  a Persian language mapping XML file, and a 
Pashuto language mapping XML file.  
Thanks for the language tool.  I will take a look.
Yan


-Original Message-
From: Riley, Charles [mailto:charles.ri...@yale.edu] 
Sent: Wednesday, April 17, 2013 5:31 PM
To: lit...@ala.org; Jacobs, Jane W; Code for Libraries 
(CODE4LIB@LISTSERV.ND.EDU)
Cc: Seyede Pouye Khoshkhoosani
Subject: [lita-l] RE: : Persian Romanization table

Hi Yan,

Sounds like a really interesting project.  Is the intent to support going from 
Persian to Pashto directly, as well as from each language to Roman script?

Among the natural language processing tools found here-- 
http://www.ling.ohio-state.edu/~jonsafari/persian_nlp.html

--the one that *might* be the most helpful is the link to the Persian Lexical 
Project, where the romanized orthography used is one that accounts for vowels 
inserted between the consonants.  It's not a large dataset, but carries a GPLv2 
license--maybe useful in some testing, and see if it's worth expanding on the 
effort.

Best,
Charles Riley


From: Han, Yan [h...@u.library.arizona.edu]
Sent: Wednesday, April 17, 2013 8:14 PM
To: Jacobs, Jane W; Code for Libraries (CODE4LIB@LISTSERV.ND.EDU); 
lit...@ala.org
Cc: Seyede Pouye Khoshkhoosani
Subject: [lita-l] RE: : Persian Romanization table

Hello, All and Jane
First I would like to appreciate Jane Jacobs at Queens Library providing me 
Urdu Romanization table.
As we are working on creating Persian/Pushutu transliterate software, my 
Persian language expert has the following question :
 In according to our conversation for transliterating Persian to Roman 
letters, I faced a big problem: As the short vowels do not show up on or under 
the letters in Persian, how a machine can read a word in Persian. For example 
we have the word “پدر  ; to the machine this word is PDR, because it cannot 
read the vowels. There is no rule for the short vowels in the Persian language; 
so the machine does not understand if the first letter is “pi”, “pa” or “po”. 
Is there any way to overcome this obstacle? 
 This seems to me that we missed a critical piece of information here. 
(Something like a dictionary). Without it, there is no way to have good 
translation from computer. We will have to have a Persian speaker to 
check/correct the computer's transliteration.
Any suggestions ?
Thanks,
Yan


-Original Message-
From: Jacobs, Jane W [mailto:jane.w.jac...@queenslibrary.org]
Sent: Wednesday, January 23, 2013 6:28 AM
To: Han, Yan
Subject: RE: : Persian Romanization table

Hi Yan,

As per my message to the listserve, here are the config files for Urdu.  If you 
do a Persian config file, I d love to get it and if possible add it to the 
MARC::Detrans site.

Let me know if you want to follow this road.
JJ

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Han, Yan
Sent: Tuesday, January 22, 2013 5:31 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] : Persian Romanization table

Hello, All,
I have a project to deal with Persian materials. I have already uses Google 
Translate API to translate. Now I am looking for an API to transliterate 
/Romanize (NOT Translate) Persian to English (not English to Persian). In other 
words, I have Persian in, and English out.
There is a Romanization table (Persian romanization table - Library of 
Congresshttp://www.loc.gov/catdir/cpso/romanization/persian.pdf 
www.loc.gov/catdir/cpso/romanization/persian.pdfhttp://www.loc.gov/catdir/cpso/romanization/persian.pdf).

For example, If

  should output as  Kit?b
My finding is that existing tools only do the opposite

1.  Google Transliterate: you enter English, output Persian (Input  
Bookmark , output  ???  , Input  ???  , output  ???  )

2.  OCLC language: the same as Google Transliterate.

3.  http://mylanguages.org/persian_romanization.php  : works, but no API.

Anyone know such API exists?

Thanks much,

Yan




To maximize your use of LITA-L or to unsubscribe, see 
http://www.ala.org/lita/involve/email
To maximize your use of LITA-L or to unsubscribe, see 
http://www.ala.org/lita/involve/email


Re: [CODE4LIB] : Persian Romanization table

2013-04-17 Thread Han, Yan
Hello, All and Jane
First I would like to appreciate Jane Jacobs at Queens Library providing me 
Urdu Romanization table. 
As we are working on creating Persian/Pushutu transliterate software, my 
Persian language expert has the following question :
 In according to our conversation for transliterating Persian to Roman 
letters, I faced a big problem: As the short vowels do not show up on or under 
the letters in Persian, how a machine can read a word in Persian. For example 
we have the word “پدر  ; to the machine this word is PDR, because it cannot 
read the vowels. There is no rule for the short vowels in the Persian language; 
so the machine does not understand if the first letter is “pi”, “pa” or “po”. 
Is there any way to overcome this obstacle? 
 This seems to me that we missed a critical piece of information here. 
(Something like a dictionary). Without it, there is no way to have good 
translation from computer. We will have to have a Persian speaker to 
check/correct the computer's transliteration. 
Any suggestions ? 
Thanks,
Yan


-Original Message-
From: Jacobs, Jane W [mailto:jane.w.jac...@queenslibrary.org] 
Sent: Wednesday, January 23, 2013 6:28 AM
To: Han, Yan
Subject: RE: : Persian Romanization table

Hi Yan,

As per my message to the listserve, here are the config files for Urdu.  If you 
do a Persian config file, I d love to get it and if possible add it to the 
MARC::Detrans site.  

Let me know if you want to follow this road.
JJ

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Han, Yan
Sent: Tuesday, January 22, 2013 5:31 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] : Persian Romanization table

Hello, All,
I have a project to deal with Persian materials. I have already uses Google 
Translate API to translate. Now I am looking for an API to transliterate 
/Romanize (NOT Translate) Persian to English (not English to Persian). In other 
words, I have Persian in, and English out.
There is a Romanization table (Persian romanization table - Library of 
Congresshttp://www.loc.gov/catdir/cpso/romanization/persian.pdf 
www.loc.gov/catdir/cpso/romanization/persian.pdfhttp://www.loc.gov/catdir/cpso/romanization/persian.pdf).

For example, If

  should output as  Kit?b
My finding is that existing tools only do the opposite

1.  Google Transliterate: you enter English, output Persian (Input  
Bookmark , output  ???  , Input  ???  , output  ???  )

2.  OCLC language: the same as Google Transliterate.

3.  http://mylanguages.org/persian_romanization.php  : works, but no API.

Anyone know such API exists?

Thanks much,

Yan


[CODE4LIB] III loading module cannot handle non-English characters

2013-01-22 Thread Han, Yan
Hello, 
We have problems using III loading module to load MARC files (.mrc) to our 
catalog.  This is to use Data Exchange  Load Electronic Records (itm). 
Basically  non- English characters (French, Arabic ) will be changed to unknown 
symbols. The MARC files (.mrk and .mrc) are verified before loading to III. 
There are only two issues:
1. the III configuration might be wrong. 
2. The III loading module has a bug and it probably does not know how to deal 
with non-English characters.  

Anyone having similar experience or resolving it? 
Thanks,
Yan


[CODE4LIB] : Persian Romanization table

2013-01-22 Thread Han, Yan
Hello, All,
I have a project to deal with Persian materials. I have already uses Google 
Translate API to translate. Now I am looking for an API to transliterate 
/Romanize (NOT Translate) Persian to English (not English to Persian). In other 
words, I have Persian in, and English out.
There is a Romanization table (Persian romanization table - Library of 
Congresshttp://www.loc.gov/catdir/cpso/romanization/persian.pdf 
www.loc.gov/catdir/cpso/romanization/persian.pdfhttp://www.loc.gov/catdir/cpso/romanization/persian.pdf).

For example, If

كتاب  should output as  Kitāb
My finding is that existing tools only do the opposite

1.  Google Transliterate: you enter English, output Persian (Input 
“Bookmark”, output “بوکمارک “, Input “بوکمارک “, output “بوکمارک “)

2.  OCLC language: the same as Google Transliterate.

3.  http://mylanguages.org/persian_romanization.php  : works, but no API.

Anyone know such API exists?

Thanks much,

Yan



Re: [CODE4LIB] LAMP Hosting service that supports php_yaz?

2011-03-07 Thread Han, Yan
You can just buy a node from a variety of cloud providers such as Amazon EC2, 
Linode etc. (It is very easy to build anything you want). 


Yan


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Cindy 
Harper
Sent: Sunday, March 06, 2011 10:54 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] LAMP Hosting service that supports php_yaz?

At the risk of exhausting my quota of messages for the month - Our LAMP hosting 
service does not support PECL extension php_yaz. Does anyone know of a service 
that does?

Cindy Harper, Systems Librarian
Colgate University Libraries
char...@colgate.edu
315-228-7363


Re: [CODE4LIB] LAMP Hosting service that supports php_yaz?

2011-03-07 Thread Han, Yan
Updating L A M is easy with Ubuntu /Debian. Not sure about PHP. If you are 
afraid of hacking/security, you can have a monitor service.

I know Google has java/python platform. I do not know who provides PHP/Perl one 
on this level.  (yes. It is a little easier when someone takes care of updating 
for you).


Yan Han, Associate Librarian
The University of Arizona Libraries
Phone: (520)307-2823
Email: h...@u.library.arizona.edu

From: Cindy Harper [mailto:char...@colgate.edu]
Sent: Monday, March 07, 2011 11:18 AM
To: Code for Libraries
Cc: Han, Yan
Subject: Re: [CODE4LIB] LAMP Hosting service that supports php_yaz?

I guess I was hoping to have service such as that provided by my current 
hosting service, where security,etc., updates for L A M  P are all taken care 
of by the host. Any recommendations along those lines?  One that provides that 
and still lets me install what I want? My service suggested that I go to a VPS 
account,where I'd have to do my own updates.

Cindy Harper, Systems Librarian
Colgate University Libraries
char...@colgate.edumailto:char...@colgate.edu
315-228-7363


On Mon, Mar 7, 2011 at 11:00 AM, Han, Yan 
h...@u.library.arizona.edumailto:h...@u.library.arizona.edu wrote:
You can just buy a node from a variety of cloud providers such as Amazon EC2, 
Linode etc. (It is very easy to build anything you want).


Yan


-Original Message-
From: Code for Libraries 
[mailto:CODE4LIB@LISTSERV.ND.EDUmailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
Cindy Harper
Sent: Sunday, March 06, 2011 10:54 AM
To: CODE4LIB@LISTSERV.ND.EDUmailto:CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] LAMP Hosting service that supports php_yaz?

At the risk of exhausting my quota of messages for the month - Our LAMP hosting 
service does not support PECL extension php_yaz. Does anyone know of a service 
that does?

Cindy Harper, Systems Librarian
Colgate University Libraries
char...@colgate.edumailto:char...@colgate.edu
315-228-7363


Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?

2010-10-20 Thread Han, Yan
I would think DSpace, Fedora, and Eprint. DSpace is fairly easy to implement, 
which has embargo support in 1.6 
(https://wiki.duraspace.org/display/DSTEST/Embargo ).
I have an article comparing DSpace and Fedora, but was written 6 years ago. 
DSpace has not been changed much, but Fedora is a different story. 
Yan
-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Deng, 
Sai
Sent: Wednesday, October 20, 2010 10:33 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] DL Systems (allowing search within documents and access 
restrictions)?

Hello, list,
Do you know the Digital Library systems which can search within the documents 
(e.g. PDFs) and handle access restrictions (e.g. DRM)?
Has any of you compared these DL systems?

Thanks for any information!
Sophie


Re: [CODE4LIB] DL Systems (allowing search within documents and access restrictions)?

2010-10-20 Thread Han, Yan
DSpace does Full-text search, you need to turn on the configuration file. 
See UAL http://arizona.openrepository.com/arizona/ 
Yan

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Deng, 
Sai
Sent: Wednesday, October 20, 2010 2:14 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access 
restrictions)?

For access restriction, I mean we would like to have certain documents open 
only to certain communities (UpLib cannot do that, right?). I don't know how 
DRM affects file indexing.

On second thought, I searched for DSpace full text search and found this: 
https://wiki.duraspace.org/display/DSPACE/Configure+full+text+indexing
However, I haven't seen any instance which shows the full text search results 
as I would see from vendor databases.

Any idea on what system might be good/best for search within documents and DRM?
Thank you for the reply!
Sophie


From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill Janssen 
[jans...@parc.com]
Sent: Wednesday, October 20, 2010 4:01 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] DL Systems (allowing search within documents and access 
restrictions)?

Deng, Sai sai.d...@wichita.edu wrote:

 Do you know the Digital Library systems which can search within the 
 documents (e.g. PDFs) and handle access restrictions (e.g. DRM)?

Not sure what you mean by handle access restrictions.  Do you mean it can 
index the documents put into it even if they have DRM encumbrances?

UpLib has search within the documents -- if you search for a word or phrase, 
it shows you all the documents which match, but also all the pages in each 
document which match.  Supports a wide variety of document formats, from 
JPEG2000 to PDF to Powerpoint.  But as far as I know it doesn't deal with DRM 
restrictions.

Bill


[CODE4LIB] Amazon EC2 ports: only 80 and 8080?

2010-07-06 Thread Han, Yan
Hello,
Currently we would like to have Amazon EC2 node hosting 2 applications: DSpace 
and Koha (so that we need 4 ports). However, it seems to me that only port 80 
and 8080 are available. Any other ports are not accessible from outside.
Anyone has similar experience and knows how to open other ports?
Thanks,
Yan


[CODE4LIB] OCR for handwritten pages

2010-01-13 Thread Han, Yan
Hello, Colleagues,
Does anyone know/use any OCR software working on handwritten pages? or at least 
think it is better than hiring a student key-in.
I know these OCR software such as ABBYY, but they do not work on handwriting.

Thanks,
Yan


Re: [CODE4LIB] Assigning DOI for local content

2009-11-19 Thread Han, Yan
Please explain in more details, that will be more helpful. 
It has been a while. Back to 2007, I checked PURL's architecture, and it was 
straightly handling web addresses only. Of course, current HTTP protocol is not 
going to last forever, and there are other protocols in the Internet. The 
coverage of PURL is not enough. 
From PURL's website, it still says  PURLs (Persistent Uniform Resource 
Locators) are Web addresses that act as permanent identifiers in the face of a 
dynamic and changing Web infrastructure. I am not sure what web addresses 
means.  http://www.purl.org/docs/help.html#overview says  PURLs are 
Persistent Uniform Resource Locators (URLs). A URL is simply an address on the 
World Wide Web. We all know that World Wide Web is not the Internet. What 
if info resource can be accessed through other Internet Protocols (FTP, VOIP, 
)?  This is the limitation of PURL. 
PURL is doing re-architecture, though I cannot find out more documentation.
The Handle system is  The Handle System is a general purpose distributed 
information system that provides efficient, extensible, and secure HDL 
identifier and resolution services for use on networks such as the Internet.. 
http://www.handle.net/index.html Notice the difference in definition. 

Yan


-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Ross 
Singer
Sent: Wednesday, November 18, 2009 8:11 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Assigning DOI for local content

On Wed, Nov 18, 2009 at 12:19 PM, Han, Yan h...@u.library.arizona.edu wrote:
 Currently DOI uses Handle (technology) with it social framework (i.e. 
 administrative body to manage DOI). In technical sense, PURL is not going to 
 last long.

I'm not entirely sure what this is supposed to mean (re: purl), but
I'm pretty sure it's not true.

I'm also pretty sure there's little to no direct connection between
purl and doi despite a superficial similarity in scope.

-Ross.


Re: [CODE4LIB] Assigning DOI for local content

2009-11-18 Thread Han, Yan
Currently DOI uses Handle (technology) with it social framework (i.e. 
administrative body to manage DOI). In technical sense, PURL is not going to 
last long. 
Crossref handles DOI registration in U.S. In Europe and Aisa, they have other 
organizations to handle it. DOI is also currently going through ISO 
standardization process. The other fact is that DOI has the biggest number of 
usage than other Persistent Identifiers. More info can be found at 
http://www.doi.org/faq.html 

Yan

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Jodi 
Schneider
Sent: Tuesday, November 17, 2009 4:59 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Assigning DOI for local content

The first question is: what are they trying to accomplish by having DOIs?

Do they have a long-term plan for persistence of their content? Financial
plan?

If they're looking for persistent identifiers, I don't understand (a
priori), why DOI is better, as an identifier scheme, than any other
'persistent identifier scheme' (ARK [1], PURL, Handle, etc[2]). (Though I
really like CrossRef and the things they're doing.)

[1] http://www.cdlib.org/inside/diglib/ark/
[2] http://www.persistent-identifier.de/english/204-examples.php

-Jodi

On Tue, Nov 17, 2009 at 11:44 PM, Bucknell, Terry 
t.d.buckn...@liverpool.ac.uk wrote:

 You should be able to find all the information you need about CrossRef fees
 and rules at:

 http://www.crossref.org/02publishers/20pub_fees.html

 and

 http://www.crossref.org/02publishers/59pub_rules.html

 Information about the system of registering and maintaining DOIs is at:

 http://www.crossref.org/help/

 Note that as well as registering DOIs for the articles in LLT, LLT would be
 obliged to link to the articles cited by LLT articles (for cited articles
 that have DOIs too). Looking at the LLT site, it looks like they would have
 to change their 'abstract' pages to 'abstract plus cited refs', or change
 the way that their PDFs are created so that they include DOI links for cited
 references. (Without this the whole system would fail: publishers would
 expect traffic to come to them, but wouldn't have to send traffic
 elsewhere).

 I'd agree that DOIs are in general a Good Thing (and for e-books / e-book
 chapters, and reference work entries as well as e-journal articles). The
 CrossRef fees are deliberately set so as not to exclude single-title
 publishers. Here's an example of a single-title, university-based e-journal
 in the UK that provides DOIs, so it must be a CrossRef member:
 http://www.bioscience.heacademy.ac.uk/journal/.


 Terry Bucknell
 Electronic Resources Manager
 University of Liverpool


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Jonathan Rochkind
 Sent: 17 November 2009 23:20
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Assigning DOI for local content

 So I have no actual experience with this.

 But you have to pay for DOI's.  I've never done it, but I don't think
 you neccesarily have to run your own purl server -- CrossRef takes care
 of it.  Of course, if your documents are going to be moving all over the
 place, if you run your own purl server and register your purls with
 CrossRef, then when a document moves, you can update your local purl
 server; otherwise, you can update CrossRef, heh.

 It certainly is useful to have DOIs, I agree.  I would suggest they
 should just contact cross-ref and get information on the cost, and what
 their responsibilities are, and then they'll be able to decide.  If the
 'structure of their content' is journal articles, then, sure DOI is
 pretty handy for people wanting to cite or link to those articles.

 Jonathan

 Ranti Junus wrote:
  Hi All,
 
  I was asked by somebody from a college @ my institution whether they
  should go with assigning DOI for their journal articles:
  http://llt.msu.edu/
 
  I can see the advantage of this approach and my first thought is more
  about whether they have resources in running their purl server, or
  whether they would need to do it through crossref (or any other
  agency.) Has anybody had any experience about this?
 
  Moreover, are there other factors that one should consider (pros and
  cons) about this? Or, looking at the structure of their content,
  whether they ever need DOI? Any ideas and/or suggestions?
 
 
  Any insights about this is much appreciated.
 
 
  thanks,
  ranti.
 
 



Re: [CODE4LIB] Digital imaging questions

2009-06-18 Thread Han, Yan
There are two things about archive images at least I can think of this moment:
1. the resolution: diff size/materials require different resolution. There is 
no one-size-fit-all. To make a judgment, I would like to know the image 
(color?), the size of the material?, 
2. the file format: TIFF is the recommended format due to its openness, 
stability, and lossless data over the time. If you believe that your jpeg file 
has enough resolution, I do not see any problem to convert it to tiff. 

Yan


From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Deng, Sai 
[sai.d...@wichita.edu]
Sent: Thursday, June 18, 2009 7:33 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Digital imaging questions

Hi, list,



A while ago, I read some interesting discussion on how to use camera to produce 
archival-quality images from this list. Now, I have some imaging questions and 
I think this might be a good list to turn to. Thank you in advance! We are 
trying to add some herbarium images to our DSpace. The specimen pictures will 
be taken at the Biology department and the library is responsible for 
depositing the images and transferring/mapping/adding metadata. On the testing 
stage, they use Fujifilm FinePix S8000fd digital camera

(http://www.fujifilmusa.com/support/ServiceSupportSoftwareContent.do?dbid=874716prodcat=871639sscucatid=664260).
 It produces 8 megapixel images, and it doesn't have raw/tiff support. It seems 
that it cannot produce archival quality images. Before we persuade the Biology 
department to switch their camera, I want to make sure it is absolutely 
necessary. The pictures they took look fine with human eyes, see an example at: 
http://library.wichita.edu/techserv/test/herbarium/Astenophylla1-02710.jpg

In order to make master images from a camera, it should be capable of producing 
raw or tiff images with 12 or above megapixels?



A related archiving question, the biology field standard is DarwinCore, 
however, DSpace doesn't support it. The Biology Dept. already has some data in 
spreadsheet. In this case, when it is impossible to map all the elements to 
Dublin Core, is it a good practice for us to set up several local elements 
mapped from DarwinCore?

Thanks a million,

Sai


Sai Deng
Metadata Catalog Librarian
Wichita State University Libraries
1845 Fairmount
Wichita, KS 67260-0068
Phone: (316) 978-5138
Fax:   (316) 978-3048
Email: sai.d...@wichita.edu
 said...@gmail.com


Re: [CODE4LIB] Recommend book scanner?

2009-05-04 Thread Han, Yan
The National Archives has the guideline which describes target that you
can use for scanning comparison. There are other targets used in other
books/articles. 
I suggest that you check the National Archives' guidelines.
http://www.archives.gov/preservation/technical/guidelines.html 

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Lars Aronsson
Sent: Friday, May 01, 2009 8:27 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Recommend book scanner?

Mike Taylor wrote:

 Or not.  Cheap cameras may well produce JPEGs that contain eight 
 million pixels, but that doesn't mean that they are using all or 
 even much of that resolution.

Does anybody have a printed test sheet that we can scan or photo, 
and then compare the resulting digital images?  It should have 
lines at various densities and areas of different colours, just 
like an old TV test image.  Can you buy such calibration sheets?

We could make it a standard routine, to always shoot such a sheet 
at the beginning of any captured book, to give the reader an idea 
of the digitization quality of the used equipment.

They are called technical target in figure 14, page 149, of
Lisa L. Fox (ed.), Preservation Microfilming, 2nd ed. (1996), 
ISBN 0-8389-0653-2.  The example there is manufactured by AP 
International, http://www.a-p-international.com/

However, their price list is $100-400 per package of 50 sheets.
I wouldn't pay more for the calibration targets than for the
camera, if I could avoid it.


-- 
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se

  Project Runeberg - free Nordic literature - http://runeberg.org/


Re: [CODE4LIB] Recommend book scanner?

2009-05-01 Thread Han, Yan
That is right. 
In addition, for certain printing (gold seal), digital camera delivers better 
result than scanners. 

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of 
Jonathan Rochkind
Sent: Friday, May 01, 2009 2:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Recommend book scanner?

Yeah, I don't think people use cameras instead of flatbed scanners 
because they produce superior results, or are cheaper: They use them 
because they're _faster_ for large-scale digitization, and also make it 
possible to capture pages from rare/fragile materials with less damage 
to the materials. (Flatbeds are not good on bindings, if you want to get 
a good image).

If these things don't apply, is there any reason not to use a flatbed 
scanner? Not that I know of?

Jonathan

Randy Stern wrote:
 My understanding is that a flatbed or sheetfed document scanner that 
 produces 300 dpi will produce much better OCR results than a cheap digital 
 camera that produces 300 dpi. The reasons have to do with the resolution 
 and distortion of the resulting image, where resolution is defined as the 
 number of line pairs per mm can be resolved (for example when scanning a 
 test chart) - in other words the details that will show up for character 
 images, and distortion is image aberration that can appear at the edges of 
 the page image areas, particularly when illumination is not even. A scanner 
 has much more even illumination.

 At 11:21 AM 5/1/2009 -0700, Erik Hetzner wrote:
   
 At Fri, 1 May 2009 09:51:19 -0500,
 Amanda P wrote:
 
 On the other hand, there are projects like bkrpr [2] and [3],
 home-brew scanning stations build for marginally more than the cost of
 a pair of $100 cameras.

 Cameras around $100 dollars are very low quality. You could get no where
 near the dpi recommended for materials that need to be OCRed. The 
   
 quality of
 
 images from cameras would be not only low, but the OCR (even with the best
 software) would probably have many errors. For someone scanning items at
 home this might be ok, but for archival quality, I would not recommend
 cameras. If you are grant funded and the grant provider requires a certain
 level of quality, you need to make sure the scanning mechanism you use can
 scan at that quality.
   
 I know very little about digital cameras, so I hope I get this right.

 According to Wikipedia, Google uses (or used) an 11MP camera (Elphel
 323). You can get a 12MP camera for about $200.

 With a 12MP camera you should easily be able to get 300 DPI images of
 book pages and letter size archival documents. For a $100 camera you
 can get more or less 300 DPI images of book pages. *

 The problems I have always seen with OCR had much to do with alignment
 and artifacts than with DPI. 300 DPI is fine for OCR as far as my
 (limited) experience goes - as long as you have quality images.

 If your intention is to scan items for preservation, then, yes, you
 want higher quality - but I can’t imagine any setup for archival
 quality costing anywhere near $1000. If you just want to make scans 
 full text OCR available, these setups seem worth looking at -
 especially if the software  workflow can be improved.

 best,
 Erik

 * 12 MP seems to equal 4256 x 2848 pixels. To take a ‘scan’ (photo) of
 a page at 300 DPI, that page would need to be 14.18 x 9.49 (dividing
 pixels / 300). As long as you can get the camera close enough to the
 image to not waste much space you will be getting in the close to 300
 DPI range for images of size 8.5 x 11 or less.
 ;; Erik Hetzner, California Digital Library
 ;; gnupg key id: 1024D/01DB07E3
 

   


Re: [CODE4LIB] You got it!!!!! Re: [CODE4LIB] Something completely different

2009-04-10 Thread Han, Yan
Bill and Peter,

Very nice posts. XML, RDF, MARC and DC are all different ways to present 
information in a way (of course, XML, RDF, and DC are easier to read/processed 
by machine). 

However, down the fundamentals, I think that it can go deeper, basically data 
structure and algorithms making things works. RDF (with triples) is a directed 
graph. Graph is a powerful (the most powerful?) data structure that you can 
model everything. However, some of the graph theory/problems are NP-hard 
problems. In fundamental we are talking about Math. So a balance needs to be 
made. (between how complex the model is and how easy(or possible) to get it 
implemented). As computing power grows, complex data modeling and data mining 
are on the horizon.

Yan

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Peter 
Schlumpf
Sent: Thursday, April 09, 2009 10:09 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] You got it! Re: [CODE4LIB] Something completely 
different

Bill,

You have hit the nail on the head!  This is EXACTLY what I am trying to do! 
 It's the underlying stuff that I am trying to get at.   Looking at RDF may 
yield some good ideas.  But I am not thinking in terms of RDF or XML, triples, 
or MARC, standards, or any of that stuff that gets thrown around here.  Even 
the Internet is not terribly necessary.  I am thinking in terms of data 
structures, pointers, sparse matrices, relationships between objects and yes, 
set theory too -- things like that.  The former is pretty much cruft that lies 
upon the latter, and it mostly just gets in the way.  Noise, as you put it, 
Bill!

A big problem here is that Libraryland has a bad habit of getting itself lost 
in the details and going off on all kinds of tangents.  As I said before, the 
biggest prison is between the ears  Throw out all that junk in there and 
just start over!  When I begin programming this thing my only tools will be a 
programming language (C or Java) a text editor (vi) and my head.  But before I 
really start that, right now I am writing a paper that explains how this stuff 
works at a very low level.  It's mostly an effort to get my thoughts down 
clearly, but I will share a draft of it with y'all on here soon.

Peter Schlumpf


-Original Message-
From: Bill Dueber b...@dueber.com
Sent: Apr 9, 2009 10:37 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Something completely different

On Thu, Apr 9, 2009 at 10:26 AM, Mike Taylor m...@indexdata.com wrote:

 I'm not sure what to make of this except to say that Yet Another XML
 Bibliographic Format is NOT the answer!


I recognize that you're being flippant, and yet think there's an important
nugget in here.

When you say it that way, it makes it sound as if folks are debating the
finer points of OAI-MARC vs MARC-XML -- that it's simply syntactic sugar
(although I'm certainly one to argue for the importance of syntactic sugar)
over the top of what we already have.

What's actually being discussed, of course, is the underlying data model.
E-R pairs primarily analyzed by set theory, triples forming directed graphs,
whether or not links between data elements can themselves have attributes --
these are all possible characteristics of the fundamental underpinning of a
data model to describe the data we're concerned with.

The fact that they all have common XML representations is noise, and
referencing the currently-most-common xml schema for these things is just
convenient shorthand in a community that understands the exemplars. The fact
that many in the library community don't understand that syntax is not the
same as a data model is how we ended up with RDA.  (Mike: I don't know your
stuff, but I seriously doubt you're among that group. I'm talkin' in
general, here.)

Bibliographic data is astoundingly complex, and I believe wholeheartedly
that modeling it sufficiently is a very, very hard task. But no matter the
underlying model, we should still insist on starting with the basics that
computer science folks have been using for decades now: uids  (and, these
days, guids) for the important attributes, separation of data and display,
definition of sufficient data types and reuse of those types whenever
possible, separation of identity and value, full normalization of data, zero
ambiguity in the relationship diagram as a fundamental tenet, and a rigorous
mathematical model to describe how it all fits together.

This is hard stuff. But it's worth doing right.




-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] Something completely different

2009-04-06 Thread Han, Yan
Well, the future of ILS is to use general computing standards without
making library's own. 

Essentially, from a computing theory view, a graph is the way to present
all the info (i.e. a graph can represent a tree, or a line. When you
look at MARC, it is a linear computing model.) Graph is powerful, but
graph theory can be difficult and extremely complex. Some of them are NP
hard problem. 

I think that RDF based standards (DC? Or something else or maybe no need
for just one metadata standard )can be used to maximize
interoperability, allow further information discovery and at the same
time provide suitable description for different type of materials. 

Yan  

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Karen Coyle
Sent: Monday, April 06, 2009 10:49 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Something completely different

Cloutman, David wrote:
 I'm open to seeing new approaches to the ILS in general. A related
 question I had the other day, speaking of MARC, is what would an
 alternative bibliographic data format look like if it was designed
with
 the intent for opening access to the data our ILS systems to
developers
 in a more informal manner? I was thinking of an XML format that a
 developer could work with without formal training, 

Well, speaking of 'without formal training' -- I posted this to the Open

Library technology list, but using the OL, which is triple-based and 
open access, I was able to create a simple demo Pipe of how you could 
determine the earliest date of publication of a book (with an interest 
in looking at potential copyright status). Caveat is that the API I'm is

still pretty stubby, so it only retrieves on exact title (this will be 
fixed sometime in the future).

The pipe is here:

http://pipes.yahoo.com/pipes/pipe.info?_id=216efa8c3b04764ca77ad181b1cc6
6e4

kc

 the basics of which
 could be learned in an hour, and could reasonably represent the
 essential fields of the 90% of records that are most likely to be
viewed
 by a public library patron. In my mind, such a format would allow
 creators of community-based web sites to pull data from their local
 library, and repurpose it without having to learn a lot of arcane
 formats (e.g. MARC) or esoteric protocols (e.g. Z39.50). The
sacrifice,
 of course, would be loosing some of the richness MARC allows, but I
 think in many common situations the really complex records are not
what
 patrons are interested in. You may want to consider prototyping this
in
 your application. I see such an effort to be vital in making our
systems
 relevant in future computing environments, and I am skeptical that a
 simple, workable solution would come out the initial efforts of a
 standardization committee.

 Just my 2 cents.

 - David

 ---
 David Cloutman dclout...@co.marin.ca.us
 Electronic Services Librarian
 Marin County Free Library 

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf
Of
 Peter Schlumpf
 Sent: Sunday, April 05, 2009 8:40 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] Something completely different


 Greetings!

 I have been lurking on (or ignoring) this forum for years.  And
 libraries too.  Some of you may know me.  I am the Avanti guy.  I am,
 perhaps, the first person to try to produce an open source ILS back in
 1999, though there is a David Duncan out there who tried before I did.
I
 was there when all this stuff was coming together.

 Since then I have seen a lot of good things happen.  There's Koha.
 There's Evergreen.  They are good things.  I have also seen first hand
 how libraries get screwed over and over by commercial vendors with
their
 crappy software.  I believe free software is the answer to that.  I
have
 neglected Avanti for years, but now I am ready to return to it.

 I want to get back to simple things.  Imagine if there were no Marc
 records.  Minimal layers of abstraction.  No politics.  No vendors.
No
 SQL straightjacket.  What would an ILS look like without those things?
 Sometimes the biggest prison is between the ears.

 I am in a position to do this now, and that's what I have decided to
do.
 I am getting busy.

 Peter Schlumpf

 Email Disclaimer:
http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm


   


-- 
---
Karen Coyle / Digital Library Consultant
kco...@kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234



Re: [CODE4LIB] OCR engine for Persian/Dari

2009-02-04 Thread Han, Yan
Mark,

Many thanks for your input. This is one of the packages that I am thinking of. 
Good to know its accuracy. 

Yan

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Mark 
Jordan
Sent: Tuesday, February 03, 2009 5:36 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] OCR engine for Persian/Dari

Hi again Yan,

There's this one:

http://www.worldlanguage.com/Products/Readiris-Pro-11-Middle-East-Edition-ArabicReadiris-Farsi-Persian-Arabic-Farsi-110226.htm

We have a copy of the Traditional Chinese version of Readiris and find its 
accuracy to be fairly poor (and its performance on latin characters was poor as 
well IIRC), but I can't comment on how this product works with other languages.

Mark

Mark Jordan
Head of Library Systems
W.A.C. Bennett Library, Simon Fraser University
Burnaby, British Columbia, V5A 1S6, Canada
Voice: 778.782.5753 / Fax: 778.782.3023
mjor...@sfu.ca

- Yan Han h...@u.library.arizona.edu wrote:

 Hello, 
 
  
 
 Do you know an OCR engine for Persian/Dari ? If so, what is the
 accurate
 rate?
 
  
 
  
 
 Thanks,
 
  
 
  
 
  
 
 Yan


[CODE4LIB] Linux tools for making PDFs

2009-02-03 Thread Han, Yan
Hello, 

 

Do you know a tool running under Linux to make PDFs from images?  I use
Adobe Acrobat professional in Windows to create PDFs from image files.
However, Acrobat does not handle image files with east Asian characters.


 

 

Yan


[CODE4LIB] OCR engine for Persian/Dari

2009-02-03 Thread Han, Yan
Hello, 

 

Do you know an OCR engine for Persian/Dari ? If so, what is the accurate
rate?

 

 

Thanks,

 

 

 

Yan


Re: [CODE4LIB] MARC 21 and MODS

2009-01-29 Thread Han, Yan
I clicked 2 URLs, and they are broken. What happened? 


404 Not Found
There is no SKOS Concept, ConceptScheme, or Collection instance in the
registry available using this resource URI.

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Tim Cornwell
Sent: Thursday, January 29, 2009 8:46 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC 21 and MODS

snip
...
 
 As a starting point in exploring semantic web types of 
 technologies we are establishing a registry for controlled 
 values used in various standards-- MARC, MODS, PREMIS. See 
 the text at:
 http://id.loc.gov 
 In the meantime we have a prototype at:
 http://www.loc.gov:8081/standards/registry/lists.html
 
 Rebecca
 
 Rebecca S. Guenther   
 


FYI:

The notion of a vocabulary registry has been investigated and
implemented to some extent
by the folks here:

  http://metadataregistry.org/

...not sure where they stand currently.


-Tim

Timothy Cornwell, Programmer/Analyst
National Science Digital Library (http://nsdl.org)
301 College Avenue
Ithaca,  NY 14850
(607)255-3297


Re: [CODE4LIB] Is there a utility to open a folder of many pdfs and determine if each one will open? (eom)

2009-01-29 Thread Han, Yan
try PDFBox. It can index PDF documents. 


-Original Message-
From: Code for Libraries on behalf of Thomas Dowling
Sent: Wed 1/28/2009 2:37 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Is there a utility to open a folder of many pdfs and 
determine if each one will open? (eom)
 
On 01/28/2009 04:31 PM, Stockwell, Chris wrote:
 Chris Stockwell
 Library Systems Programmer Analyst
 Montana State Library
 cstockw...@mt.gov
 406-444-5352


Your shell of choice should let you run pdfinfo on each one.  It will
either give you sensible information about the PDF file (in which case,
you can assume it's good), or give you an error message.

-- 
Thomas Dowling
tdowl...@ohiolink.edu


[CODE4LIB] ETD package for ProQuest/UMI old and new delivery platforms

2009-01-21 Thread Han, Yan
Hello, All, 

 

As mentioned before, I have received quite a few inquiries about the
packages. I have created a web page so that you can download them. I
have also made some fixes on the package. The software package does:

* Unzip ProQuest/UMI ETD delivery Zipped files, and create one
directory per ETD. 

* Rename these ETDs into other preferred file names (in my case,
Wang_arizona_0009D_10075.xml -- azu_etd_10075_sip1_m.xml)

* Generate digital signature for digital preservation.

* Create MARC records from ProQuest/UMI XML files.  (i.e. a MRK
file will be generated for direct loading to catalog. I load MRC file to
innovative and Koha)

* Create embargo notification and moving embargo ETDs to a
different directory for future loading

Note: My package is based on files received by U. Arizona. I do not have
access to other Universities/colleges files. I am not sure if your
University/College have a completely different file naming/structure. If
you can email me your university/college file pattern, I might be able
to generate something more flexible.  

 

The download page is available at
http://www.afghanresource.org/joomla/index.php?option=com_contenttask=v
iewid=37Itemid=52 . There are two packages existing. Please make sure
you have the similar file pattern listed on the page. 

 

Thanks,

 

Yan Han 

 


[CODE4LIB] software package for Elec. Theses/dissertations

2009-01-07 Thread Han, Yan
Hello, Colleagues,

 

As ProQuest/UMI switched its delivery platform for Electronic Theses and 
dissertations(ETD), I have developed a small software package to process ETD. 
The software package does:

1.   Unzip ProQuest/UMI ETD delivery Zipped files, and create one directory 
per ETD.

2.   Rename these ETDs into other preferred file names (in my case, 
Wang_arizona_0009D_10075.xml à azu_etd_10075_sip1_m.xml) 

3.   Generate digital signature for digital preservation.

4.   Create MARC records from ProQuest/UMI XML files.  (i.e. a MRK file 
will be generated for direct loading to catalog. I use III innovative and Koha)

5.   Create embargo notification and moving embargo ETDs to a different 
directory for future loading

 

This package saves me a lot of time to process hundreds of ETDs. The package 
(size of 50kb) has a Java compiled code (class file) and Perl Scripts. 
Currently I run it on Linux, but it can be run in Windows. 

 

If anyone wants to have it or give it a try, please contact me. 

 

p.s. I also have a package handling ProQuest old platform (BePress) ETD files. 

 

Thanks, 

 

Yan Han

The University of Arizona Libraries