Hi Everyone,
I posted this over on the Archivists' Toolkit listserv and got no response
(yet), so I thought I might try here as well.
I have a large quantity (around 300+) of digital objects that I need to add
to Archivists' Toolkit. I think I've figured out what queries I need to
run in order
Two-Year Research Fellowship in Digital Curation
Journalism and Mass Communication
University of Colorado at Boulder
We are seeking to hire a research fellow with a degree in Library and/or
Information Science, or an arts, humanities or social science discipline in
which the candidate has
It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory
structure to the byte-offsets in the fixed fields. The values in these places
all assume 8-bit character data, it's completely baked in to the file format.
-Tod
On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:
We cried our eyes out in 1976 when this first came to our attention at the BL.
Even more crying when we couldn't get rid of it in the MARC-I to MARC-II
conversion (well before MARC21 was even a twinkle) - a lot of tears are
gathering somewhere.
Peter
-Original Message-
From: Code
On 4/18/2012 6:04 AM, Tod Olson wrote:
It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory
structure to the byte-offsets in the fixed fields. The values in these places
all assume 8-bit character data, it's completely baked in to the file format.
I'm not sure that
Hi Tod,
I'm not understanding how UTF-8 would be considered 8-bit character data (other
than the ASCII-range of the Unicode repertoire, natch). I don't think ISO 2709
knows from characters, only bytes.
-- Michael
# Michael Doran, Systems Librarian
# University of Texas at Arlington
#
In fact, I worry that the standard may pre-date UTF-8, with it's
reference to UCS --- if I understand things right, at one point
there
was only one unicode encoding, called UCS, which is basically a
backwards-compatible subset of what became UTF-16.
So I worry the standard really means
UTF-8 was the marc standard from the beginning:
http://www.loc.gov/marc/marbi/1998/98-18.html
The first proposals were a character mapping between Unicode and MARC-8
and didn't mention the character encodings, thus the term UCS which
was a common term for Unicode at that time. (see:
I could be mistaken (never having had the pleasure of reading it), but
isn't ISO-2709 specified as a fixed number of characters, and any
conflation of characters and 8-bit bytes is on the part of users and
implementations?
I think ISO 2709 might not know from bytes, only characters.
I could be mistaken (never having had the pleasure of reading it), but
isn't ISO-2709 specified as a fixed number of characters, and any
conflation of characters and 8-bit bytes is on the part of users and
implementations?
I don't believe that is the case. Take UTF-8 out of the picture, and
Rosalyn,
I've written a number of scripts of this nature. Here's a quick one I wrote
recently to add DAOs to our AT for an audio digitization project (note it does
not include file versions, just Components, Instances and DAOs).
It starts at the ResourceComponent identified by the long at the
I don't know about ISO 2709 itself, but the MARC21 implementation of
it refers to octets, aka 8-bit bytes:
http://www.loc.gov/marc/specifications/specrecstruc.html
Characters may be encoded using one or more than one octet, depending
on the character set. All ASCII characters are encoded using
-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, April 17, 2012 19:55
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] more on MARC char encoding: Now we're about
ISO_2709 and MARC21
Okay, forget XML for a
ISO 2709 doesn't care how many bytes your characters are. The directory
and offsets and other things count bytes, not characters.
That was exactly my point. (Which I am stating since you quoted me and I
couldn't tell if you were refuting my point, or using it to support your
conclusion.)
* Apologies for cross-posting *
We're excited to invite you all to the third annual Islandora Camp
(Aug 1-3, 2012). Islandora Camp welcomes developers, administrators,
and users of Islandora to meet, learn, and grow the ecosystem!
Registration for Islandora Camp is now open, and is available
No Message Collected
No Message Collected
In practice it seems to mean UTF-8. At least I've only seen UTF-8, and I can't
imagine the code that processes this stuff being safe for UTF-16 or UTF-32. All
of the offsets are byte-oriented, and there's too much legacy code that makes
assumption about null-terminated strings.
-Tod
On Apr
No Message Collected
No Message Collected
No Message Collected
The Johns Hopkins University Sheridan Libraries is hiring a Records Management
Archivist to work with the University Archivist to develop an innovative
approach to records management with the purpose of improving our stewardship
of a university history that exists in print, digitized, and
22 matches
Mail list logo