Re: [CODE4LIB] free source for issn-periodical-type data?
Just a quick note: The correct URL for ONIX for Serials is http://www.editeur.org/17/ONIX-for-Serials/ - note that this is a family of standards, so it covers a very wide range of data types and content. The code lists Tom mentioned are available there in human-readable form. Also: it sounded to me that Ken was after an actual database of the journal product type information - something like a serials in print database? -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tom Pasley Sent: 16 April 2012 22:15 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] free source for issn-periodical-type data? Hi Ken, Actually, I'm not sure this will answer all of your needs - although it does cover peer-review: Metadata fields for an ISSN A number of metadata fields can be associated with an ISSN number: - form: Each ISSN has a production form, indicated by an ONIX production form code http://www.editeur.org/onixserials.html. Current supported values include: JB ( Printed serial ), JC ( Serial distributed electronically by carrier ) ,JD ( Electronic serial distributed online ), MA ( Microform ) - oclcnum: Oclcnum - peerreview: Peerreview, 'Y' if the ISSN is peer-reviewed, 'N' if the ISSN is not peer-reviewed. - publisher: Publisher - rawcoverage: Human-readable Coverage - title: Title - issnl: Linking ISSN, as defined herehttp://www.issn.org/2-22637-What-is-an-ISSN-L.php - rssurl: Journal feed URL, data obtained from ticTOCShttp://www.tictocs.ac.uk/ T. On Tue, Apr 17, 2012 at 1:33 AM, Ken Irwin kir...@wittenberg.edu wrote: Hi folks, Does anyone know of a free data source that correlates ISSNs with data that includes what kind of publication is this? e.g. *Academic journal (+/- peer review?) *Popular magazine *Newspaper *Trade journal *Etc Obviously, there's some wiggle room in these designations, and I don't need a super-solid answer. I've been asked to supply information about our academic journal collection, and I don't have a particularly good way of differentiating between our e-journals and e-magazines, for instance. Individual suppliers might make these distinctions, but I'm really hoping that a query-able (or, better: downloadable) file exists. Any ideas? Thanks Ken
[CODE4LIB] Job: Archivist, Institute of Jazz Studies at Rutgers-Newark
RESPONSIBILITIES: The Rutgers University Libraries seek an experienced, innovative, and serviceoriented librarian to fill the position of Archivist in the Institute of Jazz Studies, John Cotton Dana Library onthe Newark Campus of Rutgers, The State University of New Jersey. Reporting to the Director of the Instituteof Jazz Studies, the Archivist will take a leadership role in the management and oversight of archival andresearch collections in IJS in ensuring the effective provision of library and information services to the diversecommunity of users. Will receive, arrange, describe, preserve and create finding aids using best practices andcutting-edge techniques for the Institute's archival collections, which consist of music manuscripts, personalpapers, photographs, memorabilia, and other materials. Will provide in-depth assistance to visiting researchersand scholars as well as respond to requests by mail, email and phone. Will provide materials for the media,performing arts and other organizations. Will identify, solicit and steward donors, and advise the Dana LibraryDirector on the acceptance of gift collections for the Institute. Will oversee the activities of grant- fundedarchivists. Will supervise student workers and interns, including the provision of training in archival practices.Will provide outreach, enhancing the visibility of the Institute and its collections, by conducting tours of theInstitute and preparing exhibits, and represent the Institute at professional meetings and conferences. Willcollaborate in the Libraries' digitization efforts. As a member of a university-wide faculty, the Archivist isexpected to participate in system-wide initiatives, committees, and task forces, and to demonstrate commitmentto continual professional development through scholarly research relevant to areas of responsibility, including publications, presentations and participation and leadership in the work of relevant professional associations. QUALIFICATIONS: A record of professional experience in an academic or research library, archives, orsimilar setting, with emphasis on experience in archival processing, management and preservation. Extensiveknowledge of and experience in the development of EAD finding aids. Extensive knowledge of issues relatingto managing and preserving digital collections. Awareness of national issues and trends in archives and incollections services. Must have the ability and desire to meet tenure and promotion requirements. Thus, thesuccessful candidate will have a Master's degree from an ALA-accredited institution and/or a Master's degreein Archival Studies. Knowledge of or familiarity with jazz history is desired. SALARY: Salary and rank will be commensurate with qualifications and experience. STATUS/BENEFITS: Faculty status, calendar year appointment, retirement plans, life/health insurance,prescription drug, dental and eyeglass plans, tuition remission, one month vacation. LIBRARY AND UNIVERSITY PROFILE: Rutgers University is a member of the Association of American Universities. The university, spread over three regional campuses, includes over 50,000 graduate and undergraduate students and 2,500 faculty, engaged in numerous degree-granting, research and professionalprograms in all disciplines, as well as a broad spectrum of service programs for the state. Situated on 35 acresin downtown Newark, Rutgers- Newark is part of a dynamic urban environment and is positioned to take aleading role in the further revitalization of Newark. The Newark campus is a doctoral-degree granting researchinstitution, classified as a Carnegie Research Intensive institution. Rutgers-Newark offers 14 doctoral programs:American studies, applied physics, behavioral and neural science biology, chemistry, criminal justice,environmental science, global affairs, management, mathematical sciences, nursing, psychology, publicadministration, and urban systems. With more than 11,000 graduate and undergraduate students and anticipated Brought to you by code4lib jobs: http://jobs.code4lib.org/job/891/
[CODE4LIB] Job: Senior Web Development and User Experience Technician, Discovery Systems at Queen's University
**Description and Duties:** Within the framework of established policies, regulations and procedures, in consultation with the Systems Coordinator, the Division Head of Discovery Systems and other Discovery Systems staff, the incumbent provides technical expertise and support for the Library's web presence. Duties include development of new web applications and maintenance and enhancement of existing web applications from simple to complex; ensuring the smooth operation of designated Library web software systems (eg. Drupal, WordPress, DokuWiki, and in-house web applications) with appropriate documentation, back- up, maintenance and upgrades; diagnosis, research and troubleshooting of problems with the Library's web presence, escalating issues as necessary, and documenting solutions; exploring new web software systems; providing user support for web systems; assisting in development of web analytics tracking reports and in implementing analytics and usability driven web page modifications; providing back-up for other senior Discovery Systems technicians in the areas of user support, database administration, and user support. **Qualifications** Recent college diploma or other post-secondary education specializing in web application development and user driven design, and a minimum of one year of proven and recent experience in complex database driven website development, preferably in a high-demand user-centred environment OR The equivalent combination of education and experience which must include a minimum of two years' proven and recent experience in complex database and user focused website development, preferably in a high-demand user-centred environment. Proven proficiency with PHP, SQL, relational databases (e.g. MySQL, Oracle, PostgreSQL), HTML, CSS, CVS/SVN. Desirable experience: Unix system administration, Apache administration, experience with web application performance tuning. **Desirable experience: ** Proven experience with library specific web applications (e.g. ILS, Discovery Layer, Open Journal Software, Institutional Repository, OpenURL resolver). Proven experience working in a team environment an asset. Familiarity with Queen's computing infrastructure an asset. Brought to you by code4lib jobs: http://jobs.code4lib.org/job/892/
[CODE4LIB] MarcXML and char encodings
I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the ?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and that has always worked for me. It turns out the standard says US-ASCII is the default encoding. But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 12:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: MarcXML and char encodings I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
What's the legal thing to do? What's actually found 'in the wild' with MarcXML? In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edu / 503.999.9787
Re: [CODE4LIB] MarcXML and char encodings
So what if the ?xml? decleration says one charset encoding, but the MARC header included in the MarcXML says a different encoding... which one is the 'legal' one to believe? Is it legal to have MarcXML that is not UTF-8 _or_ Marc8, that is an entirely different charset that is legal in XML? If you did that, what should the MARC header included in the XML say? I know how char encodings work in XML. I don't understand what the standards say about how that interacts with the MARC data in MarcXML. Jonathan On 4/17/2012 1:51 PM, LeVan,Ralph wrote: There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and that has always worked for me. It turns out the standard says US-ASCII is the default encoding. But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 12:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: MarcXML and char encodings I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
On 4/17/2012 1:57 PM, Kyle Banerjee wrote: In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle So would you use the Marc header payload instead? Or you're just saying you wouldn't trust _any_ encoding declerations you find anywhere? When writing a library to handle marc, I think the base line should be making it do the official legal standards-complaint right thing. Extra heuristics to deal with invalid data can be added on top. But my trouble here is I can't even figure out what the official legal standards-compliant thing is. Maybe that's becuase the MarcXML standard simply doesn't address it, and it's all implementation dependent. sigh. The problem is how the XML documents own char encoding is supposed to interact with the MARC header; especially because there's no way to put Marc8 in an XML char encoding doctype (is there?); and whether encodings other than Marc8 or UTF8 are legal in MarcXML, even though they aren't in MARC ISO binary. I think the answer might be nobody knows, and there is no standard right way to do it. Which is unfortunate.
Re: [CODE4LIB] MarcXML and char encodings
Okay, maybe here's another way to approach the question. If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- is this legal at all? And if so, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? On 4/17/2012 1:57 PM, Kyle Banerjee wrote: What's the legal thing to do? What's actually found 'in the wild' with MarcXML? In some cases, invalid XML. In an ideal world, the encoding should be included in the declaration. But I wouldn't trust it. kyle
[CODE4LIB] Job: Director of Library Information Technology Production Services at University of Illinois at Urbana-Champaign
**Director of Library Information Technology Production Services** Academic Professional Position University of Illinois at Urbana-Champaign **Position Available**: This position is available July, 2012. This is a 100%-time, twelve-month appointment Academic Professional position. **Duties and Responsibilities**: The University of Illinois at Urbana-Champaign seeks an innovative, collaborative, and service-oriented professional for the position of Director of Library Information Technology Production Services. The University Library maintains a robust infrastructure for digital collections and services that supports the needs of nearly 100 million virtual visitors each year. Reporting to the Associate University Librarian for Information Technology Planning and Policy, the successful candidate will oversee the staff, technology support, networking, infrastructure, and applications support for Library enterprise IT systems, including Infrastructure Management and Support (IMS), Workstation and Network Support (WNS), and Help Desk (HD) services. See https://jobs.illinois.edu for complete list of duties. **Qualifications: _Required_:** A Bachelor's degree; experience in a library or academic computing services setting; demonstrated experience implementing user-focused or customer service technology services in a high-volume academic setting; project management experience in substantial computing or information system implementations or migrations; experience supervising and mentoring technical professionals; ability to facilitate effective prioritization and collaboration on projects with multiple customer groups, including domain experts, academic users at various skill levels, and IT professionals; demonstrated ability to lead and manage professional staff, to make decisions in a collaborative team environment, and successfully direct and support multiple production operations; excellent oral and written communication skills; familiarity with data storage requirements. See https://jobs.illinois.edu for list of preferred qualifications. **To Apply**: To ensure full consideration, please complete your candidate profile at https://jobs.illinois.edu and upload a letter of interest, resume, and contact information including email addresses for three professional references. Applications not submitted through this website will not be considered. For questions, please call: 217-333-8169. **Deadline**: In order to ensure full consideration, applications must be received by May 14, 2012. Illinois is an Affirmative Action /Equal Opportunity Employer and welcomes individuals with diverse backgrounds, experiences, and ideas who embrace and value diversity and inclusivity. www.inclusiveillinois.illinois.edu Brought to you by code4lib jobs: http://jobs.code4lib.org/job/893/
Re: [CODE4LIB] MarcXML and char encodings
If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- is this legal at all? And if so, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? I'd claim this is legal, if it is legal XML. Set your encoding to anything that is valid. As a Java programmer, using java XML tools, the encoding is just a hint to the tools. I end up with Unicode strings after the XML is read. So I always ignore the encoding byte in the leader. Following that logic, that byte is about encoding. It has meaning when ISO 2709 is the transfer mechanism. But, in this case, XML is the transfer mechanism and it's rules for identifying the encoding are what matter. I'm proposing that the encoding byte in the leader is meaningless. Ralph
[CODE4LIB] Code4Lib West Registration Form: July 30, 2012
The University of Oregon Libraries and Oregon State University Libraries invite you to code4lib west, Monday, July 30, 2012, at the UO Knight Library. There is no registration fee for this conference. Registration is limited to 50 participants. All participants are expected to deliver a lightning talk. In the event registration fills up quickly, limits on participation per institution may be employed. Your registration is not confirmed until you receive an email. Registrations will be confirmed by April 30, 2012. URL: https://docs.google.com/spreadsheet/viewform?formkey=dGRFM0Zob1dsNEE2RU9VY25SNlllUEE6MQ --TR *** Terry Reese, Associate Professor Gray Family Chair for Innovative Library Services 121 Valley Library Corvallis, OR 97331 tel: 541.737.6384 ***
Re: [CODE4LIB] MarcXML and char encodings
Hi Ralph, But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. That rule no longer applies per the December 2007 revision of the MARC 21 Specifications: To facilitate the movement of records between MARC-8 and Unicode environments, it was recommended for an initial period that the use of Unicode be restricted to a repertoire identical in extent to the MARC-8 repertoire. [...] however, such a restriction is no longer appropriate. The full UCS repertoire, as currently defined at the Unicode web site, is valid for encoding MARC 21 records subject only to the constraints described [in the current MARC 21 Specifications]. -- from MARC 21 Specifications (revised December 2007) [1] -- Michael [1] http://www.loc.gov/marc/specifications/speccharucs.html -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of LeVan,Ralph Sent: Tuesday, April 17, 2012 12:51 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings There are probably a couple of answers to that. XML rules define what characterset is used. The encoding attribute on the ?xml? header is where you find out what characterset is being used. I've always gone under the assumption that if an encoding wasn't specified, then UTF-8 is in effect and that has always worked for me. It turns out the standard says US-ASCII is the default encoding. But, ignoring the encoding, the original MarcXML rules were the same as the MARC-21 rules for character repertoire and you were suppose to restrict yourself to characters that could be mapped back into MARC-8. I don't know if that rule is still in force, but everyone ignores it. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 12:35 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: MarcXML and char encodings I know how char encodings work in MARC ISO binary -- the encoding can legally be either Marc8 or UTF8 (nothing else). The encoding of a record is specified in it's header. In the wild, specified encodings are frequently wrong, or data includes weird mixed encodings. Okay! But what's going on with MarcXML? What are the legal encodings for MarcXML? Only Marc8 and UTF8, or anything that can be expressed in XML? The MARC header is (or can) be present in MarcXML -- trust the MARC header, or trust the XML doctype char encoding? What's the legal thing to do? What's actually found 'in the wild' with MarcXML? Can anyone advise? Jonathan
Re: [CODE4LIB] MarcXML and char encodings
Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML
Re: [CODE4LIB] MarcXML and char encodings
Re: But do others agree that there is in fact no legal way to have Marc8 in MarcXML? No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, and you will want to be aware that XML processors are only REQUIRED to process UTF-8 and UTF-16 -- in practice many (including JAVA-based one) can handle other encodings -- but you will have to make sure whatever XML processor you use, in whatever language it is written, has a handy-dandy MARC8 coder/decoder ring Sheila -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 2:46 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML
Re: [CODE4LIB] MarcXML and char encodings
Jonathan Rochkind Sent: Tuesday, April 17, 2012 14:18 Subject: Re: [CODE4LIB] MarcXML and char encodings Okay, maybe here's another way to approach the question. If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- is this legal at all? And if so, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? You cannot have a MARC-XML document encoded in MARC-8, well sort of, but it's not standard. To answer your questions you have to refer to a variety of standards: http://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncodingDecl In an encoding declaration, the values UTF-8 , UTF-16 , ISO-10646-UCS-2 , and ISO-10646-UCS-4 should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values ISO-8859-1 , ISO-8859-2 , ... ISO-8859- n (where n is the part number) should be used for the parts of ISO 8859, and the values ISO-2022-JP , Shift_JIS , and EUC-JP should be used for the various encoded forms of JIS X-0208-1997. It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an x- prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-! registered encodings). In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration. 1) The above says that ?xml version=1.0 ? means the same as ?xml version=1.0 encoding=utf-8 ? and if you prefer you can omit the XML declaration and that is assumed to be UTF-8 unless there is a BOM (Byte Order Mark) which determines UTF-8 vs UTF-16BE vs UTF-16LE. 2) If you really wanted to encode the XML in MARC-8 you need to specify x- since if you refer to: http://www.iana.org/assignments/character-sets MARC-8 isn't a registered character set, hence cannot be specified in the encoding attribute unless the name was prefixed with x-. Which implies that no standard XML library will know how to convert the MARC-8 characters into Unicode so the XML DOM can be used. So unless you want to write your own MARC-8 = Unicode conversion routines and integrate them your preferred XML library it isn't going to work out of the box for anyone else but yourself. When dealing with MARC-XML you should ignore the values in LDR/00-04, LDR/10, LDR/11, LDR/12-16, LDR/20-23. If you look at the MARC-XML schema you will note that the definition for leaderDataType specifies LDR/00-04 [\d ]{5}, LDR/10 and LDR/11 (2| ), LDR/12-16 [\d ]{5}, LDR/20-23 (4500| ). Note the MARC-XML schema allows spaces in those positions because they are not relevant in the XML format, though very relevant in the binary format. You probably should ignore LDR/09 since most MARC to MARC-XML converters do not change this value to 'a' although many converters do change the value when converting MARC binary between MARC-8 and UTF-8. The only valid character set for MARC-XML is Unicode and it *should* be encoded in UTF-8 in Unicode normalization form D (NFD) although most XML libraries will not know the difference if it was encoded as UTF-16BE or UTF-16LE in Unicode normalization form D since the XML libraries internally work with Unicode. I could have sworn that this information was specified on LC's site at one point in time, but I'm having trouble finding the documentation. Hope this helps, Andy.
[CODE4LIB] Job: Web Developer at Michigan Technological University
Michigan Technological University's Van Pelt and Opie Library seeks an energetic, user-focused and collegial Web developer that enjoys working on a variety of projects with library and IT staff, faculty, and students that support library services, instruction and research. Michigan Technological University (mtu.edu) is a leading public research university developing new technologies and preparing students to create the future for a prosperous and sustainable world. Michigan Tech offers more than 130 undergraduate and graduate degree programs in engineering; forest resources; computing; technology; business; economics; natural, physical and environmental sciences; arts; humanities; and social sciences. Located on the Keweenaw Peninsula on Michigan's picturesque and peaceful Upper Peninsula, Houghton has been named recently as one of America's best small towns as well as a 10 Top Adrenaline Outposts by National Geographic. Four- season outdoor activities range from downhill and cross-country skiing, hiking, birding and fishing. The university's Rozsa Center for the Performing Arts, the Calumet Theatre, Pine Mountain Festival and numerous year-round heritage and international festivities provide a range of art, music and craft opportunities. For more information and to apply, go to: https://www.jobs.mtu.edu/postings/468 Michigan Technological University is an equal opportunity educational institution/equal opportunity employer, committed to excellence through diversity in education and employment. Brought to you by code4lib jobs: http://jobs.code4lib.org/job/894/
Re: [CODE4LIB] MarcXML and char encodings
So would you use the Marc header payload instead? Or you're just saying you wouldn't trust _any_ encoding declerations you find anywhere? This. The short version is that too many vendors and systems just supply some value without making sure that's what they're spitting out. I haven't had to mess with this stuff for a few years, so I'm hoping Terry Reese weighs in on this conversation -- he has a lot of experience dealing with encoding headaches. However, the bottom line is that the most reliable method is to use heuristics to detect what's going on. Yeah, that totally kills the point of listing encodings in first place, but just as is the case with any unreliably used data point, it's all GIGO. When writing a library to handle marc, I think the base line should be making it do the official legal standards-complaint right thing. Extra heuristics to deal with invalid data can be added on top. I'm hoping things have improved, but if heuristics are more reliable than reading the right areas of the record, you have to ignore what's there (which makes even reading it pointless). I do think there is value in encouraging vendors to actually pay attention to this stuff as such basic screwups undermine both the the credibility of the data source and the service that depends on the data. But my trouble here is I can't even figure out what the official legal standards-compliant thing is. Maybe that's becuase the MarcXML standard simply doesn't address it, and it's all implementation dependent. sigh. The problem is how the XML documents own char encoding is supposed to interact with the MARC header; especially because there's no way to put Marc8 in an XML char encoding doctype (is there?); and whether encodings other than Marc8 or UTF8 are legal in MarcXML, even though they aren't in MARC ISO binary. I think the answer might be nobody knows, and there is no standard right way to do it. Which is unfortunate. A good summary of the situation as I understand it. kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edu / 503.999.9787
Re: [CODE4LIB] MarcXML and char encodings
The discussions at the MARC standards group relating to Unicode all had to do with using Unicode *within* ISO2709. I can't find any evidence that MARCXML ever went through the standards process. (This may not be a bad thing.) So none of what we know about the MARBI discussions and resulting standards can really help us here, except perhaps by analogy. In LC's own example on the MARCXML page (the Sandburg example) the Leader is copied without change from the ISO2709/MARC-8 record to the MARCXML/Unicode record -- in other words, it still has a blank in offset 09, which means MARC-8. (The XML record is UTF-8.) My gut feeling is that the Leader in MARCXML should be treated like the human appendix -- something that once had a use, but is now just being carried along for historical reasons. I would not expect it to reflect the XML record within which it is embedded. Unfortunately, it is the only source of some key information, like type of record. The more I think about it, the more MARCXML strikes me as a really messed-up format. kc On 4/17/12 11:46 AM, Jonathan Rochkind wrote: Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] MarcXML and char encodings
Karen Coyle Sent: Tuesday, April 17, 2012 15:41 Subject: Re: [CODE4LIB] MarcXML and char encodings The discussions at the MARC standards group relating to Unicode all had to do with using Unicode *within* ISO2709. I can't find any evidence that MARCXML ever went through the standards process. (This may not be a bad thing.) So none of what we know about the MARBI discussions and resulting standards can really help us here, except perhaps by analogy. Well I can confirm that the MARCXML didn't go through MARBI since I was one of OCLC's representatives who solidified MARCXML. MARCXML came out of a meeting at LC between the MARC Standards office, OCLC, RLG, and one or two other interested parties whom I cannot remember or find in my emails or notes about the meeting. Andy.
Re: [CODE4LIB] MarcXML and char encodings
Let me make some recommendations. These are what I would consider best practices for interoperability. 1) Never put marc8 in xml. Just don't do it. No one expects it. Few will be willing to bother with it. 2) Always prefer utf8 for marcxml. You can use any standard charset if you need to, but without special circumstances, use utf8 3) ignore leader 9 in marcxml. Only consider the prolog. (consider not trust.) If you reasonably can, fail when the charset is Wrong. /dev Sent via the Samsung Galaxy S™ II Skyrocket™, an ATT 4G LTE smartphone. Original message Subject: Re: [CODE4LIB] MarcXML and char encodings From: Jonathan Rochkind rochk...@jhu.edu To: CODE4LIB@LISTSERV.ND.EDU CC: Thanks, this is helpful feedback at least. I think it's completely irrelevant, when determining what is legal under standards, to talk about what certain Java tools happen to do though, I don't care too much what some tool you happen to use does. In this case, I'm _writing_ the tools. I want to make them do 'the right thing', with some mix of what's actually official legally correct and what's practically useful. What your Java tools do is more or less irrelevant to me. I certainly _could_ make my tool respect the Marc leader encoded in MarcXML over the XML decleration if I wanted to. I could even make it assume the data is Marc8 in XML, even though there's no XML charset type for it, if the leader says it's Marc8. But do others agree that there is in fact no legal way to have Marc8 in MarcXML? Do others agree that you can use non-UTF8 encodings in MarcXML, so long as they are legal XML? I won't even ask someone to cite standards documents, because it's pretty clear that LC forgot to consider this when establishing MarcXML. (And I have no faith that one could get LC to make a call on this and publish it any time this century). Has anyone seen any Marc8-encoded MarcXML in the wild? Is it common? How is it represented with regard to the XML leader and the Marc header? Has anyone seen any MarcXML with char encodings that are neither Marc8 nor UTF8 in the wild? Are they common? How are they represented with regard to XML leader and Marc header? On 4/17/2012 2:32 PM, LeVan,Ralph wrote: If I want to have a MarcXML document encoded in Marc8 -- what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? Or is it not in fact legal at all? I'm going out on a limb here, but I don't think it is legal. There is no formal encoding that corresponds to MARC-8, so there's no way to tell XML tools how to interpret the bytes. If I want to have a MarcXML document encoded in UTF8, what should it look like? What should be in the XML decleration? What should be in the MARC header embedded in the XML? ?xml encoding=UTF-8? I suppose you'll want to set the leader to UTF-8 as well, but it doesn't really matter to any XML tools. If I want to have a MarcXML document with a char encoding that is _neither_ Marc8 nor UTF8, but something else generally legal for XML
Re: [CODE4LIB] MarcXML and char encodings
On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote: No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? The things that appear there need to be from a specific list, and I didn't think Marc8 was on that list? Can you give me an example? And, if you happen to have it, link to XML standard that says this is legal?
Re: [CODE4LIB] MarcXML and char encodings
No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? Nope, you can't do that. There is no approved name for the MARC-8 encoding. As Andy said, the closest you could get would be to make up an experimental name, like x-marc-8, but no tool in the world would recognize that. Ralph
Re: [CODE4LIB] MarcXML and char encodings
In XML standard: It is RECOMMENDED that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to usingtheir registered names; other encodings SHOULD use names starting with an x- prefix. XML processors SHOULD match character encoding names in a case-insensitive way and SHOULDeither interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA- registered encodings). As I suggested -- since MARC8 isn't (so far as I know) registered -- you won't get far with most standard tools, in whatever language -- you'll have to extend them to first recognize the encoding name, and second, decode the content. smm -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Tuesday, April 17, 2012 4:19 PM To: Code for Libraries Cc: Sheila M. Morrissey Subject: Re: [CODE4LIB] MarcXML and char encodings On 4/17/2012 3:01 PM, Sheila M. Morrissey wrote: No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? The things that appear there need to be from a specific list, and I didn't think Marc8 was on that list? Can you give me an example? And, if you happen to have it, link to XML standard that says this is legal?
Re: [CODE4LIB] MarcXML and char encodings
MARC-8. Cool in its time. Dumb now. Typical. --ELM
Re: [CODE4LIB] MarcXML and char encodings
I think this is a case of being in violent agreement -- see some earlier replies in this thread -- Pragmatically, if you are going to hew to marc-8 encoding transported in XML -- you are losing the usefulness of standard tools for xml -- smm -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of LeVan,Ralph Sent: Tuesday, April 17, 2012 4:21 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MarcXML and char encodings No -- it is perfectly legal - -but you MUST declare the encoding to BE Marc8 in the XML prolog, Wait, how canyou declare a Marc8 encoding in an XML decleration/prolog/whatever it's called? Nope, you can't do that. There is no approved name for the MARC-8 encoding. As Andy said, the closest you could get would be to make up an experimental name, like x-marc-8, but no tool in the world would recognize that. Ralph
[CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding: a - UCS/Unicode Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset. That doesn't say UTF-8. It says UCS or Unicode. What does that actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called UCS I think?). Whatever it actually means, do people violate it in the wild? Now we get to non-Anglophone centric marc. I think all of which is ISO_2709? A standard which of course is not open access, so I can't get it to see what it says. But leader 09 being used for encoding -- is that Marc21 specific, or is it true of any ISO-2709? Marc8 and unicode being the only valid encodings can't be true of any ISO-2709, right? Is there a generic ISO-2709 way to deal with this, or not so much?
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
On Tue, Apr 17, 2012 at 7:55 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. Actually Anglo and Francophone centric. And the USMARC style 245 was a poor replacement for the UKMARC approach (someone at the British Library hosted Linked Data meeting wondered why there were punctation characters included in the data in the title field. The catalogers wept slightly). Simon
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
On Tue, Apr 17, 2012 at 8:46 PM, Simon Spero sesunc...@gmail.com wrote: Actually Anglo and Francophone centric. And the USMARC style 245 was a poor replacement for the UKMARC approach (someone at the British Library hosted Linked Data meeting wondered why there were punctation characters included in the data in the title field. The catalogers wept slightly). Simon Slightly? I cry my eyes out *every single day* about that. Well, every weekday, anyway. -- Bill Dueber Library Systems Programmer University of Michigan Library