Re: [CODE4LIB] HTML mark-up in MARC records
Jonathan Rochkind wrote: Ere Maijala wrote: That shouldn't be a problem as any sane OAI-PMH provider, unAPI or ATOM serializer would escape the contents. Things that resemble HTML tags could be present in MARC records without any HTML-in-MARC too. Sure, and then, if you have html tags in your marc, that system doing the re-use is going to present content to users with escaped HTML in it, which isn't desirable either! How the content is stored in the transport format is separate from how it is used. Whatever the re-using system does is not related to how the data was transferred to it. If it extracts the stuff from the XML, it will of course unescape the content, but what happens after that is up to the system and unrelated to the transport mechanism. So here is an example of the whole process: MARC with embedded HTML - OAI-PMH provider escapes the MARC in some XML format - OAI-PMH harvester (the re-using system) unescapes the data from the XML format - Something is done with the data It's the same as if the source system stores the data internally in MARCXML. The content must be escaped so that it can be stored in MARCXML and doesn't mess up the markup, but when the uses the data e.g. for display, it's first retrieved from XML and unescaped, and massaged to the desired display format only after that. If you use DOM to do the XML manipulation, all this will happen automatically. You just write and read strings and DOM manipulation takes care of escaping and unescaping. You could substitute XML with e.g. Base64 encoding if it makes thinking about this stuff easier. For instance email clients often send binary files in Base64, but it doesn't mean the file is ruined, as the receiving email client can decode it back to the original binary. --Ere
Re: [CODE4LIB] HTML mark-up in MARC records
You could substitute XML with e.g. Base64 encoding if it makes thinking about this stuff easier. For instance email clients often send binary files in Base64, but it doesn't mean the file is ruined, as the receiving email client can decode it back to the original binary. A bit of an ironic statement, considering a regular, constant complaint on several library-related mailing lists I'm on is that emails are coming in garbled or need to be sent again in plain text. Without fail it's because the person is using a client that won't or can't deal with Base64. Yes, silly this day and age. Perhaps I'm just jaded from working into libraries for too long but your examples assume some logical consistent control through the process of dealing with MARC data. Let's think of this scenario instead: You're using your vendor's ILS system. You stick some html tags into a record. The vendor's ILS does some different stuff with it like indexing it, storing the complete record for later retrieval, and pulling data in the record into a semi-normalized scheme in a database. Now the librarians that have just enough training to do some reports for these systems start running them via access and start shifting the data around in Access, Exce,l and Word. Then a little while later they start raising alarms because of either: they see the markup in the record and wonder what's happening and how to remove it or one of those tools treats that area as text, another as xml content, and somewhere along the way it gets messed up. The above is not really all that uncommon of a scenario. Or how about this scenario: You add some html internally to a MARC record. You then add it to your ILS system. A few years later you go to export and decide to do it in MARCXML. Unknown to you, the ILS doesn't do a sane translation process, but rather rebuilds the MARCXML from information in the database that was put there by the original MARC. The code is horribly setup and hackish and certain fields do not bother to escape what it's retrieving from the record. You then go to import to your new ILS, which validates the MARCXML. It of course now croaks because you have something like marc:subfield code=adiv class=foopretty/div. Would you count on having someone on the staff who will be able to fix those MARCXML files? Or did you have someone like that and they burned out? How long before the support contract on your old ILS forces you to abandon it? Plus the fact there's still unresolved questions. Let's take RSS as an example as a format that has been abused by html in the past. If you find html in RSS, you can't be sure if it valid or well-formed. Frequently there's no way to know what version of html it is. Yes, in the end you can just throw it at a html parser or a browser and hope for the best, but we have to consider the input mechanisms here. Are folks going to be entering the html by hand? Are there going to be some sort of macros? Some sort of batch change process? Each have different level of risks for having bad html. How much extra processing are you going to want to do for each record each time you might end up displaying it? What to do with a mistake? Let your parser determine or their browser? It's great to say we should simply re-write our tools, but many of us work with tools supplied by vendors. We may be trying to move to more open tools and the like but ultimately we're constrained by what our upper managements dictate. There's both practical reasons (untrustworthy systems) and more abstract reasons (how to we communicate which version? namespaces? etc) issues at play here. Ultimately I do agree that if it could not be avoided to try putting in the html into the record itself. A gain of better usability and functionality over a couple of years is probably worth it as the chance of a large issue later on is quite small. (Higher chance of small issues though). I mainly sent out this email though because I don't think the folks who have been pointing out issues are confused. It's not that we don't understand that it should be able to round-trip or that we haven't played around with html in other data formats. I think we've used enough software in the library would to not trust all the layers will work as they should. Jon Gorman
Re: [CODE4LIB] HTML mark-up in MARC records
But here's my point. There is no way for a consumer of MARC records to know if the MARC records contain HTML or not. If a downstream consumer wants to display MARC in an html environment, the consumer can either assume they contain html, and then end up displaying MARC _wrong_ if it has has html special chars like or but does not have html. Or it can assume it does _not_have HTML, and end up displaying escaped html tags to the user if it really DOES have html. This really applies no matter what presentation format the downstream consumer wants to display in. Plain text? Assume it is html, and strip out html tags, potentially accidentally stripping out actual information if it wasn't html but contained html special chars. Or assume it's not html and just plain text, and just display it, and show the user html tags. There's no way for a downstream consumer of MARC records to know if data is in html or just plain text. In general, I think this is becuase the assumption is it's always just plain text. If you start putting html in there, there's no way for a downstream consumer to predict whether it's going to be html or not, because that's not part of the MARC standard to advertise that, so there's no way for a downstream consumer to reliably display it correctly. You've put html in counting on your current local system being specifically configured to expect html in certain MARC fields. Fine. But as soon as you start distributing that MARC to downstream consumers, you've made things awfully confusing and unpredictable. Jonathan Ere Maijala wrote: Jonathan Rochkind wrote: Ere Maijala wrote: That shouldn't be a problem as any sane OAI-PMH provider, unAPI or ATOM serializer would escape the contents. Things that resemble HTML tags could be present in MARC records without any HTML-in-MARC too. Sure, and then, if you have html tags in your marc, that system doing the re-use is going to present content to users with escaped HTML in it, which isn't desirable either! How the content is stored in the transport format is separate from how it is used. Whatever the re-using system does is not related to how the data was transferred to it. If it extracts the stuff from the XML, it will of course unescape the content, but what happens after that is up to the system and unrelated to the transport mechanism. So here is an example of the whole process: MARC with embedded HTML - OAI-PMH provider escapes the MARC in some XML format - OAI-PMH harvester (the re-using system) unescapes the data from the XML format - Something is done with the data It's the same as if the source system stores the data internally in MARCXML. The content must be escaped so that it can be stored in MARCXML and doesn't mess up the markup, but when the uses the data e.g. for display, it's first retrieved from XML and unescaped, and massaged to the desired display format only after that. If you use DOM to do the XML manipulation, all this will happen automatically. You just write and read strings and DOM manipulation takes care of escaping and unescaping. You could substitute XML with e.g. Base64 encoding if it makes thinking about this stuff easier. For instance email clients often send binary files in Base64, but it doesn't mean the file is ruined, as the receiving email client can decode it back to the original binary. --Ere
Re: [CODE4LIB] HTML mark-up in MARC records
Michael, A lot of the discussion so far seems to have missed the opportunity to find out exactly what you are hoping to accomplish. (unless I missed that post :) ). And to perhaps suggest alternative ways of doing that. My first question is: What is the image an image of? This is shortly followed by several more What is the name of the image (perhaps it is an ISBN)? Where is the actual image going to be stored? Does it actually make any sense to embed the URI of an image in the data? What happens if the image name changes, or becomes unavailable? What happens if the marc record is exported/consumed by another system? (setting aside completey the issue of markup within markup discussed elswhere in this thread) If this is simply a way to get an image into the catalogue display, I can think that there might be better ways. For example Juice [1] which could allow you to dynamically load an image (from a suitably maintained source) based on an identifier somewhere within the page. Best, Tim [1] http://code.google.com/p/juice-project/ -- Tim Hodson http://informationtakesover.co.uk 2009/6/21 Doran, Michael D do...@uta.edu: Is anybody else embedding HTML mark-up code in MARC records [1]? We're currently including an img tag in some MARC Holdings records in the 856z [2]. I'm inclined to think that HTML mark-up does not belong anywhere in MARC records, but am looking for other opinions (preferably with the reasoning behind the opinions), both pro and con. I'm asking on code4lib as well as the voyager-l list in order to get a mix of ILS-specific and ILS-agnostic opinions (I'm not on any cataloging lists, or would probably ask there, too). I tried googling this topic, but couldn't find anything of consequence; so if I've missed something there, and you could point me to it, I'd be obliged. -- Michael [1] http://en.wikipedia.org/wiki/HTML [2] http://www.loc.gov/marc/holdings/hd856.html # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/
Re: [CODE4LIB] HTML mark-up in MARC records
Hi Tim, A lot of the discussion so far seems to have missed the opportunity to find out exactly what you are hoping to accomplish. The discussion itself was what I was looking for. I'm actually fairly aware of the problems inherent in this practice (although not nearly as eloquent nor as thorough as some of the thread responders). What I was hoping to accomplish was to convince some cataloging decision makers here that our current practice of including HTML mark-up in MARC records is not a good idea and that we should stop doing it. I was looking to the discussion to either buttress my arguments with expert opinion or get a real convincing reason why my rationale was not valid. My first question is: What is the image an image of? It's a bit ironic perhaps, but most of the images are essentially *text* -- e.g. the words UTA Plus OffCampus in a gif [1]. Where is the actual image going to be stored? The image example above was stored on our OPAC server. With a new version of the OPAC, the path to images changed, and 250,000+ holdings records needed to be edited. This new OPAC version is more flexible and it would probably be possible to add the images we currently have encoded via HTML tags in the MARC holdings record, to the OPAC record view on the fly. -- Michael [1] http://pulse.uta.edu/images/offcampPulse.gif # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Tim Hodson Sent: Thursday, June 25, 2009 3:23 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] HTML mark-up in MARC records Michael, A lot of the discussion so far seems to have missed the opportunity to find out exactly what you are hoping to accomplish. (unless I missed that post :) ). And to perhaps suggest alternative ways of doing that. My first question is: What is the image an image of? This is shortly followed by several more What is the name of the image (perhaps it is an ISBN)? Where is the actual image going to be stored? Does it actually make any sense to embed the URI of an image in the data? What happens if the image name changes, or becomes unavailable? What happens if the marc record is exported/consumed by another system? (setting aside completey the issue of markup within markup discussed elswhere in this thread) If this is simply a way to get an image into the catalogue display, I can think that there might be better ways. For example Juice [1] which could allow you to dynamically load an image (from a suitably maintained source) based on an identifier somewhere within the page. Best, Tim [1] http://code.google.com/p/juice-project/ -- Tim Hodson http://informationtakesover.co.uk 2009/6/21 Doran, Michael D do...@uta.edu: Is anybody else embedding HTML mark-up code in MARC records [1]? We're currently including an img tag in some MARC Holdings records in the 856z [2]. I'm inclined to think that HTML mark-up does not belong anywhere in MARC records, but am looking for other opinions (preferably with the reasoning behind the opinions), both pro and con. I'm asking on code4lib as well as the voyager-l list in order to get a mix of ILS-specific and ILS-agnostic opinions (I'm not on any cataloging lists, or would probably ask there, too). I tried googling this topic, but couldn't find anything of consequence; so if I've missed something there, and you could point me to it, I'd be obliged. -- Michael [1] http://en.wikipedia.org/wiki/HTML [2] http://www.loc.gov/marc/holdings/hd856.html # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/