Re: [CODE4LIB] HTML mark-up in MARC records

2009-06-25 Thread Ere Maijala

Jonathan Rochkind wrote:

Ere Maijala wrote:


That shouldn't be a problem as any sane OAI-PMH provider, unAPI or ATOM
serializer would escape the contents. Things that resemble HTML tags
could be present in MARC records without any HTML-in-MARC too.
  
Sure, and then, if you have html tags in your marc, that system doing 
the re-use is going to present content to users with escaped HTML in it, 
which isn't desirable either!


How the content is stored in the transport format is separate from how 
it is used. Whatever the re-using system does is not related to how the 
data was transferred to it. If it extracts the stuff from the XML, it 
will of course unescape the content, but what happens after that is up 
to the system and unrelated to the transport mechanism. So here is an 
example of the whole process:


MARC with embedded HTML
-
OAI-PMH provider escapes the MARC in some XML format
-
OAI-PMH harvester (the re-using system) unescapes the data from the XML 
format

-
Something is done with the data

It's the same as if the source system stores the data internally in 
MARCXML. The content must be escaped so that it can be stored in MARCXML 
and doesn't mess up the markup, but when the uses the data e.g. for 
display, it's first retrieved from XML and unescaped, and massaged to 
the desired display format only after that. If you use DOM to do the XML 
manipulation, all this will happen automatically. You just write and 
read strings and DOM manipulation takes care of escaping and unescaping.


You could substitute XML with e.g. Base64 encoding if it makes thinking 
about this stuff easier. For instance email clients often send binary 
files in Base64, but it doesn't mean the file is ruined, as the 
receiving email client can decode it back to the original binary.


--Ere


Re: [CODE4LIB] HTML mark-up in MARC records

2009-06-25 Thread Jon Gorman
 You could substitute XML with e.g. Base64 encoding if it makes thinking
 about this stuff easier. For instance email clients often send binary files
 in Base64, but it doesn't mean the file is ruined, as the receiving email
 client can decode it back to the original binary.

A bit of an ironic statement, considering a regular, constant
complaint on several library-related mailing lists I'm on is that
emails are coming in garbled or need to be sent again in plain
text.   Without fail it's because the person is using a client that
won't or can't deal with Base64.  Yes, silly this day and age.

Perhaps I'm just jaded from working into libraries for too long but
your examples assume some logical consistent control through the
process of dealing with MARC data.

Let's think of this scenario instead:
You're using your vendor's ILS system.  You stick some html tags into
a record.  The vendor's ILS does some different stuff with it like
indexing it, storing the complete record for later retrieval, and
pulling data in the record into a semi-normalized scheme in a
database.  Now the librarians that have just enough training to do
some reports for these systems start running them via access and start
shifting the data around in Access, Exce,l and Word.  Then a little
while later they start raising alarms because of either:  they see the
markup in the record and wonder what's happening and how to remove it
or one of those tools treats that area as text, another as xml
content, and somewhere along the way it gets messed up.

The above is not really all that uncommon of a scenario.

Or how about this scenario:
You add some html internally to a MARC record.  You then add it to
your ILS system.  A few years later you go to export and decide to do
it in MARCXML.  Unknown to you, the ILS doesn't do a sane translation
process, but rather rebuilds the MARCXML from information in the
database that was put there by the original MARC.  The code is
horribly setup and hackish and certain fields do not bother to escape
what it's retrieving from the record.  You then go to import to your
new ILS, which validates the MARCXML.  It of course now croaks because
you have something like
marc:subfield code=adiv class=foopretty/div.

Would you count on having someone on the staff who will be able to fix
those MARCXML files?  Or did you have someone like that and they
burned out?  How long before the support contract on your old ILS
forces you to abandon it?


Plus the fact there's still unresolved questions.  Let's take RSS as
an example as a format that has been abused by html in the past.  If
you find html in RSS, you can't be sure if it valid or well-formed.
Frequently there's no way to know what version of html it is.  Yes, in
the end you can just throw it at a html parser or a browser and hope
for the best, but we have to consider the input mechanisms here. Are
folks going to be entering the html by hand?  Are there going to be
some sort of macros?  Some sort of batch change process?  Each have
different level of risks for having bad html.  How much extra
processing are you going to want to do for each record each time you
might end up displaying it?  What to do with a mistake?  Let your
parser determine or their browser?

It's great to say we should simply re-write our tools, but many of us
work with tools supplied by vendors.  We may be trying to move to more
open tools and the like but ultimately we're constrained by what our
upper managements dictate.

There's both practical reasons (untrustworthy systems) and more
abstract reasons (how to we communicate which version?  namespaces?
etc) issues at play here.  Ultimately I do agree that if it could not
be avoided to try putting in the html into the record itself.  A gain
of better usability and functionality over a couple of years is
probably worth it as the chance of a large issue later on is quite
small.  (Higher chance of small issues though).

I mainly sent out this email though because I don't think the folks
who have been pointing out issues are confused.  It's not that we
don't understand that it should be able to round-trip or that we
haven't played around with html in other data formats.  I think we've
used enough software in the library would to not trust all the layers
will work as they should.

Jon Gorman


Re: [CODE4LIB] HTML mark-up in MARC records

2009-06-25 Thread Jonathan Rochkind
But here's my point. 

There is no way for a consumer of MARC records to know if the MARC 
records contain HTML or not.  If a downstream consumer wants to display 
MARC in an html environment, the consumer can either assume they contain 
html, and then end up displaying MARC _wrong_ if it has has html special 
chars like  or  but does not have html. Or it can assume it does 
_not_have HTML, and end up displaying escaped html tags to the user if 
it really DOES have html.  This really applies no matter what 
presentation format the downstream consumer wants to display in. Plain 
text?  Assume it is html, and strip out html tags, potentially 
accidentally stripping out actual information if it wasn't html but 
contained html special chars. Or assume it's not html and just plain 
text, and just display it, and show the user html tags.


There's no way for a downstream consumer of MARC records to know if data 
is in html or just plain text.  In general, I think this is becuase the 
assumption is it's always just plain text.  If you start putting html in 
there, there's no way for a downstream consumer to predict whether it's 
going to be html or not, because that's not part of the MARC standard to 
advertise that, so there's no way for a downstream consumer to reliably 
display it correctly.  You've put html in counting on your current local 
system being specifically configured to expect html in certain MARC 
fields. Fine. But as soon as you start distributing that MARC to 
downstream consumers, you've made things awfully confusing and 
unpredictable.


Jonathan

Ere Maijala wrote:

Jonathan Rochkind wrote:
  

Ere Maijala wrote:


That shouldn't be a problem as any sane OAI-PMH provider, unAPI or ATOM
serializer would escape the contents. Things that resemble HTML tags
could be present in MARC records without any HTML-in-MARC too.

  

Sure, and then, if you have html tags in your marc, that system doing
the re-use is going to present content to users with escaped HTML in it,
which isn't desirable either!



How the content is stored in the transport format is separate from how
it is used. Whatever the re-using system does is not related to how the
data was transferred to it. If it extracts the stuff from the XML, it
will of course unescape the content, but what happens after that is up
to the system and unrelated to the transport mechanism. So here is an
example of the whole process:

MARC with embedded HTML
-
OAI-PMH provider escapes the MARC in some XML format
-
OAI-PMH harvester (the re-using system) unescapes the data from the XML
format
-
Something is done with the data

It's the same as if the source system stores the data internally in
MARCXML. The content must be escaped so that it can be stored in MARCXML
and doesn't mess up the markup, but when the uses the data e.g. for
display, it's first retrieved from XML and unescaped, and massaged to
the desired display format only after that. If you use DOM to do the XML
manipulation, all this will happen automatically. You just write and
read strings and DOM manipulation takes care of escaping and unescaping.

You could substitute XML with e.g. Base64 encoding if it makes thinking
about this stuff easier. For instance email clients often send binary
files in Base64, but it doesn't mean the file is ruined, as the
receiving email client can decode it back to the original binary.

--Ere
  


Re: [CODE4LIB] HTML mark-up in MARC records

2009-06-25 Thread Tim Hodson
Michael,

A lot of the discussion so far seems to have missed the opportunity to
find out exactly what you are hoping to accomplish. (unless I missed
that post :) ). And to perhaps suggest alternative ways of doing that.

My first question is: What is the image an image of?

This is shortly followed by several more
What is the name of the image (perhaps it is an ISBN)?
Where is the actual image going to be stored?
Does it actually make any sense to embed the URI of an image in the data?
What happens if the image name changes, or becomes unavailable?
What happens if the marc record is exported/consumed by another system?
(setting aside completey the issue of markup within markup discussed
elswhere in this thread)

If this is simply a way to get an image into the catalogue display, I
can think that there might be better ways.
For example Juice [1] which could allow you to dynamically load an
image (from a suitably maintained source) based on an identifier
somewhere within the page.

Best,
Tim

[1] http://code.google.com/p/juice-project/

--
Tim Hodson
http://informationtakesover.co.uk




2009/6/21 Doran, Michael D do...@uta.edu:
 Is anybody else embedding HTML mark-up code in MARC records [1]?  We're 
 currently including an img tag in some MARC Holdings records in the 856z 
 [2].   I'm inclined to think that HTML mark-up does not belong anywhere in 
 MARC records, but am looking for other opinions (preferably with the 
 reasoning behind the opinions), both pro and con.

 I'm asking on code4lib as well as the voyager-l list in order to get a mix of 
 ILS-specific and ILS-agnostic opinions (I'm not on any cataloging lists, or 
 would probably ask there, too).  I tried googling this topic, but couldn't 
 find anything of consequence; so if I've missed something there, and you 
 could point me to it, I'd be obliged.

 -- Michael

 [1] http://en.wikipedia.org/wiki/HTML

 [2] http://www.loc.gov/marc/holdings/hd856.html

 # Michael Doran, Systems Librarian
 # University of Texas at Arlington
 # 817-272-5326 office
 # 817-688-1926 mobile
 # do...@uta.edu
 # http://rocky.uta.edu/doran/



Re: [CODE4LIB] HTML mark-up in MARC records

2009-06-25 Thread Doran, Michael D
Hi Tim,

 A lot of the discussion so far seems to have missed the opportunity to
 find out exactly what you are hoping to accomplish.

The discussion itself was what I was looking for.  I'm actually fairly aware of 
the problems inherent in this practice (although not nearly as eloquent nor as 
thorough as some of the thread responders).  What I was hoping to accomplish 
was to convince some cataloging decision makers here that our current practice 
of including HTML mark-up in MARC records is not a good idea and that we should 
stop doing it.  I was looking to the discussion to either buttress my arguments 
with expert opinion or get a real convincing reason why my rationale was not 
valid.

 My first question is: What is the image an image of?

It's a bit ironic perhaps, but most of the images are essentially *text* -- 
e.g. the words UTA Plus OffCampus in a gif [1].

 Where is the actual image going to be stored?

The image example above was stored on our OPAC server.  With a new version of 
the OPAC, the path to images changed, and 250,000+ holdings records needed to 
be edited.  This new OPAC version is more flexible and it would probably be 
possible to add the images we currently have encoded via HTML tags in the MARC 
holdings record, to the OPAC record view on the fly.

-- Michael

[1] http://pulse.uta.edu/images/offcampPulse.gif

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
  

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On 
 Behalf Of Tim Hodson
 Sent: Thursday, June 25, 2009 3:23 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] HTML mark-up in MARC records
 
 Michael,
 
 A lot of the discussion so far seems to have missed the opportunity to
 find out exactly what you are hoping to accomplish. (unless I missed
 that post :) ). And to perhaps suggest alternative ways of doing that.
 
 My first question is: What is the image an image of?
 
 This is shortly followed by several more
 What is the name of the image (perhaps it is an ISBN)?
 Where is the actual image going to be stored?
 Does it actually make any sense to embed the URI of an image 
 in the data?
 What happens if the image name changes, or becomes unavailable?
 What happens if the marc record is exported/consumed by 
 another system?
 (setting aside completey the issue of markup within markup discussed
 elswhere in this thread)
 
 If this is simply a way to get an image into the catalogue display, I
 can think that there might be better ways.
 For example Juice [1] which could allow you to dynamically load an
 image (from a suitably maintained source) based on an identifier
 somewhere within the page.
 
 Best,
 Tim
 
 [1] http://code.google.com/p/juice-project/
 
 --
 Tim Hodson
 http://informationtakesover.co.uk
 
 
 
 
 2009/6/21 Doran, Michael D do...@uta.edu:
  Is anybody else embedding HTML mark-up code in MARC records 
 [1]?  We're currently including an img tag in some MARC 
 Holdings records in the 856z [2].   I'm inclined to think 
 that HTML mark-up does not belong anywhere in MARC records, 
 but am looking for other opinions (preferably with the 
 reasoning behind the opinions), both pro and con.
 
  I'm asking on code4lib as well as the voyager-l list in 
 order to get a mix of ILS-specific and ILS-agnostic opinions 
 (I'm not on any cataloging lists, or would probably ask 
 there, too).  I tried googling this topic, but couldn't find 
 anything of consequence; so if I've missed something there, 
 and you could point me to it, I'd be obliged.
 
  -- Michael
 
  [1] http://en.wikipedia.org/wiki/HTML
 
  [2] http://www.loc.gov/marc/holdings/hd856.html
 
  # Michael Doran, Systems Librarian
  # University of Texas at Arlington
  # 817-272-5326 office
  # 817-688-1926 mobile
  # do...@uta.edu
  # http://rocky.uta.edu/doran/