Re: reading and writing of utf-8 with marc::batch [resolved; gigo]

2013-03-28 Thread Eric Lease Morgan
Thank you for all the input, and I think I have resolved my particular issue. Battle won. War still raging. Using the script suggested by Galen as an starting point, I wrote the following hack outputting integers denoting MARC records containing non-UTF-8 characters, but the script output noth

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-28 Thread Ashley Sanders
Eric, > How can I figure out whether or not a MARC record contains ONLY characters > from the UTF-8 character set? You can use a regex to check if a string is utf-8. There are various examples floating around the internet. An example is the one here: http://www.w3.org/International/questions

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Galen Charlton
Hi, On Wed, Mar 27, 2013 at 2:11 PM, Eric Lease Morgan wrote: > Put another way, how can I determine whether or not position #9 of a given > MARC leader is accurate? If position #9 is an "a", then how can I read the > balance of the record to determine whether or not all the characters really >

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Shelley Doljack
helley - Original Message - > From: "Eric Lease Morgan" > To: perl4lib@perl.org > Sent: Wednesday, March 27, 2013 2:11:26 PM > Subject: Re: reading and writing of utf-8 with marc::batch [double encoding] > > > On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan > wro

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Eric Lease Morgan
On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan wrote: > When it calls as_usmarc, I think MARC::Batch tries to honor the value set in > position #9 of the leader. In other words, if the leader is empty, then it > tries to output records as MARC-8, and when the leader is a value of "a", it > tr

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Eric Lease Morgan
On Mar 27, 2013, at 2:20 PM, Eric Lease Morgan wrote: > A number of people have alluded to the problem of double encoding, and I'm > beginning to think this is true. When it calls as_usmarc, I think MARC::Batch tries to honor the value set in position #9 of the leader. In other words, if the

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Galen Charlton
Hi, On Wed, Mar 27, 2013 at 11:20 AM, Eric Lease Morgan wrote: > I have isolated a number of problem records. They all contain diacritics, > but they do not have an "a" in position #9 of the leader -- > http://dh.crc.nd.edu/tmp/original.marc Can someone verify that the file > contains UTF-8 cha

Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Eric Lease Morgan
A number of people have alluded to the problem of double encoding, and I'm beginning to think this is true. I have isolated a number of problem records. They all contain diacritics, but they do not have an "a" in position #9 of the leader -- http://dh.crc.nd.edu/tmp/original.marc Can someone

Re: reading and writing of utf-8 with marc::batch

2013-03-27 Thread Shelley Doljack
Whenever I see characters like é, I consult this website http://www.i18nqa.com/debug/bug-utf-8-latin1.html to help me figure out what's going on. You might find it helpful too. Shelley - Original Message - > From: "Eric Lease Morgan" > To: perl4lib@perl.org > Sent: Tuesday, March 26,

Re: reading and writing of utf-8 with marc::batch [terminal]

2013-03-27 Thread Galen Charlton
Hi Eric, On Wed, Mar 27, 2013 at 10:26 AM, Eric Lease Morgan wrote: > While I'm not positive my terminal is doing UTF-8, I think it is. When I > dump in the beginning the output to the terminal is correct. After I run my > script the output to the same terminal is incorrect. > Would you be will

Re: reading and writing of utf-8 with marc::batch [terminal]

2013-03-27 Thread Eric Lease Morgan
On Mar 26, 2013, at 5:57 PM, Leif Andersson wrote: > my first guess would be your terminal is not utf8. While I'm not positive my terminal is doing UTF-8, I think it is. When I dump in the beginning the output to the terminal is correct. After I run my script the output to the same terminal i

Re: reading and writing of utf-8 with marc::batch

2013-03-27 Thread Galen Charlton
Hi, On Wed, Mar 27, 2013 at 7:01 AM, Jon Gorman wrote: > One piece of advice is not to trust the terminal directly but pipe > into xxd. (And if possible, just try transforming the offending > record). Or use yaz-marcdump -v, which will also give the hex if I > remember correctly. (If it's c3 a9

Re: reading and writing of utf-8 with marc::batch

2013-03-27 Thread Jon Gorman
Ok, I can't claim to be an expert, but from my own experience, I'd say Paul is very likely right about double-encoding occuring. However, the question ends up being where that happens, and in this case I suspect how MARC::Batch will work could depend heavily on what version of perl you're running

RE: reading and writing of utf-8 with marc::batch

2013-03-27 Thread KREYCHE, MICHAEL
e-first-time Morgan!" Mike > -Original Message- > From: Leif Andersson [mailto:leif.anders...@sub.su.se] > Sent: Tuesday, March 26, 2013 5:57 PM > To: Eric Lease Morgan; perl4lib@perl.org > Subject: Re: reading and writing of utf-8 with marc::batch > > Hi Eric, >

Re: reading and writing of utf-8 with marc::batch

2013-03-26 Thread Leif Andersson
Hi Eric, my first guess would be your terminal is not utf8. If you comment out #binmode( STDOUT, ":utf8" ); and that does the trick, then you can start looking for how to change your terminal settings. (And that can sometimes be a rather frustrating task, I'm afraid) /Leif Andersson Stockholm UL

Re: reading and writing of utf-8 with marc::batch

2013-03-26 Thread Timothy Prettyman
Do your records have the utf8 encoding byte set in the LDR? (Byte 9 should be 'a' for utf8). -Tim Timothy Prettyman University of Michigan LIbrary/LIT On Tue, Mar 26, 2013 at 4:22 PM, Eric Lease Morgan wrote: > > For the life of me I can't figure out how to do reading and writing of > UTF-8

Re: reading and writing of utf-8 with marc::batch

2013-03-26 Thread Paul Hoffman
On Tue, Mar 26, 2013 at 04:22:03PM -0400, Eric Lease Morgan wrote: > For the life of me I can't figure out how to do reading and writing of > UTF-8 with MARC::Batch. > > I have a UTF-8 encoded file of MARC records. Dumping the records and > greping for a particular string illustrates the validit