Coming from nowhere on this...is there a place where it would be
convenient to flag which behavior the user (of the library) wants? I
think you're correct that most of the time you'd just want to blow
through it (or replace it), but for the situation where this isn't the
case, I think the
We run into this problem fairly regularly, and in fact, ran into it on
Monday with ruby-marc.
The way we've traditionally handled it is to put our marc stream through
a cleanup preprocessor before passing it off to a marc parser (ruby marc
or marc4j).
The preprocessor can do one of two
I am not sure how you ran into this problem on Monday with ruby-marc,
since ruby-marc doesn't currently handle Marc8 conversion to UTF-8 at
all -- how could you have run into a problem with Marc8 to UTF8
conversion? But that is what I am adding.
But yeah, using a preprocessor is certainly
Not sure what the details of our issue was on Monday -- but we do have
records that are supposedly encoded in UTF-8, but nonetheless contain
invalid characters.
I think raising an exception is fine, as long as we can still continue
to walk the records with the reader. The right thing for
Yeah, the default in ruby-marc for encodings that _aren't_ MARC8 are to
ignore bad bytes entirely -- leave them in the MARC::Record as bad
bytes. This is likely end up raising an exception later when you try to
DO something with those Strings, but was left this way for backwards
compatiblity
On 11/20/13 11:40 AM, Scott Prater wrote:
Not sure what the details of our issue was on Monday -- but we do have
records that are supposedly encoded in UTF-8, but nonetheless contain
invalid characters.
Oh, and I'd clarify, if you haven't figured it out already, if those are
ISO 2709 binary
On 11/20/2013 11:18 AM, Jonathan Rochkind wrote:
On 11/20/13 11:40 AM, Scott Prater wrote:
I would suggest one or the other -- the default of leaving bad bytes in
your ruby strings is asking for trouble, and you probably don't want to
do it, but was made the default for backwards compat
When I first started working on marc4j, its behavior was to behave as
suggested here, ie. expect the records to be correctly formed in almost
every respect, and to throw an exception when an error was encountered,
it was done in a way that didn't even allow the processing to continue
with the
On 11/20/13 12:51 PM, Scott Prater wrote:
I think the issue comes down to a distinction between a stream and a
record. Ideally, the ruby-marc library would keep pointers to which
record it is in, where the record begins, and where the record ends in
the stream. If a valid header and
Thanks, Jonathan. We'll definitely check it out.
-- Scott
On 11/20/2013 12:13 PM, Jonathan Rochkind wrote:
On 11/20/13 12:51 PM, Scott Prater wrote:
I think the issue comes down to a distinction between a stream and a
record. Ideally, the ruby-marc library would keep pointers to which
ruby-marc users, a question.
I am working on some Marc8 to UTF-8 conversion for ruby-marc.
Sometimes, what appears to be an illegal byte will appear in the Marc8
input, and it can not be converted to UTF8.
The software will support two alternatives when this happens: 1) Raising
an
11 matches
Mail list logo