subject:"\[CODE4LIB\] ruby\-marc api design feedback wanted"

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Scott Prater


Thanks, Jonathan.  We'll definitely check it out.

-- Scott

On 11/20/2013 12:13 PM, Jonathan Rochkind wrote:

On 11/20/13 12:51 PM, Scott Prater wrote:

I think the issue comes down to a distinction between a stream and a
record.  Ideally, the ruby-marc library would keep pointers to which
record it is in, where the record begins, and where the record ends in
the stream.  If a valid header and end-of-record delimiter are in place,
then the library should be able to reject the record if it contains
garbage in between those two points, without compromising the integrity
of the entire stream.


I understand what you're saying, and why it's attractive. I am not sure
if ruby-marc can do that right now. I am not personally interested in
adding that at this particular time -- I just spent a couple days adding
Marc8 support in the first place, and that's enough for me for now.  I
was just soliciting some feedback on a point I wasn't sure about with
the new MARC8 api, honestly.

But pull requests are always welcome!  Also feel free to check out
ruby-marc to see if it accomodates your desired usage already or not,
and let us know, even without a pull request!

If you (or anyone) are interested in checking out the MARC8 support
added to ruby-marc, it's currently in a branch not yet merged in or
released, but probably will be soon.

https://github.com/ruby-marc/ruby-marc/tree/marc8
https://github.com/ruby-marc/ruby-marc/pull/23



--
Scott Prater
Shared Development Group
General Library System
University of Wisconsin - Madison
pra...@wisc.edu
5-5415

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Jonathan Rochkind


On 11/20/13 12:51 PM, Scott Prater wrote:

I think the issue comes down to a distinction between a stream and a
record.  Ideally, the ruby-marc library would keep pointers to which
record it is in, where the record begins, and where the record ends in
the stream.  If a valid header and end-of-record delimiter are in place,
then the library should be able to reject the record if it contains
garbage in between those two points, without compromising the integrity
of the entire stream.


I understand what you're saying, and why it's attractive. I am not sure 
if ruby-marc can do that right now. I am not personally interested in 
adding that at this particular time -- I just spent a couple days adding 
Marc8 support in the first place, and that's enough for me for now.  I 
was just soliciting some feedback on a point I wasn't sure about with 
the new MARC8 api, honestly.


But pull requests are always welcome!  Also feel free to check out 
ruby-marc to see if it accomodates your desired usage already or not, 
and let us know, even without a pull request!


If you (or anyone) are interested in checking out the MARC8 support 
added to ruby-marc, it's currently in a branch not yet merged in or 
released, but probably will be soon.


https://github.com/ruby-marc/ruby-marc/tree/marc8
https://github.com/ruby-marc/ruby-marc/pull/23

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Robert Haschart

When I first started working on marc4j, its behavior was to behave as 
suggested here, ie. expect the records to be correctly formed in almost 
every respect, and to throw an exception when an error was encountered, 
it was done in a way that didn't even allow the processing to continue 
with the next record, since the state of the Reader when the exception 
was detected was inconsistent.


The approach that I took in creating the MarcPermissiveStreamReader was 
to move as far as possible towards the other approach being suggested 
here.  ie, flag the error, fix it as best that it can, and allow the 
program to proceed on.  To this end, Marc4j has a ErrorHandler class 
that tracks all of the errors it encounters as it is processing a 
record.   The ErrorHandler is used by the MarcPermissiveStreamReader in 
general as well as by the Marc8 to UTF-8 translation code to note what 
errors were encountered, how severe they are, and a description of the 
corrective action that was taken.


In our implementation at UVa these error messages are included in the 
records that are built and sent to the solr index, so that they can be 
later reviewed and (perhaps) eventually fixed.   I think at present our 
index of 6.3M records has close to 600K records containing errors of one 
sort or another.


-Bob Haschart

On 11/20/2013 10:26 AM, Jonathan Rochkind wrote:
I am not sure how you ran into this problem on Monday with ruby-marc, 
since ruby-marc doesn't currently handle Marc8 conversion to UTF-8 at 
all -- how could you have run into a problem with Marc8 to UTF8 
conversion?  But that is what I am adding.


But yeah, using a preprocessor is certainly one option, that will not 
be taken away from people. Although hopefully adding Marc8->UTF8 
conversion to ruby-marc might remove the need for a preprocessor in 
many cases.


So again, we have a bit of a paradox, that I have in my own head too. 
Scot suggests that "In either case, what we DON'T want is to halt the 
processing altogether."  And yet, still, that the default behavior 
should be raising an exception -- that, is halting processing 
altogether, right?


So hardly anyone hardly ever is going to want the default behavior, 
but everyone thinks it should be default anyway, to force people to 
realize what they're doing? I am not entirely objecting to that -- 
it's why I brought it up here, but it does seem odd, doesn't it?  To 
say something should be default that hardly anyone hardly ever will want?



On 11/20/13 10:10 AM, Scott Prater wrote:

We run into this problem fairly regularly, and in fact, ran into it on
Monday with ruby-marc.

The way we've traditionally handled it is to put our marc stream through
a cleanup preprocessor before passing it off to a marc parser (ruby marc
or marc4j).

The preprocessor can do one of two things:

   1)  Skip the bad record in the marc stream and move on; or
   2)  Substitute the bad characters with some default character, and
write it out.

In both cases we log the error as a warning, and include a byte offset
where the bad character occurs, and the record ID, if possible.  This
allows us to go back and fix the errors in a stream in a batch;
generally, the bad encoding errors fall into four or five common errors
(cutting and pasting data from Windows is a typical cause).

In either case, what we DON'T want is to halt the processing altogether.
  Generally, we're dealing with thousands, sometimes millions, of MARC
records in a stream;  it's very frustrating to get halfway through the
stream, then have the parser throw an exception and halt.  Halting the
processing should be the strategy of last resort, to be called only when
the stream has become so corrupted you can't go on to the next record.

I'd want the default to be option 1.  Let the user determine what
changes need to be made to the data;  the parser's job is to parse, not
infer and create.  Overwriting data could also lead to the misperception
that everything is okay, when it really isn't.

-- Scott

On 11/20/2013 08:32 AM, Jon Stroop wrote:

Coming from nowhere on this...is there a place where it would be
convenient to flag which behavior the user (of the library) wants? I
think you're correct that most of the time you'd just want to blow
through it (or replace it), but for the situation where this isn't the
case, I think the Right Thing to do is raise the exception. I don't
think you would want to bury it in some assumption made internal to the
library unless that assumption can be turned off.

-Jon


On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:

ruby-marc users, a question.

I am working on some Marc8 to UTF-8 conversion for ruby-marc.

Sometimes, what appears to be an illegal byte will appear in the Marc8
input, and it can not be converted to UTF8.

The software will support two alternatives when this happens: 1)
Raising an exception. 2) Replacing the illegal byte with a replacement
char and/or omitting it.

I feel like most of the time, users are going to

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Scott Prater


On 11/20/2013 11:18 AM, Jonathan Rochkind wrote:

On 11/20/13 11:40 AM, Scott Prater wrote:



I would suggest one or the other -- the default of leaving bad bytes in
your ruby strings is asking for trouble, and you probably don't want to
do it, but was made the default for backwards compat reasons with older
versions of ruby-marc. (See why I am reluctant to add another default
that we don't think hardly anyone would actually want? :) )


Thanks for your usage suggestions and work on this, Jonathan.  I work 
mostly with marc4j, not ruby-marc, so I'm pretty unfamiliar with the 
capabilities of the gem.  My comments are more oriented towards general 
error handling when processing MARC streams.


I think the issue comes down to a distinction between a stream and a 
record.  Ideally, the ruby-marc library would keep pointers to which 
record it is in, where the record begins, and where the record ends in 
the stream.  If a valid header and end-of-record delimiter are in place, 
then the library should be able to reject the record if it contains 
garbage in between those two points, without compromising the integrity 
of the entire stream.  So my final output would not contain bad data; 
it would simply be missing some records, records that contained bad data.


Here's some (partial) pseudo ruby code of how I'd like to handle it:

count=0
reader = MARC::Reader.new('marc8.dat')
writer = MARC::XMLWriter.new('marc-utf8.xml')
for record in reader
  count+=1
  begin
 utf8rec = record.convert_to_utf()
 writer.write(utf8rec)
  rescue => exception
 log exception, "Skipping record #{count}"
  end
  ... now read the next record...
end

This example doesn't capture the exception if the next record can't be 
retrieved, because the stream is corrupt, but that would be the other 
addition I'd make.  The larger point is that reading a MARC stream 
should be handled as reading a sequence of MARC records encoded in that 
stream -- one bad record does not automatically invalidate the entire 
stream; it only invalidates it if the next record can't be found.


-- Scott

--
Scott Prater
Shared Development Group
General Library System
University of Wisconsin - Madison
pra...@wisc.edu
5-5415

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Jonathan Rochkind


On 11/20/13 11:40 AM, Scott Prater wrote:

Not sure what the details of our issue was on Monday -- but we do have
records that are supposedly encoded in UTF-8, but nonetheless contain
invalid characters.


Oh, and I'd clarify, if you haven't figured it out already, if those are 
ISO 2709 binary records, you can ask the reader to do different things 
there in that case (already avail in current ruby-marc release):


# raise:
MARC::Reader("something.marc", :validate_encoding => true)

# replace with unicode replacement char:
MARC::Reader("something.marc", :invalid => :replace)

This is already available in present ruby-marc release.

I would suggest one or the other -- the default of leaving bad bytes in 
your ruby strings is asking for trouble, and you probably don't want to 
do it, but was made the default for backwards compat reasons with older 
versions of ruby-marc. (See why I am reluctant to add another default 
that we don't think hardly anyone would actually want? :) )


Oh, and you may also want to explicitly specify the expected encoding to 
avoid confusing:


MARC::Reader("something.marc", :external_encoding => "UTF-8", 
:validate_encoding => true)


(It will also work with any other encoding recognized by ruby, for those 
with legacy, possibly international, data).


This stuff is confusing to explain, there are so many permutations and 
combinations of circumstances involved.  But I'll try to improve the 
ruby-marc docs on this stuff, as part of adding the yet more options for 
MARC8 handling.

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Jonathan Rochkind

Yeah, the default in ruby-marc for encodings that _aren't_ MARC8 are to 
ignore bad bytes entirely -- leave them in the MARC::Record as bad 
bytes. This is likely end up raising an exception later when you try to 
DO something with those Strings, but was left this way for backwards 
compatiblity reasons.


You can optionally tell ruby-marc to raise or 'fix' these bad bytes 
instead, but the default is to leave them alone.


However, that's not really possible for MARC8->UTF8 conversion. Since a 
conversion is going on, bad bytes can't be 'left alone', something has 
to be done with them -- raise or replace.


My question here is solely about MARC8->UTF8 conversion, I am not 
changing anything else about the ruby-marc API at this time.


"I think raising an exception is fine, as long as we can still continue
 to walk the records with the reader."  Honestly, I'm not sure if 
that's true, I'm not sure how easy it's going to be to continue 
iterating through the records after an exception, I think the exception 
gets raised in a place that leaves the reader inconsistent. If so, there 
may not be any easy way to fix that. Bah. Scott, you want to beta test 
this new version of ruby-marc?


At any rate, pull requests always welcome once it gets released, having 
some MARC8->UTF8 conversion seems an improvement even if the details 
aren't right. We've always placed a premium on backwards compat in 
ruby-marc though, so I wanted to try and avoid making api/default 
choices we'd later regret but not want to change for backwards compat.



On 11/20/13 11:40 AM, Scott Prater wrote:

Not sure what the details of our issue was on Monday -- but we do have
records that are supposedly encoded in UTF-8, but nonetheless contain
invalid characters.

I think raising an exception is fine, as long as we can still continue
to walk the records with the reader.  The right thing for application
code to do then would be to catch the exception, log it, and continue to
the next record.  The more information in the exception, the better.

-- Scott


I am not sure how you ran into this problem on Monday with ruby-marc,
since ruby-marc doesn't currently handle Marc8 conversion to UTF-8 at
all -- how could you have run into a problem with Marc8 to UTF8
conversion?  But that is what I am adding.

But yeah, using a preprocessor is certainly one option, that will not be
taken away from people. Although hopefully adding Marc8->UTF8 conversion
to ruby-marc might remove the need for a preprocessor in many cases.

So again, we have a bit of a paradox, that I have in my own head too.
Scot suggests that "In either case, what we DON'T want is to halt the
processing altogether."  And yet, still, that the default behavior
should be raising an exception -- that, is halting processing
altogether, right?

So hardly anyone hardly ever is going to want the default behavior, but
everyone thinks it should be default anyway, to force people to realize
what they're doing? I am not entirely objecting to that -- it's why I
brought it up here, but it does seem odd, doesn't it?  To say something
should be default that hardly anyone hardly ever will want?


On 11/20/13 10:10 AM, Scott Prater wrote:

We run into this problem fairly regularly, and in fact, ran into it on
Monday with ruby-marc.

The way we've traditionally handled it is to put our marc stream through
a cleanup preprocessor before passing it off to a marc parser (ruby marc
or marc4j).

The preprocessor can do one of two things:

   1)  Skip the bad record in the marc stream and move on; or
   2)  Substitute the bad characters with some default character, and
write it out.

In both cases we log the error as a warning, and include a byte offset
where the bad character occurs, and the record ID, if possible.  This
allows us to go back and fix the errors in a stream in a batch;
generally, the bad encoding errors fall into four or five common errors
(cutting and pasting data from Windows is a typical cause).

In either case, what we DON'T want is to halt the processing altogether.
  Generally, we're dealing with thousands, sometimes millions, of MARC
records in a stream;  it's very frustrating to get halfway through the
stream, then have the parser throw an exception and halt.  Halting the
processing should be the strategy of last resort, to be called only when
the stream has become so corrupted you can't go on to the next record.

I'd want the default to be option 1.  Let the user determine what
changes need to be made to the data;  the parser's job is to parse, not
infer and create.  Overwriting data could also lead to the misperception
that everything is okay, when it really isn't.

-- Scott

On 11/20/2013 08:32 AM, Jon Stroop wrote:

Coming from nowhere on this...is there a place where it would be
convenient to flag which behavior the user (of the library) wants? I
think you're correct that most of the time you'd just want to blow
through it (or replace it), but for the situation where this isn't the
ca

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Scott Prater

Not sure what the details of our issue was on Monday -- but we do have 
records that are supposedly encoded in UTF-8, but nonetheless contain 
invalid characters.


I think raising an exception is fine, as long as we can still continue 
to walk the records with the reader.  The right thing for application 
code to do then would be to catch the exception, log it, and continue to 
the next record.  The more information in the exception, the better.


-- Scott


I am not sure how you ran into this problem on Monday with ruby-marc,
since ruby-marc doesn't currently handle Marc8 conversion to UTF-8 at
all -- how could you have run into a problem with Marc8 to UTF8
conversion?  But that is what I am adding.

But yeah, using a preprocessor is certainly one option, that will not be
taken away from people. Although hopefully adding Marc8->UTF8 conversion
to ruby-marc might remove the need for a preprocessor in many cases.

So again, we have a bit of a paradox, that I have in my own head too.
Scot suggests that "In either case, what we DON'T want is to halt the
processing altogether."  And yet, still, that the default behavior
should be raising an exception -- that, is halting processing
altogether, right?

So hardly anyone hardly ever is going to want the default behavior, but
everyone thinks it should be default anyway, to force people to realize
what they're doing? I am not entirely objecting to that -- it's why I
brought it up here, but it does seem odd, doesn't it?  To say something
should be default that hardly anyone hardly ever will want?


On 11/20/13 10:10 AM, Scott Prater wrote:

We run into this problem fairly regularly, and in fact, ran into it on
Monday with ruby-marc.

The way we've traditionally handled it is to put our marc stream through
a cleanup preprocessor before passing it off to a marc parser (ruby marc
or marc4j).

The preprocessor can do one of two things:

   1)  Skip the bad record in the marc stream and move on; or
   2)  Substitute the bad characters with some default character, and
write it out.

In both cases we log the error as a warning, and include a byte offset
where the bad character occurs, and the record ID, if possible.  This
allows us to go back and fix the errors in a stream in a batch;
generally, the bad encoding errors fall into four or five common errors
(cutting and pasting data from Windows is a typical cause).

In either case, what we DON'T want is to halt the processing altogether.
  Generally, we're dealing with thousands, sometimes millions, of MARC
records in a stream;  it's very frustrating to get halfway through the
stream, then have the parser throw an exception and halt.  Halting the
processing should be the strategy of last resort, to be called only when
the stream has become so corrupted you can't go on to the next record.

I'd want the default to be option 1.  Let the user determine what
changes need to be made to the data;  the parser's job is to parse, not
infer and create.  Overwriting data could also lead to the misperception
that everything is okay, when it really isn't.

-- Scott

On 11/20/2013 08:32 AM, Jon Stroop wrote:

Coming from nowhere on this...is there a place where it would be
convenient to flag which behavior the user (of the library) wants? I
think you're correct that most of the time you'd just want to blow
through it (or replace it), but for the situation where this isn't the
case, I think the Right Thing to do is raise the exception. I don't
think you would want to bury it in some assumption made internal to the
library unless that assumption can be turned off.

-Jon


On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:

ruby-marc users, a question.

I am working on some Marc8 to UTF-8 conversion for ruby-marc.

Sometimes, what appears to be an illegal byte will appear in the Marc8
input, and it can not be converted to UTF8.

The software will support two alternatives when this happens: 1)
Raising an exception. 2) Replacing the illegal byte with a replacement
char and/or omitting it.

I feel like most of the time, users are going to want #2.  I know
that's what I'm going to want nearly all the time.

Yet, still, I am feeling uncertain whether that should be the default.
Which should be the default behavior, #1 or #2?  If most people most
of the time are going to want #2 (is this true?), then should that be
the default behavior?   Or should #1 still be the default behavior,
because by default bad input should raise, not be silently recovered
from, even though most people most of the time won't want that, heh.

Jonathan






--
Scott Prater
Shared Development Group
General Library System
University of Wisconsin - Madison
pra...@wisc.edu
5-5415

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Jonathan Rochkind

I am not sure how you ran into this problem on Monday with ruby-marc, 
since ruby-marc doesn't currently handle Marc8 conversion to UTF-8 at 
all -- how could you have run into a problem with Marc8 to UTF8 
conversion?  But that is what I am adding.


But yeah, using a preprocessor is certainly one option, that will not be 
taken away from people. Although hopefully adding Marc8->UTF8 conversion 
to ruby-marc might remove the need for a preprocessor in many cases.


So again, we have a bit of a paradox, that I have in my own head too. 
Scot suggests that "In either case, what we DON'T want is to halt the 
processing altogether."  And yet, still, that the default behavior 
should be raising an exception -- that, is halting processing 
altogether, right?


So hardly anyone hardly ever is going to want the default behavior, but 
everyone thinks it should be default anyway, to force people to realize 
what they're doing? I am not entirely objecting to that -- it's why I 
brought it up here, but it does seem odd, doesn't it?  To say something 
should be default that hardly anyone hardly ever will want?



On 11/20/13 10:10 AM, Scott Prater wrote:

We run into this problem fairly regularly, and in fact, ran into it on
Monday with ruby-marc.

The way we've traditionally handled it is to put our marc stream through
a cleanup preprocessor before passing it off to a marc parser (ruby marc
or marc4j).

The preprocessor can do one of two things:

   1)  Skip the bad record in the marc stream and move on; or
   2)  Substitute the bad characters with some default character, and
write it out.

In both cases we log the error as a warning, and include a byte offset
where the bad character occurs, and the record ID, if possible.  This
allows us to go back and fix the errors in a stream in a batch;
generally, the bad encoding errors fall into four or five common errors
(cutting and pasting data from Windows is a typical cause).

In either case, what we DON'T want is to halt the processing altogether.
  Generally, we're dealing with thousands, sometimes millions, of MARC
records in a stream;  it's very frustrating to get halfway through the
stream, then have the parser throw an exception and halt.  Halting the
processing should be the strategy of last resort, to be called only when
the stream has become so corrupted you can't go on to the next record.

I'd want the default to be option 1.  Let the user determine what
changes need to be made to the data;  the parser's job is to parse, not
infer and create.  Overwriting data could also lead to the misperception
that everything is okay, when it really isn't.

-- Scott

On 11/20/2013 08:32 AM, Jon Stroop wrote:

Coming from nowhere on this...is there a place where it would be
convenient to flag which behavior the user (of the library) wants? I
think you're correct that most of the time you'd just want to blow
through it (or replace it), but for the situation where this isn't the
case, I think the Right Thing to do is raise the exception. I don't
think you would want to bury it in some assumption made internal to the
library unless that assumption can be turned off.

-Jon


On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:

ruby-marc users, a question.

I am working on some Marc8 to UTF-8 conversion for ruby-marc.

Sometimes, what appears to be an illegal byte will appear in the Marc8
input, and it can not be converted to UTF8.

The software will support two alternatives when this happens: 1)
Raising an exception. 2) Replacing the illegal byte with a replacement
char and/or omitting it.

I feel like most of the time, users are going to want #2.  I know
that's what I'm going to want nearly all the time.

Yet, still, I am feeling uncertain whether that should be the default.
Which should be the default behavior, #1 or #2?  If most people most
of the time are going to want #2 (is this true?), then should that be
the default behavior?   Or should #1 still be the default behavior,
because by default bad input should raise, not be silently recovered
from, even though most people most of the time won't want that, heh.

Jonathan

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Scott Prater

We run into this problem fairly regularly, and in fact, ran into it on 
Monday with ruby-marc.


The way we've traditionally handled it is to put our marc stream through 
a cleanup preprocessor before passing it off to a marc parser (ruby marc 
or marc4j).


The preprocessor can do one of two things:

  1)  Skip the bad record in the marc stream and move on; or
  2)  Substitute the bad characters with some default character, and 
write it out.


In both cases we log the error as a warning, and include a byte offset 
where the bad character occurs, and the record ID, if possible.  This 
allows us to go back and fix the errors in a stream in a batch; 
generally, the bad encoding errors fall into four or five common errors 
(cutting and pasting data from Windows is a typical cause).


In either case, what we DON'T want is to halt the processing altogether. 
 Generally, we're dealing with thousands, sometimes millions, of MARC 
records in a stream;  it's very frustrating to get halfway through the 
stream, then have the parser throw an exception and halt.  Halting the 
processing should be the strategy of last resort, to be called only when 
the stream has become so corrupted you can't go on to the next record.


I'd want the default to be option 1.  Let the user determine what 
changes need to be made to the data;  the parser's job is to parse, not 
infer and create.  Overwriting data could also lead to the misperception 
that everything is okay, when it really isn't.


-- Scott

On 11/20/2013 08:32 AM, Jon Stroop wrote:

Coming from nowhere on this...is there a place where it would be
convenient to flag which behavior the user (of the library) wants? I
think you're correct that most of the time you'd just want to blow
through it (or replace it), but for the situation where this isn't the
case, I think the Right Thing to do is raise the exception. I don't
think you would want to bury it in some assumption made internal to the
library unless that assumption can be turned off.

-Jon


On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:

ruby-marc users, a question.

I am working on some Marc8 to UTF-8 conversion for ruby-marc.

Sometimes, what appears to be an illegal byte will appear in the Marc8
input, and it can not be converted to UTF8.

The software will support two alternatives when this happens: 1)
Raising an exception. 2) Replacing the illegal byte with a replacement
char and/or omitting it.

I feel like most of the time, users are going to want #2.  I know
that's what I'm going to want nearly all the time.

Yet, still, I am feeling uncertain whether that should be the default.
Which should be the default behavior, #1 or #2?  If most people most
of the time are going to want #2 (is this true?), then should that be
the default behavior?   Or should #1 still be the default behavior,
because by default bad input should raise, not be silently recovered
from, even though most people most of the time won't want that, heh.

Jonathan



--
Scott Prater
Shared Development Group
General Library System
University of Wisconsin - Madison
pra...@wisc.edu
5-5415

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Jonathan Rochkind


Yes, it will be configurable -- "the assumption can be turned off."

The question is which the default should be.

Any opinions, especially from users of ruby-marc, or other MARC parsing 
libraries?


On 11/20/13 9:32 AM, Jon Stroop wrote:

Coming from nowhere on this...is there a place where it would be
convenient to flag which behavior the user (of the library) wants? I
think you're correct that most of the time you'd just want to blow
through it (or replace it), but for the situation where this isn't the
case, I think the Right Thing to do is raise the exception. I don't
think you would want to bury it in some assumption made internal to the
library unless that assumption can be turned off.

-Jon


On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:

ruby-marc users, a question.

I am working on some Marc8 to UTF-8 conversion for ruby-marc.

Sometimes, what appears to be an illegal byte will appear in the Marc8
input, and it can not be converted to UTF8.

The software will support two alternatives when this happens: 1)
Raising an exception. 2) Replacing the illegal byte with a replacement
char and/or omitting it.

I feel like most of the time, users are going to want #2.  I know
that's what I'm going to want nearly all the time.

Yet, still, I am feeling uncertain whether that should be the default.
Which should be the default behavior, #1 or #2?  If most people most
of the time are going to want #2 (is this true?), then should that be
the default behavior?   Or should #1 still be the default behavior,
because by default bad input should raise, not be silently recovered
from, even though most people most of the time won't want that, heh.

Jonathan

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Timothy Prettyman

Could you possibly replace the bad bytes with their NCR values, and raise a
warning?

-Tim


On Wed, Nov 20, 2013 at 9:32 AM, Jon Stroop  wrote:

> Coming from nowhere on this...is there a place where it would be
> convenient to flag which behavior the user (of the library) wants? I think
> you're correct that most of the time you'd just want to blow through it (or
> replace it), but for the situation where this isn't the case, I think the
> Right Thing to do is raise the exception. I don't think you would want to
> bury it in some assumption made internal to the library unless that
> assumption can be turned off.
>
> -Jon
>
>
>
> On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:
>
>> ruby-marc users, a question.
>>
>> I am working on some Marc8 to UTF-8 conversion for ruby-marc.
>>
>> Sometimes, what appears to be an illegal byte will appear in the Marc8
>> input, and it can not be converted to UTF8.
>>
>> The software will support two alternatives when this happens: 1) Raising
>> an exception. 2) Replacing the illegal byte with a replacement char and/or
>> omitting it.
>>
>> I feel like most of the time, users are going to want #2.  I know that's
>> what I'm going to want nearly all the time.
>>
>> Yet, still, I am feeling uncertain whether that should be the default.
>> Which should be the default behavior, #1 or #2?  If most people most of the
>> time are going to want #2 (is this true?), then should that be the default
>> behavior?   Or should #1 still be the default behavior, because by default
>> bad input should raise, not be silently recovered from, even though most
>> people most of the time won't want that, heh.
>>
>> Jonathan
>>
>

Re: [CODE4LIB] ruby-marc api design feedback wanted

2013-11-20 Thread Jon Stroop

Coming from nowhere on this...is there a place where it would be 
convenient to flag which behavior the user (of the library) wants? I 
think you're correct that most of the time you'd just want to blow 
through it (or replace it), but for the situation where this isn't the 
case, I think the Right Thing to do is raise the exception. I don't 
think you would want to bury it in some assumption made internal to the 
library unless that assumption can be turned off.


-Jon


On 11/19/2013 07:51 PM, Jonathan Rochkind wrote:

ruby-marc users, a question.

I am working on some Marc8 to UTF-8 conversion for ruby-marc.

Sometimes, what appears to be an illegal byte will appear in the Marc8 
input, and it can not be converted to UTF8.


The software will support two alternatives when this happens: 1) 
Raising an exception. 2) Replacing the illegal byte with a replacement 
char and/or omitting it.


I feel like most of the time, users are going to want #2.  I know 
that's what I'm going to want nearly all the time.


Yet, still, I am feeling uncertain whether that should be the default. 
Which should be the default behavior, #1 or #2?  If most people most 
of the time are going to want #2 (is this true?), then should that be 
the default behavior?   Or should #1 still be the default behavior, 
because by default bad input should raise, not be silently recovered 
from, even though most people most of the time won't want that, heh.


Jonathan

[CODE4LIB] ruby-marc api design feedback wanted

2013-11-19 Thread Jonathan Rochkind


ruby-marc users, a question.

I am working on some Marc8 to UTF-8 conversion for ruby-marc.

Sometimes, what appears to be an illegal byte will appear in the Marc8 
input, and it can not be converted to UTF8.


The software will support two alternatives when this happens: 1) Raising 
an exception. 2) Replacing the illegal byte with a replacement char 
and/or omitting it.


I feel like most of the time, users are going to want #2.  I know that's 
what I'm going to want nearly all the time.


Yet, still, I am feeling uncertain whether that should be the default. 
Which should be the default behavior, #1 or #2?  If most people most of 
the time are going to want #2 (is this true?), then should that be the 
default behavior?   Or should #1 still be the default behavior, because 
by default bad input should raise, not be silently recovered from, even 
though most people most of the time won't want that, heh.


Jonathan

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

Re: [CODE4LIB] ruby-marc api design feedback wanted

[CODE4LIB] ruby-marc api design feedback wanted

13 matches

Site Navigation

Mail list logo

Footer information