Seems like a lively party, so I'll join in.
I think all the matchpoints are bad. Ultimately we used many different
factors to arrive at a sort of "match score". This seemed to be a good
approach. Obviously, we can't plan for all the possible variations and
inconsistencies in MARC, but we can do a
> We liked your fingerprinting idea. We expanded it a bit:
Awesome.
There was another idea we had (and implemented) back when I worked for
PINES, though I don't know how worthwhile it is these days:
A dedupe interface that can allow/expedite user processing of proposed
merges from algorithms sim
All,
I meant to share the results (the fruits of our labor):
1. 169,206 bibs were deduped.
2. We had 1,000,234 non-deleted bibs and now we have 832,915 (there were
new bibs getting added to the system for the duration of the dedupe)
3. 16.9% duplication resolution
4. 12,381 bibs were NOT merge
Jason,
We liked your fingerprinting idea. We expanded it a bit:
$fingerprints{alternate} = join("\t",
$marc{item_form}, $marc{date1}, $marc{record_type},
$marc{bib_lvl}, $marc{title}, $marc{subtitle}.$marc{subtitlep},
$marc{author} ? $marc{author} : '',
$marc{audioformat}, $m
For what it's worth, this is the fairly conservative algorithm used by
the default fingerprinter in the migration-tools repository:
https://docs.google.com/document/d/1tvuA0Os3W0B2Fl_GvO_Z6ZG6ZHecg8JtTRMz3QUktK8/edit?usp=sharing
Comments welcome.
--
Jason Etheridge
| Community and Migration Man
We will have to agree to disagree.
On Tue, Apr 26, 2016 at 2:32 PM, Elaine Hardy
wrote:
> For cataloging, ISBN is not a match point. For data cleanup and migration,
> it is, at the very least, a bad match point . There are too many potential
> errors with 020s to use it as a main match point. We
For cataloging, ISBN is not a match point. For data cleanup and migration,
it is, at the very least, a bad match point . There are too many potential
errors with 020s to use it as a main match point. We still have mismatches
in our catalog where a vendor used it as the main match point -- as a
resu
I disagree that the 020 can't be used as a match point. I don't think it
should be used as the only match point. It is possible to generate errors
with the method described in that code. In my experience the benefits of
the high number of accurate matches outweighed the bad matches.
CiL publish
Keep in mind that an ISBN (MARC field 020) is not a match point. It is a
finding aid. Publishers do reuse ISBNs or use a different ISBN for what is
a new printing rather than a new publication (meaning no change in
information). Not only can ISBNs for all formats of a title be present on a
bib reco
That is one thing to point out, when it was written originally electronic
records were still fairly rare. The consortium it was written for still
only uses them in very small numbers and I setup those as distinct bib
sources that I modified the bib selection code to exclude. Those are
things to l
Whatever method you use I heartily recommend doing so on a testing
system and having catalogers look over the results first.
You may have already done all the due diligence but I say it for
anyone reading along as well. I've never had problems with
this method and heard back from others with pos
; Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] Programmatic Merging of Bibliographic Records
Hi Jim,
It is available. To be clear I helped create the de-duplication algorithm but
the actual coding was done by Galen Charlton of Equinox. You can find it
here:
http
Hi Jim,
It is available. To be clear I helped create the de-duplication algorithm
but the actual coding was done by Galen Charlton of Equinox. You can find
it here:
http://git.esilibrary.com/?p=migration-tools.git;h=300a04108fc6a3d14424c6d365329be334114f7d
The full scope of the script goes a
On 04/25/2016 02:45 PM, Jim Taylor wrote:
Yes. Thank you. I should have written that down right after we talked but
didn't. I assume this will also take care of any holds and other related
links?
It does.
Thanks.
Jim
You're welcome.
Rogan Hamby shared his work with me. Â It's a set of SQL procedures that product a 'best bib' and then identifies the less interesting duplicate and it seems to work well. Â I modified it so that it produces the candidates but doesn't actually do the merge since we like to have that personal touch
lf Of
Jason Stephenson
Sent: Monday, April 25, 2016 1:42 PM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] Programmatic Merging of Bibliographic
Records
On 04/25/2016 02:24 PM, Jim Taylor wrote:
> I raised the question at the conference regarding the ability to merge
> records
On 04/25/2016 02:24 PM, Jim Taylor wrote:
I raised the question at the conference regarding the ability to merge
records outside the program interface and was told there was a
procedure/function that would allow this to be done. Does anyone know
where I can find this function? My searching has
-boun...@list.georgialibraries.org] On Behalf Of
Janet Schrader
Sent: Monday, April 25, 2016 1:35 PM
To: 'Evergreen Discussion Group'
Subject: Re: [OPEN-ILS-GENERAL] Programmatic Merging of Bibliographic
Records
Jim,
Do you mean something other than record buckets? You can put bib re
Jim,
Do you mean something other than record buckets? You can put bib records in a
bucket and then select to merge them. The records open tiled vertically and you
select the lead (retained) one. In this interface you can edit the record if
you want to include something from the merged record in
19 matches
Mail list logo