On Wed, Nov 24, 2010 at 10:05 AM, Karen Coyle <kco...@kcoyle.net> wrote: > It might be necessary to drop them out of the Amazon data gathering, > although it would be a shame because they also contribute some of the > "long tail" books to the database. I wonder it it wouldn't at least be > possible to drop all of the instances of > "(translator)" (case insensitive) > from the author strings and see how much that clears these up. (I also > saw a few cases of "[translator]" and there may be other patterns as > well.)
Personally, I don't think we should automate dropping them; it is good metadata. Rather, I think we should automate moving it into the additional people list. The trick will be coming up with some judicious pattern matching smarts. (But here is another fun one that probably should be just dropped: http://openlibrary.org/search/authors?q=from+old+catalog :-) I see quite a few cases where useful metadata could be moved from one field to another. Things such as book titles with series or edition suffixes like "(Great Classics Series)" or http://openlibrary.org/search?q=large+print+edition etc. These follow fairly regular patterns, so it could be automated with supervision. I'd like to automate some of that myself, but I haven't come across any references to bulk update tools for users. I've downloaded the dumps and grep'ed through them as information for author merges, but I haven't seen any way for me to do the actual updates besides a real browser. The API docs indicate they are read-only for remote users. Anyone have any techniques they are using currently for mass updates? - Alan _______________________________________________ Ol-discuss mailing list Ol-discuss@archive.org http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to ol-discuss-unsubscr...@archive.org