I'm trying to interpret the contents of the JSON full dumps found at
https://openlibrary.org/developers/dumps . Unfortunately the page just
has a bunch of :TODO: under the JSON Format section, and I haven't
found any other relevant docs on the dumps yet.
It looks like the documentation on
As I work with OL, more inconsistencies continue to show up. For
example, something like 1% of the searches I perform by title simply
fail, even though the exact title is in the full dump, and the work page
can be reached by going directly to the corresponding URL. Consider:
I'd like to try building a local database from the full OL dumps,
since grepping a 5 GB file is less than speedy even on an SSD. I looked
on the website and first found
https://openlibrary.org/about/tech
which has a link under Read the source code to
https://github.com/openlibrary
On Thu, Nov 05, 2015 at 08:45:13PM -0500, Tom Morris wrote:
> Thanks for the quick reply, Jon. It really needs to be something which
> doesn't require borrowing, because, as far as I know, the source for those
> will always be unavailable.
>
> For this example, the directory of files is:
>
On Thu, Nov 05, 2015 at 10:16:04PM -0500, Tom Morris wrote:
> I've got a fix in hand and will generate a pull request as soon as I have
> some test data to test with.
It looks like the 'epub' project requires 'abbyy' OCR output as a
starting point. Is the toolchain for going from raw scans to
On Fri, Sep 25, 2015 at 03:58:40PM -0400, Tom Morris wrote:
> Someone asked me off-list what types of OpenLibrary data cleanups I'd
> suggest. Below is the list that I came up with off the top of my head.
> What others would folks suggest? What do you think is more important?
>
> Possible data