[ol-tech] Documentation of JSON name/values in OL full dumps?

2015-02-18 Thread Jon Leech
I'm trying to interpret the contents of the JSON full dumps found at https://openlibrary.org/developers/dumps . Unfortunately the page just has a bunch of :TODO: under the JSON Format section, and I haven't found any other relevant docs on the dumps yet. It looks like the documentation on

[ol-tech] Is the OL database considered to be internally consistent?

2015-03-03 Thread Jon Leech
As I work with OL, more inconsistencies continue to show up. For example, something like 1% of the searches I perform by title simply fail, even though the exact title is in the full dump, and the work page can be reached by going directly to the corresponding URL. Consider:

[ol-tech] Database schema current dev code repo

2015-02-25 Thread Jon Leech
I'd like to try building a local database from the full OL dumps, since grepping a 5 GB file is less than speedy even on an SSD. I looked on the website and first found https://openlibrary.org/about/tech which has a link under Read the source code to https://github.com/openlibrary

Re: [ol-tech] Epubs with missing pages (was Re: What data cleanups would you like to see?)

2015-11-05 Thread Jon Leech
On Thu, Nov 05, 2015 at 08:45:13PM -0500, Tom Morris wrote: > Thanks for the quick reply, Jon. It really needs to be something which > doesn't require borrowing, because, as far as I know, the source for those > will always be unavailable. > > For this example, the directory of files is: >

[ol-tech] How are 'abbyy' files generated? (was Re: Epubs with missing pages)

2015-11-06 Thread Jon Leech
On Thu, Nov 05, 2015 at 10:16:04PM -0500, Tom Morris wrote: > I've got a fix in hand and will generate a pull request as soon as I have > some test data to test with. It looks like the 'epub' project requires 'abbyy' OCR output as a starting point. Is the toolchain for going from raw scans to

Re: [ol-tech] What data cleanups would you like to see?

2015-09-25 Thread Jon Leech
On Fri, Sep 25, 2015 at 03:58:40PM -0400, Tom Morris wrote: > Someone asked me off-list what types of OpenLibrary data cleanups I'd > suggest. Below is the list that I came up with off the top of my head. > What others would folks suggest? What do you think is more important? > > Possible data