Re: [Wikitech-l] Extension:Pdfhandler

Luiz Augusto Thu, 25 Dec 2008 11:41:17 -0800

On Thu, Dec 25, 2008 at 3:52 PM, Ilmari Karonen <[email protected]> wrote:

> Luiz Augusto wrote:
> >
> > I'm asking it because I've approximately 30GB of public domain scans in
> .pdf
> > format to upload on Commons on the next months (see
> >
> http://en.wikisource.org/w/index.php?oldid=928004#Royal_Society_Digital_Archive_only_for_3_Months_FREE
> > for further information on it) and because I fully agree to the reasons
> > listed on https://bugzilla.wikimedia.org/show_bug.cgi?id=11215#c3
>
> Assuming that these are scanned documents that haven't been vectorized,
> have you considered converting them to DjVu format?  Not only does
> Wikimedia currently have better support for it than PDF, but you might
> realize some file size savings.  Apparently, there's software out there
> to more or less automate it.

Someone asked it on en.wikisource and I've replied with this:
http://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=prev&oldid=928130

DjVu (or at least all conversion tools/configuration options that I've tried
in the past months, including the LizardTech Document Express Enterprise
pdf2djvu and png2djvu options) is a lossy format. If I convert a .pdf
downloaded from Google Book Search I will get a low quality file (70 dpi or
150 dpi per page), but if I extract the images from the same .pdf file using
Adobe Acrobat Pro 8 I will get a 600 dpi jpeg for each page (OCR
softwares normally
recommeds to use 300 dpi images).

>
> Of course, that doesn't in any way preclude or remove the need for
> _also_ improving our PDF support.

Surely :)

> But PDF, as common and useful as it
> is, might not be the optimal format here.
>

Well, all digitized works from all libraries that I known (from Europe,
United States and Brazil) are avaiable only in .pdf file format. The
Internet Archive is the only one to make avaiable both .pdf and .djvu for
the same book (the .djvu version from IA is also a low quality file, but it
at least is delivered with a high-quality OCR embedded at the .djvu file due
to some closed-source and pay OCR software [Abbyy FineReader, I believe]).
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Extension:Pdfhandler

Reply via email to