Re: PDF files handling

Andre Garzia Fri, 06 Jul 2007 07:27:13 -0700

Bryan,

there's still hope! :D


two new tricks, does the filename contain meaningful data about the title?
if so, check for the presence of those words and what is near them. I belive
titles uses big font faces and appear alone on a page or at least have
importance on a page. Look for big font sized text.

What you need is a conflict resolution screen, for the pdfs that the process
work, then it's fine. For those that the process get lost, just launch them
in Preview or your favorite application and tell the "user" to
hightlight/select the text of title in preview. In rev use a simple
applescript to get the selected Text of preview. This way, for the small
cases where your software does not work, you have a quick fix that involves
only a human selecting the text of the title and pressing a button.

Well, I am assuming you are using MacOS X, if you're indeed on windows, then
someone here with a better windows experience may know how to get the
selected text of Adobe Reader using vbscript or shell commands or something
like that.

A system agnostic approach would be to ask the user to select and copy the
title to the clipboard, this way, you just need to check
clipboarddata["text"] to get your title.


Cheers
andre "this is a hack" garzia

On 7/6/07, Bryan McCormick <[EMAIL PROTECTED]> wrote:


Ken, Andre

Thanks for taking the time on this vexing PDF issue. Either solution
does appear to work some of the time at least. The better formatted
papers that have standard form (JEL classifcation, etc) can easily be
read by Andre's solution. Sometimes by Ken's as well although many
papers were created in other countries though published in English. Thus
the file seems radically different in structure and honked the app in
some cases.

The problem is one that is simply not technical on some level. Many
papers that were published "pretty print" don't have any explicit
structure. So for example using the direct read method you'd never find
a title element. And when read in using the pdftohtml conversion (cool
trick!) there is nothing, nada, rien de tout that suggests where the
title is on the page. So for automatic indexing or scraping of the page,
it's a no go.

Unfortunately this appears to be a result of not thinking through (the
publishers) the implications of needing a machine to read a file. These
worst offenders have no consistent structure and assumed one person
sitting at a machine at a time having the leisure to actually read
something. What the heck were they thinking?

This is one area where Google wins. Thanks guys.
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: PDF files handling

Reply via email to