Bryan, there's still hope! :D
two new tricks, does the filename contain meaningful data about the title? if so, check for the presence of those words and what is near them. I belive titles uses big font faces and appear alone on a page or at least have importance on a page. Look for big font sized text. What you need is a conflict resolution screen, for the pdfs that the process work, then it's fine. For those that the process get lost, just launch them in Preview or your favorite application and tell the "user" to hightlight/select the text of title in preview. In rev use a simple applescript to get the selected Text of preview. This way, for the small cases where your software does not work, you have a quick fix that involves only a human selecting the text of the title and pressing a button. Well, I am assuming you are using MacOS X, if you're indeed on windows, then someone here with a better windows experience may know how to get the selected text of Adobe Reader using vbscript or shell commands or something like that. A system agnostic approach would be to ask the user to select and copy the title to the clipboard, this way, you just need to check clipboarddata["text"] to get your title. Cheers andre "this is a hack" garzia On 7/6/07, Bryan McCormick <[EMAIL PROTECTED]> wrote:
Ken, Andre Thanks for taking the time on this vexing PDF issue. Either solution does appear to work some of the time at least. The better formatted papers that have standard form (JEL classifcation, etc) can easily be read by Andre's solution. Sometimes by Ken's as well although many papers were created in other countries though published in English. Thus the file seems radically different in structure and honked the app in some cases. The problem is one that is simply not technical on some level. Many papers that were published "pretty print" don't have any explicit structure. So for example using the direct read method you'd never find a title element. And when read in using the pdftohtml conversion (cool trick!) there is nothing, nada, rien de tout that suggests where the title is on the page. So for automatic indexing or scraping of the page, it's a no go. Unfortunately this appears to be a result of not thinking through (the publishers) the implications of needing a machine to read a file. These worst offenders have no consistent structure and assumed one person sitting at a machine at a time having the leisure to actually read something. What the heck were they thinking? This is one area where Google wins. Thanks guys. _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
_______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
