Ken, Andre

Thanks for taking the time on this vexing PDF issue. Either solution does appear to work some of the time at least. The better formatted papers that have standard form (JEL classifcation, etc) can easily be read by Andre's solution. Sometimes by Ken's as well although many papers were created in other countries though published in English. Thus the file seems radically different in structure and honked the app in some cases.

The problem is one that is simply not technical on some level. Many papers that were published "pretty print" don't have any explicit structure. So for example using the direct read method you'd never find a title element. And when read in using the pdftohtml conversion (cool trick!) there is nothing, nada, rien de tout that suggests where the title is on the page. So for automatic indexing or scraping of the page, it's a no go.

Unfortunately this appears to be a result of not thinking through (the publishers) the implications of needing a machine to read a file. These worst offenders have no consistent structure and assumed one person sitting at a machine at a time having the leisure to actually read something. What the heck were they thinking?

This is one area where Google wins. Thanks guys.
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to