PDF files handling

Bryan McCormick Fri, 06 Jul 2007 04:58:01 -0700

Ken, Andre

Thanks for taking the time on this vexing PDF issue. Either solutiondoes appear to work some of the time at least. The better formattedpapers that have standard form (JEL classifcation, etc) can easily beread by Andre's solution. Sometimes by Ken's as well although manypapers were created in other countries though published in English. Thusthe file seems radically different in structure and honked the app insome cases.

The problem is one that is simply not technical on some level. Manypapers that were published "pretty print" don't have any explicitstructure. So for example using the direct read method you'd never finda title element. And when read in using the pdftohtml conversion (cooltrick!) there is nothing, nada, rien de tout that suggests where thetitle is on the page. So for automatic indexing or scraping of the page,it's a no go.

Unfortunately this appears to be a result of not thinking through (thepublishers) the implications of needing a machine to read a file. Theseworst offenders have no consistent structure and assumed one personsitting at a machine at a time having the leisure to actually readsomething. What the heck were they thinking?


This is one area where Google wins. Thanks guys.
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

PDF files handling

Reply via email to