You can check out NIEM (I mistyped before) at 
https://www.niem.gov/Pages/default.aspx

I fully agree about PDF.  PDFs are just a bag of word and images.  The 
structure is for presentation and not machine-readable content.

The eXist XML database has an optional extension module that is suppose to do 
content extraction from even PDFs.

To turn it on, you change the following value from false to try in 
extensions/build.properties and recompile.

> # Binary Content and Metadata Extraction Module
> include.feature.contentextraction = false


I tried it a little bit and it seems to be functional.



On Apr 22, 2012, at 5:16 PM, Michael Sokolov <soko...@ifactory.com> wrote:

> On 4/22/2012 4:02 PM, daniela florescu wrote:
>> MIchael,
>> 
>> while XML isn't obviously the panacea for the world's evil, I don't think 
>> anybody can deny the fact that a uniform use
>> of XML through the IT layers would allow MUCH more information to flow among 
>> participants.
>> 
>> And that can be golden in many circumstances...
>> 
>> Best
>> Dana
>> 
> Oh I completely agree.  A standardized XML format (is it NEIM?) will be far 
> more useful in many ways than mere PDFs, I'm sure.  I just think it's a bit 
> funny how the effort required from the information *providers* tends to get 
> glossed over.  I don't really know much about the adoption of this standard - 
> I just know that starting from PDF, generating useful XML is a nontrivial 
> task.
> 
> -Mike
> _______________________________________________
> talk@x-query.com
> http://x-query.com/mailman/listinfo/talk

_______________________________________________
talk@x-query.com
http://x-query.com/mailman/listinfo/talk

Reply via email to