entity extraction from pdf

McGibbney, Lewis John Mon, 29 Nov 2010 07:06:36 -0800

Good Afternoon List,

I am currently attempting to extract entities from pdf documents in an attempt 
to construct a domain ontology. I do not need to index anything but wish to 
extract and push the output as plain text. My requirements are as follows



*         drop stop words

*         be able to pick up Bi Grams or NGrams such as the following 
"U-Values", "super-insulated", etc,

*         lower case filter

My intention is to pass the pdf document as input and receive the above as 
output which I can then use to manually construct my ontology from entities and 
their relationships. I have been using lucene 3.0.1 and Luke as another 
solution to solving my problem however this is time consuming and requires  a 
lot of work which does not directly attempt to solve the situation.

I apologies if this is not a query for the tika mailing list.

Any help would be great ;0) Thanks

Lewis


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 
2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

entity extraction from pdf

Reply via email to