Good Afternoon List, I am currently attempting to extract entities from pdf documents in an attempt to construct a domain ontology. I do not need to index anything but wish to extract and push the output as plain text. My requirements are as follows
* drop stop words * be able to pick up Bi Grams or NGrams such as the following "U-Values", "super-insulated", etc, * lower case filter My intention is to pass the pdf document as input and receive the above as output which I can then use to manually construct my ontology from entities and their relationships. I have been using lucene 3.0.1 and Luke as another solution to solving my problem however this is time consuming and requires a lot of work which does not directly attempt to solve the situation. I apologies if this is not a query for the tika mailing list. Any help would be great ;0) Thanks Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
