content extraction for pdf links

McGibbney, Lewis John Thu, 20 Jan 2011 04:34:06 -0800

Hello list,

I have been using Nutch 1.2 to crawl the web for a small number of very 
relevant html pages and associated URL's containing PDF document's. I have then 
been using Luke v 1.0.1 to look inside my index to guarantee I have indexed 
specific PDF documents which reside on these web pages. When I search my index 
via my web application interface I am returned a hyperlink (amongst other 
information) for a relevant hit. It is my intention to implement a content 
extraction mechanism to also provide relevant information contained within the 
pdf documents which reside in my index whenever a user submits a query. E.g. if 
someone were to submit a query relating to a clause within a legal document, 
the content extraction tool would parse the pdf file and provide a snippet of 
the relevant data from within the PDF document in the search result.


I hope I have explained my problem properly, I am posting here as I have been 
aware for some time that Tika was possibly the solution but I am only just 
getting round to working on this now.

Thank you

Lewis


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 
2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

content extraction for pdf links

Reply via email to