Re: Word / PDF document snippet rendering in search

Charlie Hull Fri, 02 Mar 2018 02:06:49 -0800

On 02/03/2018 00:15, T Wild wrote:

I'm interested in building a software system which will connect to various
document sources, extract the content from the documents contained within
each source, and make the extracted content available to a search engine
such Solr. This search engine will serve as the back-end for a web-based
search application.

This is basically an 'enterprise search' system. You use 'connectors' toget text out of the source documents - in Solr applications we often useApache Tika to extract text from common formats like Office or PDF,Apache ManifoldCF is another useful project for connecting to repositories.


I'm interested in rendering snippets of these documents in the search
results for well-known types, such as Microsoft Word and PDF. How would one
go about implementing document snippet rendering in search?

If you just want the snippets as text, you can use Solr highlighterswhich can provide contextual snippets (i.e chunks of text around thequery matches).


I'd be happy with serving up these snippets in any format, including as
images. I just want to be able to give my users some kind of formatted
preview of their results for well-known types.

If you however want to show bits of the original documents that's moredifficult. You'll need to store a reference to the original document inSolr and use an external system to display it - you'll need specificsystems for different doc types: PDFs can be shown in various browserplugins for example. Another approach is illustrated in this open sourcecode we wrote a while ago - it uses OpenOffice in 'headless' mode toprovide images of the source document:

https://github.com/flaxsearch/flaxcode/tree/master/flax_basic/libs/previewgen

Hope this helps!

Cheers

Charlie


Thank you!



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Word / PDF document snippet rendering in search

Reply via email to