On 02/03/2018 00:15, T Wild wrote:
I'm interested in building a software system which will connect to various
document sources, extract the content from the documents contained within
each source, and make the extracted content available to a search engine
such Solr. This search engine will serve as the back-end for a web-based
search application.
This is basically an 'enterprise search' system. You use 'connectors' to get text out of the source documents - in Solr applications we often use Apache Tika to extract text from common formats like Office or PDF, Apache ManifoldCF is another useful project for connecting to repositories.


I'm interested in rendering snippets of these documents in the search
results for well-known types, such as Microsoft Word and PDF. How would one
go about implementing document snippet rendering in search?

If you just want the snippets as text, you can use Solr highlighters which can provide contextual snippets (i.e chunks of text around the query matches).

I'd be happy with serving up these snippets in any format, including as
images. I just want to be able to give my users some kind of formatted
preview of their results for well-known types.

If you however want to show bits of the original documents that's more difficult. You'll need to store a reference to the original document in Solr and use an external system to display it - you'll need specific systems for different doc types: PDFs can be shown in various browser plugins for example. Another approach is illustrated in this open source code we wrote a while ago - it uses OpenOffice in 'headless' mode to provide images of the source document:
https://github.com/flaxsearch/flaxcode/tree/master/flax_basic/libs/previewgen

Hope this helps!

Cheers

Charlie

Thank you!



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to