On 02/03/2018 00:15, T Wild wrote:
I'm interested in building a software system which will connect to various
document sources, extract the content from the documents contained within
each source, and make the extracted content available to a search engine
such Solr. This search engine will serve as the back-end for a web-based
search application.
This is basically an 'enterprise search' system. You use 'connectors' to
get text out of the source documents - in Solr applications we often use
Apache Tika to extract text from common formats like Office or PDF,
Apache ManifoldCF is another useful project for connecting to repositories.
I'm interested in rendering snippets of these documents in the search
results for well-known types, such as Microsoft Word and PDF. How would one
go about implementing document snippet rendering in search?
If you just want the snippets as text, you can use Solr highlighters
which can provide contextual snippets (i.e chunks of text around the
query matches).
I'd be happy with serving up these snippets in any format, including as
images. I just want to be able to give my users some kind of formatted
preview of their results for well-known types.
If you however want to show bits of the original documents that's more
difficult. You'll need to store a reference to the original document in
Solr and use an external system to display it - you'll need specific
systems for different doc types: PDFs can be shown in various browser
plugins for example. Another approach is illustrated in this open source
code we wrote a while ago - it uses OpenOffice in 'headless' mode to
provide images of the source document:
https://github.com/flaxsearch/flaxcode/tree/master/flax_basic/libs/previewgen
Hope this helps!
Cheers
Charlie
Thank you!
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk