Highlighting can be done as three step process:

Pre-requisite: Get the pdf with text after the OCR of the image pdf.

Step 1:
For sending the extracted text content from text pdf to solr, use a low
level pdf converter such as poppler-utils (pdftotext or pdftohtml) to
correctly get the coordinates and page no. of each word. Store it in a
seperate file as word map. This word map will contain page+coordinates
mapping to occurence number for word.

Step 2:
Solr highlighter needs to be changed to get the word and their occurence
number in the text document, rather than the character offsets for each hit.

Step 3:
Combine the solr output to the word map created in step 1 and the pdf page
and coordinates can be generated for original pdf docuemnt which can be
highlighted by any viewer.

We are succesufully able to implement this for our own application.

Thanks,
Gopal


On Thu, Dec 26, 2013 at 3:56 PM, Gora Mohanty <g...@mimirtech.com> wrote:

> On 26 December 2013 15:44, Fatima Issawi <issa...@qu.edu.qa> wrote:
> > Hi,
> >
> > I should clarify. We have another application extracting the text from
> the document. The full text from each document will be stored in a database
> either at the document level or page level (this hasn't been decided yet).
> We will also be storing word location of each word on the page in the
> database.
>
> What do you mean by "word location"? The number on the page? What purpose
> would this serve?
>
> > What I'm having problems with is deciding on the schema. We want a user
> to be able to search for a word in the database, have a list of documents
> that word is located in, and location in the document that word is located
> it. When he selects the search results, we want the scanned picture to have
> that word highlighted on the page.
> [...]
>
> I think that you might be confusing things:
> * If you have the full-text, you can highlight where the word was found.
> Solr
>   highlighting handles this for you, and there is no need to store word
> location
> * You can have different images (presumably, individual scanned pages)
> linked
>    to different sections of text, and show the entire image.
> Highlighting in the image
>    is not possible, unless by "word location" you mean the (x, y)
> coordinates of
>    the word on the page. Even then:
>    - It will be prohibitively expensive to store the location of every
> word in every
>      image for a large number of documents
>    - Some image processing will be required to handle the highlighting
> after the
>      scanned image is retrieved
>
> Regards,
> Gora
>

Reply via email to