Hi, I should clarify. We have another application extracting the text from the document. The full text from each document will be stored in a database either at the document level or page level (this hasn't been decided yet). We will also be storing word location of each word on the page in the database.
What I'm having problems with is deciding on the schema. We want a user to be able to search for a word in the database, have a list of documents that word is located in, and location in the document that word is located it. When he selects the search results, we want the scanned picture to have that word highlighted on the page. I want to index the document using Solr, but I'm having trouble figuring out how to design the schema to return that "word location" of a search term on the scanned picture in order to highlight it. Does this make more sense? Fatima -----Original Message----- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Thursday, December 26, 2013 1:00 PM To: solr-user@lucene.apache.org Subject: Re: How to use Solr in my project On 26 December 2013 10:54, Fatima Issawi <issa...@qu.edu.qa> wrote: > Hello, > > First off, I apologize if this was sent twice. I was having issues > subscribing to the list. > > I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me > figure out how to implement Solr in my project. I have gone through some > tutorials online and I was able to import and query text in some Arabic PDF > documents. > > We have some scans of Historical Handwritten Arabic documents that will have > text extracted into a database (or PDF). We would like the user to be able to > search the document for text, then have the scanned image show up in a viewer > with the text highlighted. This will not work for scanned images which do not actually contain the text. If you have the text of the documents, the best that you can do is break the text into pages corresponding to the scanned images, and index into Solr the text from the pages and the scanned image that should be linked to the text. For a user search, you will need to show the scanned image for the entire page: Highlighting of the search term in an image is not possible without optical character recognition (OCR). Similarly, if you are indexing from PDFs, you will need to ensure that they contain text, and not just images. Regards, Gora