Re: How to use Solr in my project

2013-12-30 Thread Gora Mohanty
On 30 December 2013 11:27, Fatima Issawi issa...@qu.edu.qa wrote:
 Hi again,

 We have another program that will be extracting the text, and it will be 
 extracting the top right and bottom left corners of the words. You are right, 
 I do expect to have a lot of data.

 When would solr start experiencing issues in performance? Is it better to:

 INDEX:
 - document metadata
 - words

 STORE:
 - document metadata
 - words
 - coordinates

 in Solr rather than in the database? How would I set up the schema in order 
 to store the coordinates?

You do not mention the number of documents, but for a few
tens of thousands of documents, your problem should be tractable
in Solr. Not sure what document metadata you have, and if you need
to search through it, but what I would do is index the words, and
store the coordinates in Solr, the assumption being that words are
searched but not retrieved from Solr, while coordinates are retrieved
but never searched.

Off the top of my head, each record can be:
doc1 pg1 word1 coord_x1 coord_y1 coord_x2 coord_y2
doc1 pg1 word2 
...
doc1 pg2 ...
...
doc2 ...

* doc_id and pg_id from Solr search results let you retrieve the image
  from the filesystem
* The coordinates allow post-processing to highlight the word in the image

As always, set up a prototype system with a subset of the records in order
to measure performance.

 If storing the coordinates in solr is not recommended, what would be the best 
 process to get the coordinates after indexing the words and metadata? Do I 
 search in solr and then use the documentID to then search the database for 
 the words and coordinates?

You could do that, but Solr by itself should be fine.

Regards,
Gora


RE: How to use Solr in my project

2013-12-30 Thread Fatima Issawi
I think we may have up to 100,000 books, but I don't think the site will have a 
lot of traffic.

Thank you for your help. I think it is a little more clear and will try to 
implement it now.

 -Original Message-
 From: Gora Mohanty [mailto:g...@mimirtech.com]
 Sent: Monday, December 30, 2013 11:46 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to use Solr in my project
 
 On 30 December 2013 11:27, Fatima Issawi issa...@qu.edu.qa wrote:
  Hi again,
 
  We have another program that will be extracting the text, and it will be
 extracting the top right and bottom left corners of the words. You are right, 
 I
 do expect to have a lot of data.
 
  When would solr start experiencing issues in performance? Is it better to:
 
  INDEX:
  - document metadata
  - words
 
  STORE:
  - document metadata
  - words
  - coordinates
 
  in Solr rather than in the database? How would I set up the schema in order
 to store the coordinates?
 
 You do not mention the number of documents, but for a few tens of
 thousands of documents, your problem should be tractable in Solr. Not sure
 what document metadata you have, and if you need to search through it, but
 what I would do is index the words, and store the coordinates in Solr, the
 assumption being that words are searched but not retrieved from Solr, while
 coordinates are retrieved but never searched.
 
 Off the top of my head, each record can be:
 doc1 pg1 word1 coord_x1 coord_y1 coord_x2 coord_y2
 doc1 pg1 word2 
 ...
 doc1 pg2 ...
 ...
 doc2 ...
 
 * doc_id and pg_id from Solr search results let you retrieve the image
   from the filesystem
 * The coordinates allow post-processing to highlight the word in the image
 
 As always, set up a prototype system with a subset of the records in order to
 measure performance.
 
  If storing the coordinates in solr is not recommended, what would be the
 best process to get the coordinates after indexing the words and metadata?
 Do I search in solr and then use the documentID to then search the database
 for the words and coordinates?
 
 You could do that, but Solr by itself should be fine.
 
 Regards,
 Gora


Re: How to use Solr in my project

2013-12-29 Thread Gora Mohanty
On 29 December 2013 11:10, Fatima Issawi issa...@qu.edu.qa wrote:
[...]
 We will have the full text stored, but we want to highlight the text in the 
 original image. I expect to process the image after retrieval. We do plan on 
 storing the (x, y) coordinates of the words in a database - I suspected that 
 it would be too expensive to store them in Solr. I guess I'm still confused 
 about how to use Solr to index the document, but then retrieve the (x, y) 
 coordinates of the search term from the database. Is this possible? If it 
 can, can you give an example how this can be done?

Storing, and retrieving the coordinates from Solr will likely be
faster than from the database. However, I still think that you
should think more carefully about your use case of highlighting
the images. It can be done, but is a significant amount of work,
and will need storage, and computational resources.
1. For highlighting in the image, you will need to store two sets
of coordinates (e.g., top right and bottom left corners) as you
not know the length of the word in the image. Thus, say with
15 words per line, 50 lines per page, 100 pages per document,
you will need to store:
  4 x 15 x 50 x 100 = 3,00,000 coordinates/document
2. Also, how are you going to get the coordinates in the first
place?

Regards,
Gora


RE: How to use Solr in my project

2013-12-29 Thread Fatima Issawi
Hi again,

We have another program that will be extracting the text, and it will be 
extracting the top right and bottom left corners of the words. You are right, I 
do expect to have a lot of data.

When would solr start experiencing issues in performance? Is it better to:

INDEX: 
- document metadata 
- words  

STORE: 
- document metadata
- words 
- coordinates 

in Solr rather than in the database? How would I set up the schema in order to 
store the coordinates?

If storing the coordinates in solr is not recommended, what would be the best 
process to get the coordinates after indexing the words and metadata? Do I 
search in solr and then use the documentID to then search the database for the 
words and coordinates?

Thanks for your patience. I don't have much choice in the use case. 


 -Original Message-
 From: Gora Mohanty [mailto:g...@mimirtech.com]
 Sent: Sunday, December 29, 2013 2:48 PM
 To: solr-user@lucene.apache.org
 Subject: Re: How to use Solr in my project
 
 On 29 December 2013 11:10, Fatima Issawi issa...@qu.edu.qa wrote:
 [...]
  We will have the full text stored, but we want to highlight the text in the
 original image. I expect to process the image after retrieval. We do plan on
 storing the (x, y) coordinates of the words in a database - I suspected that 
 it
 would be too expensive to store them in Solr. I guess I'm still confused about
 how to use Solr to index the document, but then retrieve the (x, y)
 coordinates of the search term from the database. Is this possible? If it can,
 can you give an example how this can be done?
 
 Storing, and retrieving the coordinates from Solr will likely be faster than
 from the database. However, I still think that you should think more carefully
 about your use case of highlighting the images. It can be done, but is a
 significant amount of work, and will need storage, and computational
 resources.
 1. For highlighting in the image, you will need to store two sets
 of coordinates (e.g., top right and bottom left corners) as you
 not know the length of the word in the image. Thus, say with
 15 words per line, 50 lines per page, 100 pages per document,
 you will need to store:
   4 x 15 x 50 x 100 = 3,00,000 coordinates/document 2. Also, how are you
 going to get the coordinates in the first
 place?
 
 Regards,
 Gora


RE: How to use Solr in my project

2013-12-28 Thread Fatima Issawi
 What do you mean by word location? The number on the page? What
 purpose would this serve?

I mean the (x, y) coordinates of the word on the page. We want to be able to 
highlight the image of the word that was extracted from the text.

 I think that you might be confusing things:
 * If you have the full-text, you can highlight where the word was found. Solr
   highlighting handles this for you, and there is no need to store word 
 location
 * You can have different images (presumably, individual scanned pages)
 linked
to different sections of text, and show the entire image.
 Highlighting in the image
is not possible, unless by word location you mean the (x, y) coordinates 
 of
the word on the page. Even then:
- It will be prohibitively expensive to store the location of every word in
 every
  image for a large number of documents
- Some image processing will be required to handle the highlighting after
 the
  scanned image is retrieved

We will have the full text stored, but we want to highlight the text in the 
original image. I expect to process the image after retrieval. We do plan on 
storing the (x, y) coordinates of the words in a database - I suspected that it 
would be too expensive to store them in Solr. I guess I'm still confused about 
how to use Solr to index the document, but then retrieve the (x, y) coordinates 
of the search term from the database. Is this possible? If it can, can you give 
an example how this can be done?

Thank you!


RE: How to use Solr in my project

2013-12-28 Thread Fatima Issawi
Hello,

Our pages are images of handwritten text in Arabic so OCR'ing is not possible. 
We will be extracting the text during pre-processing and storing the words and 
(x, y) coordinates in a database. Would your process apply to our images?

 Step 1:
 For sending the extracted text content from text pdf to solr, use a low level
 pdf converter such as poppler-utils (pdftotext or pdftohtml) to correctly get
 the coordinates and page no. of each word. Store it in a seperate file as word
 map. This word map will contain page+coordinates mapping to occurence
 number for word.

Can we generate a word map manually? Is this used by Solr and requires a 
specific format?

 Step 2:
 Solr highlighter needs to be changed to get the word and their occurence
 number in the text document, rather than the character offsets for each hit.

How is this done? I read the solr highlighting wiki, but don't see how this can 
be done.

 Step 3:
 Combine the solr output to the word map created in step 1 and the pdf page
 and coordinates can be generated for original pdf docuemnt which can be
 highlighted by any viewer.

Can I get more information about how to do this?

Thanks!


Re: How to use Solr in my project

2013-12-27 Thread Gopal Agarwal
Highlighting can be done as three step process:

Pre-requisite: Get the pdf with text after the OCR of the image pdf.

Step 1:
For sending the extracted text content from text pdf to solr, use a low
level pdf converter such as poppler-utils (pdftotext or pdftohtml) to
correctly get the coordinates and page no. of each word. Store it in a
seperate file as word map. This word map will contain page+coordinates
mapping to occurence number for word.

Step 2:
Solr highlighter needs to be changed to get the word and their occurence
number in the text document, rather than the character offsets for each hit.

Step 3:
Combine the solr output to the word map created in step 1 and the pdf page
and coordinates can be generated for original pdf docuemnt which can be
highlighted by any viewer.

We are succesufully able to implement this for our own application.

Thanks,
Gopal


On Thu, Dec 26, 2013 at 3:56 PM, Gora Mohanty g...@mimirtech.com wrote:

 On 26 December 2013 15:44, Fatima Issawi issa...@qu.edu.qa wrote:
  Hi,
 
  I should clarify. We have another application extracting the text from
 the document. The full text from each document will be stored in a database
 either at the document level or page level (this hasn't been decided yet).
 We will also be storing word location of each word on the page in the
 database.

 What do you mean by word location? The number on the page? What purpose
 would this serve?

  What I'm having problems with is deciding on the schema. We want a user
 to be able to search for a word in the database, have a list of documents
 that word is located in, and location in the document that word is located
 it. When he selects the search results, we want the scanned picture to have
 that word highlighted on the page.
 [...]

 I think that you might be confusing things:
 * If you have the full-text, you can highlight where the word was found.
 Solr
   highlighting handles this for you, and there is no need to store word
 location
 * You can have different images (presumably, individual scanned pages)
 linked
to different sections of text, and show the entire image.
 Highlighting in the image
is not possible, unless by word location you mean the (x, y)
 coordinates of
the word on the page. Even then:
- It will be prohibitively expensive to store the location of every
 word in every
  image for a large number of documents
- Some image processing will be required to handle the highlighting
 after the
  scanned image is retrieved

 Regards,
 Gora



Re: How to use Solr in my project

2013-12-26 Thread Gora Mohanty
On 26 December 2013 10:54, Fatima Issawi issa...@qu.edu.qa wrote:
 Hello,

 First off, I apologize if this was sent twice. I was having issues 
 subscribing to the list.

 I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me 
 figure out how to implement Solr in my project. I have gone through some 
 tutorials online and I was able to import and query text in some Arabic PDF 
 documents.

 We have some scans of Historical Handwritten Arabic documents that will have 
 text extracted into a database (or PDF). We would like the user to be able to 
 search the document for text, then have the scanned image show up in a viewer 
 with the text highlighted.

This will not work for scanned images which do not actually contain the
text. If you have the text of the documents, the best that you can do is
break the text into pages corresponding to the scanned images, and
index into Solr the text from the pages and the scanned image that should
be linked to the text. For a user search, you will need to show the scanned
image for the entire page: Highlighting of the search term in an image is not
possible without optical character recognition (OCR).

Similarly, if you are indexing from PDFs, you will need to ensure that they
contain text, and not just images.

Regards,
Gora


RE: How to use Solr in my project

2013-12-26 Thread Fatima Issawi
Hi,

I should clarify. We have another application extracting the text from the 
document. The full text from each document will be stored in a database either 
at the document level or page level (this hasn't been decided yet). We will 
also be storing word location of each word on the page in the database. 

What I'm having problems with is deciding on the schema. We want a user to be 
able to search for a word in the database, have a list of documents that word 
is located in, and location in the document that word is located it. When he 
selects the search results, we want the scanned picture to have that word 
highlighted on the page. 

I want to index the document using Solr, but I'm having trouble figuring out 
how to design the schema to return that word location of a search term on the 
scanned picture in order to highlight it.

Does this make more sense?

Fatima

-Original Message-
From: Gora Mohanty [mailto:g...@mimirtech.com] 
Sent: Thursday, December 26, 2013 1:00 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Solr in my project

On 26 December 2013 10:54, Fatima Issawi issa...@qu.edu.qa wrote:
 Hello,

 First off, I apologize if this was sent twice. I was having issues 
 subscribing to the list.

 I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me 
 figure out how to implement Solr in my project. I have gone through some 
 tutorials online and I was able to import and query text in some Arabic PDF 
 documents.

 We have some scans of Historical Handwritten Arabic documents that will have 
 text extracted into a database (or PDF). We would like the user to be able to 
 search the document for text, then have the scanned image show up in a viewer 
 with the text highlighted.

This will not work for scanned images which do not actually contain the text. 
If you have the text of the documents, the best that you can do is break the 
text into pages corresponding to the scanned images, and index into Solr the 
text from the pages and the scanned image that should be linked to the text. 
For a user search, you will need to show the scanned image for the entire page: 
Highlighting of the search term in an image is not possible without optical 
character recognition (OCR).

Similarly, if you are indexing from PDFs, you will need to ensure that they 
contain text, and not just images.

Regards,
Gora


Re: How to use Solr in my project

2013-12-26 Thread Gora Mohanty
On 26 December 2013 15:44, Fatima Issawi issa...@qu.edu.qa wrote:
 Hi,

 I should clarify. We have another application extracting the text from the 
 document. The full text from each document will be stored in a database 
 either at the document level or page level (this hasn't been decided yet). We 
 will also be storing word location of each word on the page in the database.

What do you mean by word location? The number on the page? What purpose
would this serve?

 What I'm having problems with is deciding on the schema. We want a user to be 
 able to search for a word in the database, have a list of documents that word 
 is located in, and location in the document that word is located it. When he 
 selects the search results, we want the scanned picture to have that word 
 highlighted on the page.
[...]

I think that you might be confusing things:
* If you have the full-text, you can highlight where the word was found. Solr
  highlighting handles this for you, and there is no need to store word location
* You can have different images (presumably, individual scanned pages) linked
   to different sections of text, and show the entire image.
Highlighting in the image
   is not possible, unless by word location you mean the (x, y) coordinates of
   the word on the page. Even then:
   - It will be prohibitively expensive to store the location of every
word in every
 image for a large number of documents
   - Some image processing will be required to handle the highlighting after the
 scanned image is retrieved

Regards,
Gora