Re: How to use Solr in my project
On 30 December 2013 11:27, Fatima Issawi issa...@qu.edu.qa wrote: Hi again, We have another program that will be extracting the text, and it will be extracting the top right and bottom left corners of the words. You are right, I do expect to have a lot of data. When would solr start experiencing issues in performance? Is it better to: INDEX: - document metadata - words STORE: - document metadata - words - coordinates in Solr rather than in the database? How would I set up the schema in order to store the coordinates? You do not mention the number of documents, but for a few tens of thousands of documents, your problem should be tractable in Solr. Not sure what document metadata you have, and if you need to search through it, but what I would do is index the words, and store the coordinates in Solr, the assumption being that words are searched but not retrieved from Solr, while coordinates are retrieved but never searched. Off the top of my head, each record can be: doc1 pg1 word1 coord_x1 coord_y1 coord_x2 coord_y2 doc1 pg1 word2 ... doc1 pg2 ... ... doc2 ... * doc_id and pg_id from Solr search results let you retrieve the image from the filesystem * The coordinates allow post-processing to highlight the word in the image As always, set up a prototype system with a subset of the records in order to measure performance. If storing the coordinates in solr is not recommended, what would be the best process to get the coordinates after indexing the words and metadata? Do I search in solr and then use the documentID to then search the database for the words and coordinates? You could do that, but Solr by itself should be fine. Regards, Gora
RE: How to use Solr in my project
I think we may have up to 100,000 books, but I don't think the site will have a lot of traffic. Thank you for your help. I think it is a little more clear and will try to implement it now. -Original Message- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Monday, December 30, 2013 11:46 AM To: solr-user@lucene.apache.org Subject: Re: How to use Solr in my project On 30 December 2013 11:27, Fatima Issawi issa...@qu.edu.qa wrote: Hi again, We have another program that will be extracting the text, and it will be extracting the top right and bottom left corners of the words. You are right, I do expect to have a lot of data. When would solr start experiencing issues in performance? Is it better to: INDEX: - document metadata - words STORE: - document metadata - words - coordinates in Solr rather than in the database? How would I set up the schema in order to store the coordinates? You do not mention the number of documents, but for a few tens of thousands of documents, your problem should be tractable in Solr. Not sure what document metadata you have, and if you need to search through it, but what I would do is index the words, and store the coordinates in Solr, the assumption being that words are searched but not retrieved from Solr, while coordinates are retrieved but never searched. Off the top of my head, each record can be: doc1 pg1 word1 coord_x1 coord_y1 coord_x2 coord_y2 doc1 pg1 word2 ... doc1 pg2 ... ... doc2 ... * doc_id and pg_id from Solr search results let you retrieve the image from the filesystem * The coordinates allow post-processing to highlight the word in the image As always, set up a prototype system with a subset of the records in order to measure performance. If storing the coordinates in solr is not recommended, what would be the best process to get the coordinates after indexing the words and metadata? Do I search in solr and then use the documentID to then search the database for the words and coordinates? You could do that, but Solr by itself should be fine. Regards, Gora
Re: How to use Solr in my project
On 29 December 2013 11:10, Fatima Issawi issa...@qu.edu.qa wrote: [...] We will have the full text stored, but we want to highlight the text in the original image. I expect to process the image after retrieval. We do plan on storing the (x, y) coordinates of the words in a database - I suspected that it would be too expensive to store them in Solr. I guess I'm still confused about how to use Solr to index the document, but then retrieve the (x, y) coordinates of the search term from the database. Is this possible? If it can, can you give an example how this can be done? Storing, and retrieving the coordinates from Solr will likely be faster than from the database. However, I still think that you should think more carefully about your use case of highlighting the images. It can be done, but is a significant amount of work, and will need storage, and computational resources. 1. For highlighting in the image, you will need to store two sets of coordinates (e.g., top right and bottom left corners) as you not know the length of the word in the image. Thus, say with 15 words per line, 50 lines per page, 100 pages per document, you will need to store: 4 x 15 x 50 x 100 = 3,00,000 coordinates/document 2. Also, how are you going to get the coordinates in the first place? Regards, Gora
RE: How to use Solr in my project
Hi again, We have another program that will be extracting the text, and it will be extracting the top right and bottom left corners of the words. You are right, I do expect to have a lot of data. When would solr start experiencing issues in performance? Is it better to: INDEX: - document metadata - words STORE: - document metadata - words - coordinates in Solr rather than in the database? How would I set up the schema in order to store the coordinates? If storing the coordinates in solr is not recommended, what would be the best process to get the coordinates after indexing the words and metadata? Do I search in solr and then use the documentID to then search the database for the words and coordinates? Thanks for your patience. I don't have much choice in the use case. -Original Message- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Sunday, December 29, 2013 2:48 PM To: solr-user@lucene.apache.org Subject: Re: How to use Solr in my project On 29 December 2013 11:10, Fatima Issawi issa...@qu.edu.qa wrote: [...] We will have the full text stored, but we want to highlight the text in the original image. I expect to process the image after retrieval. We do plan on storing the (x, y) coordinates of the words in a database - I suspected that it would be too expensive to store them in Solr. I guess I'm still confused about how to use Solr to index the document, but then retrieve the (x, y) coordinates of the search term from the database. Is this possible? If it can, can you give an example how this can be done? Storing, and retrieving the coordinates from Solr will likely be faster than from the database. However, I still think that you should think more carefully about your use case of highlighting the images. It can be done, but is a significant amount of work, and will need storage, and computational resources. 1. For highlighting in the image, you will need to store two sets of coordinates (e.g., top right and bottom left corners) as you not know the length of the word in the image. Thus, say with 15 words per line, 50 lines per page, 100 pages per document, you will need to store: 4 x 15 x 50 x 100 = 3,00,000 coordinates/document 2. Also, how are you going to get the coordinates in the first place? Regards, Gora
RE: How to use Solr in my project
What do you mean by word location? The number on the page? What purpose would this serve? I mean the (x, y) coordinates of the word on the page. We want to be able to highlight the image of the word that was extracted from the text. I think that you might be confusing things: * If you have the full-text, you can highlight where the word was found. Solr highlighting handles this for you, and there is no need to store word location * You can have different images (presumably, individual scanned pages) linked to different sections of text, and show the entire image. Highlighting in the image is not possible, unless by word location you mean the (x, y) coordinates of the word on the page. Even then: - It will be prohibitively expensive to store the location of every word in every image for a large number of documents - Some image processing will be required to handle the highlighting after the scanned image is retrieved We will have the full text stored, but we want to highlight the text in the original image. I expect to process the image after retrieval. We do plan on storing the (x, y) coordinates of the words in a database - I suspected that it would be too expensive to store them in Solr. I guess I'm still confused about how to use Solr to index the document, but then retrieve the (x, y) coordinates of the search term from the database. Is this possible? If it can, can you give an example how this can be done? Thank you!
RE: How to use Solr in my project
Hello, Our pages are images of handwritten text in Arabic so OCR'ing is not possible. We will be extracting the text during pre-processing and storing the words and (x, y) coordinates in a database. Would your process apply to our images? Step 1: For sending the extracted text content from text pdf to solr, use a low level pdf converter such as poppler-utils (pdftotext or pdftohtml) to correctly get the coordinates and page no. of each word. Store it in a seperate file as word map. This word map will contain page+coordinates mapping to occurence number for word. Can we generate a word map manually? Is this used by Solr and requires a specific format? Step 2: Solr highlighter needs to be changed to get the word and their occurence number in the text document, rather than the character offsets for each hit. How is this done? I read the solr highlighting wiki, but don't see how this can be done. Step 3: Combine the solr output to the word map created in step 1 and the pdf page and coordinates can be generated for original pdf docuemnt which can be highlighted by any viewer. Can I get more information about how to do this? Thanks!
Re: How to use Solr in my project
Highlighting can be done as three step process: Pre-requisite: Get the pdf with text after the OCR of the image pdf. Step 1: For sending the extracted text content from text pdf to solr, use a low level pdf converter such as poppler-utils (pdftotext or pdftohtml) to correctly get the coordinates and page no. of each word. Store it in a seperate file as word map. This word map will contain page+coordinates mapping to occurence number for word. Step 2: Solr highlighter needs to be changed to get the word and their occurence number in the text document, rather than the character offsets for each hit. Step 3: Combine the solr output to the word map created in step 1 and the pdf page and coordinates can be generated for original pdf docuemnt which can be highlighted by any viewer. We are succesufully able to implement this for our own application. Thanks, Gopal On Thu, Dec 26, 2013 at 3:56 PM, Gora Mohanty g...@mimirtech.com wrote: On 26 December 2013 15:44, Fatima Issawi issa...@qu.edu.qa wrote: Hi, I should clarify. We have another application extracting the text from the document. The full text from each document will be stored in a database either at the document level or page level (this hasn't been decided yet). We will also be storing word location of each word on the page in the database. What do you mean by word location? The number on the page? What purpose would this serve? What I'm having problems with is deciding on the schema. We want a user to be able to search for a word in the database, have a list of documents that word is located in, and location in the document that word is located it. When he selects the search results, we want the scanned picture to have that word highlighted on the page. [...] I think that you might be confusing things: * If you have the full-text, you can highlight where the word was found. Solr highlighting handles this for you, and there is no need to store word location * You can have different images (presumably, individual scanned pages) linked to different sections of text, and show the entire image. Highlighting in the image is not possible, unless by word location you mean the (x, y) coordinates of the word on the page. Even then: - It will be prohibitively expensive to store the location of every word in every image for a large number of documents - Some image processing will be required to handle the highlighting after the scanned image is retrieved Regards, Gora
Re: How to use Solr in my project
On 26 December 2013 10:54, Fatima Issawi issa...@qu.edu.qa wrote: Hello, First off, I apologize if this was sent twice. I was having issues subscribing to the list. I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me figure out how to implement Solr in my project. I have gone through some tutorials online and I was able to import and query text in some Arabic PDF documents. We have some scans of Historical Handwritten Arabic documents that will have text extracted into a database (or PDF). We would like the user to be able to search the document for text, then have the scanned image show up in a viewer with the text highlighted. This will not work for scanned images which do not actually contain the text. If you have the text of the documents, the best that you can do is break the text into pages corresponding to the scanned images, and index into Solr the text from the pages and the scanned image that should be linked to the text. For a user search, you will need to show the scanned image for the entire page: Highlighting of the search term in an image is not possible without optical character recognition (OCR). Similarly, if you are indexing from PDFs, you will need to ensure that they contain text, and not just images. Regards, Gora
RE: How to use Solr in my project
Hi, I should clarify. We have another application extracting the text from the document. The full text from each document will be stored in a database either at the document level or page level (this hasn't been decided yet). We will also be storing word location of each word on the page in the database. What I'm having problems with is deciding on the schema. We want a user to be able to search for a word in the database, have a list of documents that word is located in, and location in the document that word is located it. When he selects the search results, we want the scanned picture to have that word highlighted on the page. I want to index the document using Solr, but I'm having trouble figuring out how to design the schema to return that word location of a search term on the scanned picture in order to highlight it. Does this make more sense? Fatima -Original Message- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Thursday, December 26, 2013 1:00 PM To: solr-user@lucene.apache.org Subject: Re: How to use Solr in my project On 26 December 2013 10:54, Fatima Issawi issa...@qu.edu.qa wrote: Hello, First off, I apologize if this was sent twice. I was having issues subscribing to the list. I'm a complete noob in Solr (and indexing), so I'm hoping someone can help me figure out how to implement Solr in my project. I have gone through some tutorials online and I was able to import and query text in some Arabic PDF documents. We have some scans of Historical Handwritten Arabic documents that will have text extracted into a database (or PDF). We would like the user to be able to search the document for text, then have the scanned image show up in a viewer with the text highlighted. This will not work for scanned images which do not actually contain the text. If you have the text of the documents, the best that you can do is break the text into pages corresponding to the scanned images, and index into Solr the text from the pages and the scanned image that should be linked to the text. For a user search, you will need to show the scanned image for the entire page: Highlighting of the search term in an image is not possible without optical character recognition (OCR). Similarly, if you are indexing from PDFs, you will need to ensure that they contain text, and not just images. Regards, Gora
Re: How to use Solr in my project
On 26 December 2013 15:44, Fatima Issawi issa...@qu.edu.qa wrote: Hi, I should clarify. We have another application extracting the text from the document. The full text from each document will be stored in a database either at the document level or page level (this hasn't been decided yet). We will also be storing word location of each word on the page in the database. What do you mean by word location? The number on the page? What purpose would this serve? What I'm having problems with is deciding on the schema. We want a user to be able to search for a word in the database, have a list of documents that word is located in, and location in the document that word is located it. When he selects the search results, we want the scanned picture to have that word highlighted on the page. [...] I think that you might be confusing things: * If you have the full-text, you can highlight where the word was found. Solr highlighting handles this for you, and there is no need to store word location * You can have different images (presumably, individual scanned pages) linked to different sections of text, and show the entire image. Highlighting in the image is not possible, unless by word location you mean the (x, y) coordinates of the word on the page. Even then: - It will be prohibitively expensive to store the location of every word in every image for a large number of documents - Some image processing will be required to handle the highlighting after the scanned image is retrieved Regards, Gora