Hi: this is slightly long..... I am posting this question on this list as there seem to be people of diverse backgrounds here..., and hopefully someone can come up with an idea....or a solution...!
This for a legislative automation project ; We are in the process of converting a corpus of legislative acts and bills into digital format. We want to scan all the documents into tiff format. The tiff image will then be OCRed, and after a series of verifications aimed at increasing OCR accuracy, we will end up with a bi-tonal (JBig2) Searchable Image pdf...Hopefully at the end of that, the pdf will show & print the text of the document as 100% true copy of the paper document. The two main problems i cant get a handle on is ascertaining the accuracy of the OCR data: Manual verification seems to be what many of the outsourcing companies do....but we dont want to go this way as it is quite expensive (are I am talking of digitizing legislative acts for over 5 parliaments...)... and to get better accuracy you need to do a triple-compare (a.k.a keying in the same information three times....). I have been trying to look for a "brighter" solution along the lines of...: a) we have a TIFF image of the document scanned and virtually re-scanned (a technology that cleans up crimps, creases and ink blothces on the scanned image)....this is a true 100% copy of the document. b) then after the OCR process on the TIFF we have a text-over-image pdf....which is searchable for the text in the document, and also highlights the text (like google search highlights) within the pdf document.... the question is, how do we verify the accuracy this text-over-image pdf (b) with the 100% true TIFF created in (a)....? clearly a plain image comparison will generate a huge amount of difference...so is there a way to do it more smartly ? One possible solution, is to use the JBIG2 format that (used by Adobe to compress and store text images inside a pdf). It seems that this format, actually builds a table with the shape of the letters storing the shape only once per page. So we would like to build a verification process that takes the original data image (a) and compares it to the optically recognized data that has been transformed back into an image (b) . The application should ...somehow compare the "patterns"/"shapes" (thereby eliminating the "noise" of a pixel by pixel kind of comparison) which will point out the parts that show a mismatch for visual interpretation and verification by experienced operators..... does it make any sense? If someone has the domain experience in doing something like this or has the fundamentals of a software which can do this.... Please get in touch with me!!!! thanks Ashok
