Hi all, I keep getting the error- "Recognition of image failed" for which I am unable to figure out the root cause.
What I am trying to do : I have a PDF document(12 pages) on which I am trying perform the OCR. Step-1: Splitting the entire PDF document to individual pdf pages. Step-2: Converting each page to a Bitmap image. Step-3: Each Bitmap is now being fed as an input to the Tesseract so that it will perform the OCR and gives back the result. Where I am getting the error: This error is not occuring consistantly in a specifc page. I am getting the error randomly at any pagenumber (i.e error can occur in any page say: 7 or 8 or 11 etc..) Exact line of code where I'm getting the error: using (ResultIterator r = currentPage.GetIterator()) More Observations: I noticed that I am getting this error only when the PDF I uploaded is considerably big., like if the PDF is more that 10 pages or so.. Looking for some inputs from you guys.. Thanks in advance.. Code in high level: Tesseract.Page page; TesseractEngine ocr; public void SplitFile(string path) { Spire.Pdf.PdfDocument document = new Spire.Pdf.PdfDocument(path); System.Drawing.Bitmap bitmap; int pageNumber = 0; int pageCount = document.Pages.Count; try { for (int i = 0; i < pageCount; i++) { bitmap = (Bitmap)document.SaveAsImage(pageNumber, PdfImageType.Bitmap, 450, 450); //450 is the DPI BitmapToPixConverter b = new BitmapToPixConverter(); Tesseract.Pix pix = b.Convert(bitmap); ProcessOCR(pix, documentId, pageNumber, pageCount); ocr.Dispose(); pageNumber++; } document.Close(); } } public void ProcessOCR(Pix image, int pageNumber, int pageCount) { List<Coordinates> lstCoordinates; ocr = new TesseractEngine(HttpContext.Current.Server.MapPath(@"~/tessdata"),"eng",EngineMode.Default); using (page = ocr.Process(image, PageSegMode.SingleColumn)) { lstCoordinates = GetWordsWithCoordinates(page, image.Width, image.Height, pageNumber); } } public List<Coordinates> GetWordsWithCoordinates(Page currentPage, int width, int height, int pageNumber) { List<Coordinates> words = new List<Coordinates>(); using (ResultIterator r = currentPage.GetIterator()) { do { string word = r.GetText(PageIteratorLevel.Word); if (word != null) { // fetch the coordinates of the word and do something } } while (r.Next(PageIteratorLevel.Word)); } } -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b86985c-96f7-4c31-a28b-9d829cf67c2a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.