As usual: - try to reproduce problem with tessract executable if you use something else (wrapper, in some cases API) - sent input image
Zdenko On Thu, Feb 14, 2013 at 5:13 PM, Markus Austin <[email protected]>wrote: > Hi All, > > I currently have Tesseract implemented within a PERL module with the > import, "use Image::OCR::Tesseract 'get_ocr';". The PERL module is designed > to do web-scrapping, particularly to scrap data from the website > www.hoteltravel.com. Tesseract comes into play when doing extraction of > rates from the rate breakup (per day price breakdown of a room). The > website stores it's room pricing data within a PNG image file for which > Tesseract is used to extract the text from the image. > > The issue I'm currently facing revolves around the improper conversion of > these rates and currencies when converting the image to text using > 'get_ocr'. Sample code is provided below showing how I'm using Tesseract > and the clean up of the extracted rates. Further specifics on the issue at > hand are stated below the sample code. > > $agent->save_content("/tmp/rate_img.png");# Saving the content in > temporary file > > my $rate_img = get_ocr('/tmp/rate_img.png'); # Convert image to text > > system ("rm /tmp/rate_img.png"); # Deleting the temporary file > my ($rate_per_day) = $rate_img =~ m!U\d{2}\s*([\d\,\.]+)!is; # Extract > Rate from text > > ############################## > #Perform Cleanup of Extracted Rate > ############################## > > ($rate_per_day) = $rate_img =~ m!U\w{1}\$([^.]*).!is if (!$rate_per_day); > ($rate_per_day) = $rate_img =~ m!U\d{2}|\$\s*([\d\,\.]+)!is if > (!$rate_per_day); > ($rate_per_day) = $rate_img =~ m!U\w{2}\s*([\w\,\.]+)!is if > (!$rate_per_day); > ($rate_per_day) = $rate_img =~ m!£([\d\,\.]+)!is if (!$rate_per_day); > $rate_per_day =~ s!E!8!isg; > $rate_per_day =~ s!L!4.!isg; > $rate_per_day =~ s!\,!!sg; > $rate_img = &make_ascii_text($rate_img, 'utf-8'); > > $currency = 'EUR' if ($rate_img =~ m!ae?!is); > $currency = 'THB' if ($rate_img =~ m!as!is); > $currency = 'USD' if ($rate_img =~ m!USS!is); > $currency = 'USD' if ($rate_img =~ m!U55|US\$!is); > $currency = 'GBP' if ($rate_img =~ m!a?!is); > > *S. No* > > *Criteria* > > *Image * > > *Text after conversion* > > 1.**** > > (run module) --arv_dt=2013-02-24 --los=11 --guests=1 (los = Length of > Stay)**** > > *US$118.00* > > *U5511E.OO* > > * * > > 2.**** > > (run module) --arv_dt=2013-02-24 --prop_id=15 --los=11 --guests=1 ****(los > = Length of Stay) > > ** ** > > *EUR 97.00* > > *⬠97.00* > > 3.**** > > (run module) --arv_dt=2013-02-24 --prop_id=16316 --los=11 --guests=1 ** > **(los = Length of Stay) > > *USD 789.75* > > *U55 7E9.75* > > 4.**** > > (run module) --arv_dt=2013-02-24 --prop_id=1553 --los=11 --guests=1 ****(los > = Length of Stay) > > *US$ 84.15* > > *US$ 96.82* > > *U55 EL15* > > *U55 96.52* > > > > > > > > > > > > > > > > > > > > *Main Issues:- * > > It convert ‘*EUR*’ to ‘*â¬*’ which changes for almost all currencies.**** > > It converts ‘*US$ 84.15*’ to ‘*U55 EL15*’, here it converts ‘*4.*’ to ‘*L > *’**** > > It converts ‘*US$ 96.82*’ to ‘*U55 96.52 *’, here it converts ‘*8*’ to ‘*5 > *’**** > If anyone has encountered an issue like this, or would know of a more > flexible solution to solve this issue, any help would be much appreciated? > > Thanks > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

