Hi All,
I currently have Tesseract implemented within a PERL module with the
import, "use Image::OCR::Tesseract 'get_ocr';". The PERL module is designed
to do web-scrapping, particularly to scrap data from the website
www.hoteltravel.com. Tesseract comes into play when doing extraction of
rates from the rate breakup (per day price breakdown of a room). The
website stores it's room pricing data within a PNG image file for which
Tesseract is used to extract the text from the image.
The issue I'm currently facing revolves around the improper conversion of
these rates and currencies when converting the image to text using
'get_ocr'. Sample code is provided below showing how I'm using Tesseract
and the clean up of the extracted rates. Further specifics on the issue at
hand are stated below the sample code.
$agent->save_content("/tmp/rate_img.png");# Saving the content in temporary
file
my $rate_img = get_ocr('/tmp/rate_img.png'); # Convert image to text
system ("rm /tmp/rate_img.png"); # Deleting the temporary file
my ($rate_per_day) = $rate_img =~ m!U\d{2}\s*([\d\,\.]+)!is; # Extract Rate
from text
##############################
#Perform Cleanup of Extracted Rate
##############################
($rate_per_day) = $rate_img =~ m!U\w{1}\$([^.]*).!is if (!$rate_per_day);
($rate_per_day) = $rate_img =~ m!U\d{2}|\$\s*([\d\,\.]+)!is if
(!$rate_per_day);
($rate_per_day) = $rate_img =~ m!U\w{2}\s*([\w\,\.]+)!is if
(!$rate_per_day);
($rate_per_day) = $rate_img =~ m!£([\d\,\.]+)!is if (!$rate_per_day);
$rate_per_day =~ s!E!8!isg;
$rate_per_day =~ s!L!4.!isg;
$rate_per_day =~ s!\,!!sg;
$rate_img = &make_ascii_text($rate_img, 'utf-8');
$currency = 'EUR' if ($rate_img =~ m!ae?!is);
$currency = 'THB' if ($rate_img =~ m!as!is);
$currency = 'USD' if ($rate_img =~ m!USS!is);
$currency = 'USD' if ($rate_img =~ m!U55|US\$!is);
$currency = 'GBP' if ($rate_img =~ m!a?!is);
*S. No*
*Criteria*
*Image *
*Text after conversion*
1.
(run module) --arv_dt=2013-02-24 --los=11 --guests=1 (los = Length of Stay)
*US$118.00*
*U5511E.OO*
* *
2.
(run module) --arv_dt=2013-02-24 --prop_id=15 --los=11 --guests=1 (los =
Length of Stay)
*EUR 97.00*
*⬠97.00*
3.
(run module) --arv_dt=2013-02-24 --prop_id=16316 --los=11 --guests=1 (los
= Length of Stay)
*USD 789.75*
*U55 7E9.75*
4.
(run module) --arv_dt=2013-02-24 --prop_id=1553 --los=11 --guests=1 (los =
Length of Stay)
*US$ 84.15*
*US$ 96.82*
*U55 EL15*
*U55 96.52*
*Main Issues:- *
It convert ‘*EUR*’ to ‘*â¬*’ which changes for almost all currencies.
It converts ‘*US$ 84.15*’ to ‘*U55 EL15*’, here it converts ‘*4.*’ to ‘*L*’
It converts ‘*US$ 96.82*’ to ‘*U55 96.52 *’, here it converts ‘*8*’ to ‘*5*’
If anyone has encountered an issue like this, or would know of a more
flexible solution to solve this issue, any help would be much appreciated?
Thanks
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.