Hi All,
I currently have Tesseract implemented in a PERL module importing "use
Image::OCR::Tesseract 'get_ocr';".
The PERL module itself is used to do webscrapping, particularly scrapping
hotel room rates from the website www.hoteltravel.com. The site stores and
displays its daily rate breakup (rate of a given room per day) within a PNG
image file.
The issues I'm running into mainly involves tesseract improperly converting
certain characters. I've pasted a code snippet below that basically shows
me grabbing the specified image, extracting the rate data, and performing
some cleaning of that data. The break down of the issues I'm facing will be
below the code snippet.
$agent->save_content("/tmp/rate_img.png");# Saving the content in temporary
file
my $rate_img = get_ocr('/tmp/rate_img.png'); #converts image to text
##############################
#
# Perform Cleanup of Extracted Text
#
##############################
my ($rate_per_day) = $rate_img =~ m!U\d{2}\s*([\d\,\.]+)!is;
($rate_per_day) = $rate_img =~ m!U\w{1}\$([^.]*).!is if (!$rate_per_day);
($rate_per_day) = $rate_img =~ m!U\d{2}|\$\s*([\d\,\.]+)!is if
(!$rate_per_day);
($rate_per_day) = $rate_img =~ m!U\w{2}\s*([\w\,\.]+)!is if
(!$rate_per_day);
($rate_per_day) = $rate_img =~ m!£([\d\,\.]+)!is if (!$rate_per_day);
$rate_per_day =~ s!E!8!isg;
$rate_per_day =~ s!L!4.!isg;
$rate_per_day =~ s!\,!!sg;
$rate_img = &make_ascii_text($rate_img, 'utf-8');
$currency = 'EUR' if ($rate_img =~ m!ae?!is);
$currency = 'THB' if ($rate_img =~ m!as!is);
$currency = 'USD' if ($rate_img =~ m!USS!is);
$currency = 'USD' if ($rate_img =~ m!U55|US\$!is);
$currency = 'GBP' if ($rate_img =~ m!a?!is);
$rate_per_day =~ s!,\s*!!sg;
*S. No*
*Criteria*
*Image *
*Text after conversion*
1.
(run module) --arv_dt=2013-02-24 --los=11 --guests=1 (los = Length of Stay)
*US$118.00*
*U5511E.OO*
* *
2.
(run module) --arv_dt=2013-02-24 --los=11 --guests=1 (los = Length of
Stay)
*EUR 97.00*
*⬠97.00*
3.
(run module) --arv_dt=2013-02-24 --los=11 --guests=1 (los = Length of
Stay)
*USD 789.75*
*U55 7E9.75*
4.
(run module) --arv_dt=2013-02-24 --los=11 --guests=1 (los = Length of
Stay)
*US$ 84.15*
*US$ 96.82*
*U55 EL15*
*U55 96.52*
*Main Issues:*
It convert ‘*EUR*’ to ‘*â¬*’ which changes for almost all currencies.
It converts ‘*US$ 84.15*’ to ‘*U55 EL15*’, here it converts ‘*4.*’ to ‘*L*’
It converts ‘*US$ 96.82*’ to ‘*U55 96.52 *’, here it converts ‘*8*’ to ‘*5*’
Has anyone faced a similar issue before, and does anyone have any insight
into more flexible solutions to solving this conversion issue?
Thanks
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.