Hi All,

I currently have Tesseract implemented in a PERL module importing "use 
Image::OCR::Tesseract 'get_ocr';".  

The PERL module itself is used to do webscrapping, particularly scrapping 
hotel room rates from the website www.hoteltravel.com. The site stores and 
displays its daily rate breakup (rate of a given room per day) within a PNG 
image file. 

The issues I'm running into mainly involves tesseract improperly converting 
certain characters. I've pasted a code snippet below that basically shows 
me grabbing the specified image, extracting the rate data, and performing 
some cleaning of that data. The break down of the issues I'm facing will be 
below the code snippet.

$agent->save_content("/tmp/rate_img.png");# Saving the content in temporary 
file

my $rate_img = get_ocr('/tmp/rate_img.png'); #converts image to text


##############################
#
# Perform Cleanup of Extracted Text
#
##############################

my ($rate_per_day) = $rate_img =~ m!U\d{2}\s*([\d\,\.]+)!is;
($rate_per_day) = $rate_img =~ m!U\w{1}\$([^.]*).!is if (!$rate_per_day);
($rate_per_day) = $rate_img =~ m!U\d{2}|\$\s*([\d\,\.]+)!is if 
(!$rate_per_day);
($rate_per_day) = $rate_img =~ m!U\w{2}\s*([\w\,\.]+)!is if 
(!$rate_per_day);
($rate_per_day) = $rate_img =~ m!£([\d\,\.]+)!is if (!$rate_per_day);
$rate_per_day    =~ s!E!8!isg;
$rate_per_day    =~ s!L!4.!isg;
$rate_per_day =~ s!\,!!sg;
 $rate_img = &make_ascii_text($rate_img, 'utf-8');

$currency = 'EUR' if ($rate_img =~ m!ae?!is);
$currency = 'THB' if ($rate_img =~ m!as!is);
$currency = 'USD' if ($rate_img =~ m!USS!is);
$currency = 'USD' if ($rate_img =~ m!U55|US\$!is);
$currency = 'GBP' if ($rate_img =~ m!a?!is);
$rate_per_day =~ s!,\s*!!sg;


 *S. No*
 
*Criteria*
 
*Image *
 
*Text after conversion*
  
1.
 
(run module) --arv_dt=2013-02-24  --los=11 --guests=1 (los = Length of Stay)
 
*US$118.00*
 
*U5511E.OO*

* *
  
2.
 
(run module) --arv_dt=2013-02-24  --los=11 --guests=1  (los = Length of 
Stay)

 
 
*EUR 97.00*
 
*⬠97.00*
  
3.
 
(run module)  --arv_dt=2013-02-24  --los=11 --guests=1  (los = Length of 
Stay)
 
*USD 789.75*
 
*U55 7E9.75*
  
4.
 
(run module)  --arv_dt=2013-02-24  --los=11 --guests=1  (los = Length of 
Stay)
 
*US$ 84.15*

*US$ 96.82*
 
*U55 EL15*

*U55 96.52*
 
















*Main Issues:*

It convert ‘*EUR*’ to ‘*â¬*’ which changes for almost all currencies.

It converts ‘*US$ 84.15*’  to ‘*U55 EL15*’, here it converts ‘*4.*’ to ‘*L*’

It converts ‘*US$ 96.82*’ to ‘*U55 96.52 *’, here it converts ‘*8*’ to ‘*5*’
Has anyone faced a similar issue before, and does anyone have any insight 
into more flexible solutions to solving this conversion issue? 

Thanks

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to