As usual:

   - try to reproduce problem with tessract executable if you use something
   else (wrapper, in some cases API)
   - sent input image


Zdenko


On Thu, Feb 14, 2013 at 5:13 PM, Markus Austin <[email protected]>wrote:

> Hi All,
>
> I currently have Tesseract implemented within a PERL module with the
> import, "use Image::OCR::Tesseract 'get_ocr';". The PERL module is designed
> to do web-scrapping, particularly to scrap data from the website
> www.hoteltravel.com. Tesseract comes into play when doing extraction of
> rates from the rate breakup (per day price breakdown of a room). The
> website stores it's room pricing data within a PNG image file for which
> Tesseract is used to extract the text from the image.
>
> The issue I'm currently facing revolves around the improper conversion of
> these rates and currencies when converting the image to text using
> 'get_ocr'. Sample code is provided below showing how I'm using Tesseract
> and the clean up of the extracted rates. Further specifics on the issue at
> hand are stated below the sample code.
>
> $agent->save_content("/tmp/rate_img.png");# Saving the content in
> temporary file
>
> my $rate_img = get_ocr('/tmp/rate_img.png'); # Convert image to text
>
> system ("rm /tmp/rate_img.png"); # Deleting the temporary file
> my ($rate_per_day) = $rate_img =~ m!U\d{2}\s*([\d\,\.]+)!is; # Extract
> Rate from text
>
> ##############################
> #Perform Cleanup of Extracted Rate
> ##############################
>
> ($rate_per_day) = $rate_img =~ m!U\w{1}\$([^.]*).!is if (!$rate_per_day);
> ($rate_per_day) = $rate_img =~ m!U\d{2}|\$\s*([\d\,\.]+)!is if
> (!$rate_per_day);
> ($rate_per_day) = $rate_img =~ m!U\w{2}\s*([\w\,\.]+)!is if
> (!$rate_per_day);
> ($rate_per_day) = $rate_img =~ m!£([\d\,\.]+)!is if (!$rate_per_day);
> $rate_per_day    =~ s!E!8!isg;
> $rate_per_day    =~ s!L!4.!isg;
> $rate_per_day =~ s!\,!!sg;
> $rate_img = &make_ascii_text($rate_img, 'utf-8');
>
> $currency = 'EUR' if ($rate_img =~ m!ae?!is);
> $currency = 'THB' if ($rate_img =~ m!as!is);
> $currency = 'USD' if ($rate_img =~ m!USS!is);
> $currency = 'USD' if ($rate_img =~ m!U55|US\$!is);
> $currency = 'GBP' if ($rate_img =~ m!a?!is);
>
>  *S. No*
>
> *Criteria*
>
> *Image *
>
> *Text after conversion*
>
> 1.****
>
> (run module) --arv_dt=2013-02-24  --los=11 --guests=1 (los = Length of
> Stay)****
>
> *US$118.00*
>
> *U5511E.OO*
>
> * *
>
> 2.****
>
> (run module) --arv_dt=2013-02-24 --prop_id=15    --los=11 --guests=1 ****(los
> = Length of Stay)
>
> ** **
>
> *EUR 97.00*
>
> *⬠97.00*
>
> 3.****
>
> (run module) --arv_dt=2013-02-24 --prop_id=16316    --los=11 --guests=1 **
> **(los = Length of Stay)
>
> *USD 789.75*
>
> *U55 7E9.75*
>
> 4.****
>
> (run module) --arv_dt=2013-02-24 --prop_id=1553 --los=11 --guests=1 ****(los
> = Length of Stay)
>
> *US$ 84.15*
>
> *US$ 96.82*
>
> *U55 EL15*
>
> *U55 96.52*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Main Issues:- *
>
> It convert ‘*EUR*’ to ‘*â¬*’ which changes for almost all currencies.****
>
> It converts ‘*US$ 84.15*’  to ‘*U55 EL15*’, here it converts ‘*4.*’ to ‘*L
> *’****
>
> It converts ‘*US$ 96.82*’ to ‘*U55 96.52 *’, here it converts ‘*8*’ to ‘*5
> *’****
> If anyone has encountered an issue like this, or would know of a more
> flexible solution to solve this issue, any help would be much appreciated?
>
> Thanks
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to