Interesting paper re TICCL - wondering whether tesseract is using similar
approach for 3.04 language data with the unigram and bigram lists along
with 'clean' word lists ...

see section 4.4 processing steps

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Fri, Sep 5, 2014 at 6:44 PM, Rick Leir <[email protected]> wrote:

> Here is something about automated corrections:
>
>
> http://ilk.uvt.nl/downloads/pub/papers/CICLING08.TICCL.MRE.postpublication.pdf
>
> Unrelated to the above, I would like to use languagetool.org to automate
> corrections.  So much to do, so little time..
>
>
> On Tuesday, September 2, 2014 11:13:50 AM UTC-4, Pierre Lison wrote:
>>
>>
>> Hi,
>>
>> I'm a researcher in statistical machine translation, and use for my work
>> of bunch of translated texts (in multiple languages), some of which were
>> automatically generated via OCR.  I recently noticed that some texts
>> included subtantial numbers of OCR errors, which I would of course like to
>> correct to improve the quality of my data.
>>
>> I was therefore wondering if I could use tesseract or some related
>> software tool in order to correct at least some of these OCR-generated
>> errors (through e.g. statistical language modelling techniques).  Note that
>> I unfortunately don't have access to the original scans, I only have the
>> raw, OCR-produced text.
>>
>> Any suggestions?
>>
>> Thanks!
>>
>> Pierre
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/715ce30f-c574-446a-997a-d5dfb137d89b%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/715ce30f-c574-446a-997a-d5dfb137d89b%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUbrEhfFV0ukivOAx2_-pimQQjMZFTO3VAmsWg0snavdQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to