Hello,Chirag. I am also trying to find a way to detect the script of the input document.
Kindly let me know if you have some progress. Thanks and Regards, Alex 在 2012年3月14日星期三UTC+8下午7时17分29秒,Chirag Jain写道: > > With -psm 3, I got non-empty files (test_osd.txt) which were empty with > -psm 0. This is true for both with/without -l options. > > However, the results of detectOS is same for both -psm [0/3] option for > any of with/without -l options. > > Please note that I have modified the code slightly to call detectOS > separately, which has been doing a good job for orientation detection given > script. I am struggling to detect the script of the input document. > > Regards, > Chirag > > > On Wed, Mar 14, 2012 at 4:05 PM, Sriranga(78yrsold) > <[email protected]<javascript:> > > wrote: > >> one more important - please test again as follows: >> 1st test:tesseract.exe japanese_doc.tif test_osd -l jpn -psm 3 >> 2nd test:tesseract.exe japanese_doc.tif test_osd -psm 3 >> Please check the output text files "test_osd" - you will find difference >> in script between two. >> >> On Wed, Mar 14, 2012 at 3:51 PM, Sriranga(78yrsold) >> <[email protected]<javascript:> >> > wrote: >> >>> I noticed "-l lang" before "-psm 0" is missing in your commandline. In >>> the absence of "-l lang" tesseract will always assume as "-l eng". >>> >>> extract of help is reproduced below: >>> >>> M:\>tesseract.exe -h >>> *Usage:tesseract.exe imagename outputbase [-l lang] [-psm pagesegmode] >>> [configfil* >>> e...] >>> pagesegmode values are: >>> 0 = Orientation and script detection (OSD) only. >>> 1 = Automatic page segmentation with OSD. >>> 2 = Automatic page segmentation, but no OSD, or OCR >>> 3 = Fully automatic page segmentation, but no OSD. (Default) >>> 4 = Assume a single column of text of variable sizes. >>> 5 = Assume a single uniform block of vertically aligned text. >>> 6 = Assume a single uniform block of text. >>> 7 = Treat the image as a single text line. >>> 8 = Treat the image as a single word. >>> 9 = Treat the image as a single word in a circle. >>> 10 = Treat the image as a single character. >>> -l lang and/or -psm pagesegmode must occur before anyconfigfile. >>> >>> >>> >>> On Wed, Mar 14, 2012 at 3:22 PM, Chirag <[email protected] <javascript:> >>> > wrote: >>> >>>> Hi all, >>>> >>>> I was able to successfully test orientation detection (after stepping >>>> though the code) for various scripts using following commands: >>>> >>>> English: tesseract.exe english_doc.tif test_osd -l eng -psm 0 >>>> Japanese: tesseract.exe japanese_doc.tif test_osd -l jpn -psm 0 >>>> Korean: tesseract.exe korean_doc.tif test_osd -l kor -psm 0 >>>> >>>> In these cases, the executable search for eng.traineddata, >>>> jpn.traineddata and kor.traineddata respectively along with >>>> osd.traineddata. >>>> >>>> The performance is really good. >>>> >>>> >>>> However, it seems like Tesseract is detecting orientation given script. >>>> >>>> >>>> If I run the executable as following: >>>> >>>> Japanese: tesseract.exe japanese_doc.tif test_osd -psm 0 >>>> Korean: tesseract.exe korean_doc.tif test_osd -psm 0 >>>> >>>> The results are not good. It seems like script detection is not robust. >>>> >>>> Am I missing some step? Kindly clarify. >>>> >>>> >>>> Regards, >>>> Chirag >>>> >>>> >>>> On Sat, Mar 3, 2012 at 7:12 PM, koray >>>> <[email protected]<javascript:> >>>> > wrote: >>>> >>>>> OSD returns emty text when I tried. Can anyone please clarify if >>>>> this is a bug or I m doing things wrong? >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to >>>>> [email protected]<javascript:> >>>>> To unsubscribe from this group, send email to >>>>> [email protected] <javascript:> >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to >>>> [email protected]<javascript:> >>>> To unsubscribe from this group, send email to >>>> [email protected] <javascript:> >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>> >>> >>> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

