Thanks Sriranga for the response. I was able to perform automatic orientation detection along with page segmentation but I had to supply the script information in the argument for Non English scripts.
I still could not perform automatic script detection. Regards Chirag 2012/3/14 Sriranga(78yrsold) <[email protected]> > I tested using my own lang.tif as follows: > 1) using with -l option -psm 3 ->pl see attached testtif-osd.txt. > (non-english) > 2)using without -l option -psm3 ->pl see attached 2testtif-osd.txt. (in > English) > In both cases there are *no empty* output but in different lang > > Extract of cmd reproduced below, if used -psm 0 > M:\>tesseract.exe test.tif 2testtif-osd -psm 0 > Tesseract Open Source OCR Engine v3.02 with Leptonica > Error during processing. > > M:\>tesseract.exe test.tif 2testtif-osd -l k27 -psm 0 > Tesseract Open Source OCR Engine v3.02 with Leptonica > Error during processing. > > > > On Wed, Mar 14, 2012 at 4:47 PM, Chirag <[email protected]> wrote: > >> With -psm 3, I got non-empty files (test_osd.txt) which were empty with >> -psm 0. This is true for both with/without -l options. >> >> However, the results of detectOS is same for both -psm [0/3] option for >> any of with/without -l options. >> >> Please note that I have modified the code slightly to call detectOS >> separately, which has been doing a good job for orientation detection given >> script. I am struggling to detect the script of the input document. >> >> Regards, >> Chirag >> >> >> On Wed, Mar 14, 2012 at 4:05 PM, Sriranga(78yrsold) < >> [email protected]> wrote: >> >>> one more important - please test again as follows: >>> 1st test:tesseract.exe japanese_doc.tif test_osd -l jpn -psm 3 >>> 2nd test:tesseract.exe japanese_doc.tif test_osd -psm 3 >>> Please check the output text files "test_osd" - you will find >>> difference in script between two. >>> >>> On Wed, Mar 14, 2012 at 3:51 PM, Sriranga(78yrsold) < >>> [email protected]> wrote: >>> >>>> I noticed "-l lang" before "-psm 0" is missing in your commandline. >>>> In the absence of "-l lang" tesseract will always assume as "-l eng". >>>> >>>> extract of help is reproduced below: >>>> >>>> M:\>tesseract.exe -h >>>> *Usage:tesseract.exe imagename outputbase [-l lang] [-psm pagesegmode] >>>> [configfil* >>>> e...] >>>> pagesegmode values are: >>>> 0 = Orientation and script detection (OSD) only. >>>> 1 = Automatic page segmentation with OSD. >>>> 2 = Automatic page segmentation, but no OSD, or OCR >>>> 3 = Fully automatic page segmentation, but no OSD. (Default) >>>> 4 = Assume a single column of text of variable sizes. >>>> 5 = Assume a single uniform block of vertically aligned text. >>>> 6 = Assume a single uniform block of text. >>>> 7 = Treat the image as a single text line. >>>> 8 = Treat the image as a single word. >>>> 9 = Treat the image as a single word in a circle. >>>> 10 = Treat the image as a single character. >>>> -l lang and/or -psm pagesegmode must occur before anyconfigfile. >>>> >>>> >>>> >>>> On Wed, Mar 14, 2012 at 3:22 PM, Chirag <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I was able to successfully test orientation detection (after stepping >>>>> though the code) for various scripts using following commands: >>>>> >>>>> English: tesseract.exe english_doc.tif test_osd -l eng -psm 0 >>>>> Japanese: tesseract.exe japanese_doc.tif test_osd -l jpn -psm 0 >>>>> Korean: tesseract.exe korean_doc.tif test_osd -l kor -psm 0 >>>>> >>>>> In these cases, the executable search for eng.traineddata, >>>>> jpn.traineddata and kor.traineddata respectively along with >>>>> osd.traineddata. >>>>> >>>>> The performance is really good. >>>>> >>>>> >>>>> However, it seems like Tesseract is detecting orientation given script. >>>>> >>>>> >>>>> If I run the executable as following: >>>>> >>>>> Japanese: tesseract.exe japanese_doc.tif test_osd -psm 0 >>>>> Korean: tesseract.exe korean_doc.tif test_osd -psm 0 >>>>> >>>>> The results are not good. It seems like script detection is not robust. >>>>> >>>>> Am I missing some step? Kindly clarify. >>>>> >>>>> >>>>> Regards, >>>>> Chirag >>>>> >>>>> >>>>> On Sat, Mar 3, 2012 at 7:12 PM, koray <[email protected]>wrote: >>>>> >>>>>> OSD returns emty text when I tried. Can anyone please clarify if >>>>>> this is a bug or I m doing things wrong? >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To post to this group, send email to [email protected] >>>>>> To unsubscribe from this group, send email to >>>>>> [email protected] >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected] >>>>> To unsubscribe from this group, send email to >>>>> [email protected] >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>> >>>> >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> [email protected] >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

