On Mon, May 20, 2013 at 2:49 AM, zdenko podobny <[email protected]> wrote:
> On Sat, May 18, 2013 at 2:57 PM, sdk <[email protected]> wrote: > >> Hi, >> >> I have used QT Box Editor 1.10 on Windows 7. It works fine on .png files >> (does not opne .tif files). >> >> QT Box Editor (QBE) uses QT4 functionality to load images. QT4 does not > support multipage tif. I did not experienced problem with tiff on my > Windows XP. I have no possibility to test it on Win 7. Anyway I plan to use > leptonica to import images (at leastfor tiff ;-) ) - this would bring also > multipage tiff support, but I have not time for this development (you know > it works for me at this stage, and there is bunch of other tasks...) > Multi tiff support will be good to have. For now, I just convert the tif to png, if I want to use QTE. > > >> I had a question regarding its import / export feature. >> >> I have generated a box file using QT Box Editor with Hindi traineddata >> and want to fix the errors in it as some areas boxes are being marked >> erroneously. >> >> 1. If I export the text from box file 'Line by Line', is there a way to >> import it back? I am getting the error that number of boxes dont match. >> >> Maybe it is possible. Import feature is quite simple: > > 1. QBE expects that number of boxes (already in table view) is equal > to number of symbols excluding spaces and linebreaks (\n). Otherwise you > got error (there are more symbol than boxes, or there are more boxes than > symbol). > 2. QBE expects for import format: one box = one symbol > > I guess the problem with devanagari is that symbols as chopped may be different from symbols defined in unichar or training data (because of combining characters). So, the number of lines does not usually match. > 2. It seems from the little reading of the source that I have done, that >> there is an option for box files which handle a line at a time and segment >> at word level. Is there some feature like that? > > > It is not clear to me, what do you need (or expect): tesseract box file > can have one box per line. There is no information about words. > QBE offer export where it try to identify words based on space between > boxes (6pt - this can be adjusted in settings) It is not perfect, but it > works in most of situation (problem could be on historical documents with > non consistent spacing) > > I am just fishing :-) to see if there is an alternative method that I have not tried. My question was based on the following comments in http://code.google.com/p/tesseract-ocr/source/browse/trunk/ccmain/tesseractclass.h?r=820 //// applybox.cpp ////////////////////////////////////////////////////// // Applies the box file based on the image name fname, and resegments // the words in the block_list (page), with: // blob-mode: one blob per line in the box file, words as input. // word/line-mode: one blob per space-delimited unit after the #, and one word // per line in the box file. (See comment above for box file format.) // If find_segmentation is true, (word/line mode) then the classifier is used // to re-segment words/lines to match the space-delimited truth string for // each box. In this case, the input box may be for a word or even a whole // text line, and the output words will contain multiple blobs corresponding // to the space-delimited input string. // With find_segmentation false, no classifier is needed, but the chopper // can still be used to correctly segment touching characters with the help // of the input boxes. // In the returned PAGE_RES, the WERD_RES are setup as they would be returned // from normal classification, ie. with a word, chopped_word, rebuild_word, // seam_array, denorm, box_word, and best_state, but NO best_choice or // raw_choice, as they would require a UNICHARSET, which we aim to avoid. // Instead, the correct_text member of WERD_RES is set, and this may be later // converted to a best_choice using CorrectClassifyWords. CorrectClassifyWords // is not required before calling ApplyBoxTraining. PAGE_RES* ApplyBoxes(const STRING& fname, bool find_segmentation, BLOCK_LIST *block_list); // Builds a PAGE_RES from the block_list in the way required for ApplyBoxes: // All fuzzy spaces are removed, and all the words are maximally chopped. There is also a config called rebox. i was just looking to see if the export of one line at a time was related to a new format boxfile in anyway. I am trying to create some data using scanned images and it is faster to edit it at a line at a time rather than on a box basis, hence looking for alternatives. Thanks, Shree Can it be used with QT Box editor? >> >> Generally: QBE was based for my needs (latin based script - so have no > clue, how it behaves in other scripts). Improvements (code, patches) are > welcomed ;-) > > >> Shree >> >> >> >> >> On Friday, November 16, 2012 6:02:34 PM UTC+5:30, zdenop wrote: >>> >>> QT Box Editor 1.10 was released. It is a multi-platform visual editor >>> for tesseract-**ocr <http://code.google.com/p/tesseract-ocr/> box >>> files<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3> (used >>> for OCR training) based on QT4 library <http://qt.nokia.com/products/>. >>> >>> Several problems were fixed so upgrade is recommended. >>> >>> New features: >>> >>> - implemented bbox drag resizing (thanks to D. Silaev) e.g. user can >>> change box rectangle on image with mouse >>> - reload image, reload box file from disk >>> - implemented 'regenerate box file' >>> - implemented 'convert image to binary image' so user can see its >>> image the way tesseract will use it in OCR process >>> - implemented 'zoom in/out' with CTRL + mouse wheel >>> - watch for modified boxfile outside of program >>> >>> >>> For windows users there are binary files (qt-box-editor-1.10.exe + >>> qt-box-editor-dependecies-1.**09.zip) created with mingw32, QT 4.8.1, >>> leptonica 1.69 and tesseract 3.02 on Windows XP SP3 (32bit). For other >>> platforms you need to compile it from source. >>> >>> Homepage: >>> http://zdenop.**github.com/qt-box-editor/<http://zdenop.github.com/qt-box-editor/> >>> Code: >>> https://github.com/**zdenop/qt-box-editor<https://github.com/zdenop/qt-box-editor> >>> Changelog: https://github.com/**zdenop/qt-box-editor/blob/** >>> master/CHANGELOG<https://github.com/zdenop/qt-box-editor/blob/master/CHANGELOG> >>> >>> >>> -- >>> Zdenko >>> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

