W dniu sobota, 26 marca 2016 11:56:17 UTC+1 użytkownik Kim Rönnberg napisał: > > Is there a way to make Tesseract produce "real" xml instead of the (x)html > hOCR produces, ie. to create xml tags like <ocr_page id='page_1' > title='...'> instead of "<div class='ocr_page' id='page_1'...", <ocr_area > id='...' title='...'> instead of "<div class='ocr_carea' id='block_1_1'..." > etc.? > > Or is there somewhere a "ready" something with which the (x)html hOCR > produces can be converted to a more "easily" xml parseable format, or, even > better, a something that would give me the div's, span's and p's gouped per > word, line, area and page readily insertable to a (php) array for inserting > into a database, of the data format the hOCR produces now? > > Like "file_name", "page_nr", "area_id", "line_nr", "word_nr", "word bbox > x1 y1 x2 y2", "the word value", for each word? I realise this means a lot > of rows (one per word in a document), but this is something I need. > > I have spent some days on this, trying to find something that works on > php, but have not managed to find anything. > > Regards > > Kim Rönnberg >
There are some tools to convert hOCR to XCES (XML Corpus Encoding Format): https://bitbucket.org/jwilk/marasca-wbl/ Regards Janusz W dniu sobota, 26 marca 2016 11:56:17 UTC+1 użytkownik Kim Rönnberg napisał: > > Is there a way to make Tesseract produce "real" xml instead of the (x)html > hOCR produces, ie. to create xml tags like <ocr_page id='page_1' > title='...'> instead of "<div class='ocr_page' id='page_1'...", <ocr_area > id='...' title='...'> instead of "<div class='ocr_carea' id='block_1_1'..." > etc.? > > Or is there somewhere a "ready" something with which the (x)html hOCR > produces can be converted to a more "easily" xml parseable format, or, even > better, a something that would give me the div's, span's and p's gouped per > word, line, area and page readily insertable to a (php) array for inserting > into a database, of the data format the hOCR produces now? > > Like "file_name", "page_nr", "area_id", "line_nr", "word_nr", "word bbox > x1 y1 x2 y2", "the word value", for each word? I realise this means a lot > of rows (one per word in a document), but this is something I need. > > I have spent some days on this, trying to find something that works on > php, but have not managed to find anything. > > Regards > > Kim Rönnberg > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f48f4ff9-0eb1-45e8-86d8-10c9680053e8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

