Am Samstag, 26. März 2016 11:56:17 UTC+1 schrieb Kim Rönnberg:
>
> Is there a way to make Tesseract produce "real" xml instead of the (x)html
> hOCR produces, ie. to create xml tags like <ocr_page id='page_1'
> title='...'> instead of "<div class='ocr_page' id='page_1'...", <ocr_area
> id='...' title='...'> instead of "<div class='ocr_carea' id='block_1_1'..."
> etc.?
>
> Or is there somewhere a "ready" something with which the (x)html hOCR
> produces can be converted to a more "easily" xml parseable format, or, even
> better, a something that would give me the div's, span's and p's gouped per
> word, line, area and page readily insertable to a (php) array for inserting
> into a database, of the data format the hOCR produces now?
>
> Like "file_name", "page_nr", "area_id", "line_nr", "word_nr", "word bbox
> x1 y1 x2 y2", "the word value", for each word? I realise this means a lot
> of rows (one per word in a document), but this is something I need.
>
> I have spent some days on this, trying to find something that works on
> php, but have not managed to find anything.
>
> Regards
>
> Kim Rönnberg
>
I usually use Perl for such tasks:
use Mojo::DOM;
open(my $hocr_fh,"<:encoding(UTF-8)",$hocr_file) or die "cannot open
$hocr_file: $!";
my $html = '';
while (my $line = <$hocr_fh>) { $html .= $line;}
my $dom = Mojo::DOM->new($html);
my $ocr_page ={};
# <div class='ocr_page' id='page_1' title='image
"isisvonoken00oken_0153.png"; bbox 0 0 2321 2817; ppageno 0'>
for my $e ($dom->find('div.ocr_page')->each) {
my $title = $e->{'title'};
print 'page title: ',$title,"\n";
if ($title =~ m/bbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/) {
$ocr_page->{x1} = $1;
$ocr_page->{y1} = $2;
$ocr_page->{x2} = $3;
$ocr_page->{y2} = $4;
}
}
Mojo::DOM is an XML parser allowing to navigate by CSS-selectors (like
jQuery).
Of course, there are dozens of other XML parsers available in Perl.
I'm sure, there are similar parsers usable via PHP.
Helmut Wollmersdorfer
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/9562d500-bbe6-49d5-9c46-9443b0e9ce5f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.