font metrics issue

Francisco Andrés Fernández Tue, 23 Feb 2016 07:00:30 -0800

Hi all,
I'm extracting some text from pdf. As result, some important words end with
spaces between characters. For example, I could have the word "Subtitle"
that I want to detect, written like "S u b t i t l e". If I would parse the
text with a standard tokenizer, the word will be lost.
I think (after consultation in Solr list) that this might be related to
fonts.
Is there any way to cope with this through Tika configuration?
Many Thanks,


Francisco

font metrics issue

Reply via email to