Hi all,
I'm extracting some text from pdf. As result, some important words end with
spaces between characters. For example, I could have the word "Subtitle"
that I want to detect, written like "S u b t i t l e". If I would parse the
text with a standard tokenizer, the word will be lost.
I think (after consultation in Solr list) that this might be related to
fonts.
Is there any way to cope with this through Tika configuration?
Many Thanks,

Francisco

Reply via email to