Hi all, I'm extracting some text from pdf. As result, some important words end with spaces between characters. For example, I could have the word "Subtitle" that I want to detect, written like "S u b t i t l e". If I would parse the text with a standard tokenizer, the word will be lost. I think (after consultation in Solr list) that this might be related to fonts. Is there any way to cope with this through Tika configuration? Many Thanks,
Francisco
