Nick, Marcus, thank you for your help. It works great, and one of the problems that I saw was indeed with my code, not Tika.
Mark Mark Kerzner, SHMsoft <http://shmsoft.com/>, Book a call with me here <http://www.meetme.so/markkerzner> Mobile: 713-724-2534 Skype: mark.kerzner1 <http://shmsoft.com/> On Thu, Sep 29, 2016 at 5:21 PM, Nick Burch <[email protected]> wrote: > On Wed, 28 Sep 2016, Mark Kerzner wrote: > >> probably yes, but how do I tell it which parser to use? Today, I just do >> that >> >> String text = tika.parseToString(inputStream, metadata); >> >> and it know the parser. >> > > That might be your issue. It's quite hard to identify the language of a > piece of source code from just the first few hundred bytes of text. If you > tell Tika the filename, including the extension, it'll have much more luck > spotting the file is code and using the appropriate parser! > > (Binary files often have common magic at/near the start that helps Tika > identify the file type, source code is text based and lacks that) > > Nick >
