Hello all,

I have a Spring controller to handle uploads and I would like to extract
the contents of a pdf, doc, txt, html file as it is uploaded.

Problem is that I can see the file being uploaded and I can see the bytes
payload, but when I try to use the AutodetectParser I cannot get the
contents, here is what I am doing.

Notice that I extract the bytes from MultipartFIle and then build a
ByteArrayInputStream.

I can see from the debugger that it is not empty and has the contents. But
when I try to extract them with Tika I get an empty string but no errors.

@Controller
@RequestMapping(value = "/documents")
public class DocumentController {

@RequestMapping(value = "/parse", method = RequestMethod.POST)
public @ResponseBody String handleFileUpload(
@RequestParam("file") MultipartFile file) {
if (!file.isEmpty()) {
try {
 byte[] source = file.getBytes();
long size = file.getSize();
ByteArrayInputStream is = new ByteArrayInputStream(source);
    Metadata metadata = new Metadata();
    //metadata.set(Metadata.RESOURCE_NAME_KEY, file.getOriginalFilename());

    Parser parser = new AutoDetectParser(); // Should auto-detect!
    StringWriter textBuffer = new StringWriter();
    BodyContentHandler handler = new BodyContentHandler(textBuffer);
        ParseContext context = new ParseContext();
        parser.parse(is, handler, metadata, context);
        String content2 = textBuffer.toString();
        String content1 = handler.toString();


        Tika tk = new Tika();
String text = tk.parseToString(is, metadata);
 is.close();
// TODO : return structure instead of text
return text;

} catch (Exception e) {
return "You failed to upload the file => " + e.getMessage();
}
} else {
return "You failed to upload the file because the file was empty.";
}
}

}

I thought that calling either handler.toString() or passing a textBuffer to
the handler constructor and then calling textBuffer,toString() would give
me the contents of the text file, or pdf being uploaded to it.

I get an empty string instead.

I do not want to save the file but just extract its text content. How shall
I do it?

Thanks.

Reply via email to