Hello all,
I have a Spring controller to handle uploads and I would like to extract
the contents of a pdf, doc, txt, html file as it is uploaded.
Problem is that I can see the file being uploaded and I can see the bytes
payload, but when I try to use the AutodetectParser I cannot get the
contents, here is what I am doing.
Notice that I extract the bytes from MultipartFIle and then build a
ByteArrayInputStream.
I can see from the debugger that it is not empty and has the contents. But
when I try to extract them with Tika I get an empty string but no errors.
@Controller
@RequestMapping(value = "/documents")
public class DocumentController {
@RequestMapping(value = "/parse", method = RequestMethod.POST)
public @ResponseBody String handleFileUpload(
@RequestParam("file") MultipartFile file) {
if (!file.isEmpty()) {
try {
byte[] source = file.getBytes();
long size = file.getSize();
ByteArrayInputStream is = new ByteArrayInputStream(source);
Metadata metadata = new Metadata();
//metadata.set(Metadata.RESOURCE_NAME_KEY, file.getOriginalFilename());
Parser parser = new AutoDetectParser(); // Should auto-detect!
StringWriter textBuffer = new StringWriter();
BodyContentHandler handler = new BodyContentHandler(textBuffer);
ParseContext context = new ParseContext();
parser.parse(is, handler, metadata, context);
String content2 = textBuffer.toString();
String content1 = handler.toString();
Tika tk = new Tika();
String text = tk.parseToString(is, metadata);
is.close();
// TODO : return structure instead of text
return text;
} catch (Exception e) {
return "You failed to upload the file => " + e.getMessage();
}
} else {
return "You failed to upload the file because the file was empty.";
}
}
}
I thought that calling either handler.toString() or passing a textBuffer to
the handler constructor and then calling textBuffer,toString() would give
me the contents of the text file, or pdf being uploaded to it.
I get an empty string instead.
I do not want to save the file but just extract its text content. How shall
I do it?
Thanks.