Hi, I'm working on an open source project attempting to convert raw content from a pdf (stored as a databytearray) into plain text using a Pig UDF and Apache Tika. I could use your help. For some reason, the UDF I'm using isn't working. The script succeeds but no output is written. *This is the Pig script I'm following:*
register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; DEFINE ExtractTextFromPDFs org.warcbase.pig.piggybank.ExtractTextFromPDFs(); DEFINE ArcLoader org.warcbase.pig.ArcLoader(); raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, date: chararray, mime: chararray, content: bytearray); --load the data a = FILTER raw BY (url matches '.*\\.pdf$'); --gets all PDF pages from the arc file b = LIMIT a 2; --limit to 2 pages to speed up testing time c = foreach b generate url, ExtractTextFromPDFs(content); store c into 'output/pdf_test'; *This is the UDF I wrote:* public class ExtractTextFromPDFs extends EvalFunc<String> { @Override public String exec(Tuple input) throws IOException { String pdfText = ""; if (input == null || input.size() == 0 || input.get(0) == null) { return "N/A"; } DataByteArray dba = (DataByteArray)input.get(0); pdfText.concat(String.valueOf(dba.size())); //my attempt at debugging. Nothing written InputStream is = new ByteArrayInputStream(dba.get()); ContentHandler contenthandler = new BodyContentHandler(); Metadata metadata = new Metadata(); DefaultDetector detector = new DefaultDetector(); AutoDetectParser pdfparser = new AutoDetectParser(detector); try { pdfparser.parse(is, contenthandler, metadata, new ParseContext()); } catch (SAXException | TikaException e) { // TODO Auto-generated catch block e.printStackTrace(); } pdfText.concat(" : "); //another attempt at debugging. Still nothing written pdfText.concat(contenthandler.toString()); //close the input stream if(is != null){ is.close(); } return pdfText; } } Thank you for your assistance, Ryan