Hi,

I'm working on an open source project attempting to convert raw content
from a pdf (stored as a databytearray) into plain text using a Pig UDF and
Apache Tika. I could use your help. For some reason, the UDF I'm using
isn't working. The script succeeds but no output is written. *This is the
Pig script I'm following:*

register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
DEFINE ExtractTextFromPDFs
 org.warcbase.pig.piggybank.ExtractTextFromPDFs();
DEFINE ArcLoader org.warcbase.pig.ArcLoader();

raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, date:
chararray, mime: chararray, content: bytearray); --load the data

a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from the
arc file
b = LIMIT a 2; --limit to 2 pages to speed up testing time
c = foreach b generate url, ExtractTextFromPDFs(content);
store c into 'output/pdf_test';


*This is the UDF I wrote:*

public class ExtractTextFromPDFs extends EvalFunc<String> {

  @Override
  public String exec(Tuple input) throws IOException {
      String pdfText = "";

      if (input == null || input.size() == 0 || input.get(0) == null) {
          return "N/A";
      }

      DataByteArray dba = (DataByteArray)input.get(0);
      pdfText.concat(String.valueOf(dba.size())); //my attempt at
debugging. Nothing written

      InputStream is = new ByteArrayInputStream(dba.get());

      ContentHandler contenthandler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      DefaultDetector detector = new DefaultDetector();
      AutoDetectParser pdfparser = new AutoDetectParser(detector);

      try {
        pdfparser.parse(is, contenthandler, metadata, new ParseContext());
      } catch (SAXException | TikaException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
      }
      pdfText.concat(" : "); //another attempt at debugging. Still nothing
written
      pdfText.concat(contenthandler.toString());

      //close the input stream
      if(is != null){
        is.close();
      }
      return pdfText;
  }

}

Thank you for your assistance,
Ryan

Reply via email to