Got it, thanks! Any idea why Tika might not be working? I've been testing
and while no exceptions are being thrown, neither is anything being
appended when I call pdfText.append(contenthandler.toString());

On Fri, Dec 5, 2014 at 6:21 PM, Pradeep Gollakota <pradeep...@gmail.com>
wrote:

> A static variable is not necessary... a simple instance variable is just
> fine.
>
> On Fri Dec 05 2014 at 2:27:53 PM Ryan <freelanceflashga...@gmail.com>
> wrote:
>
> > After running it with updated code, it seems like the problem has to do
> > with something related to Tika since my output says that my input is the
> > correct number of bytes (i.e. it's actually being sent in correctly).
> Going
> > to test further to narrow down the problem.
> >
> > Pradeep, would you recommend using a static variable inside the
> > ExtractTextFromPDFs function to store the PdfParser once it has been
> > initialized once? I'm still learning how to best do things within the
> > Pig/MapReduce/Hadoop framework
> >
> > Ryan
> >
> > On Fri, Dec 5, 2014 at 1:35 PM, Ryan <freelanceflashga...@gmail.com>
> > wrote:
> >
> > > Thanks Pradeep! I'll give it a try and report back
> > >
> > > Ryan
> > >
> > > On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota <
> pradeep...@gmail.com
> > >
> > > wrote:
> > >
> > >> I forgot to mention earlier that you should probably move the
> PdfParser
> > >> initialization code out of the evaluate method. This will probably
> > cause a
> > >> significant overhead both in terms of gc and runtime performance.
> You'll
> > >> want to initialize your parser once and evaluate all your docs against
> > it.
> > >>
> > >> - Pradeep
> > >>
> > >> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <
> > pradeep...@gmail.com>
> > >> wrote:
> > >>
> > >> > Java string's are immutable. So "pdfText.concat()" returns a new
> > string
> > >> > and the original string is left unmolested. So at the end, all
> you're
> > >> doing
> > >> > is returning an empty string. Instead, you can do "pdfText =
> > >> > pdfText.concat(...)". But the better way to write it is to use a
> > >> > StringBuilder.
> > >> >
> > >> > StringBuilder pdfText = ...;
> > >> > pdfText.append(...);
> > >> > pdfText.append(...);
> > >> > ...
> > >> > return pdfText.toString();
> > >> >
> > >> > On Fri Dec 05 2014 at 9:12:37 AM Ryan <
> freelanceflashga...@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> Hi,
> > >> >>
> > >> >> I'm working on an open source project attempting to convert raw
> > content
> > >> >> from a pdf (stored as a databytearray) into plain text using a Pig
> > UDF
> > >> and
> > >> >> Apache Tika. I could use your help. For some reason, the UDF I'm
> > using
> > >> >> isn't working. The script succeeds but no output is written. *This
> is
> > >> the
> > >> >> Pig script I'm following:*
> > >> >>
> > >> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> > >> >> DEFINE ExtractTextFromPDFs
> > >> >>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> > >> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
> > >> >>
> > >> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url:
> chararray,
> > >> >> date:
> > >> >> chararray, mime: chararray, content: bytearray); --load the data
> > >> >>
> > >> >> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages
> > from
> > >> >> the
> > >> >> arc file
> > >> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> > >> >> c = foreach b generate url, ExtractTextFromPDFs(content);
> > >> >> store c into 'output/pdf_test';
> > >> >>
> > >> >>
> > >> >> *This is the UDF I wrote:*
> > >> >>
> > >> >> public class ExtractTextFromPDFs extends EvalFunc<String> {
> > >> >>
> > >> >>   @Override
> > >> >>   public String exec(Tuple input) throws IOException {
> > >> >>       String pdfText = "";
> > >> >>
> > >> >>       if (input == null || input.size() == 0 || input.get(0) ==
> > null) {
> > >> >>           return "N/A";
> > >> >>       }
> > >> >>
> > >> >>       DataByteArray dba = (DataByteArray)input.get(0);
> > >> >>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
> > >> >> debugging. Nothing written
> > >> >>
> > >> >>       InputStream is = new ByteArrayInputStream(dba.get());
> > >> >>
> > >> >>       ContentHandler contenthandler = new BodyContentHandler();
> > >> >>       Metadata metadata = new Metadata();
> > >> >>       DefaultDetector detector = new DefaultDetector();
> > >> >>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
> > >> >>
> > >> >>       try {
> > >> >>         pdfparser.parse(is, contenthandler, metadata, new
> > >> ParseContext());
> > >> >>       } catch (SAXException | TikaException e) {
> > >> >>         // TODO Auto-generated catch block
> > >> >>         e.printStackTrace();
> > >> >>       }
> > >> >>       pdfText.concat(" : "); //another attempt at debugging. Still
> > >> nothing
> > >> >> written
> > >> >>       pdfText.concat(contenthandler.toString());
> > >> >>
> > >> >>       //close the input stream
> > >> >>       if(is != null){
> > >> >>         is.close();
> > >> >>       }
> > >> >>       return pdfText;
> > >> >>   }
> > >> >>
> > >> >> }
> > >> >>
> > >> >> Thank you for your assistance,
> > >> >> Ryan
> > >> >>
> > >> >
> > >>
> > >
> > >
> >
>

Reply via email to