Got it, thanks! Any idea why Tika might not be working? I've been testing and while no exceptions are being thrown, neither is anything being appended when I call pdfText.append(contenthandler.toString());
On Fri, Dec 5, 2014 at 6:21 PM, Pradeep Gollakota <pradeep...@gmail.com> wrote: > A static variable is not necessary... a simple instance variable is just > fine. > > On Fri Dec 05 2014 at 2:27:53 PM Ryan <freelanceflashga...@gmail.com> > wrote: > > > After running it with updated code, it seems like the problem has to do > > with something related to Tika since my output says that my input is the > > correct number of bytes (i.e. it's actually being sent in correctly). > Going > > to test further to narrow down the problem. > > > > Pradeep, would you recommend using a static variable inside the > > ExtractTextFromPDFs function to store the PdfParser once it has been > > initialized once? I'm still learning how to best do things within the > > Pig/MapReduce/Hadoop framework > > > > Ryan > > > > On Fri, Dec 5, 2014 at 1:35 PM, Ryan <freelanceflashga...@gmail.com> > > wrote: > > > > > Thanks Pradeep! I'll give it a try and report back > > > > > > Ryan > > > > > > On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota < > pradeep...@gmail.com > > > > > > wrote: > > > > > >> I forgot to mention earlier that you should probably move the > PdfParser > > >> initialization code out of the evaluate method. This will probably > > cause a > > >> significant overhead both in terms of gc and runtime performance. > You'll > > >> want to initialize your parser once and evaluate all your docs against > > it. > > >> > > >> - Pradeep > > >> > > >> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota < > > pradeep...@gmail.com> > > >> wrote: > > >> > > >> > Java string's are immutable. So "pdfText.concat()" returns a new > > string > > >> > and the original string is left unmolested. So at the end, all > you're > > >> doing > > >> > is returning an empty string. Instead, you can do "pdfText = > > >> > pdfText.concat(...)". But the better way to write it is to use a > > >> > StringBuilder. > > >> > > > >> > StringBuilder pdfText = ...; > > >> > pdfText.append(...); > > >> > pdfText.append(...); > > >> > ... > > >> > return pdfText.toString(); > > >> > > > >> > On Fri Dec 05 2014 at 9:12:37 AM Ryan < > freelanceflashga...@gmail.com> > > >> > wrote: > > >> > > > >> >> Hi, > > >> >> > > >> >> I'm working on an open source project attempting to convert raw > > content > > >> >> from a pdf (stored as a databytearray) into plain text using a Pig > > UDF > > >> and > > >> >> Apache Tika. I could use your help. For some reason, the UDF I'm > > using > > >> >> isn't working. The script succeeds but no output is written. *This > is > > >> the > > >> >> Pig script I'm following:* > > >> >> > > >> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; > > >> >> DEFINE ExtractTextFromPDFs > > >> >> org.warcbase.pig.piggybank.ExtractTextFromPDFs(); > > >> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader(); > > >> >> > > >> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: > chararray, > > >> >> date: > > >> >> chararray, mime: chararray, content: bytearray); --load the data > > >> >> > > >> >> a = FILTER raw BY (url matches '.*\\.pdf$'); --gets all PDF pages > > from > > >> >> the > > >> >> arc file > > >> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time > > >> >> c = foreach b generate url, ExtractTextFromPDFs(content); > > >> >> store c into 'output/pdf_test'; > > >> >> > > >> >> > > >> >> *This is the UDF I wrote:* > > >> >> > > >> >> public class ExtractTextFromPDFs extends EvalFunc<String> { > > >> >> > > >> >> @Override > > >> >> public String exec(Tuple input) throws IOException { > > >> >> String pdfText = ""; > > >> >> > > >> >> if (input == null || input.size() == 0 || input.get(0) == > > null) { > > >> >> return "N/A"; > > >> >> } > > >> >> > > >> >> DataByteArray dba = (DataByteArray)input.get(0); > > >> >> pdfText.concat(String.valueOf(dba.size())); //my attempt at > > >> >> debugging. Nothing written > > >> >> > > >> >> InputStream is = new ByteArrayInputStream(dba.get()); > > >> >> > > >> >> ContentHandler contenthandler = new BodyContentHandler(); > > >> >> Metadata metadata = new Metadata(); > > >> >> DefaultDetector detector = new DefaultDetector(); > > >> >> AutoDetectParser pdfparser = new AutoDetectParser(detector); > > >> >> > > >> >> try { > > >> >> pdfparser.parse(is, contenthandler, metadata, new > > >> ParseContext()); > > >> >> } catch (SAXException | TikaException e) { > > >> >> // TODO Auto-generated catch block > > >> >> e.printStackTrace(); > > >> >> } > > >> >> pdfText.concat(" : "); //another attempt at debugging. Still > > >> nothing > > >> >> written > > >> >> pdfText.concat(contenthandler.toString()); > > >> >> > > >> >> //close the input stream > > >> >> if(is != null){ > > >> >> is.close(); > > >> >> } > > >> >> return pdfText; > > >> >> } > > >> >> > > >> >> } > > >> >> > > >> >> Thank you for your assistance, > > >> >> Ryan > > >> >> > > >> > > > >> > > > > > > > > >