Hi Lewis, This parsed for me using 1.7-SNAPSHOT:
[chipotle:~/tmp/tika] mattmann% tika -t "http://www.who.int/about/who_reform/who-internal-control-framework.pdf" WARN - Count in xref table is 0 at offset 651997 Internal Control Framework November 2013 2 ANNEX ANNEXES Table of Contents Table of Contents ........................................................................... ....................................... 2 1. INTRODUCTION ........................................................................... ................................................................ 3 2. SCOPE AND DEFINITION OF INTERNAL CONTROL ......................................................................... 4 3. THE FIVE COMPONENTS AND EIGHTEEN PRINCIPLES OF INTERNAL CONTROL: ............... 5 I/ Internal Environment ........................................................................... ............................ 5 II/ Risk Assessment ........................................................................... ................................... 6 III/ Control Activities ........................................................................... ................................. 6 IV/ Information and Communication ........................................................................... ......... 7 ..more snipped Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Lewis John Mcgibbney <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Saturday, October 11, 2014 at 5:30 PM To: "[email protected]" <[email protected]> Subject: Problematic PDF >Hi Folks, > >I have a problematic PDF which I keeps on crashing my Nutch crawl. >I am trying to get all data from the PDF, so content is not truncated at >all. >http://www.who.int/about/who_reform/who-internal-control-framework.pdf > >Can someone please try to see if they have any issues parsing this >document with Tika 1.6? > >I have tried it locally, and it seems OK. If I can confirm this with some >other folks then I can isolate this to my Nutch crawl. >Thank you >Lewis > >-- >Lewis > > > > >
