Interestingly, the tika jar i have downloaded separately is able to parse all the text from the pdf files while the nutch tika parser is failing for some of the files. I have set the content.limit to -1.
The error message is '2012-10-30 09:30:37,382 WARN parse.ParseUtil - Unable to successfully parse content http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf of type application/pdf' for the failed pdf files. I could see some title and text when i am debugging in Eclipse but i could see it failing due to the parseCodes. Thank you. Kiran On Tue, Oct 30, 2012 at 8:58 AM, kiran chitturi <[email protected]>wrote: > Hi > > I did not sent the content limit to -1 but i have set it high enough to be > able to go through the documents that i am parsing. I could see some title > and text but i am not sure how much it is able to do. I am gonna try using > tika separately and try to process the documents. If all of it goes through > tika-1.2 separately then i have to try to debug where i am getting the > error here. > > Many Thanks, > Kiran. > > > On Tue, Oct 30, 2012 at 4:37 AM, Julien Nioche < > [email protected]> wrote: > >> Hi >> >> Look at the code for the class ParseStatusCodes. This simply indicates >> that >> the parsing failed and is not the cause for the failing itself. Do you get >> the entire text for the document or just what the parser managed to >> process >> until it failed? Did you set the content limit to -1? >> >> Thanks >> >> Julien >> >> >> On 29 October 2012 19:17, kiran chitturi <[email protected]> >> wrote: >> >> > Hi! >> > >> > I am debugging nutch with eclipse and i have found out that some pdf >> files >> > which are not succesfully parsed have majorCode as 2 and minorCode as >> 200 >> > and files which are succesfully parsed have majorCode 1 and minorCode 0. >> > >> > Can someone please explain me or point to what these codes mean ? >> > >> > Actually, the title, text and everything is parsed in the failed parses >> but >> > somehow because of the codes it not saving the fields and returning as >> > failed parsing. >> > >> > Thanks for your help. >> > >> > Regards, >> > -- >> > Kiran Chitturi >> > >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > > > > -- > Kiran Chitturi > > -- Kiran Chitturi

