Hi Julien, The parsechecker works fine for me too but this fails when i do the complete crawl and try to save it in the database. I do not know where its failing. I can check back if you want me to.
Thanks! Kiran On Tue, Oct 30, 2012 at 11:06 AM, Julien Nioche < [email protected]> wrote: > *./nutch parsechecker -D http.agent.name="tralala" -D > http.content.limit=-1 > -dumpText http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf* > > works absolutely fine in both the trunk and 2.x branch. try from the > runtime/local/bin directory and check the logs for more details > > On 30 October 2012 13:54, kiran chitturi <[email protected]> > wrote: > > > Interestingly, the tika jar i have downloaded separately is able to parse > > all the text from the pdf files while the nutch tika parser is failing > for > > some of the files. I have set the content.limit to -1. > > > > The error message is '2012-10-30 09:30:37,382 WARN parse.ParseUtil - > > Unable to successfully parse content > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf of type > > application/pdf' > > > > for the failed pdf files. I could see some title and text when i am > > debugging in Eclipse but i could see it failing due to the parseCodes. > > > > Thank you. > > Kiran > > > > On Tue, Oct 30, 2012 at 8:58 AM, kiran chitturi > > <[email protected]>wrote: > > > > > Hi > > > > > > I did not sent the content limit to -1 but i have set it high enough to > > be > > > able to go through the documents that i am parsing. I could see some > > title > > > and text but i am not sure how much it is able to do. I am gonna try > > using > > > tika separately and try to process the documents. If all of it goes > > through > > > tika-1.2 separately then i have to try to debug where i am getting the > > > error here. > > > > > > Many Thanks, > > > Kiran. > > > > > > > > > On Tue, Oct 30, 2012 at 4:37 AM, Julien Nioche < > > > [email protected]> wrote: > > > > > >> Hi > > >> > > >> Look at the code for the class ParseStatusCodes. This simply indicates > > >> that > > >> the parsing failed and is not the cause for the failing itself. Do you > > get > > >> the entire text for the document or just what the parser managed to > > >> process > > >> until it failed? Did you set the content limit to -1? > > >> > > >> Thanks > > >> > > >> Julien > > >> > > >> > > >> On 29 October 2012 19:17, kiran chitturi <[email protected]> > > >> wrote: > > >> > > >> > Hi! > > >> > > > >> > I am debugging nutch with eclipse and i have found out that some pdf > > >> files > > >> > which are not succesfully parsed have majorCode as 2 and minorCode > as > > >> 200 > > >> > and files which are succesfully parsed have majorCode 1 and > minorCode > > 0. > > >> > > > >> > Can someone please explain me or point to what these codes mean ? > > >> > > > >> > Actually, the title, text and everything is parsed in the failed > > parses > > >> but > > >> > somehow because of the codes it not saving the fields and returning > as > > >> > failed parsing. > > >> > > > >> > Thanks for your help. > > >> > > > >> > Regards, > > >> > -- > > >> > Kiran Chitturi > > >> > > > >> > > >> > > >> > > >> -- > > >> * > > >> *Open Source Solutions for Text Engineering > > >> > > >> http://digitalpebble.blogspot.com/ > > >> http://www.digitalpebble.com > > >> http://twitter.com/digitalpebble > > >> > > > > > > > > > > > > -- > > > Kiran Chitturi > > > > > > > > > > > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi

