Interestingly, the tika jar i have downloaded separately is able to parse
all the text from the pdf files while the nutch tika parser is failing for
some of the files. I have set the content.limit to -1.

The error message is '2012-10-30 09:30:37,382 WARN  parse.ParseUtil -
Unable to successfully parse content
http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf of type
application/pdf'

for the failed pdf files. I could see some title and text when i am
debugging in Eclipse but i could see it failing due to the parseCodes.

Thank you.
Kiran

On Tue, Oct 30, 2012 at 8:58 AM, kiran chitturi
<[email protected]>wrote:

> Hi
>
> I did not sent the content limit to -1 but i have set it high enough to be
> able to go through the documents that i am parsing. I could see some title
> and text but i am not sure how much it is able to do. I am gonna try using
> tika separately and try to process the documents. If all of it goes through
> tika-1.2 separately then i have to try to debug where i am getting the
> error here.
>
> Many Thanks,
> Kiran.
>
>
> On Tue, Oct 30, 2012 at 4:37 AM, Julien Nioche <
> [email protected]> wrote:
>
>> Hi
>>
>> Look at the code for the class ParseStatusCodes. This simply indicates
>> that
>> the parsing failed and is not the cause for the failing itself. Do you get
>> the entire text for the document or just what the parser managed to
>> process
>> until it failed? Did you set the content limit to -1?
>>
>> Thanks
>>
>> Julien
>>
>>
>> On 29 October 2012 19:17, kiran chitturi <[email protected]>
>> wrote:
>>
>> > Hi!
>> >
>> > I am debugging nutch with eclipse and i have found out that some pdf
>> files
>> > which are not succesfully parsed have majorCode as 2 and minorCode as
>> 200
>> > and files which are succesfully parsed have majorCode 1 and minorCode 0.
>> >
>> > Can someone please explain me or point to what these codes mean ?
>> >
>> > Actually, the title, text and everything is parsed in the failed parses
>> but
>> > somehow because of the codes it not saving the fields and returning as
>> > failed parsing.
>> >
>> > Thanks for your help.
>> >
>> > Regards,
>> > --
>> > Kiran Chitturi
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>
>
> --
> Kiran Chitturi
>
>


-- 
Kiran Chitturi

Reply via email to