Re: Nutch 2.x parse MajorCode, MinorCode

kiran chitturi Tue, 30 Oct 2012 08:50:40 -0700

Hi Julien,

The parsechecker works fine for me too but this fails when i do the
complete crawl and try to save it in the database. I do not know where its
failing. I can check back if you want me to.


Thanks!
Kiran

On Tue, Oct 30, 2012 at 11:06 AM, Julien Nioche <
[email protected]> wrote:

> *./nutch parsechecker -D http.agent.name="tralala" -D
> http.content.limit=-1
> -dumpText http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf*
>
> works absolutely fine in both the trunk and 2.x branch. try from the
> runtime/local/bin directory and check the logs for more details
>
> On 30 October 2012 13:54, kiran chitturi <[email protected]>
> wrote:
>
> > Interestingly, the tika jar i have downloaded separately is able to parse
> > all the text from the pdf files while the nutch tika parser is failing
> for
> > some of the files. I have set the content.limit to -1.
> >
> > The error message is '2012-10-30 09:30:37,382 WARN  parse.ParseUtil -
> > Unable to successfully parse content
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf of type
> > application/pdf'
> >
> > for the failed pdf files. I could see some title and text when i am
> > debugging in Eclipse but i could see it failing due to the parseCodes.
> >
> > Thank you.
> > Kiran
> >
> > On Tue, Oct 30, 2012 at 8:58 AM, kiran chitturi
> > <[email protected]>wrote:
> >
> > > Hi
> > >
> > > I did not sent the content limit to -1 but i have set it high enough to
> > be
> > > able to go through the documents that i am parsing. I could see some
> > title
> > > and text but i am not sure how much it is able to do. I am gonna try
> > using
> > > tika separately and try to process the documents. If all of it goes
> > through
> > > tika-1.2 separately then i have to try to debug where i am getting the
> > > error here.
> > >
> > > Many Thanks,
> > > Kiran.
> > >
> > >
> > > On Tue, Oct 30, 2012 at 4:37 AM, Julien Nioche <
> > > [email protected]> wrote:
> > >
> > >> Hi
> > >>
> > >> Look at the code for the class ParseStatusCodes. This simply indicates
> > >> that
> > >> the parsing failed and is not the cause for the failing itself. Do you
> > get
> > >> the entire text for the document or just what the parser managed to
> > >> process
> > >> until it failed? Did you set the content limit to -1?
> > >>
> > >> Thanks
> > >>
> > >> Julien
> > >>
> > >>
> > >> On 29 October 2012 19:17, kiran chitturi <[email protected]>
> > >> wrote:
> > >>
> > >> > Hi!
> > >> >
> > >> > I am debugging nutch with eclipse and i have found out that some pdf
> > >> files
> > >> > which are not succesfully parsed have majorCode as 2 and minorCode
> as
> > >> 200
> > >> > and files which are succesfully parsed have majorCode 1 and
> minorCode
> > 0.
> > >> >
> > >> > Can someone please explain me or point to what these codes mean ?
> > >> >
> > >> > Actually, the title, text and everything is parsed in the failed
> > parses
> > >> but
> > >> > somehow because of the codes it not saving the fields and returning
> as
> > >> > failed parsing.
> > >> >
> > >> > Thanks for your help.
> > >> >
> > >> > Regards,
> > >> > --
> > >> > Kiran Chitturi
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> *
> > >> *Open Source Solutions for Text Engineering
> > >>
> > >> http://digitalpebble.blogspot.com/
> > >> http://www.digitalpebble.com
> > >> http://twitter.com/digitalpebble
> > >>
> > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> > >
> >
> >
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Kiran Chitturi

Re: Nutch 2.x parse MajorCode, MinorCode

Reply via email to