nutch parse fails

kiran chitturi Mon, 05 Nov 2012 06:50:52 -0800

I have just tested with Hbase as gora-backend and all the pdf files are
parsed in the first attempt. This is an sql backend problem as Julien has
noted before.


Thanks,
Kiran.


On Thu, Nov 1, 2012 at 1:49 PM, <[email protected]> wrote:

> Hi,
>
> I think in order to be sure that this is gora-sql problem, you need to do
> the same crawling with nutch/hbase. It must not take much time if you run
> it in local mode. Simply install hbase and follow quick start tutorial.
>
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: kiran chitturi <[email protected]>
> To: user <[email protected]>
> Sent: Thu, Nov 1, 2012 9:29 am
> Subject: Re: bin/nutch parsechecker -dumpText works but bin/nutch parse
> fails
>
>
> Hi,
>
> I have created an issue (https://issues.apache.org/jira/browse/NUTCH-1487
> ).
>
> Do you think this is because of the SQL backend ? Its failing for PDF files
> but working for HTML files.
>
> Can the problem be due to some bug in the tika.parser code (since tika
> plugin handles the PDF parsing) ?
>
> I am interesting in fixing this problem, if i can find out where the issue
> starts.
>
> Does anyone have inputs for this ?
>
> Thanks,
> Kiran.
>
>
>
> On Thu, Nov 1, 2012 at 10:15 AM, Julien Nioche <
> [email protected]> wrote:
>
> > Hi
> >
> > Yes please do open an issue. The docs should be parsed in one go and I
> > suspect (yet another) issue with the SQL backend
> >
> > Thanks
> >
> > J
> >
> > On 1 November 2012 13:48, kiran chitturi <[email protected]>
> > wrote:
> >
> > > Thank you alxsss for the suggestion. It displays the actualSize and
> > > inHeaderSize for every file and two more lines in logs but it did not
> > much
> > > information even when i set parserJob to Debug.
> > >
> > > I had the same problem when i re-compiled everything today. I have to
> run
> > > the parse command multiple times to get all the files parsed.
> > >
> > > I am using SQL with GORA. Its mysql database.
> > >
> > > For now, atleast the files are getting parsed, do  i need to open a
> issue
> > > for this ?
> > >
> > > Thank you,
> > >
> > > Regards,
> > > Kiran.
> > >
> > >
> > > On Wed, Oct 31, 2012 at 4:36 PM, Julien Nioche <
> > > [email protected]> wrote:
> > >
> > > > Hi Kiran
> > > >
> > > > Interesting. Which backend are you using with GORA? The SQL one?
> Could
> > > be a
> > > > problem at that level
> > > >
> > > > Julien
> > > >
> > > > On 31 October 2012 17:01, kiran chitturi <[email protected]>
> > > > wrote:
> > > >
> > > > > Hi Julien,
> > > > >
> > > > > I have just noticed something when running the parse.
> > > > >
> > > > > First when i ran the parse command 'sh bin/nutch parse
> > > > > 1351188762-1772522488', the parsing of all the PDF files has
> failed.
> > > > >
> > > > > When i ran the command again one pdf file got parsed. Next time,
> > > another
> > > > > pdf file got parsed.
> > > > >
> > > > > When i ran the parse command the number of times the total number
> of
> > > pdf
> > > > > files, all the pdf files got parsed.
> > > > >
> > > > > In my case,  i ran it 17 times and all the pdf files are parsed.
> > Before
> > > > > that, not everything is parsed.
> > > > >
> > > > > This sounds strange, do you think it is some configuration problem
> ?
> > > > >
> > > > > I have tried this 2 times and same thing happened two times for me
> .
> > > > >
> > > > > I am not sure why this is happening.
> > > > >
> > > > > Thanks for your help.
> > > > >
> > > > > Regards,
> > > > > Kiran.
> > > > >
> > > > >
> > > > > On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Hi
> > > > > >
> > > > > >
> > > > > > > Sorry about that. I did not notice the parsecodes are actually
> > > nutch
> > > > > and
> > > > > > > not tika.
> > > > > > >
> > > > > > > no problems!
> > > > > >
> > > > > >
> > > > > > > The setup is local on Mac desktop and i am using through
> command
> > > line
> > > > > and
> > > > > > > remote debugging through eclipse (
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
> > > > > > > ).
> > > > > > >
> > > > > >
> > > > > > OK
> > > > > >
> > > > > > >
> > > > > > > I have set both http.content.limit and file.content.limit to
> -1.
> > > The
> > > > > logs
> > > > > > > just say 'WARN  parse.ParseUtil - Unable to successfully parse
> > > > content
> > > > > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdfof
> > > > type
> > > > > > > application/pdf'.
> > > > > > >
> > > > > >
> > > > > > you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml
> right?
> > > (not
> > > > > > in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean
> > > runtime')
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > All the html's are getting parsed and when i crawl this page (
> > > > > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the
> html's
> > > and
> > > > > > some
> > > > > > > of the pdf files get parsed. Like, half of the pdf files get
> > parsed
> > > > and
> > > > > > the
> > > > > > > other half don't get parsed.
> > > > > > >
> > > > > >
> > > > > > do the ones that are not parsed have something in common? length?
> > > > > >
> > > > > >
> > > > > > > I am not sure about what causing the problem as you said
> > > parsechecker
> > > > > is
> > > > > > > actually work. I want the parser to crawl the full-text of the
> > pdf
> > > > and
> > > > > > the
> > > > > > > metadata, title.
> > > > > > >
> > > > > >
> > > > > > OK
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > The metatags are also getting crawled for failed pdf parsing.
> > > > > > >
> > > > > >
> > > > > > They would be discarded because of the failure even if they
> > > > > > were successfully extracted indeed. The current mechanism does
> not
> > > > cater
> > > > > > for semi-failures
> > > > > >
> > > > > > J.
> > > > > >
> > > > > > --
> > > > > > *
> > > > > > *Open Source Solutions for Text Engineering
> > > > > >
> > > > > > http://digitalpebble.blogspot.com/
> > > > > > http://www.digitalpebble.com
> > > > > > http://twitter.com/digitalpebble
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kiran Chitturi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > > http://twitter.com/digitalpebble
> > > >
> > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
>
>
> --
> Kiran Chitturi
>
>
>


-- 
Kiran Chitturi

Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails

Reply via email to