Re: How to get page content of crawled pages

Lewis John Mcgibbney Sat, 09 Feb 2013 10:39:11 -0800

Hi,
Once I get access to my office I am going to build the patches from trunk.
Is it trunk that you are using?
Thanks
Lewis


On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto <peterbarrett...@gmail.com>wrote:

> Hi Lewis,
>
> I managed to get the code working by adding the below function to
> MongodbWriter.java in the public class MongodbWriter  implements
> NutchIndexWriter :-
>
>          public void delete(String key) throws IOException{
>                 return;
>         }
>
> And the crawled data was getting stored in mongodb.
> The only issue was it was storing only the text of the page and not the
> full
> html content of the page.
> How do i store the full html content of the page also?
> Hope to see the patches soon.
> Thanks
>
>
>
> lewis john mcgibbney wrote
> > Certainly.
> > I am currently reviewing the code and will hopefully have patches for
> > Nutch trunk cooked up for tomorrow.
> > I'll update this thread likewise.
> > Thanks
> > Lewis
> >
> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
> > &lt;
>
> > peterbarretto08@
>
> > &gt; wrote:
> >> Hi Lewis,
> >>
> >> I am new to java and i dont know how to inherit all public methods from
> >> NutchIndexWriter
> >> Can you help me with that? Then i can rebuild and check if it works.
> >>
> >>
> >> lewis john mcgibbney wrote
> >>> As you will see the code has not been amended in a year or so.
> >>> The positive side is that you only seem to be getting one issue with
> >>> javac
> >>>
> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
> >>
> >>> peterbarretto08@
> >>
> >>> &gt;wrote:
> >>>
> >>>>
> >>>>
> >>>>
> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
> >>>> error: MongodbWriter is not abstract and does not override abstract
> >>>> method
> >>>> delete(String) in NutchIndexWriter
> >>>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
> >>>>
> >>>> Sort this error out by inheriting all public methods from
> >>>> NutchIndexWriter
> >>> for starts. I take it you are not developing from within Eclipse? As
> >>> this
> >>> would have been flagged up immediately. This should at least enable you
> >>> to
> >>> compile the code.
> >>>
> >>>
> >>>>
> >>>> I have already crawled some urls now and i need to move those to
> >>>> mongodb.
> >>>> Is
> >>>> there a easy to use code to do that?
> >>>
> >>>
> >>> Not apart from hacking the code as you are already doing. The code you
> >>> are
> >>> pulling is not part of the official nutch codebase and to be honest a
> >>> few
> >>> of us didn't even know about it until you brought it to our attention
> >>> :0)
> >>>
> >>> There is no silver bullet here, just take your time and we will get it
> >>> working.
> >>> Lewis
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
> > --
> > Lewis
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: How to get page content of crawled pages

Reply via email to