Re: How to get page content of crawled pages

peterbarretto Sun, 10 Feb 2013 20:41:55 -0800

Hi Lewis,

I downloaded the nutch copy from
http://apache.techartifact.com/mirror/nutch/1.6/



lewis john mcgibbney wrote
> Hi,
> Once I get access to my office I am going to build the patches from trunk.
> Is it trunk that you are using?
> Thanks
> Lewis
> 
> On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;

> peterbarretto08@

> &gt;wrote:
> 
>> Hi Lewis,
>>
>> I managed to get the code working by adding the below function to
>> MongodbWriter.java in the public class MongodbWriter  implements
>> NutchIndexWriter :-
>>
>>          public void delete(String key) throws IOException{
>>                 return;
>>         }
>>
>> And the crawled data was getting stored in mongodb.
>> The only issue was it was storing only the text of the page and not the
>> full
>> html content of the page.
>> How do i store the full html content of the page also?
>> Hope to see the patches soon.
>> Thanks
>>
>>
>>
>> lewis john mcgibbney wrote
>> > Certainly.
>> > I am currently reviewing the code and will hopefully have patches for
>> > Nutch trunk cooked up for tomorrow.
>> > I'll update this thread likewise.
>> > Thanks
>> > Lewis
>> >
>> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
>> > &lt;
>>
>> > peterbarretto08@
>>
>> > &gt; wrote:
>> >> Hi Lewis,
>> >>
>> >> I am new to java and i dont know how to inherit all public methods
>> from
>> >> NutchIndexWriter
>> >> Can you help me with that? Then i can rebuild and check if it works.
>> >>
>> >>
>> >> lewis john mcgibbney wrote
>> >>> As you will see the code has not been amended in a year or so.
>> >>> The positive side is that you only seem to be getting one issue with
>> >>> javac
>> >>>
>> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>> >>
>> >>> peterbarretto08@
>> >>
>> >>> &gt;wrote:
>> >>>
>> >>>>
>> >>>>
>> >>>>
>> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>> >>>> error: MongodbWriter is not abstract and does not override abstract
>> >>>> method
>> >>>> delete(String) in NutchIndexWriter
>> >>>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>> >>>>
>> >>>> Sort this error out by inheriting all public methods from
>> >>>> NutchIndexWriter
>> >>> for starts. I take it you are not developing from within Eclipse? As
>> >>> this
>> >>> would have been flagged up immediately. This should at least enable
>> you
>> >>> to
>> >>> compile the code.
>> >>>
>> >>>
>> >>>>
>> >>>> I have already crawled some urls now and i need to move those to
>> >>>> mongodb.
>> >>>> Is
>> >>>> there a easy to use code to do that?
>> >>>
>> >>>
>> >>> Not apart from hacking the code as you are already doing. The code
>> you
>> >>> are
>> >>> pulling is not part of the official nutch codebase and to be honest a
>> >>> few
>> >>> of us didn't even know about it until you brought it to our attention
>> >>> :0)
>> >>>
>> >>> There is no silver bullet here, just take your time and we will get
>> it
>> >>> working.
>> >>> Lewis
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>> >
>> >
>> > --
>> > Lewis
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> 
> 
> -- 
> *Lewis*





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039613.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Reply via email to