Re: How to get page content of crawled pages

peterbarretto Fri, 08 Feb 2013 21:04:46 -0800

Hi Lewis,

I managed to get the code working by adding the below function to
MongodbWriter.java in the public class MongodbWriter  implements
NutchIndexWriter :-


         public void delete(String key) throws IOException{
                return;
        }

And the crawled data was getting stored in mongodb.
The only issue was it was storing only the text of the page and not the full
html content of the page.
How do i store the full html content of the page also? 
Hope to see the patches soon.
Thanks



lewis john mcgibbney wrote
> Certainly.
> I am currently reviewing the code and will hopefully have patches for
> Nutch trunk cooked up for tomorrow.
> I'll update this thread likewise.
> Thanks
> Lewis
> 
> On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
> &lt;

> peterbarretto08@

> &gt; wrote:
>> Hi Lewis,
>>
>> I am new to java and i dont know how to inherit all public methods from
>> NutchIndexWriter
>> Can you help me with that? Then i can rebuild and check if it works.
>>
>>
>> lewis john mcgibbney wrote
>>> As you will see the code has not been amended in a year or so.
>>> The positive side is that you only seem to be getting one issue with
>>> javac
>>>
>>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>>
>>> peterbarretto08@
>>
>>> &gt;wrote:
>>>
>>>>
>>>>
>>>> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>>>> error: MongodbWriter is not abstract and does not override abstract
>>>> method
>>>> delete(String) in NutchIndexWriter
>>>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>>>>
>>>> Sort this error out by inheriting all public methods from
>>>> NutchIndexWriter
>>> for starts. I take it you are not developing from within Eclipse? As
>>> this
>>> would have been flagged up immediately. This should at least enable you
>>> to
>>> compile the code.
>>>
>>>
>>>>
>>>> I have already crawled some urls now and i need to move those to
>>>> mongodb.
>>>> Is
>>>> there a easy to use code to do that?
>>>
>>>
>>> Not apart from hacking the code as you are already doing. The code you
>>> are
>>> pulling is not part of the official nutch codebase and to be honest a
>>> few
>>> of us didn't even know about it until you brought it to our attention
>>> :0)
>>>
>>> There is no silver bullet here, just take your time and we will get it
>>> working.
>>> Lewis
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> 
> -- 
> Lewis





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Reply via email to