Re: How to get page content of crawled pages

Mattmann, Chris A (388J) Wed, 30 Jan 2013 21:14:44 -0800

Hey Guys,

I'm working on a tool to grab the file content of the crawled pages. I
created a JIRA ticket and Review Board for this:


https://issues.apache.org/jira/browse/NUTCH-1526

https://reviews.apache.org/r/9119/


Am still working on finishing the patch but you can see the sketch on my
Github, and also from my conversation on the Nutch ML and from the JIRA
ticket interface spec, etc.

Hopefully will have this done before next week.

Cheers,
Chris

On 1/30/13 11:05 PM, "Lewis John Mcgibbney" <lewis.mcgibb...@gmail.com>
wrote:

>As you will see the code has not been amended in a year or so.
>The positive side is that you only seem to be getting one issue with javac
>
>On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto
><peterbarrett...@gmail.com>wrote:
>
>>
>>
>> 
>>C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:
>>18:
>> error: MongodbWriter is not abstract and does not override abstract
>>method
>> delete(String) in NutchIndexWriter
>>     [javac] public class MongodbWriter  implements NutchIndexWriter{
>>
>> Sort this error out by inheriting all public methods from
>>NutchIndexWriter
>for starts. I take it you are not developing from within Eclipse? As this
>would have been flagged up immediately. This should at least enable you to
>compile the code.
>
>
>>
>> I have already crawled some urls now and i need to move those to
>>mongodb.
>> Is
>> there a easy to use code to do that?
>
>
>Not apart from hacking the code as you are already doing. The code you are
>pulling is not part of the official nutch codebase and to be honest a few
>of us didn't even know about it until you brought it to our attention :0)
>
>There is no silver bullet here, just take your time and we will get it
>working.
>Lewis

Re: How to get page content of crawled pages

Reply via email to