Re: How to get page content of crawled pages

peterbarretto Sun, 17 Feb 2013 08:29:52 -0800

Thanks for the patch Lewis.

Where do i make the pom.xml changes i cant find the file?


Also in 1.6 if i give the below command it returns the html content
./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch
-nogenerate -noparse -noparsedata -noparsetext

I havent built the patch changes as i cant find pom.xml file.


 

lewis john mcgibbney wrote
> https://issues.apache.org/jira/browse/NUTCH-1528
> 
> This is the mongodb indexer patch ported to trunk.
> 
> Can I mention that there is usually no time line on these things e.g.
> feature requests.
> I'm sure you can appreciate that we are all extremely busy at work with an
> array of other things so if it takes a bit of time, then thats OK. The
> world goes on and keeps spinning. Even if we are getting bombarded by
> meteorites in Russia!!!
> 
> Please check the patch and out comment accordingly.
> 
> Regarding your issue with regards to the full page content, I am not sure
> if this is currently available in Nutch trunk with out you writing some
> code.
> Full html markup is certainly stored in 2.x... but I don't know whether
> you
> are prepared to move to 2.x for your operations?
> 
> hth
> Lewis
> 
> On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto &lt;

> peterbarretto08@

> &gt;wrote:
> 
>> Hi Lewis,
>>
>> Is this patch done??
>>
>>
>> lewis john mcgibbney wrote
>> > Hi,
>> > Once I get access to my office I am going to build the patches from
>> trunk.
>> > Is it trunk that you are using?
>> > Thanks
>> > Lewis
>> >
>> > On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto &lt;
>>
>> > peterbarretto08@
>>
>> > &gt;wrote:
>> >
>> >> Hi Lewis,
>> >>
>> >> I managed to get the code working by adding the below function to
>> >> MongodbWriter.java in the public class MongodbWriter  implements
>> >> NutchIndexWriter :-
>> >>
>> >>          public void delete(String key) throws IOException{
>> >>                 return;
>> >>         }
>> >>
>> >> And the crawled data was getting stored in mongodb.
>> >> The only issue was it was storing only the text of the page and not
>> the
>> >> full
>> >> html content of the page.
>> >> How do i store the full html content of the page also?
>> >> Hope to see the patches soon.
>> >> Thanks
>> >>
>> >>
>> >>
>> >> lewis john mcgibbney wrote
>> >> > Certainly.
>> >> > I am currently reviewing the code and will hopefully have patches
>> for
>> >> > Nutch trunk cooked up for tomorrow.
>> >> > I'll update this thread likewise.
>> >> > Thanks
>> >> > Lewis
>> >> >
>> >> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto
>> >> > &lt;
>> >>
>> >> > peterbarretto08@
>> >>
>> >> > &gt; wrote:
>> >> >> Hi Lewis,
>> >> >>
>> >> >> I am new to java and i dont know how to inherit all public methods
>> >> from
>> >> >> NutchIndexWriter
>> >> >> Can you help me with that? Then i can rebuild and check if it
>> works.
>> >> >>
>> >> >>
>> >> >> lewis john mcgibbney wrote
>> >> >>> As you will see the code has not been amended in a year or so.
>> >> >>> The positive side is that you only seem to be getting one issue
>> with
>> >> >>> javac
>> >> >>>
>> >> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto &lt;
>> >> >>
>> >> >>> peterbarretto08@
>> >> >>
>> >> >>> &gt;wrote:
>> >> >>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >>
>> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18:
>> >> >>>> error: MongodbWriter is not abstract and does not override
>> abstract
>> >> >>>> method
>> >> >>>> delete(String) in NutchIndexWriter
>> >> >>>>     [javac] public class MongodbWriter  implements
>> NutchIndexWriter{
>> >> >>>>
>> >> >>>> Sort this error out by inheriting all public methods from
>> >> >>>> NutchIndexWriter
>> >> >>> for starts. I take it you are not developing from within Eclipse?
>> As
>> >> >>> this
>> >> >>> would have been flagged up immediately. This should at least
>> enable
>> >> you
>> >> >>> to
>> >> >>> compile the code.
>> >> >>>
>> >> >>>
>> >> >>>>
>> >> >>>> I have already crawled some urls now and i need to move those to
>> >> >>>> mongodb.
>> >> >>>> Is
>> >> >>>> there a easy to use code to do that?
>> >> >>>
>> >> >>>
>> >> >>> Not apart from hacking the code as you are already doing. The code
>> >> you
>> >> >>> are
>> >> >>> pulling is not part of the official nutch codebase and to be
>> honest
>> a
>> >> >>> few
>> >> >>> of us didn't even know about it until you brought it to our
>> attention
>> >> >>> :0)
>> >> >>>
>> >> >>> There is no silver bullet here, just take your time and we will
>> get
>> >> it
>> >> >>> working.
>> >> >>> Lewis
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html
>> >> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Lewis
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>
>> >
>> >
>> >
>> > --
>> > *Lewis*
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040596.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> 
> 
> -- 
> *Lewis*





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Reply via email to