Hi, Kumar,

To design a crawler is not an easy job. It depends on your goals. The most
complicated one is to crawl the entire Web.

http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669

This book might give you a hand.

Thanks,
LB

On Thu, Dec 16, 2010 at 12:28 AM, Anurag <[email protected]> wrote:

>
> Can you tell how u designed crawler ? Is it by by writing code like this
> CrawlDb.java<
> http://www.docjar.com/html/api/org/apache/nutch/crawl/CrawlDb.java.html>
> ?
>
> Actually wring your own Crawler is important stuff, I want to know.
>
> Thanks
>
> On Wed, Dec 15, 2010 at 9:56 AM, Bing Li [via Lucene] <
> [email protected]<ml-node%[email protected]>
> <ml-node%[email protected]<ml-node%[email protected]>
> >
> > wrote:
>
> > Hi, all,
> >
> > I am a new Nutch user. Before knowing Nutch, I designed a crawler myself.
> > However, the quality is not good. So I decide to try Nutch.
> >
> > However, after reading some materials about Nutch, I notice that Nutch
> puts
> >
> > all of crawled pages into persistent Lucene indexes. In my project, I
> hope
> > I
> > could get crawled data in memory. So I can manipulate them in Java or C#
> > collections. I don't want to retrieve the indexes crawled by Nutch.
> >
> > Could you give me a solution to that? Thanks so much!
> >
> > Best regards,
> > Li Bing
> >
> >
> > ------------------------------
> >  View message @
> >
> http://lucene.472066.n3.nabble.com/Get-Crawled-Data-in-Java-or-C-Collections-tp2089972p2089972.html
> > To start a new topic under Nutch - User, email
> > [email protected]<ml-node%[email protected]>
> <ml-node%[email protected]<ml-node%[email protected]>
> >
> > To unsubscribe from Nutch - User, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=
> >.
> >
> >
>
>
>
> --
> Kumar Anurag
>
>
> -----
> Kumar Anurag
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Get-Crawled-Data-in-Java-or-C-Collections-tp2089972p2092990.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to