Re: Get Crawled Data in Java or C# Collections

xiao yang Wed, 15 Dec 2010 00:32:43 -0800

Hi, Bing,

Nutch puts all the crawled pages in HDFS or local FS, in "segments" directory.
It provide APIs to retrieve the page content. You can find them in the
Web app part of Nutch. The "cache" of search results is read through
those APIs.
To process the content while crawling, you can try to write a Nutch
Plug-in. You can find a tutorial on the official site of Nutch.


Or you can try Nutch 2.0. It's under developing. you can check out
from SVN. It puts crawled data in HBase or other Database Systems,
which is easier for you to manipulate.

Thanks!
Xiao

On Wed, Dec 15, 2010 at 12:25 PM, Bing Li <[email protected]> wrote:
> Hi, all,
>
> I am a new Nutch user. Before knowing Nutch, I designed a crawler myself.
> However, the quality is not good. So I decide to try Nutch.
>
> However, after reading some materials about Nutch, I notice that Nutch puts
> all of crawled pages into persistent Lucene indexes. In my project, I hope I
> could get crawled data in memory. So I can manipulate them in Java or C#
> collections. I don't want to retrieve the indexes crawled by Nutch.
>
> Could you give me a solution to that? Thanks so much!
>
> Best regards,
> Li Bing
>

Re: Get Crawled Data in Java or C# Collections

Reply via email to