Re: Renaming Nutch

2009-10-18 Thread Nutch Newbie
Sure you can re-structure/rename the files.. all source is there and with time/effort you will be able compile etc.. I don't see any value of doing this at all... nutch is a HUGE project you can't just simply search/replace/rebrand/compile ... even if you are successful all the future

Re: Tried to run Crawl with depth of only 2 and getting IOException

2010-01-20 Thread Nutch Newbie
On Wed, Jan 20, 2010 at 7:10 PM, kraman kirthi.ra...@gmail.com wrote: kirth...@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth 2 crawl started in: tinycrawl rootUrlDir = url threads = 10 depth = 2 Injector: starting Injector: crawlDb: tinycrawl/crawldb Injector:

Re: Injecting urls and define Inlink

2010-01-20 Thread Nutch Newbie
On Wed, Jan 20, 2010 at 8:04 AM, MyD myd.ro...@googlemail.com wrote: Because I like to associate two separate running crawls. The only way is to associate it through the URL. Is there any way like CrawlDBReader but instead of reading I would like to write into it. A example would be great.

Re: Alt text of images as anchor text

2010-01-20 Thread Nutch Newbie
On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote: after several test, I have noticed that nutch ignores alt text of images inside a href= tags. So, this feature isn't implemented yet right? what exactly you want nutch should do to the alt text index it? tokenize it? make this field

Re: Alt text of images as anchor text

2010-01-20 Thread Nutch Newbie
is counted. are you crawling for images? or http://svn.apache.org/repos/asf/lucene/nutch/trunk/conf/crawl-urlfilter.txt.template # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ Nutch