Sure you can re-structure/rename the files.. all source is there and
with time/effort you will be able compile etc.. I don't see any value
of doing this at all... nutch is a HUGE project you can't just simply
search/replace/rebrand/compile ... even if you are successful all the
future
On Wed, Jan 20, 2010 at 7:10 PM, kraman kirthi.ra...@gmail.com wrote:
kirth...@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth
2
crawl started in: tinycrawl
rootUrlDir = url
threads = 10
depth = 2
Injector: starting
Injector: crawlDb: tinycrawl/crawldb
Injector:
On Wed, Jan 20, 2010 at 8:04 AM, MyD myd.ro...@googlemail.com wrote:
Because I like to associate two separate running crawls. The only way is to
associate it through the URL.
Is there any way like CrawlDBReader but instead of reading I would like to
write into it. A example would be great.
On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote:
after several test, I have noticed that nutch ignores alt text of images
inside a href= tags.
So, this feature isn't implemented yet right?
what exactly you want nutch should do to the alt text index it?
tokenize it? make this field
is counted.
are you crawling for images? or
http://svn.apache.org/repos/asf/lucene/nutch/trunk/conf/crawl-urlfilter.txt.template
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
Nutch