Saurabh Suman schrieb:
> Hi
> I have some confusion regarding Fetcher.java. Does Fetcher fetches Html
> page ,stores it first and then parse?
> Can i just store the html and i don't want to parse it?
>
it can. it has a -noParsing option
bin/nutch fetch
Usage: Fetcher [-threads n] [-noParsin
I'm trying to follow the example in the Wiki, but it's corrupt. It
has a bunch of garbage in the part you're supposed to past into
solrconfig.xml - I don't know if something got interpreted as wiki
markup when it shouldn't, or what, but I doubt superscripts are a
normal part of the configuration.
Hi
In Fetcher.java, if protacol status of a url is
ProtocolStatus.ACCESS_DENIED.
Will nutch try to crawl it again after certain time interval? If yes , how
can i prevent nutch to recrawl it again if its protocol status is
ProtocolStatus.ACCESS_DENIED?
--
View this message in context:
http://ww
Hi Paul,
yeah there is a dump command
bin/nutch readlinkdb crawl/linkdb/ -dump dumpdir
You can also dump the CrawlDB, but I dont know if the complete data are
dumpable and this is usefull for you...
HTH
Mario
Paul Tomblin wrote:
> The nutch data files are pretty opaque, and even "strings" can'
I found to dump the crawldb/linkdb into a txt file.
Are there other formats to dump into? Maybe XML?
I am thinking to read aboutn csv format
thanks in advance
Mario
--
Mario Schröder | http://www.finanz-checks.de
Phone: +49 34464 62301 Cell: +49 163 27 09 807
http://www.xing.com/go/invite/6035
Hi,
As I understood only indexing part of nutch is in C++ as clucene.? I want to
code? nutch in C++, only in case if it is worth doing that.? I wondered if is
worth coding the remaining parts of nutch in C++, let say the crawler. Can
someone give me directions on what to start.
Thanks
Alex.
Hi,
I would like to know how can I modify nutch code to exclude external links with
certain extensions. For example, if have in urls mydomain.com and my domain.com
has a lot of links like mydomain.com/mylink.shtml, then I want nutch not to
fetch(crawl) these kind of urls at all.
Thanks
On Thu, Jul 30, 2009 at 9:15 PM, wrote:
> I would like to know how can I modify nutch code to exclude external links
> with certain extensions. For example, if have in urls mydomain.com and my
> domain.com has a lot of links like mydomain.com/mylink.shtml, then I want
> nutch not to fetch(craw
How do I develop a plugin that isn't in the nutch source tree? I want
to keep all my project's source code together, and not put the
project specific plugin in with the nutch code. Do I just have my
plugin's build.xml include $NUTCH_HOME/src/home/build-plugin.xml?
(I'm a little shakey on ant synt
Hi
--
View this message in context:
http://www.nabble.com/denied-by-robots.txt-rules-tp24750512p24750512.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hi
if a url is denied by denied once by robots.txt rules,is crawled again by
nutch?
--
View this message in context:
http://www.nabble.com/denied-by-robots.txt-rules-tp24750517p24750517.html
Sent from the Nutch - User mailing list archive at Nabble.com.
This is a simple HowTo
http://wiki.apache.org/nutch/WritingPluginExample-0.9
Best Regards
Alexander Aristov
2009/7/31 Paul Tomblin
> How do I develop a plugin that isn't in the nutch source tree? I want
> to keep all my project's source code together, and not put the
> project specific plugi
12 matches
Mail list logo