it is not allowed for robots.
http://search.yahoo.com/robots.txt
User-agent: *
Disallow: /search
Disallow: /bin
Disallow: /myweb
Disallow: /myresults
Disallow: /language
Kim Theng Chong schrieb:
Hi all,
Can Nutch crawl Yahoo search result page? eg :
you try to refetch an already fetched segment.
if in your loop
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
does not generate a new segment, this can happen.
you have to check whether a new segment is generated by this command.
check the exit status.
there are some scripts in
i guess it is filtered out by your url filter configuration.
DOMContentUtils.java in parse-html plugin extracts the links.
Stefano Cherchi schrieb:
As in subject: when I try to fetch a page whose link should open in new
window (with the tag target=_new or _blank) Nutch fails. No errors or
QueroVc schrieb:
But the crawl-urlfilter.txt not accept only characters instead of strings?
If accepted, as I write?
# Skip URLs containing certain characters as probable queries, etc..
-[...@=]
Could be?
# Skip URLs containing certain characters as probable queries, etc..
- [ menu]
you can edit regex-urlfilter.txt to exclude those urls if you use fetch
command,
or crawl-urlfilter.txt if you use crawl command.
QueroVc schrieb:
Please could someone tell me how to not get the crawl URLs that contain the
word menu.
Thanks
Andrzej Bialecki schrieb:
On 2010-02-21 12:36, reinhard schwab wrote:
Andrzej Bialecki schrieb:
On 2010-02-20 23:32, reinhard schwab wrote:
Andrzej Bialecki schrieb:
On 2010-02-20 22:45, reinhard schwab wrote:
the content of one page is stored even 7 times.
http://www.cinema-paradiso.at
::
reinhard schwab schrieb:
i implement now this tool by forking SegmentMerger.
i have only added an additional filter in the map method and
keep the segment name.
i have then be surprised, that the reduce method logs 4 times the content
of a crawl datum.
why this?
i have logged then the content
segment.SegmentFilter -
org.apache.nutch.crawl.CrawlDatum
2010-02-19 13:25:54,794 INFO segment.SegmentFilter - reduce 348
regards
reinhard schwab schrieb:
i would like to have a segment filter, which filters out unneeded content.
i only want to keep the content of pages which are still indexed in solr
after adding a synchronized modifier to the addFetchItem method,
i have not seen any hang of the fetcher.
reinhard schwab schrieb:
after studying the code and the analysis done by Steven Denny in jira,
i think he is right.
Note that the queue is created and then immediately reaped, and after
i would like to have a segment filter, which filters out unneeded content.
i only want to keep the content of pages which are still indexed in solr
and which belong to this segment,
when i query solr by this segment name.
is there any existing tool available?
SegmentMerger is a no go for me. it
nutch expect urls to be a directory.
create a directory urls and create in this directory a file called
like you want and
edit this file, add the urls you want to crawl.
Injector: urlDir: urls
Input path doesnt exist : C:/cygwin/home/MouadSibel/nutch-0.9/urls
Mouad schrieb:
Hello,
i
paul tomblin has posted a diff for handling last modified.
dont know whether an issue has been opened in jira.
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15056.html
Rupesh Mankar schrieb:
Hi,
I am using Nutch 1.0. I have successfully crawled our intranet site. But when
I
/browse/NUTCH-719. A solution has
been proposed but I am not sure that it really fixes the problem.
J.
2010/1/26 reinhard schwab reinhard.sch...@aon.at:
sometimes i watch
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
sorry, i have overseen a method with the same name in FetchItemQueues.
line number 394 in my code version after expanding the import statements.
will test it.
reinhard schwab schrieb:
i have had now the opportunity to test again fetching.
it has looked good so far until now.
again the same
is not synchronized.
if getFetchItem is called before addFetchItem has finished, then the
queue is reaped and later addFetchItem increments the counter.
reinhard schwab schrieb:
sorry, i have overseen a method with the same name in FetchItemQueues.
line number 394 in my code version after
you have to install some additional jars.
read the nutch README
it says
Apache Nutch README
Important note: Due to licensing issues we cannot provide two libraries that
are normally provided with PDFBox (jai_core.jar, jai_codec.jar), the parser
library we use for parsing PDF files. If you
/loocia/nutch-1.0/build.xml:62: Specify at least one
source--a file or resource collection.
Total time: 0 seconds
ant version:
Apache Ant version 1.7.0 compiled on April 29 2008
any idea why it doesn't build?
reinhard schwab wrote:
you have to install some additional jars.
read
sometimes i watch
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
Aborting with 10 hung threads.
if i connect with jconsole, all fetcher threads are sleeping.
something wrong with fetchQueues totalSize?
before it has logged
easiest way i think of would be to modify CrawlDbReducer.
it has a reduce method where it writes the crawl dates to crawldb when
updating
the crawl db.
you can filter out there the crawl dates with low score and
return before
output.collect(key, result);
then they are not in the crawl db.
using nutch readdb you can dump the entry of the page.
i believe that the fetch interval of this page is zero.
Sunnyvale Fl schrieb:
Hi,
I am using Nutch 0.9.1 and I am having this weird problem - it will
repeatedly fetch the same page without error. So if I let it run to 10
levels deep, the
:00 PST 1969
Retries since fetch: 0
Retry interval: 7.0 days
Score: 0.0
Signature: 5ec8dc313a9ae4d61c6e8c9d9c18ea26
Metadata: _pst_:success(1), lastModified=0
On Thu, Jan 21, 2010 at 5:00 PM, reinhard schwab
reinhard.sch...@aon.atwrote:
using nutch readdb you can dump the entry
fetch: 0
Retry interval: 0.0 days
Score: 0.0
Signature: 09854146546e5e7fe5def1e1add23037
Metadata: _pst_:success(1), lastModified=0
On Thu, Jan 21, 2010 at 5:50 PM, reinhard schwab
reinhard.sch...@aon.atwrote:
yes, i mean that.
in the java classes, it is called fetch interval, see
check the class DOMContentUtils.java
in the plugin parse-html
you can modify it to meet your requirements.
in general option value does not contain links.
you may apply a heuristic.
Joshua J Pavel schrieb:
So, with HTML like this (from a dropdown box):
option
Ken Ken schrieb:
/nutch-1.0/conf/regex-urlfilter.txt
Hello,
I just want to fetch/crawl all .com domain names, so what should I put in the
/nutch-1.0/conf/regex-urlfilter.txt file
e.g.
+^http://([a-z0-9]*\.)*apache.org/
Correct me if I am wrong. I think the above only crawl/fetch
http://www.fileformat.info/info/unicode/char/2029/index.htm
i have experienced that this unicode character breaks JSON deserializing
when using SOLR and AJAX.
it comes from a pdf text.
where to filter out or replace this character? pdf parser/text
extractor? solr indexer?
regards
reinhard
if you dont want to refetch already fetched pages,
i think of 3 possibilities:
a/ set a very high fetch interval
b/ use a customized fetch schedule class instead of DefaultFetchSchedule
implement there a method
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
which returns
Peters, Vijaya schrieb:
I am using Nutch 1.0. I want to perform a 'clean' crawl.
I see the force option in this patch: NUTCH-601v1.0.patch
https://issues.apache.org/jira/secure/attachment/12375717/NUTCH-601v1.0
.patch
Do I have to make those code changes, or does Nutch 1.0 have
dates have 60 days retry interval.
this crawl date will be fetched and fetched again with 0 days retry
interval.
i will open an issue in jira and attach a patch.
regards
reinhard
reinhard schwab schrieb:
i'm observing crawl dates, which have fetch interval with value 0.
when i dump the segment
i'm observing crawl dates, which have fetch interval with value 0.
when i dump the segment, i see
Recno:: 33
URL::
http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Tue Dec 01 23:41:15 CET 2009
Modified time: Thu
Andrzej Bialecki schrieb:
BELLINI ADAM wrote:
hi,
my two urls points to the same page !
Please, no need to shout ...
If the MD5 signatures are different, then the binary content of these
pages is different, period.
Use readseg -dump utility to retrieve the page content from the
Andrzej Bialecki schrieb:
reinhard schwab wrote:
there is some piece of code i dont understand
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time
to time.
// pages with too long fetchInterval
there is some piece of code i dont understand
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long fetchInterval are adjusted so that they fit
within
// maximum
reinhard schwab wrote:
opsec schrieb:
I've added this to my conf/crawl-urlfilter.txt and
conf/regex-urlfilter.txt
yet when I start a crawl this domain is heavily spidered. I would like to
remove it from my search results entirely and prevent it from being
crawled
in the future
if you want to recrawl urls, you have to generate a new segment, fetch
this segment
and update the crawl db.
example script:
bin/nutch generate crawl/crawldb crawl/segments -topN $topN -adddays
$adddays
segment=`ls -d crawl/segments/* | tail -1`
bin/nutch fetch $segment
bin/nutch updatedb
is MD5 hash of the content.
another reason may be that you have some indexing filters.
i dont believe its the reason here.
regards
kevin chen schrieb:
I have similar experience.
Reinhard schwab responded a possible fix. See mail in this group from
Reinhard schwab at
Sun, 25 Oct 2009 10
(db_fetched)
So it was sucessfully fetchet. But, according to indexing log - it still was
not sent to indexer!
reinhard schwab wrote:
what is the db status of this url in your crawl db?
if it is STATUS_DB_NOTMODIFIED,
then it may be the reason.
(you can check it if you dump your
hmm i have no idea now.
check the reduce method in IndexerMapReduce and add some debug
statements there.
recompile nutch and try it again.
caezar schrieb:
Thanks, checked, it was parsed. Still no answer why it was not indexed
reinhard schwab wrote:
yes, its permanently redirected.
you
is your problem solved now???
this can be ok.
new discovered urls will be added to a segment when fetched documents
are parsed and if these urls pass the filters.
they will not have a crawl datum Generate because they are unknown until
they are extracted.
regards
caezar schrieb:
I've compared
== null) {
return; // only have inlinks
}
in IndexerMapReduce code. For this page dbDatum is null, so it is not
indexed!
reinhard schwab wrote:
is your problem solved now???
this can be ok.
new discovered urls will be added to a segment when
paul tomblin has sent a patch at 14.10.2009
to filter out not modified pages makes sense for me if the index is
built incrementally and
if these pages are already in the index which is updated then
lucene offers the option to update an index
but in my case i always build a new one.
you may
if you try
bin/nutch
without any arguments and options, it will show you
Usage: nutch [-core] COMMAND
where COMMAND is one of:
...
parse parse a segment's pages
invertlinks create a linkdb from parsed segments
index run the indexer on parsed segments and
part (?).
nutchcase schrieb:
Here is the output from that:
TOTAL urls: 297
retry 0: 297
min score:0.0
avg score:0.023377104
max score:2.009
status 2 (db_fetched):295
status 5 (db_redir_perm): 2
reinhard schwab wrote:
try
bin/nutch readdb crawl
try
bin/nutch readdb crawl/crawldb -stats
are there any unfetched pages?
nutchcase schrieb:
My crawl always stops at depth=3. It gets documents but does not continue any
further.
Here is my nutch-site.xml
?xml version=1.0?
configuration
property
namehttp.agent.name/name
it in LinkDB
in both cases, but if it has URL like /img/img.jpg for image, it's missing
from LinkDB in case of execution using separate commands.)
Any thoughts?
TIA,
--Hrishi
-Original Message-
From: reinhard schwab [mailto:reinhard.sch...@aon.at]
Sent: Tuesday, September 01, 2009 3
there is a config option in nutch-default.xml
property
namedb.ignore.internal.links/name
valuetrue/value
descriptionIf true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest
either you have no seed urls or your filter is to restrictive.
also take care that nutch crawl will use conf/crawl-urlfilter.txt by
default and
not conf/regex-urlfilter.txt!
Aditya Sakhuja schrieb:
I am having issues getting the data injected into the crawldb. I have set
the filter in the
Iain Downs schrieb:
I think there is probably a sub text here (I'm putting words in Otis' mouth,
for which my apologies).
' Yes, you could rewrite Nutch in C++ and have that use CLucene.' But you'd
be mad to do so!
I'm a bit out of date with Nutch, but it's large. And Java to C++ is not
Saurabh Suman schrieb:
Hi
I have some confusion regarding Fetcher.java. Does Fetcher fetches Html
page ,stores it first and then parse?
Can i just store the html and i don't want to parse it?
it can. it has a -noParsing option
bin/nutch fetch
Usage: Fetcher segment [-threads n]
i would suggest that you implement an urlfilter plugin which is doing that.
which is mapping hosts to regexp rules.
Paul Tomblin schrieb:
Is there any way other than the config files to specify the url filter
parameters? I have a few dozen sites to crawl, and for each site I
want to specify
Doğacan Güney schrieb:
On Tue, Jul 21, 2009 at 21:50, Tomislav Poljaktpol...@gmail.com wrote:
Hi,
thanks for your answers, I've configured compression:
mapred.output.compress = true
mapred.compress.map.output = true
mapred.output.compression.type= BLOCK
( in xml format in
Doğacan Güney schrieb:
On Wed, Jul 29, 2009 at 13:11, reinhard schwabreinhard.sch...@aon.at wrote:
Doğacan Güney schrieb:
On Tue, Jul 21, 2009 at 21:50, Tomislav Poljaktpol...@gmail.com wrote:
Hi,
thanks for your answers, I've configured compression:
yes, there are tools which you can use to dump the content of crawl db,
link db and segments.
dump=./crawl/dump
bin/nutch readdb $crawl/crawldb -dump $dump/crawldb
bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb
bin/nutch readseg -dump $1 $dump/segments/$1
you will get more info if you
url in the domain apache.org.
* Until someone could explain this...When I use the file
crawl-urlfilter.txt the filter doesn't work, instead of it use the file
conf/regex-urlfilter.txt and change the last line from +. to -.
reinhard schwab schrieb:
i have tried the recrawl script of susam pal
i believe it can.
check your configuration files, nutch-site.xml and nutch-default.xml.
you will find something like
property
nameplugin.includes/name
i have tried the recrawl script of susam pal and have wondered why
url filtering no longer works.
http://wiki.apache.org/nutch/Crawl
the mystery is
only Crawl.java adds crawl-tool.xml to the NutchConfiguration.
Configuration conf = NutchConfiguration.create();
conf.addResource(crawl-tool.xml);
because?
you mean urls which contain a query part?
they can be crawled.
the default nutch configuration excludes them by this filter rule in
conf/crawl-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc.
-[...@=]
Zaihan schrieb:
Hi All,
I'm sure I've read
can dump segment info to a directory, let's say tmps,
$NUTCH_HOME/bin/nutch readseg -dump $segment tmps -nocontent
Then, go to the directory, you should see a file dump
grep outlink: dump | cut -f5 -d outlinks
On Fri, 2009-07-17 at 18:43 +0200, reinhard schwab wrote:
is any tool
http://www.google.se/robots.txt
google disallows it.
User-agent: *
Allow: /searchhistory/
Disallow: /search
Larsson85 schrieb:
Why isnt nutch able to handle links from google?
I tried to start a crawl from the following url
http://www.google.se/search?q=site:sehl=svstart=100sa=N
And all
://www.google.com/terms_of_service.html
if you set the user agent properties to a client such as firefox,
google will serve your request.
reinhard schwab schrieb:
http://www.google.se/robots.txt
google disallows it.
User-agent: *
Allow: /searchhistory/
Disallow: /search
Larsson85 schrieb
you can check the response of google by dumping the segment
bin/nutch readseg -dump crawl/segments/... somedirectory
reinhard schwab schrieb:
it seems that google is blocking the user agent
i get this reply with lwp-request
Your client does not have permission to get URL
code/search?q
identify nutch as popular user agent such as firefox.
Larsson85 schrieb:
Any workaround for this? Making nutch identify as something else or something
similar?
reinhard schwab wrote:
http://www.google.se/robots.txt
google disallows it.
User-agent: *
Allow: /searchhistory/
Disallow
directives.
Dennis
reinhard schwab wrote:
identify nutch as popular user agent such as firefox.
Larsson85 schrieb:
Any workaround for this? Making nutch identify as something else or
something
similar?
reinhard schwab wrote:
http://www.google.se/robots.txt
google disallows it.
User
, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
/description
/property
If I dont have the star in the end I get the same as earlier, No URLs to
fetch. And if I do I get 0 records selected for fetching, exiting
reinhard schwab wrote:
identify nutch as popular
Doğacan Güney schrieb:
On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote:
when i crawl a domain such as
http://www.weissenkirchen.at/
nutch extracts these outlinks.
do they come from some heuristics?
These are probably coming from parse-js plugin.
reinhard schwab schrieb:
Doğacan Güney schrieb:
On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote:
when i crawl a domain such as
http://www.weissenkirchen.at/
nutch extracts these outlinks.
do they come from some heuristics
65 matches
Mail list logo