Hello,
I am trying to crawl http://www.msnbc.com/ but having problem to get anything
else beside the original seed URL. The INJECT/GENERATE/FETCH steps complete
without problems but after executing PARSE I see only one outlink pointing to
the original seed URL:
"outlinks" : {
"http://www*msnbc*com/" : ""
}
Executing "bin/nutch parsechecker http://www.msnbc.com" found ~130 outlinks. I
removed results of the crawling and repeated the steps but run PARSE in
debugger. Here are my observations:
1. The seed URL page contains <meta http-equiv="refresh" content="1200;
URL=http://www.msnbc.com/" />
2. During HtmlParser.getParse() meta tag attributes are extracted and
instance of HTMLMetaTags object is created
3. HtmlParser.getParse() sets major code of ParseStatus to
ParseStatusCodes.SUCCESS
4. HtmlParser.getParse() sets minor code of ParseStatus to
ParseStatusCodes.SUCCESS_REDIRECT based on the presence of "refresh" in
HTMLMetaTags object.
5. Upon successful parsing, ParseUtil.process() generates one new
"http://www*msnbc*com/" outlink and ignores ~130 discovered due to
pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT
Based on the above, it seems like there is a loop created since "re-fetch" will
return "refresh" again and again.
Here are snippets of Nutch code mentioned:
HtmlParser.getParse ()
...
ParseStatus status = ParseStatus.newBuilder().build();
status.setMajorCode((int) ParseStatusCodes.SUCCESS);
if (metaTags.getRefresh()) {
status.setMinorCode((int) ParseStatusCodes.SUCCESS_REDIRECT);
status.getArgs().add(new Utf8(metaTags.getRefreshHref().toString()));
status.getArgs().add(
new Utf8(Integer.toString(metaTags.getRefreshTime())));
}
...
ParseUtil.process()
...
if (ParseStatusUtils.isSuccess(pstatus)) {
if (pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
String newUrl = ParseStatusUtils.getMessage(pstatus);
int refreshTime = Integer.parseInt(ParseStatusUtils.getArg(pstatus, 1));
try {
newUrl = normalizers.normalize(newUrl, URLNormalizers.SCOPE_FETCHER);
if (newUrl == null) {
LOG.warn("redirect normalized to null " + url);
return;
}
try {
newUrl = filters.filter(newUrl);
} catch (URLFilterException e) {
return;
}
if (newUrl == null) {
LOG.warn("redirect filtered to null " + url);
return;
}
} catch (MalformedURLException e) {
LOG.warn("malformed url exception parsing redirect " + url);
return;
}
page.getOutlinks().put(new Utf8(newUrl), new Utf8());
page.getMetadata().put(FetcherJob.REDIRECT_DISCOVERED,
TableUtil.YES_VAL);
if (newUrl == null || newUrl.equals(url)) {
String reprUrl = URLUtil.chooseRepr(url, newUrl,
refreshTime < FetcherJob.PERM_REFRESH_TIME);
if (reprUrl == null) {
LOG.warn("reprUrl==null for " + url);
return;
} else {
page.setReprUrl(new Utf8(reprUrl));
}
}
} else {
page.setText(new Utf8(parse.getText()));
page.setTitle(new Utf8(parse.getTitle()));
ByteBuffer prevSig = page.getSignature();
if (prevSig != null) {
page.setPrevSignature(prevSig);
}
final byte[] signature = sig.calculate(page);
page.setSignature(ByteBuffer.wrap(signature));
if (page.getOutlinks() != null) {
page.getOutlinks().clear();
}
final Outlink[] outlinks = parse.getOutlinks();
int outlinksToStore = Math.min(maxOutlinks, outlinks.length);
String fromHost;
if (ignoreExternalLinks) {
try {
fromHost = new URL(url).getHost().toLowerCase();
} catch (final MalformedURLException e) {
fromHost = null;
}
} else {
fromHost = null;
}
int validCount = 0;
for (int i = 0; validCount < outlinksToStore && i < outlinks.length;
i++) {
String toUrl = outlinks[i].getToUrl();
try {
toUrl = normalizers.normalize(toUrl, URLNormalizers.SCOPE_OUTLINK);
toUrl = filters.filter(toUrl);
} catch (MalformedURLException e2) {
continue;
} catch (URLFilterException e) {
continue;
}
if (toUrl == null) {
continue;
}
Utf8 utf8ToUrl = new Utf8(toUrl);
if (page.getOutlinks().get(utf8ToUrl) != null) {
// skip duplicate outlinks
continue;
}
String toHost;
if (ignoreExternalLinks) {
try {
toHost = new URL(toUrl).getHost().toLowerCase();
} catch (final MalformedURLException e) {
toHost = null;
}
if (toHost == null || !toHost.equals(fromHost)) { // external links
continue; // skip it
}
}
validCount++;
page.getOutlinks().put(utf8ToUrl, new Utf8(outlinks[i].getAnchor()));
}
Utf8 fetchMark = Mark.FETCH_MARK.checkMark(page);
if (fetchMark != null) {
Mark.PARSE_MARK.putMark(page, fetchMark);
}
}
}
...
Regards,
Vyacheslav Pascarel