I've opened NUTCH-1567 to track and address this. https://issues.apache.org/jira/browse/NUTCH-1567
On Tue, Apr 30, 2013 at 9:39 AM, Lewis John Mcgibbney < [email protected]> wrote: > Hi, > There is a pretty difficult aspect to this problem which makes it > difficult for others/me to address. > There are a number of variables which may (depending on your task > execution between crawls) change the possibility/probability of some MARK > not being present. > The core problem here within the ParserJob at least is that the > Mark.FETCH_MARK.checkMark(page); is null. > The explanation I was given for this is documented on the wiki > (unfortunately wiki is under maintenance just now). > I do not think that the DEBUG logging currently in 2.x branch HEAD is > useful at all. It should display the batchId as oppose to the Mark. Mu > justification for this is that the batchId is always null, so showing this > is pointless. We would be better showing the batchId which will enable the > user to refetch the batchId in an attempt to ensure that a MARK is assigned > to the page. > Does this make sense? > Lewis > > > On Sun, Apr 28, 2013 at 8:33 AM, cervenkovab <[email protected]>wrote: > >> Hallo, >> I have the same problem with *"Skipping some.relevant.page.com; different >> batch id (null)"* for a lot of pages. My configuration is almost the same >> as >> bellow (only different OS and storage is Hbase). >> >> I do the steps (inject), generate, fetch, and the skipping appears in >> parse >> phase. But I want those pages to be parsed, the urls are relevant for me. >> There is a problem that I want to crawl a lot of websites. *When a lot of >> pages are skipped, I have very few collected pages, many empty pages and >> it >> is bad for me*. And I also dont know why the page for example >> /http://videos.arte.tv/de/videos/arte-reportage--7471210.html/ is fetched >> and parsed and for example the page >> / >> http://videos.arte.tv/de/videos/real-humans-echte-menschen-7-10-achtung-schockierende-bilder--7455402.html/ >> is skipped and most of the other pages of the domain /arte.tv/ is >> skipped. >> It is the same domain name. >> >> *What causes this error? How can I resolve this problem?* >> Thanks for help >> >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-tp4040592p4059636.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > > > > -- > *Lewis* > -- *Lewis*

