I've opened NUTCH-1567 to track and address this.
https://issues.apache.org/jira/browse/NUTCH-1567


On Tue, Apr 30, 2013 at 9:39 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
> There is a pretty difficult aspect to this problem which makes it
> difficult for others/me to address.
> There are a number of variables which may (depending on your task
> execution between crawls) change the possibility/probability of some MARK
> not being present.
> The core problem here within the ParserJob at least is that the
> Mark.FETCH_MARK.checkMark(page); is null.
> The explanation I was given for this is documented on the wiki
> (unfortunately wiki is under maintenance just now).
> I do not think that the DEBUG logging currently in 2.x branch HEAD is
> useful at all. It should display the batchId as oppose to the Mark. Mu
> justification for this is that the batchId is always null, so showing this
> is pointless. We would be better showing the batchId which will enable the
> user to refetch the batchId in an attempt to ensure that a MARK is assigned
> to the page.
> Does this make sense?
> Lewis
>
>
> On Sun, Apr 28, 2013 at 8:33 AM, cervenkovab <[email protected]>wrote:
>
>> Hallo,
>> I have the same problem with *"Skipping some.relevant.page.com; different
>> batch id (null)"* for a lot of pages. My configuration is almost the same
>> as
>> bellow (only different OS and storage is Hbase).
>>
>> I do the steps (inject), generate, fetch, and the skipping appears in
>> parse
>> phase. But I want those pages to be parsed, the urls are relevant for me.
>> There is a problem that I want to crawl a lot of websites. *When a lot of
>> pages are skipped, I have very few collected pages, many empty pages and
>> it
>> is bad for me*. And I also dont know why the page for example
>> /http://videos.arte.tv/de/videos/arte-reportage--7471210.html/ is fetched
>> and parsed and for example the page
>> /
>> http://videos.arte.tv/de/videos/real-humans-echte-menschen-7-10-achtung-schockierende-bilder--7455402.html/
>> is skipped and most of the other pages of the domain /arte.tv/ is
>> skipped.
>> It is the same domain name.
>>
>> *What causes this error? How can I resolve this problem?*
>> Thanks for help
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-tp4040592p4059636.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Reply via email to