Hi,
There is a pretty difficult aspect to this problem which makes it difficult
for others/me to address.
There are a number of variables which may (depending on your task execution
between crawls) change the possibility/probability of some MARK not being
present.
The core problem here within the ParserJob at least is that the
Mark.FETCH_MARK.checkMark(page); is null.
The explanation I was given for this is documented on the wiki
(unfortunately wiki is under maintenance just now).
I do not think that the DEBUG logging currently in 2.x branch HEAD is
useful at all. It should display the batchId as oppose to the Mark. Mu
justification for this is that the batchId is always null, so showing this
is pointless. We would be better showing the batchId which will enable the
user to refetch the batchId in an attempt to ensure that a MARK is assigned
to the page.
Does this make sense?
Lewis


On Sun, Apr 28, 2013 at 8:33 AM, cervenkovab <[email protected]> wrote:

> Hallo,
> I have the same problem with *"Skipping some.relevant.page.com; different
> batch id (null)"* for a lot of pages. My configuration is almost the same
> as
> bellow (only different OS and storage is Hbase).
>
> I do the steps (inject), generate, fetch, and the skipping appears in parse
> phase. But I want those pages to be parsed, the urls are relevant for me.
> There is a problem that I want to crawl a lot of websites. *When a lot of
> pages are skipped, I have very few collected pages, many empty pages and it
> is bad for me*. And I also dont know why the page for example
> /http://videos.arte.tv/de/videos/arte-reportage--7471210.html/ is fetched
> and parsed and for example the page
> /
> http://videos.arte.tv/de/videos/real-humans-echte-menschen-7-10-achtung-schockierende-bilder--7455402.html/
> is skipped and most of the other pages of the domain /arte.tv/ is skipped.
> It is the same domain name.
>
> *What causes this error? How can I resolve this problem?*
> Thanks for help
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-2-1-different-batch-id-null-tp4040592p4059636.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Reply via email to