Hi Lewis,
as you have to load all backend entries, as there are no filters ("where"
clauses in SQL) in gora, you will see a lot of entries with wrong fetchmark.
null values are possible, too, think about these steps: inject -> generate
-> inject -> fetch
The second inject will leave entries in the db without fetchmarks seen by
the fetcher later.
--Roland
On Fri, Apr 26, 2013 at 12:30 AM, Lewis John Mcgibbney <
[email protected]> wrote:
> Additionally, why do we log.DEBUG that there is a different batch id (" +
> mark + ")", should we not log what the different batch id is, as oppose to
> the FETCH_MARK mark? ...which in this case is null which is useless to us.
>
> This DEBUG logging is also present in the following classes and until I
> understand it, I am not really happy with it being present.
> IndexerJob
> ParserJob
> FetcherJob
>
>
> On Thu, Apr 25, 2013 at 3:20 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
> > Hi,
> > Within ParserJob#map, I am keen to see how the situation arises where the
> > !NutchJob.shouldProcess returns true due to the fact that
> > Mark.FETCH_MARK.checkMark(page) returns value null.
> >
> > In what scenarios is it possible to have a page which we attempt to
> fetch,
> > which has a null value for FETCH_MARK?
> >
> > @Override
> > public void map(String key, WebPage page, Context context)
> > throws IOException, InterruptedException {
> > Utf8 mark = Mark.FETCH_MARK.checkMark(page);
> > String unreverseKey = TableUtil.unreverseUrl(key);
> > if (batchId.equals(REPARSE)) {
> > LOG.debug("Reparsing " + unreverseKey);
> > } else {
> > if (!NutchJob.shouldProcess(mark, batchId)) {
> > if (LOG.isDebugEnabled()) {
> > LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + ";
> > different batch id (" + mark + ")");
> > }
> > return;
> >
> > Any ideas? Is this a bug?
> >
> >
> >
> > On Thu, Apr 25, 2013 at 7:31 AM, Carmine Paternoster <
> [email protected]
> > > wrote:
> >
> >> Hi Lewis, thank you very much, for your answer. I do not know how, but I
> >> solved it. No longer appear "different batch id (null)". In any case,
> I'm
> >> using Nutch 2.1
> >> Good day, Carmine
> >>
> >>
> >> 2013/4/24 Lewis John Mcgibbney <[email protected]>
> >>
> >>>
> >>> Hi Carmine,
> >>>
> >>> CC: [email protected]
> >>>
> >>> On Wed, Apr 24, 2013 at 3:13 AM, Carmine Paternoster <
> >>> [email protected]> wrote:
> >>>
> >>>> I configured Nutch and mySql following this guide (
> >>>> http://nlp.solutions.asia/?p=180). everything worked fine, but at
> some
> >>>> point in the database I find all elements with baseUrl=null,
> content=null.
> >>>> Nutch not parsing, many url. I receive this message in Nutch console:
> >>>> Skipping http://myurlForParsing.it; different batch id (null)
> >>>>
> >>>> How can I fix?
> >>>>
> >>>>
> >>>>
> >>> This is actually something which I've wondered about for a while and it
> >>> was on my TODO list of things to address!!!
> >>> I want to know how to reproduce different batch id (null).
> >>> Which version of 2.x are you on? 2.1?
> >>> Thanks
> >>> Leewis
> >>>
> >>
> >>
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *Lewis*
>