Hi Barnabas,

> The reduce function of IndexerMapReduce only receives CrawlDatums (line 198) 
> the
> parseData/parseText is always null thus the function returns in line 261.

Parse data and text is stored in segments while the CrawlDatum may come from 
the CrawlDb.
Does the index job get the segment with the fetched and parsed pages passed as 
input?
If "parseData/parseText is always null" no segment is read (or the segment is 
empty).

Best,
Sebastian

On 08/09/2017 07:49 PM, Barnabás Balázs wrote:
> Small followup tidbit:
> 
> The reduce function of IndexerMapReduce only receives CrawlDatums (line 198) 
> the parseData/parseText is always null thus the function returns in line 261.
> 
> So the main question now:
> Why is it the Indexer only receiving CrawlDatums when the Parse function 
> executed before the Indexer creates the ParseData perfectly?
> On 2017. 08. 09. 19:00:27, Barnabás Balázs <barnabas.bal...@impresign.com> 
> wrote:
> Dear community!
> 
> I'm relatively new to Nutch 1.x and got stumped on an indexing issue.
> I have a local Java application that sends Nutch jobs to a remote Hadoop 
> deployment for execution. The jobs are sent in the following order:
> Inject -> Generate -> Fetch -> Parse -> Index -> Update -> Invertlinks
> Once a round is finished it starts over. The commands are of course 
> configured based on the previous one's results (when necessary).
> 
> This setup seems to work, I can see that fetch gathers the correct urls for 
> example. The problem is the Index stage. I implemented a custom IndexWriter 
> that should send data to Couchbase buckets and Kafka Producers, however even 
> though the plugin seems to construct correctly (I can see Kafka producer 
> setup records in the reduce log), the open/write/update functions are never 
> called. I put logs in each and also used remote debugging to make sure that 
> they are really never called.
> I also used a debugger inside the IndexerMapReduce class and to be honest I'm 
> not sure where the IndexWriter is used, but the job definitely receives data 
> (I saw the fetched urls).
> 
> I should mention that I also created an HTMLParseFilter plugin and that one 
> works perfectly, so plugin deployment shouldn't be the issue. Also in the 
> logs I can see the following:
> Registered Plugins: ... Couchbase indexer (indexer-couchbase) ... 
> org.apache.nutch.indexer.IndexWriters: Adding correct.package.Indexer
> I've been stuck on this issue for a few days now, any help/ideas would be 
> appreciated on why my IndexWriter is never called when running an Indexer job.
> 
> Best,
> Barnabas
> 

Reply via email to