I have updated NUTCH-1559 with the info I found and a patch.  I would like
someone to review the approach from an architectural perspective. I am very
unfamiliar with the underpinnings of Nutch to know if the proposed solution
will not have any downstream issues. I have been testing this on my test
system, but my testing is only within the plugins/scope of my project.
(i.e. lacks regression testing.)

jeff

On Thu, Apr 30, 2015 at 12:35 PM, Jorge Luis Betancourt González <
[email protected]> wrote:

> This was reported in NUTCH-1559 and planned to be solved for Nutch 1.11,
> can you provide a patch?. The root of the problem appears to be using the
> plugin tika-parser in combination with parse-metatags.
>
> Regards,
>
> ----- Original Message -----
> From: "Jeff Cocking" <[email protected]>
> To: "Nutch User MailList" <[email protected]>
> Sent: Thursday, April 30, 2015 1:05:33 PM
> Subject: [MASSMAIL]Re: Duplicate Metatag.Description Values
>
> Ok, after further investigations, I believe I have found the culprit. It
> appears we may have conflicting activities occurring:
>
> I was able to remove the error by only running the parse-html plugin.  When
> the tika plugin is activated the duplicate value occurs.  The Tika plugin
> copies all the Tika metadata into the nutch metadata. (see code below). The
> MetaTagsParser is setup to parse both Tika metadata and Nutch metadata.
> This is the reason for the duplicate values.
>
> I do not know how these values are used else where within the system.  It
> would appear we could remove the MetaTagsParser execution of the Tika
> metadata.  Thoughts?
>
> Source Code:
> TikaParser.java (around line 184):
>         // populate Nutch metadata with Tika metadata
>         String[] TikaMDNames = tikamd.names();
>         for (String tikaMDName : TikaMDNames) {
>             if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
>                 continue;
>             // TODO what if multivalued?
>             nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));
>         }
>
> MetaTagsParser.java (around line 104)
>     // check in the metadata first : the tika-parser
>     // might have stored the values there already
>     for (String mdName : metadata.names()) {
>       addIndexedMetatags(metadata, mdName, metadata.getValues(mdName));
>     }
>
>     Metadata generalMetaTags = metaTags.getGeneralTags();
>     for (String tagName : generalMetaTags.names()) {
>       addIndexedMetatags(metadata, tagName,
> generalMetaTags.getValues(tagName));
>     }
>
> I hope this makes sense....
>
> jeff
>
>
>
> On Thu, Apr 30, 2015 at 11:22 AM, Jeff Cocking <[email protected]>
> wrote:
>
> > I am getting duplicate metatag.description values in my indexed results.
> > When running a parse checker, I am picking up meta name=description and
> the
> > meta property=og:description values.
> >
> > Has anyone else ran into this issue?  If so, how have you fixed it?
> >
> > If not, any clues on how to resolve.
> >
> > Thank you in advance,
> > jeff
> >
> >
> > configuration: Nutch 1.9
> > nutch-site.xml(partial):
> > <!-- Plugin Control statement -->
> > <property>
> >   <name>plugin.includes</name>
> >
> >
> <value>protocol-httpclient|urlfilter-(prefix|suffix|regex)|feed|headings|parse-(tika|html|metatags)|urlmeta|index-(basic|anchor|metadata|img)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >   <description></description>
> > </property>
> >
> > <!-- Parse Meta Tag parameters -->
> > <property>
> >   <name>metatags.names</name>
> >   <value>description</value>
> > </property>
> >
> > <!-- Parse - Tika Controls -->
> > <property>
> >   <name>tika.boilerpipe</name>
> >   <value>true</value>
> > </property>
> >
> > <property>
> >   <name>tika.boilerpipe.extractor</name>
> >   <value>JeffExtractor</value>
> > </property>
> >
> > <!-- Index-Metadata Plugin -->
> > <property>
> >   <name>index.parse.md</name>
> >   <value>metatag.description</value>
> > </property>
> > <property>
> >   <name>index.content.md</name>
> >   <value>description</value>
> > </property>
> >
>

Reply via email to