Re: Indexing specific metadata tags with urlmeta

Vijith Kumar V Thu, 12 Jan 2012 05:15:42 -0800

Hi all,

I am facing a few problems in using the urlmeta plugin as described in the
pluginwiki.


What I tried so far :-

- followed the wiki page to use urlmeta plugin with a <meta > tag in one of
my webpages and haven't got indexed as I expected.
was not even showing the metatag in the readseg dumps

- then I tried giving the metatags along with the urls in the seed file (
tab seperated ). Meta tags showed up in the dump but querying
solr gave no results.

- then I applied the NUTCH-809 patch >> built nutch etc ... to see if that
works. Same result as the first case ...not in metadata field (readseg dump
) and not in solr results.

- Checking the index with Luke showed NO field named "keywords" (my metatag
name)

So what could be the issue here.
Also I want to know what is the difference between urlmeta and
index-metatags plugins and their exact uses.
I am bit confused when urlmeta wiki tells about adding a <meta> tag to your
html page and still not indexing it
and there exists another plugin index-metatags for the same.

I am a newbie to this please help..


On Thu, Jan 12, 2012 at 5:20 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Elisabeth please see my comments on issue.
>
> Thanks again
>
> Lewis
>
> On Thu, Jan 12, 2012 at 9:15 AM, Elisabeth Adler
> <[email protected]>wrote:
>
> > Hi Dean,
> > I added my documentation and bundled plugin to jira (
> > https://issues.apache.org/**jira/browse/NUTCH-809<
> https://issues.apache.org/jira/browse/NUTCH-809>),
> > hope this helps.
> >
> >
> > On 11.01.2012 22:44, Dean Del Ponte wrote:
> >
> >> Thank-you for your response.
> >>
> >> My goal is to get Nutch to index meta tags.  It's been quite an
> adventure
> >> so far!
> >>
> >> On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney<
> >> [email protected]>  wrote:
> >>
> >>  Hi Dean,
> >>>
> >>> Unfortunately nothing official. If you look you will see that this
> plugin
> >>> (if eventually integrated), will combine with two other issues which
> all
> >>> revolve roughly around the same area.
> >>>
> >>> I have never used this patch or any of the others.
> >>>
> >>> Anyone else?
> >>>
> >>> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte<
> [email protected]
> >>>
> >>>> wrote:
> >>>> Any documentation on how to use the patch at
> >>>> https://issues.apache.org/**jira/browse/NUTCH-809<
> https://issues.apache.org/jira/browse/NUTCH-809>
> >>>> ?
> >>>>
> >>>> My apologies for the newbie question.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Dean Del Ponte
> >>>>
> >>>> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche<
> >>>> [email protected]>  wrote:
> >>>>
> >>>>  Hi Matt,
> >>>>>
> >>>>> The plugin urlmeta does NOT extract the metadata from HTML pages. The
> >>>>> 'meta'
> >>>>> in its name means 'crawldb metadata'
> >>>>>
> >>>>> You need to use the patch in
> >>>>> https://issues.apache.org/**jira/browse/NUTCH-809<
> https://issues.apache.org/jira/browse/NUTCH-809>
> >>>>>
> >>>>> HTH
> >>>>>
> >>>>> Julien
> >>>>>
> >>>>>
> >>>>> On 26 September 2011 21:18, Wilson, Matt<Matthew.Wilson@salliemae.**
> >>>>> com <[email protected]>
> >>>>>
> >>>>>> wrote:
> >>>>>> Also,
> >>>>>>
> >>>>>> In case this helps.  I removed the Keywords field from the solr
> >>>>>>
> >>>>> schema
> >>>
> >>>> to
> >>>>
> >>>>> see if it would generate an error when the SolrIndexer runs and it
> >>>>>>
> >>>>> does
> >>>
> >>>> not.
> >>>>>
> >>>>>>  This has lead me to believe that nutch is either not indexing the
> >>>>>>
> >>>>> meta
> >>>
> >>>> content or it is not sending the update to solr when SolrIndexer
> >>>>>>
> >>>>> runs.
> >>>
> >>>> Matt Wilson
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.**com<
> [email protected]>
> >>>>>> ]
> >>>>>> Sent: Monday, September 26, 2011 3:04 PM
> >>>>>> To: [email protected]
> >>>>>> Subject: Re: Indexing specific metadata tags with urlmeta
> >>>>>>
> >>>>>> Hi Matt,
> >>>>>>
> >>>>>> Try changing
> >>>>>>
> >>>>>> <field name="keywords" type="string" stored="true" indexed="true"
> >>>>>> multiValued="true"/>
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> <field name="Keywords" type="string" stored="true" indexed="true"
> >>>>>> multiValued="true"/>  as per your metadata tags.
> >>>>>>
> >>>>>> We also have a configuration option in nutch-site.xml which you
> could
> >>>>>>
> >>>>> check
> >>>>>
> >>>>>> out.
> >>>>>>
> >>>>>> <property>
> >>>>>>  <name>urlmeta.tags</name>
> >>>>>>  <value></value>
> >>>>>>  <description>
> >>>>>>    To be used in conjunction with features introduced in NUTCH-655,
> >>>>>>
> >>>>> which
> >>>>
> >>>>> allows
> >>>>>>    for custom metatags to be injected alongside your crawl URLs.
> >>>>>>
> >>>>> Specifying
> >>>>>
> >>>>>> those
> >>>>>>    custom tags here will allow for their propagation into a pages
> >>>>>>
> >>>>> outlinks,
> >>>>>
> >>>>>> as
> >>>>>>    well as allow for them to be included as part of an index.
> >>>>>>    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
> >>>>>>
> >>>>> the
> >>>
> >>>> tags
> >>>>>
> >>>>>> with
> >>>>>>    white-space at their boundaries, if you are using anything
> earlier
> >>>>>>
> >>>>> than
> >>>>>
> >>>>>> Hadoop-0.21.
> >>>>>>  </description>
> >>>>>> </property>
> >>>>>>
> >>>>>> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
> >>>>>> <[email protected]>**wrote:
> >>>>>>
> >>>>>>  I am attempting to crawl a corporate intranet site and allow it to
> >>>>>>>
> >>>>>> be
> >>>
> >>>> searched in solr.  As part of the requirements I have to be able to
> >>>>>>>
> >>>>>> index
> >>>>>
> >>>>>> certain metadata tags as their own field in solr (for faceted
> >>>>>>>
> >>>>>> search).
> >>>>
> >>>>>  For
> >>>>>>
> >>>>>>> example, the pages being crawled contain the following meta tag:
> >>>>>>>
> >>>>>>> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
> >>>>>>>
> >>>>>> Savings,
> >>>>
> >>>>> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
> >>>>>>>
> >>>>>> Plans"
> >>>>>>
> >>>>>>> />
> >>>>>>>
> >>>>>>> I have updated the nutch-site.xml with the following:
> >>>>>>>
> >>>>>>> <property>
> >>>>>>>    <name>plugin.includes</name>
> >>>>>>>    <value>urlmeta|protocol-**httpclient|...</value>
> >>>>>>> </property>
> >>>>>>> <property>
> >>>>>>>    <name>urlmeta.tags</name>
> >>>>>>>    <value>keywords</value>
> >>>>>>> </property>
> >>>>>>>
> >>>>>>> I have updated the solr schema.xml with the following addition:
> >>>>>>>
> >>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
> >>>>>>> multiValued="true"/>
> >>>>>>>
> >>>>>>> I can see that the field has been created in Solr via the admin
> >>>>>>>
> >>>>>> interface.
> >>>>>>
> >>>>>>>  I also see that nutch is loading the urlmeta plugin and adding the
> >>>>>>> indexfilters etc in the hadroop.log.  The problem is that nutch
> >>>>>>>
> >>>>>> does
> >>>
> >>>> not
> >>>>>
> >>>>>> appear to be indexing the keywords field.  All of the pages crawled
> >>>>>>>
> >>>>>> have
> >>>>>
> >>>>>> the
> >>>>>>
> >>>>>>> tag present and I am receiving no errors in the nutch log.  I am
> >>>>>>>
> >>>>>> unsure
> >>>>
> >>>>> as
> >>>>>>
> >>>>>>> to what I am missing.  This seems to be pretty straightforward;
> >>>>>>>
> >>>>>> however,
> >>>>>
> >>>>>> I
> >>>>>>
> >>>>>>> must be misunderstanding either the urlmeta plugin or missing
> >>>>>>>
> >>>>>> something
> >>>>
> >>>>> in
> >>>>>>
> >>>>>>> the configuration.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> *Lewis*
> >>>>>>
> >>>>>>
> >>>>>> This E-Mail has been scanned for viruses.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> *
> >>>>> *Open Source Solutions for Text Engineering
> >>>>>
> >>>>> http://digitalpebble.blogspot.**com/<
> http://digitalpebble.blogspot.com/>
> >>>>> http://www.digitalpebble.com
> >>>>>
> >>>>>
> >>>
> >>> --
> >>> *Lewis*
> >>>
> >>>
>
>
> --
> *Lewis*
>



-- 
*Thanks & Regards*
*
*
*Vijith V*

Re: Indexing specific metadata tags with urlmeta

Reply via email to