Re: Indexing specific metadata tags with urlmeta

Dean Del Ponte Wed, 11 Jan 2012 13:45:17 -0800

Thank-you for your response.

My goal is to get Nutch to index meta tags.  It's been quite an adventure
so far!


On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Dean,
>
> Unfortunately nothing official. If you look you will see that this plugin
> (if eventually integrated), will combine with two other issues which all
> revolve roughly around the same area.
>
> I have never used this patch or any of the others.
>
> Anyone else?
>
> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte <[email protected]
> >wrote:
>
> > Any documentation on how to use the patch at
> > https://issues.apache.org/jira/browse/NUTCH-809?
> >
> > My apologies for the newbie question.
> >
> > Thanks,
> >
> > Dean Del Ponte
> >
> > On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche <
> > [email protected]> wrote:
> >
> > > Hi Matt,
> > >
> > > The plugin urlmeta does NOT extract the metadata from HTML pages. The
> > > 'meta'
> > > in its name means 'crawldb metadata'
> > >
> > > You need to use the patch in
> > > https://issues.apache.org/jira/browse/NUTCH-809
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > > On 26 September 2011 21:18, Wilson, Matt <[email protected]
> > > >wrote:
> > >
> > > > Also,
> > > >
> > > > In case this helps.  I removed the Keywords field from the solr
> schema
> > to
> > > > see if it would generate an error when the SolrIndexer runs and it
> does
> > > not.
> > > >  This has lead me to believe that nutch is either not indexing the
> meta
> > > > content or it is not sending the update to solr when SolrIndexer
> runs.
> > > >
> > > > Matt Wilson
> > > >
> > > > -----Original Message-----
> > > > From: lewis john mcgibbney [mailto:[email protected]]
> > > > Sent: Monday, September 26, 2011 3:04 PM
> > > > To: [email protected]
> > > > Subject: Re: Indexing specific metadata tags with urlmeta
> > > >
> > > > Hi Matt,
> > > >
> > > > Try changing
> > > >
> > > > <field name="keywords" type="string" stored="true" indexed="true"
> > > > multiValued="true"/>
> > > >
> > > > to
> > > >
> > > > <field name="Keywords" type="string" stored="true" indexed="true"
> > > > multiValued="true"/> as per your metadata tags.
> > > >
> > > > We also have a configuration option in nutch-site.xml which you could
> > > check
> > > > out.
> > > >
> > > > <property>
> > > >  <name>urlmeta.tags</name>
> > > >  <value></value>
> > > >  <description>
> > > >    To be used in conjunction with features introduced in NUTCH-655,
> > which
> > > > allows
> > > >    for custom metatags to be injected alongside your crawl URLs.
> > > Specifying
> > > > those
> > > >    custom tags here will allow for their propagation into a pages
> > > outlinks,
> > > > as
> > > >    well as allow for them to be included as part of an index.
> > > >    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
> the
> > > tags
> > > > with
> > > >    white-space at their boundaries, if you are using anything earlier
> > > than
> > > > Hadoop-0.21.
> > > >  </description>
> > > > </property>
> > > >
> > > > On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
> > > > <[email protected]>wrote:
> > > >
> > > > > I am attempting to crawl a corporate intranet site and allow it to
> be
> > > > > searched in solr.  As part of the requirements I have to be able to
> > > index
> > > > > certain metadata tags as their own field in solr (for faceted
> > search).
> > > >  For
> > > > > example, the pages being crawled contain the following meta tag:
> > > > >
> > > > > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
> > Savings,
> > > > > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
> > > > Plans"
> > > > > />
> > > > >
> > > > > I have updated the nutch-site.xml with the following:
> > > > >
> > > > > <property>
> > > > >    <name>plugin.includes</name>
> > > > >    <value>urlmeta|protocol-httpclient|... </value>
> > > > > </property>
> > > > > <property>
> > > > >    <name>urlmeta.tags</name>
> > > > >    <value>keywords</value>
> > > > > </property>
> > > > >
> > > > > I have updated the solr schema.xml with the following addition:
> > > > >
> > > > > <field name="keywords" type="string" stored="true" indexed="true"
> > > > > multiValued="true"/>
> > > > >
> > > > > I can see that the field has been created in Solr via the admin
> > > > interface.
> > > > >  I also see that nutch is loading the urlmeta plugin and adding the
> > > > > indexfilters etc in the hadroop.log.  The problem is that nutch
> does
> > > not
> > > > > appear to be indexing the keywords field.  All of the pages crawled
> > > have
> > > > the
> > > > > tag present and I am receiving no errors in the nutch log.  I am
> > unsure
> > > > as
> > > > > to what I am missing.  This seems to be pretty straightforward;
> > > however,
> > > > I
> > > > > must be misunderstanding either the urlmeta plugin or missing
> > something
> > > > in
> > > > > the configuration.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > > >
> > > > This E-Mail has been scanned for viruses.
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
>
>
>
> --
> *Lewis*
>

Re: Indexing specific metadata tags with urlmeta

Reply via email to