Thank-you for your response. My goal is to get Nutch to index meta tags. It's been quite an adventure so far!
On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Dean, > > Unfortunately nothing official. If you look you will see that this plugin > (if eventually integrated), will combine with two other issues which all > revolve roughly around the same area. > > I have never used this patch or any of the others. > > Anyone else? > > On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte <[email protected] > >wrote: > > > Any documentation on how to use the patch at > > https://issues.apache.org/jira/browse/NUTCH-809? > > > > My apologies for the newbie question. > > > > Thanks, > > > > Dean Del Ponte > > > > On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche < > > [email protected]> wrote: > > > > > Hi Matt, > > > > > > The plugin urlmeta does NOT extract the metadata from HTML pages. The > > > 'meta' > > > in its name means 'crawldb metadata' > > > > > > You need to use the patch in > > > https://issues.apache.org/jira/browse/NUTCH-809 > > > > > > HTH > > > > > > Julien > > > > > > > > > On 26 September 2011 21:18, Wilson, Matt <[email protected] > > > >wrote: > > > > > > > Also, > > > > > > > > In case this helps. I removed the Keywords field from the solr > schema > > to > > > > see if it would generate an error when the SolrIndexer runs and it > does > > > not. > > > > This has lead me to believe that nutch is either not indexing the > meta > > > > content or it is not sending the update to solr when SolrIndexer > runs. > > > > > > > > Matt Wilson > > > > > > > > -----Original Message----- > > > > From: lewis john mcgibbney [mailto:[email protected]] > > > > Sent: Monday, September 26, 2011 3:04 PM > > > > To: [email protected] > > > > Subject: Re: Indexing specific metadata tags with urlmeta > > > > > > > > Hi Matt, > > > > > > > > Try changing > > > > > > > > <field name="keywords" type="string" stored="true" indexed="true" > > > > multiValued="true"/> > > > > > > > > to > > > > > > > > <field name="Keywords" type="string" stored="true" indexed="true" > > > > multiValued="true"/> as per your metadata tags. > > > > > > > > We also have a configuration option in nutch-site.xml which you could > > > check > > > > out. > > > > > > > > <property> > > > > <name>urlmeta.tags</name> > > > > <value></value> > > > > <description> > > > > To be used in conjunction with features introduced in NUTCH-655, > > which > > > > allows > > > > for custom metatags to be injected alongside your crawl URLs. > > > Specifying > > > > those > > > > custom tags here will allow for their propagation into a pages > > > outlinks, > > > > as > > > > well as allow for them to be included as part of an index. > > > > Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad > the > > > tags > > > > with > > > > white-space at their boundaries, if you are using anything earlier > > > than > > > > Hadoop-0.21. > > > > </description> > > > > </property> > > > > > > > > On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt > > > > <[email protected]>wrote: > > > > > > > > > I am attempting to crawl a corporate intranet site and allow it to > be > > > > > searched in solr. As part of the requirements I have to be able to > > > index > > > > > certain metadata tags as their own field in solr (for faceted > > search). > > > > For > > > > > example, the pages being crawled contain the following meta tag: > > > > > > > > > > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, > > Savings, > > > > > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 > > > > Plans" > > > > > /> > > > > > > > > > > I have updated the nutch-site.xml with the following: > > > > > > > > > > <property> > > > > > <name>plugin.includes</name> > > > > > <value>urlmeta|protocol-httpclient|... </value> > > > > > </property> > > > > > <property> > > > > > <name>urlmeta.tags</name> > > > > > <value>keywords</value> > > > > > </property> > > > > > > > > > > I have updated the solr schema.xml with the following addition: > > > > > > > > > > <field name="keywords" type="string" stored="true" indexed="true" > > > > > multiValued="true"/> > > > > > > > > > > I can see that the field has been created in Solr via the admin > > > > interface. > > > > > I also see that nutch is loading the urlmeta plugin and adding the > > > > > indexfilters etc in the hadroop.log. The problem is that nutch > does > > > not > > > > > appear to be indexing the keywords field. All of the pages crawled > > > have > > > > the > > > > > tag present and I am receiving no errors in the nutch log. I am > > unsure > > > > as > > > > > to what I am missing. This seems to be pretty straightforward; > > > however, > > > > I > > > > > must be misunderstanding either the urlmeta plugin or missing > > something > > > > in > > > > > the configuration. > > > > > > > > > > > > > > > > > > > > > -- > > > > *Lewis* > > > > > > > > > > > > This E-Mail has been scanned for viruses. > > > > > > > > > > > > > > > > -- > > > * > > > *Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com > > > > > > > > > -- > *Lewis* >

