Re: Indexing specific metadata tags with urlmeta

Lewis John Mcgibbney Thu, 12 Jan 2012 03:51:42 -0800

Hi Elisabeth please see my comments on issue.

Thanks again


Lewis

On Thu, Jan 12, 2012 at 9:15 AM, Elisabeth Adler
<[email protected]>wrote:

> Hi Dean,
> I added my documentation and bundled plugin to jira (
> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>),
> hope this helps.
>
>
> On 11.01.2012 22:44, Dean Del Ponte wrote:
>
>> Thank-you for your response.
>>
>> My goal is to get Nutch to index meta tags.  It's been quite an adventure
>> so far!
>>
>> On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney<
>> [email protected]>  wrote:
>>
>>  Hi Dean,
>>>
>>> Unfortunately nothing official. If you look you will see that this plugin
>>> (if eventually integrated), will combine with two other issues which all
>>> revolve roughly around the same area.
>>>
>>> I have never used this patch or any of the others.
>>>
>>> Anyone else?
>>>
>>> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte<[email protected]
>>>
>>>> wrote:
>>>> Any documentation on how to use the patch at
>>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>
>>>> ?
>>>>
>>>> My apologies for the newbie question.
>>>>
>>>> Thanks,
>>>>
>>>> Dean Del Ponte
>>>>
>>>> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche<
>>>> [email protected]>  wrote:
>>>>
>>>>  Hi Matt,
>>>>>
>>>>> The plugin urlmeta does NOT extract the metadata from HTML pages. The
>>>>> 'meta'
>>>>> in its name means 'crawldb metadata'
>>>>>
>>>>> You need to use the patch in
>>>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>
>>>>>
>>>>> HTH
>>>>>
>>>>> Julien
>>>>>
>>>>>
>>>>> On 26 September 2011 21:18, Wilson, Matt<Matthew.Wilson@salliemae.**
>>>>> com <[email protected]>
>>>>>
>>>>>> wrote:
>>>>>> Also,
>>>>>>
>>>>>> In case this helps.  I removed the Keywords field from the solr
>>>>>>
>>>>> schema
>>>
>>>> to
>>>>
>>>>> see if it would generate an error when the SolrIndexer runs and it
>>>>>>
>>>>> does
>>>
>>>> not.
>>>>>
>>>>>>  This has lead me to believe that nutch is either not indexing the
>>>>>>
>>>>> meta
>>>
>>>> content or it is not sending the update to solr when SolrIndexer
>>>>>>
>>>>> runs.
>>>
>>>> Matt Wilson
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: lewis john mcgibbney 
>>>>>> [mailto:lewis.mcgibbney@gmail.**com<[email protected]>
>>>>>> ]
>>>>>> Sent: Monday, September 26, 2011 3:04 PM
>>>>>> To: [email protected]
>>>>>> Subject: Re: Indexing specific metadata tags with urlmeta
>>>>>>
>>>>>> Hi Matt,
>>>>>>
>>>>>> Try changing
>>>>>>
>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>>> multiValued="true"/>
>>>>>>
>>>>>> to
>>>>>>
>>>>>> <field name="Keywords" type="string" stored="true" indexed="true"
>>>>>> multiValued="true"/>  as per your metadata tags.
>>>>>>
>>>>>> We also have a configuration option in nutch-site.xml which you could
>>>>>>
>>>>> check
>>>>>
>>>>>> out.
>>>>>>
>>>>>> <property>
>>>>>>  <name>urlmeta.tags</name>
>>>>>>  <value></value>
>>>>>>  <description>
>>>>>>    To be used in conjunction with features introduced in NUTCH-655,
>>>>>>
>>>>> which
>>>>
>>>>> allows
>>>>>>    for custom metatags to be injected alongside your crawl URLs.
>>>>>>
>>>>> Specifying
>>>>>
>>>>>> those
>>>>>>    custom tags here will allow for their propagation into a pages
>>>>>>
>>>>> outlinks,
>>>>>
>>>>>> as
>>>>>>    well as allow for them to be included as part of an index.
>>>>>>    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
>>>>>>
>>>>> the
>>>
>>>> tags
>>>>>
>>>>>> with
>>>>>>    white-space at their boundaries, if you are using anything earlier
>>>>>>
>>>>> than
>>>>>
>>>>>> Hadoop-0.21.
>>>>>>  </description>
>>>>>> </property>
>>>>>>
>>>>>> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
>>>>>> <[email protected]>**wrote:
>>>>>>
>>>>>>  I am attempting to crawl a corporate intranet site and allow it to
>>>>>>>
>>>>>> be
>>>
>>>> searched in solr.  As part of the requirements I have to be able to
>>>>>>>
>>>>>> index
>>>>>
>>>>>> certain metadata tags as their own field in solr (for faceted
>>>>>>>
>>>>>> search).
>>>>
>>>>>  For
>>>>>>
>>>>>>> example, the pages being crawled contain the following meta tag:
>>>>>>>
>>>>>>> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
>>>>>>>
>>>>>> Savings,
>>>>
>>>>> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
>>>>>>>
>>>>>> Plans"
>>>>>>
>>>>>>> />
>>>>>>>
>>>>>>> I have updated the nutch-site.xml with the following:
>>>>>>>
>>>>>>> <property>
>>>>>>>    <name>plugin.includes</name>
>>>>>>>    <value>urlmeta|protocol-**httpclient|...</value>
>>>>>>> </property>
>>>>>>> <property>
>>>>>>>    <name>urlmeta.tags</name>
>>>>>>>    <value>keywords</value>
>>>>>>> </property>
>>>>>>>
>>>>>>> I have updated the solr schema.xml with the following addition:
>>>>>>>
>>>>>>> <field name="keywords" type="string" stored="true" indexed="true"
>>>>>>> multiValued="true"/>
>>>>>>>
>>>>>>> I can see that the field has been created in Solr via the admin
>>>>>>>
>>>>>> interface.
>>>>>>
>>>>>>>  I also see that nutch is loading the urlmeta plugin and adding the
>>>>>>> indexfilters etc in the hadroop.log.  The problem is that nutch
>>>>>>>
>>>>>> does
>>>
>>>> not
>>>>>
>>>>>> appear to be indexing the keywords field.  All of the pages crawled
>>>>>>>
>>>>>> have
>>>>>
>>>>>> the
>>>>>>
>>>>>>> tag present and I am receiving no errors in the nutch log.  I am
>>>>>>>
>>>>>> unsure
>>>>
>>>>> as
>>>>>>
>>>>>>> to what I am missing.  This seems to be pretty straightforward;
>>>>>>>
>>>>>> however,
>>>>>
>>>>>> I
>>>>>>
>>>>>>> must be misunderstanding either the urlmeta plugin or missing
>>>>>>>
>>>>>> something
>>>>
>>>>> in
>>>>>>
>>>>>>> the configuration.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>>
>>>>>> This E-Mail has been scanned for viruses.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> *
>>>>> *Open Source Solutions for Text Engineering
>>>>>
>>>>> http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/>
>>>>> http://www.digitalpebble.com
>>>>>
>>>>>
>>>
>>> --
>>> *Lewis*
>>>
>>>


-- 
*Lewis*

Re: Indexing specific metadata tags with urlmeta

Reply via email to