Trying to understand and use URLmeta

John R. Brinkema Wed, 24 Aug 2011 13:37:00 -0700

Hi all,

I am trying use URLmeta to inject meta data into documents that I crawland I am having some problems.


First the context:  Nutch 1.3 with Solr 3.2

My seed url files looks like:http://mySite.com/Guide/index.html\trecommended="Guide"\tkeywords="Guide,Policy,JBmarker";


I put JBmarker there so I could see where the metadata got put.

Index.html itself is a table of contents of a guide; that is, it ismostly a list of outlinks to parts of the overall guide.


My nutch-site.xml includes the following properties:

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|urlmeta)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>urlmeta.tags</name>
<value>recommended,keywords</value>
</property>

I fire up nutch to crawl and all goes well. To see what nutch did, Iran 'readseg -dump' and looked at the results. What I found was thefollowing:


... other Recno's above ...

Recno:: 56
URL:: http:/mySite.com/Guide/index.html

CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Tue Aug 23 10:08:18 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 5c182af41027766eccf1ea60d112772c
Metadata:

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Aug 23 10:08:04 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null

Metadata: recommended: Guide_ngt_: 1314108489210keywords:"Guide,Policy,JBmarker"


Content::
Version: -1
url: http://mySite.com/Guide/index.html
base: http://mySite.com/Guide/index.html
... lots more content ...

CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Tue Aug 23 10:08:15 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null

Metadata: recommended: Guide_ngt_: 1314108489210keywords:"Guide,Policy,JBmarker"_pst_: success(1), lastModified=0


ParseData::
Version: 5
Status: success(1,0)
Title: Guide
Outlinks: 60
  outlink: toUrl: http://mySite.com/Home/About.html anchor: About Me

outlink: toUrl: http://mySite.com/Guide/Contact_The_Guide.htmlanchor: Contact Me

... many more outlinks ...

Content Metadata: nutch.content.digest=5c182af41027766eccf1ea60d112772cAccept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMTContent-Length=28798 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMTnutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20110823100811Content-Type=text/html Connection=close Server=Netscape-Enterprise/6.0Parse Metadata: CharEncodingForConversion=windows-1252OriginalCharEncoding=windows-1252


ParseText::
... lots of parsed text ...

Recno::  57

... and so forth.

JBmarker does not appear anywhere else, in this segment or any of theothers.


When I do a solrindex, JBmarker does not appear to be anywhere.  ??

*What I expected*

As I understand ULRmeta (as defined by the two nutch patches), the metadata that is included with the url is injected into the seed url; thatis to say, it is as if the lines:


<META NAME="recommended" CONTENT="Guide">
<META NAME="keywords" CONTENT="Guide,Policy,JBmarker">

were in the seed url content. Furthermore, it is as if those two lineswere in all the outlink content of the seed url. So, I expected thatwhen I looked at all the CrawlDatum and ParseData of the outlinks fromthe seed url, I would see the same meta data as in the seed CrawlDatumand ParseData. Which is clearly not the case.

As for solrindex, I assume that I have some work to do to get anyspecial metadata actions moved over to solr; a special plugin of somesort. That is, urlmeta does not help get the collected metadata fromNutch to Solr.

So what is happening? Where did I go astray? Am I analyzing the Nutchdumps incorrectly?

One other side note: I assume that Luke no longer will help me debugNutch since it works with Lucene indexes and Nutch no longer create suchbeasts. Are there any tools that help with viewing Nutch databases? Itseems that Nutch takes some liberties with the data it is dumping (e.g.,the meta tags all concatenated together without delimiters; I assumethat internally, the meta tags are separated somehow).


Thanks, as always.

Trying to understand and use URLmeta

Reply via email to