Re: Trying to understand and use URLmeta

John R. Brinkema Mon, 29 Aug 2011 15:28:22 -0700

Lewis,

After shaking off the annoyance of your "RTFM Luke" answer (I had readthe tutorial several times), I listened to your suggestion (I do respectmy elders ... especially my 'application-elders') and I spent theweekend reading code, scanning the javadocs files and adding loggingstatements. Considering how poorly Nutch is documented (sparse, andwhat is documented probably refers to an old version), it waschallenging, but worthwhile. What I found:

That I deserve a kick in the head since I was only looking in the Nutchdatabases for the results of urlmeta., The Nutch databases, of course,no longer contain indexing information; the name URLMetaIndexingFilter(indexing !!!) should have told me.

That still did not help when I looked in the Solr Index. After a lot ofanalysis and some logging statements later, I discovered that 'urlmeta'was not being loaded. The plugin.includes statement in the tutorial isincorrect. It is (fragment)


...|index-(basic|anchor|'''urlmeta''')| ...


and should be

...|index-(basic|anchor)| urlmeta | ...

The name of the plugin is 'urlmeta' not index-urlmeta.

Once I got urlmeta loaded, the indexing almost ran correctly. I got aSolr error complaining that a field was undefined ... the metadatafields that I was injecting. I solved that problem by added the twofields I was injecting to the Solr schema.xml. With that, the indexingcompleted with no errors.

I now (I think) understand how urlmeta works. I do have two questions,however.

1) Now that Solr is the official indexer for Nutch, are we stillsupposed to copy the Nutch schema over to Solr? The Solr schema hasgotten very complicated recently and I am concerned about losing someSolr functionality.

2) What is the roll of solrindex-mapping.xml ? I only added my fieldnames to the Solr schema.xml; I made no changes to the Nutch schema.xmlnor made any changes to solrindex-mapping.xml.


All, in all, an interesting and educational weekend.

/jb


On 8/25/2011 5:11 AM, lewis john mcgibbney wrote:

Hi JB,

We have recently finished a complete plugin tutorial which fully explains
the functionality of the urlmeta plugin on the wiki. It can be found here
[1], could I ask you to have a thorough look at it, and the code and if you
still have questions then please reinforce them.

[1] http://wiki.apache.org/nutch/WritingPluginExample

Thank you

On Wed, Aug 24, 2011 at 9:36 PM, John R. Brinkema<[email protected]

wrote:
Hi all,

I am trying use URLmeta to inject meta data into documents that I crawl and
I am having some problems.

First the context:  Nutch 1.3 with Solr 3.2

My seed url files looks like:  http://mySite.com/Guide/index.**
html\trecommended=<http://mySite.com/Guide/index.html%5Ctrecommended=>
"Guide"\**tkeywords="Guide,Policy,**JBmarker"

I put JBmarker there so I could see where the metadata got put.

Index.html itself is a table of contents of a guide; that is, it is mostly
a list of outlinks to parts of the overall guide.

My nutch-site.xml includes the following properties:

<property>
<name>plugin.includes</name>
<value>protocol-http|**urlfilter-regex|parse-(html|**
tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
urlnormalizer-(pass|regex|**basic)</value>
</property>
<property>
<name>urlmeta.tags</name>
<value>recommended,keywords</**value>
</property>

I fire up nutch to crawl and all goes well.   To see what nutch did, I ran
'readseg -dump' and looked at the results.  What I found was the following:

... other Recno's above ...

Recno:: 56
URL:: http:/mySite.com/Guide/index.**html

CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Tue Aug 23 10:08:18 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 5c182af41027766eccf1ea60d11277**2c
Metadata:

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Aug 23 10:08:04 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: recommended: Guide_ngt_: 1314108489210keywords:
"Guide,Policy,JBmarker"

Content::
Version: -1
url: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html>
base: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html>
... lots more content ...

CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Tue Aug 23 10:08:15 EDT 2011
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: recommended: Guide_ngt_: 1314108489210keywords:
"Guide,Policy,JBmarker"_pst_: success(1), lastModified=0

ParseData::
Version: 5
Status: success(1,0)
Title: Guide
Outlinks: 60
  outlink: toUrl: 
http://mySite.com/Home/About.**html<http://mySite.com/Home/About.html>anchor: 
About Me
  outlink: toUrl: 
http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/Guide/Contact_The_Guide.html>anchor:
 Contact Me
... many more outlinks ...
Content Metadata: nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT Content-Length=28798
Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT nutch.crawl.score=1.0 _fst_=33
nutch.segment.name=**20110823100811 Content-Type=text/html
Connection=close Server=Netscape-Enterprise/6.0
Parse Metadata: CharEncodingForConversion=**windows-1252
OriginalCharEncoding=windows-**1252

ParseText::
... lots of parsed text ...

Recno::  57

... and so forth.

JBmarker does not appear anywhere else, in this segment or any of the
others.

When I do a solrindex, JBmarker does not appear to be anywhere.  ??

*What I expected*

As I understand ULRmeta (as defined by the two nutch patches), the meta
data that is included with the url  is injected into the seed url; that is
to say, it is as if the lines:

<META NAME="recommended" CONTENT="Guide">
<META NAME="keywords" CONTENT="Guide,Policy,**JBmarker">

were in the seed url content.  Furthermore,  it is as if those two lines
were in all the outlink content of the seed url.  So, I expected that when I
looked at all the CrawlDatum and ParseData of the outlinks from the seed
url, I would see the same meta data as in the seed CrawlDatum and ParseData.
  Which is clearly not the case.

As for solrindex, I assume that I have some work to do to get any special
metadata actions moved over to solr; a special plugin of some sort.  That
is, urlmeta does not help get the collected metadata from Nutch to Solr.

So what is happening?  Where did I go astray?  Am I analyzing the Nutch
dumps incorrectly?

One other side note:  I assume that Luke no longer will help me debug Nutch
since it works with Lucene indexes and Nutch no longer create such beasts.
  Are there any tools that help with viewing Nutch databases?  It seems that
Nutch takes some liberties with the data it is dumping (e.g., the meta tags
all concatenated together without delimiters; I assume that internally, the
meta tags are separated somehow).

Thanks, as always.

Re: Trying to understand and use URLmeta

Reply via email to