In the current Nutch 1.4-dev you can check the output of the indexer by using 
the indexchecker command. It'll take an url and displays the values of the 
fields it's going to add.
 
> Lewis,
> 
> After shaking off the annoyance of your "RTFM Luke" answer (I had read
> the tutorial several times), I listened to your suggestion (I do respect
> my elders ... especially my 'application-elders') and I spent the
> weekend reading code,  scanning the javadocs files and adding logging
> statements.  Considering how poorly Nutch is documented (sparse, and
> what is documented probably refers to an old version), it was
> challenging, but worthwhile.  What I found:
> 
> That I deserve a kick in the head since I was only looking in the Nutch
> databases for the results of urlmeta., The Nutch databases, of course,
> no longer contain indexing information; the name URLMetaIndexingFilter
> (indexing !!!) should have told me.
> 
> That still did not help when I looked in the Solr Index.  After a lot of
> analysis and some logging statements later, I discovered that 'urlmeta'
> was not being loaded.  The plugin.includes statement in the tutorial is
> incorrect.  It is (fragment)
> 
> ...|index-(basic|anchor|'''urlmeta''')| ...
> 
> 
> and should be
> 
> ...|index-(basic|anchor)| urlmeta | ...
> 
> The name of the plugin is 'urlmeta' not index-urlmeta.
> 
> Once I got urlmeta loaded, the indexing almost ran correctly.  I got a
> Solr error complaining that a field was undefined ... the metadata
> fields that I was injecting.  I solved that problem by added the two
> fields I was injecting to the Solr schema.xml.  With that, the indexing
> completed with no errors.
> 
> I now (I think) understand how urlmeta works.  I do have two questions,
> however.
> 
> 1) Now that Solr is the official indexer for Nutch, are we still
> supposed to copy the Nutch schema over to Solr?  The Solr schema has
> gotten very complicated recently and I am concerned about losing some
> Solr functionality.
> 
> 2) What is the roll of solrindex-mapping.xml ? I only added my field
> names to the Solr schema.xml; I made no changes to the Nutch schema.xml
> nor made any changes to solrindex-mapping.xml.
> 
> All, in all, an interesting and educational weekend.
> 
> /jb
> 
> On 8/25/2011 5:11 AM, lewis john mcgibbney wrote:
> > Hi JB,
> > 
> > We have recently finished a complete plugin tutorial which fully explains
> > the functionality of the urlmeta plugin on the wiki. It can be found here
> > [1], could I ask you to have a thorough look at it, and the code and if
> > you still have questions then please reinforce them.
> > 
> > [1] http://wiki.apache.org/nutch/WritingPluginExample
> > 
> > Thank you
> > 
> > On Wed, Aug 24, 2011 at 9:36 PM, John R.
> > Brinkema<[email protected]
> > 
> >> wrote:
> >> Hi all,
> >> 
> >> I am trying use URLmeta to inject meta data into documents that I crawl
> >> and I am having some problems.
> >> 
> >> First the context:  Nutch 1.3 with Solr 3.2
> >> 
> >> My seed url files looks like:  http://mySite.com/Guide/index.**
> >> html\trecommended=<http://mySite.com/Guide/index.html%5Ctrecommended=>
> >> "Guide"\**tkeywords="Guide,Policy,**JBmarker"
> >> 
> >> I put JBmarker there so I could see where the metadata got put.
> >> 
> >> Index.html itself is a table of contents of a guide; that is, it is
> >> mostly a list of outlinks to parts of the overall guide.
> >> 
> >> My nutch-site.xml includes the following properties:
> >> 
> >> <property>
> >> <name>plugin.includes</name>
> >> <value>protocol-http|**urlfilter-regex|parse-(html|**
> >> tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
> >> urlnormalizer-(pass|regex|**basic)</value>
> >> </property>
> >> <property>
> >> <name>urlmeta.tags</name>
> >> <value>recommended,keywords</**value>
> >> </property>
> >> 
> >> I fire up nutch to crawl and all goes well.   To see what nutch did, I
> >> ran 'readseg -dump' and looked at the results.  What I found was the
> >> following:
> >> 
> >> ... other Recno's above ...
> >> 
> >> Recno:: 56
> >> URL:: http:/mySite.com/Guide/index.**html
> >> 
> >> CrawlDatum::
> >> Version: 7
> >> Status: 65 (signature)
> >> Fetch time: Tue Aug 23 10:08:18 EDT 2011
> >> Modified time: Wed Dec 31 19:00:00 EST 1969
> >> Retries since fetch: 0
> >> Retry interval: 0 seconds (0 days)
> >> Score: 1.0
> >> Signature: 5c182af41027766eccf1ea60d11277**2c
> >> Metadata:
> >> 
> >> CrawlDatum::
> >> Version: 7
> >> Status: 1 (db_unfetched)
> >> Fetch time: Tue Aug 23 10:08:04 EDT 2011
> >> Modified time: Wed Dec 31 19:00:00 EST 1969
> >> Retries since fetch: 0
> >> Retry interval: 2592000 seconds (30 days)
> >> Score: 1.0
> >> Signature: null
> >> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> >> "Guide,Policy,JBmarker"
> >> 
> >> Content::
> >> Version: -1
> >> url:
> >> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html
> >> > base:
> >> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html
> >> > ... lots more content ...
> >> 
> >> CrawlDatum::
> >> Version: 7
> >> Status: 33 (fetch_success)
> >> Fetch time: Tue Aug 23 10:08:15 EDT 2011
> >> Modified time: Wed Dec 31 19:00:00 EST 1969
> >> Retries since fetch: 0
> >> Retry interval: 2592000 seconds (30 days)
> >> Score: 1.0
> >> Signature: null
> >> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> >> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0
> >> 
> >> ParseData::
> >> Version: 5
> >> Status: success(1,0)
> >> Title: Guide
> >> Outlinks: 60
> >> 
> >>   outlink: toUrl:
> >>   http://mySite.com/Home/About.**html<http://mySite.com/Home/About.html
> >>   >anchor: About Me outlink: toUrl:
> >>   http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/Gu
> >>   ide/Contact_The_Guide.html>anchor: Contact Me
> >> 
> >> ... many more outlinks ...
> >> Content Metadata:
> >> nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
> >> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT
> >> Content-Length=28798 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT
> >> nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=**20110823100811
> >> Content-Type=text/html
> >> Connection=close Server=Netscape-Enterprise/6.0
> >> Parse Metadata: CharEncodingForConversion=**windows-1252
> >> OriginalCharEncoding=windows-**1252
> >> 
> >> ParseText::
> >> ... lots of parsed text ...
> >> 
> >> Recno::  57
> >> 
> >> ... and so forth.
> >> 
> >> JBmarker does not appear anywhere else, in this segment or any of the
> >> others.
> >> 
> >> When I do a solrindex, JBmarker does not appear to be anywhere.  ??
> >> 
> >> *What I expected*
> >> 
> >> As I understand ULRmeta (as defined by the two nutch patches), the meta
> >> data that is included with the url  is injected into the seed url; that
> >> is to say, it is as if the lines:
> >> 
> >> <META NAME="recommended" CONTENT="Guide">
> >> <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker">
> >> 
> >> were in the seed url content.  Furthermore,  it is as if those two lines
> >> were in all the outlink content of the seed url.  So, I expected that
> >> when I looked at all the CrawlDatum and ParseData of the outlinks from
> >> the seed url, I would see the same meta data as in the seed CrawlDatum
> >> and ParseData.
> >> 
> >>   Which is clearly not the case.
> >> 
> >> As for solrindex, I assume that I have some work to do to get any
> >> special metadata actions moved over to solr; a special plugin of some
> >> sort.  That is, urlmeta does not help get the collected metadata from
> >> Nutch to Solr.
> >> 
> >> So what is happening?  Where did I go astray?  Am I analyzing the Nutch
> >> dumps incorrectly?
> >> 
> >> One other side note:  I assume that Luke no longer will help me debug
> >> Nutch since it works with Lucene indexes and Nutch no longer create
> >> such beasts.
> >> 
> >>   Are there any tools that help with viewing Nutch databases?  It seems
> >>   that
> >> 
> >> Nutch takes some liberties with the data it is dumping (e.g., the meta
> >> tags all concatenated together without delimiters; I assume that
> >> internally, the meta tags are separated somehow).
> >> 
> >> Thanks, as always.

Reply via email to