Re: Trying to understand and use URLmeta

Markus Jelsma Tue, 30 Aug 2011 15:03:08 -0700

Hi,

It's a fine tool indeed. In my experience (not opinion) 1.4-dev is as stable 
as 1.3 with quality bug fixes and improvements. We use it (always the latest 
revision) in production and i would certainly vote +1 if we were to release 
the current 1.4-dev as a new stable release.


If you're hestitant, which is a good quality, you can always use the tool in a 
local/development enviroment. There are no invasive changes in how plugins 
work so using the tool in a dev enviroment would help you on your way.

Cheers

> Markus,
> 
> Yes, I drooled over indexchecker enough that I briefly considered trying
> the development release, but I (for now) need to focus on a production
> quality product.
> 
> In the meantime, LOG.info's scattered about the code will suffice for my
> needs.
> 
> On 8/29/2011 7:00 PM, Markus Jelsma wrote:
> > In the current Nutch 1.4-dev you can check the output of the indexer by
> > using the indexchecker command. It'll take an url and displays the
> > values of the fields it's going to add.
> > 
> >> Lewis,
> >> 
> >> After shaking off the annoyance of your "RTFM Luke" answer (I had read
> >> the tutorial several times), I listened to your suggestion (I do respect
> >> my elders ... especially my 'application-elders') and I spent the
> >> weekend reading code,  scanning the javadocs files and adding logging
> >> statements.  Considering how poorly Nutch is documented (sparse, and
> >> what is documented probably refers to an old version), it was
> >> challenging, but worthwhile.  What I found:
> >> 
> >> That I deserve a kick in the head since I was only looking in the Nutch
> >> databases for the results of urlmeta., The Nutch databases, of course,
> >> no longer contain indexing information; the name URLMetaIndexingFilter
> >> (indexing !!!) should have told me.
> >> 
> >> That still did not help when I looked in the Solr Index.  After a lot of
> >> analysis and some logging statements later, I discovered that 'urlmeta'
> >> was not being loaded.  The plugin.includes statement in the tutorial is
> >> incorrect.  It is (fragment)
> >> 
> >> ...|index-(basic|anchor|'''urlmeta''')| ...
> >> 
> >> 
> >> and should be
> >> 
> >> ...|index-(basic|anchor)| urlmeta | ...
> >> 
> >> The name of the plugin is 'urlmeta' not index-urlmeta.
> >> 
> >> Once I got urlmeta loaded, the indexing almost ran correctly.  I got a
> >> Solr error complaining that a field was undefined ... the metadata
> >> fields that I was injecting.  I solved that problem by added the two
> >> fields I was injecting to the Solr schema.xml.  With that, the indexing
> >> completed with no errors.
> >> 
> >> I now (I think) understand how urlmeta works.  I do have two questions,
> >> however.
> >> 
> >> 1) Now that Solr is the official indexer for Nutch, are we still
> >> supposed to copy the Nutch schema over to Solr?  The Solr schema has
> >> gotten very complicated recently and I am concerned about losing some
> >> Solr functionality.
> >> 
> >> 2) What is the roll of solrindex-mapping.xml ? I only added my field
> >> names to the Solr schema.xml; I made no changes to the Nutch schema.xml
> >> nor made any changes to solrindex-mapping.xml.
> >> 
> >> All, in all, an interesting and educational weekend.
> >> 
> >> /jb
> >> 
> >> On 8/25/2011 5:11 AM, lewis john mcgibbney wrote:
> >>> Hi JB,
> >>> 
> >>> We have recently finished a complete plugin tutorial which fully
> >>> explains the functionality of the urlmeta plugin on the wiki. It can
> >>> be found here [1], could I ask you to have a thorough look at it, and
> >>> the code and if you still have questions then please reinforce them.
> >>> 
> >>> [1] http://wiki.apache.org/nutch/WritingPluginExample
> >>> 
> >>> Thank you
> >>> 
> >>> On Wed, Aug 24, 2011 at 9:36 PM, John R.
> >>> Brinkema<[email protected]
> >>> 
> >>>> wrote:
> >>>> Hi all,
> >>>> 
> >>>> I am trying use URLmeta to inject meta data into documents that I
> >>>> crawl and I am having some problems.
> >>>> 
> >>>> First the context:  Nutch 1.3 with Solr 3.2
> >>>> 
> >>>> My seed url files looks like:  http://mySite.com/Guide/index.**
> >>>> html\trecommended=<http://mySite.com/Guide/index.html%5Ctrecommended=>
> >>>> "Guide"\**tkeywords="Guide,Policy,**JBmarker"
> >>>> 
> >>>> I put JBmarker there so I could see where the metadata got put.
> >>>> 
> >>>> Index.html itself is a table of contents of a guide; that is, it is
> >>>> mostly a list of outlinks to parts of the overall guide.
> >>>> 
> >>>> My nutch-site.xml includes the following properties:
> >>>> 
> >>>> <property>
> >>>> <name>plugin.includes</name>
> >>>> <value>protocol-http|**urlfilter-regex|parse-(html|**
> >>>> tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
> >>>> urlnormalizer-(pass|regex|**basic)</value>
> >>>> </property>
> >>>> <property>
> >>>> <name>urlmeta.tags</name>
> >>>> <value>recommended,keywords</**value>
> >>>> </property>
> >>>> 
> >>>> I fire up nutch to crawl and all goes well.   To see what nutch did, I
> >>>> ran 'readseg -dump' and looked at the results.  What I found was the
> >>>> following:
> >>>> 
> >>>> ... other Recno's above ...
> >>>> 
> >>>> Recno:: 56
> >>>> URL:: http:/mySite.com/Guide/index.**html
> >>>> 
> >>>> CrawlDatum::
> >>>> Version: 7
> >>>> Status: 65 (signature)
> >>>> Fetch time: Tue Aug 23 10:08:18 EDT 2011
> >>>> Modified time: Wed Dec 31 19:00:00 EST 1969
> >>>> Retries since fetch: 0
> >>>> Retry interval: 0 seconds (0 days)
> >>>> Score: 1.0
> >>>> Signature: 5c182af41027766eccf1ea60d11277**2c
> >>>> Metadata:
> >>>> 
> >>>> CrawlDatum::
> >>>> Version: 7
> >>>> Status: 1 (db_unfetched)
> >>>> Fetch time: Tue Aug 23 10:08:04 EDT 2011
> >>>> Modified time: Wed Dec 31 19:00:00 EST 1969
> >>>> Retries since fetch: 0
> >>>> Retry interval: 2592000 seconds (30 days)
> >>>> Score: 1.0
> >>>> Signature: null
> >>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> >>>> "Guide,Policy,JBmarker"
> >>>> 
> >>>> Content::
> >>>> Version: -1
> >>>> url:
> >>>> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.htm
> >>>> l
> >>>> 
> >>>>> base:
> >>>> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.htm
> >>>> l
> >>>> 
> >>>>> ... lots more content ...
> >>>> 
> >>>> CrawlDatum::
> >>>> Version: 7
> >>>> Status: 33 (fetch_success)
> >>>> Fetch time: Tue Aug 23 10:08:15 EDT 2011
> >>>> Modified time: Wed Dec 31 19:00:00 EST 1969
> >>>> Retries since fetch: 0
> >>>> Retry interval: 2592000 seconds (30 days)
> >>>> Score: 1.0
> >>>> Signature: null
> >>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> >>>> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0
> >>>> 
> >>>> ParseData::
> >>>> Version: 5
> >>>> Status: success(1,0)
> >>>> Title: Guide
> >>>> Outlinks: 60
> >>>> 
> >>>>    outlink: toUrl:
> >>>>    http://mySite.com/Home/About.**html<http://mySite.com/Home/About.ht
> >>>>    ml
> >>>>    
> >>>>    >anchor: About Me outlink: toUrl:
> >>>>    http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/
> >>>>    Gu ide/Contact_The_Guide.html>anchor: Contact Me
> >>>> 
> >>>> ... many more outlinks ...
> >>>> Content Metadata:
> >>>> nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
> >>>> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT
> >>>> Content-Length=28798 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT
> >>>> nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=**20110823100811
> >>>> Content-Type=text/html
> >>>> Connection=close Server=Netscape-Enterprise/6.0
> >>>> Parse Metadata: CharEncodingForConversion=**windows-1252
> >>>> OriginalCharEncoding=windows-**1252
> >>>> 
> >>>> ParseText::
> >>>> ... lots of parsed text ...
> >>>> 
> >>>> Recno::  57
> >>>> 
> >>>> ... and so forth.
> >>>> 
> >>>> JBmarker does not appear anywhere else, in this segment or any of the
> >>>> others.
> >>>> 
> >>>> When I do a solrindex, JBmarker does not appear to be anywhere.  ??
> >>>> 
> >>>> *What I expected*
> >>>> 
> >>>> As I understand ULRmeta (as defined by the two nutch patches), the
> >>>> meta data that is included with the url  is injected into the seed
> >>>> url; that is to say, it is as if the lines:
> >>>> 
> >>>> <META NAME="recommended" CONTENT="Guide">
> >>>> <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker">
> >>>> 
> >>>> were in the seed url content.  Furthermore,  it is as if those two
> >>>> lines were in all the outlink content of the seed url.  So, I
> >>>> expected that when I looked at all the CrawlDatum and ParseData of
> >>>> the outlinks from the seed url, I would see the same meta data as in
> >>>> the seed CrawlDatum and ParseData.
> >>>> 
> >>>>    Which is clearly not the case.
> >>>> 
> >>>> As for solrindex, I assume that I have some work to do to get any
> >>>> special metadata actions moved over to solr; a special plugin of some
> >>>> sort.  That is, urlmeta does not help get the collected metadata from
> >>>> Nutch to Solr.
> >>>> 
> >>>> So what is happening?  Where did I go astray?  Am I analyzing the
> >>>> Nutch dumps incorrectly?
> >>>> 
> >>>> One other side note:  I assume that Luke no longer will help me debug
> >>>> Nutch since it works with Lucene indexes and Nutch no longer create
> >>>> such beasts.
> >>>> 
> >>>>    Are there any tools that help with viewing Nutch databases?  It
> >>>>    seems that
> >>>> 
> >>>> Nutch takes some liberties with the data it is dumping (e.g., the meta
> >>>> tags all concatenated together without delimiters; I assume that
> >>>> internally, the meta tags are separated somehow).
> >>>> 
> >>>> Thanks, as always.

Re: Trying to understand and use URLmeta

Reply via email to