Hi, It's a fine tool indeed. In my experience (not opinion) 1.4-dev is as stable as 1.3 with quality bug fixes and improvements. We use it (always the latest revision) in production and i would certainly vote +1 if we were to release the current 1.4-dev as a new stable release.
If you're hestitant, which is a good quality, you can always use the tool in a local/development enviroment. There are no invasive changes in how plugins work so using the tool in a dev enviroment would help you on your way. Cheers > Markus, > > Yes, I drooled over indexchecker enough that I briefly considered trying > the development release, but I (for now) need to focus on a production > quality product. > > In the meantime, LOG.info's scattered about the code will suffice for my > needs. > > On 8/29/2011 7:00 PM, Markus Jelsma wrote: > > In the current Nutch 1.4-dev you can check the output of the indexer by > > using the indexchecker command. It'll take an url and displays the > > values of the fields it's going to add. > > > >> Lewis, > >> > >> After shaking off the annoyance of your "RTFM Luke" answer (I had read > >> the tutorial several times), I listened to your suggestion (I do respect > >> my elders ... especially my 'application-elders') and I spent the > >> weekend reading code, scanning the javadocs files and adding logging > >> statements. Considering how poorly Nutch is documented (sparse, and > >> what is documented probably refers to an old version), it was > >> challenging, but worthwhile. What I found: > >> > >> That I deserve a kick in the head since I was only looking in the Nutch > >> databases for the results of urlmeta., The Nutch databases, of course, > >> no longer contain indexing information; the name URLMetaIndexingFilter > >> (indexing !!!) should have told me. > >> > >> That still did not help when I looked in the Solr Index. After a lot of > >> analysis and some logging statements later, I discovered that 'urlmeta' > >> was not being loaded. The plugin.includes statement in the tutorial is > >> incorrect. It is (fragment) > >> > >> ...|index-(basic|anchor|'''urlmeta''')| ... > >> > >> > >> and should be > >> > >> ...|index-(basic|anchor)| urlmeta | ... > >> > >> The name of the plugin is 'urlmeta' not index-urlmeta. > >> > >> Once I got urlmeta loaded, the indexing almost ran correctly. I got a > >> Solr error complaining that a field was undefined ... the metadata > >> fields that I was injecting. I solved that problem by added the two > >> fields I was injecting to the Solr schema.xml. With that, the indexing > >> completed with no errors. > >> > >> I now (I think) understand how urlmeta works. I do have two questions, > >> however. > >> > >> 1) Now that Solr is the official indexer for Nutch, are we still > >> supposed to copy the Nutch schema over to Solr? The Solr schema has > >> gotten very complicated recently and I am concerned about losing some > >> Solr functionality. > >> > >> 2) What is the roll of solrindex-mapping.xml ? I only added my field > >> names to the Solr schema.xml; I made no changes to the Nutch schema.xml > >> nor made any changes to solrindex-mapping.xml. > >> > >> All, in all, an interesting and educational weekend. > >> > >> /jb > >> > >> On 8/25/2011 5:11 AM, lewis john mcgibbney wrote: > >>> Hi JB, > >>> > >>> We have recently finished a complete plugin tutorial which fully > >>> explains the functionality of the urlmeta plugin on the wiki. It can > >>> be found here [1], could I ask you to have a thorough look at it, and > >>> the code and if you still have questions then please reinforce them. > >>> > >>> [1] http://wiki.apache.org/nutch/WritingPluginExample > >>> > >>> Thank you > >>> > >>> On Wed, Aug 24, 2011 at 9:36 PM, John R. > >>> Brinkema<[email protected] > >>> > >>>> wrote: > >>>> Hi all, > >>>> > >>>> I am trying use URLmeta to inject meta data into documents that I > >>>> crawl and I am having some problems. > >>>> > >>>> First the context: Nutch 1.3 with Solr 3.2 > >>>> > >>>> My seed url files looks like: http://mySite.com/Guide/index.** > >>>> html\trecommended=<http://mySite.com/Guide/index.html%5Ctrecommended=> > >>>> "Guide"\**tkeywords="Guide,Policy,**JBmarker" > >>>> > >>>> I put JBmarker there so I could see where the metadata got put. > >>>> > >>>> Index.html itself is a table of contents of a guide; that is, it is > >>>> mostly a list of outlinks to parts of the overall guide. > >>>> > >>>> My nutch-site.xml includes the following properties: > >>>> > >>>> <property> > >>>> <name>plugin.includes</name> > >>>> <value>protocol-http|**urlfilter-regex|parse-(html|** > >>>> tika)|index-(basic|anchor|**urlmeta)|scoring-opic|** > >>>> urlnormalizer-(pass|regex|**basic)</value> > >>>> </property> > >>>> <property> > >>>> <name>urlmeta.tags</name> > >>>> <value>recommended,keywords</**value> > >>>> </property> > >>>> > >>>> I fire up nutch to crawl and all goes well. To see what nutch did, I > >>>> ran 'readseg -dump' and looked at the results. What I found was the > >>>> following: > >>>> > >>>> ... other Recno's above ... > >>>> > >>>> Recno:: 56 > >>>> URL:: http:/mySite.com/Guide/index.**html > >>>> > >>>> CrawlDatum:: > >>>> Version: 7 > >>>> Status: 65 (signature) > >>>> Fetch time: Tue Aug 23 10:08:18 EDT 2011 > >>>> Modified time: Wed Dec 31 19:00:00 EST 1969 > >>>> Retries since fetch: 0 > >>>> Retry interval: 0 seconds (0 days) > >>>> Score: 1.0 > >>>> Signature: 5c182af41027766eccf1ea60d11277**2c > >>>> Metadata: > >>>> > >>>> CrawlDatum:: > >>>> Version: 7 > >>>> Status: 1 (db_unfetched) > >>>> Fetch time: Tue Aug 23 10:08:04 EDT 2011 > >>>> Modified time: Wed Dec 31 19:00:00 EST 1969 > >>>> Retries since fetch: 0 > >>>> Retry interval: 2592000 seconds (30 days) > >>>> Score: 1.0 > >>>> Signature: null > >>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords: > >>>> "Guide,Policy,JBmarker" > >>>> > >>>> Content:: > >>>> Version: -1 > >>>> url: > >>>> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.htm > >>>> l > >>>> > >>>>> base: > >>>> http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.htm > >>>> l > >>>> > >>>>> ... lots more content ... > >>>> > >>>> CrawlDatum:: > >>>> Version: 7 > >>>> Status: 33 (fetch_success) > >>>> Fetch time: Tue Aug 23 10:08:15 EDT 2011 > >>>> Modified time: Wed Dec 31 19:00:00 EST 1969 > >>>> Retries since fetch: 0 > >>>> Retry interval: 2592000 seconds (30 days) > >>>> Score: 1.0 > >>>> Signature: null > >>>> Metadata: recommended: Guide_ngt_: 1314108489210keywords: > >>>> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0 > >>>> > >>>> ParseData:: > >>>> Version: 5 > >>>> Status: success(1,0) > >>>> Title: Guide > >>>> Outlinks: 60 > >>>> > >>>> outlink: toUrl: > >>>> http://mySite.com/Home/About.**html<http://mySite.com/Home/About.ht > >>>> ml > >>>> > >>>> >anchor: About Me outlink: toUrl: > >>>> http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/ > >>>> Gu ide/Contact_The_Guide.html>anchor: Contact Me > >>>> > >>>> ... many more outlinks ... > >>>> Content Metadata: > >>>> nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c > >>>> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT > >>>> Content-Length=28798 Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT > >>>> nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=**20110823100811 > >>>> Content-Type=text/html > >>>> Connection=close Server=Netscape-Enterprise/6.0 > >>>> Parse Metadata: CharEncodingForConversion=**windows-1252 > >>>> OriginalCharEncoding=windows-**1252 > >>>> > >>>> ParseText:: > >>>> ... lots of parsed text ... > >>>> > >>>> Recno:: 57 > >>>> > >>>> ... and so forth. > >>>> > >>>> JBmarker does not appear anywhere else, in this segment or any of the > >>>> others. > >>>> > >>>> When I do a solrindex, JBmarker does not appear to be anywhere. ?? > >>>> > >>>> *What I expected* > >>>> > >>>> As I understand ULRmeta (as defined by the two nutch patches), the > >>>> meta data that is included with the url is injected into the seed > >>>> url; that is to say, it is as if the lines: > >>>> > >>>> <META NAME="recommended" CONTENT="Guide"> > >>>> <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker"> > >>>> > >>>> were in the seed url content. Furthermore, it is as if those two > >>>> lines were in all the outlink content of the seed url. So, I > >>>> expected that when I looked at all the CrawlDatum and ParseData of > >>>> the outlinks from the seed url, I would see the same meta data as in > >>>> the seed CrawlDatum and ParseData. > >>>> > >>>> Which is clearly not the case. > >>>> > >>>> As for solrindex, I assume that I have some work to do to get any > >>>> special metadata actions moved over to solr; a special plugin of some > >>>> sort. That is, urlmeta does not help get the collected metadata from > >>>> Nutch to Solr. > >>>> > >>>> So what is happening? Where did I go astray? Am I analyzing the > >>>> Nutch dumps incorrectly? > >>>> > >>>> One other side note: I assume that Luke no longer will help me debug > >>>> Nutch since it works with Lucene indexes and Nutch no longer create > >>>> such beasts. > >>>> > >>>> Are there any tools that help with viewing Nutch databases? It > >>>> seems that > >>>> > >>>> Nutch takes some liberties with the data it is dumping (e.g., the meta > >>>> tags all concatenated together without delimiters; I assume that > >>>> internally, the meta tags are separated somehow). > >>>> > >>>> Thanks, as always.

