Re: Storing Metadata with Crawled Sites

Scott Gonyea Mon, 12 Jul 2010 22:08:26 -0700

I believe I figured this out; I'll be working some more on it tomorrow.  I 
don't want you to waste your time telling me stuff that I (may) already know.  
If you can, just let me know if I'm on the right track.


I created my own Nutch plugin, which I called "index-meta".

Inside the nutch-site.xml, I made two changes...  One is on the property 
'plugin.includes' "index-(basic|meta)":

<property>
  <name>plugin.includes</name>
  <value>
    
nutch-extensionpoints|ndex-(basic|meta)|protocol-http|parse-(text|html)|iquery-(basic|site|url)
  </value>
  <description>...</description>
</property>

Second, I created a new property:

<property>
  <name>index.meta_tags</name>
  <value>blawg_corp some_other_meta i_like_cake</value>
  <description>...</description>
</property>

I then parse this out, inside my plugin (Configuration conf nonsense) so I know 
what meta tags to look for.  In my IndexingFilter, I was able to enumerate all 
of the meta tags, through each Indexing pass.  In it were the tags I had been 
looking for.  I grabbed them out of the CrawlDatum.getMetaData().entrySet().:

public class MetaIndexingFilter implements IndexingFilter {
  ...
  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
      CrawlDatum datum, Inlinks inlinks) throws IndexingException {
        ...
        for(String metatag : metatags) {
          doc.add(my_internet_pollution, meta_dreck);
        }
        ...
      }
  ...
}

The only minor thing I seem to be dealing with is pulling out a specific meta 
tag, from the getMetaData(), as it returns a Writable object--which was 
cleverly designed to fill me with a bottom-less, impotent rage.  I like that I 
can typecast it to a Text object, but not String.  Nor is there a toString() 
method, as I can't imagine such a thing having any use.

I'm sure there an intelligible reason for it all.  It's far more enjoyable to 
rage against the International Business Machine, until I can peaceably return 
to my former state of Ruby Zen...

Question, if you've read this far: If I de-tard my code, is it something 
worthwhile-enough to be submitted into the nebulous depths of Apache Nutch?

Thank you,
Scott Gonyea

On Jul 12, 2010, at 10:50 AM, Scott Gonyea wrote:

> More questions are below your answers. (Thank you!)
> 
> On Mon, Jul 12, 2010 at 1:34 AM, Julien Nioche 
> <[email protected]> wrote:
> Hi
> 
> I want to crawl URLs and Index them (I'm using Nutch+Solr+Ruby/Rails) and,
> > when a search term is matched, I'd like to have arbitrary metadata be
> > stored/associated with those results.  IE, suppose I crawl blogs and want to
> > search for occurrences of "Android."  When I search the index that was
> > collected, I'd like to have the parent company's name (for example) be
> > returned with the URL who's index matched that query.
> >
> 
> Ok, so it would be a matter of having a field for storing this in SOLR.
> 
> I imagine this would be the easy part--I threw it into the schema.xml.  The 
> rtfm'ing would be from the Nutch side of things, given everything I've 
> rtf&m-'d (in no particular order).
>  
> > __What I've Found/Done So Far__:
> >
> > NUTCH-655 Injecting Crawl metadata (jnioche)
> > NUTCH-779 Mechanism for passing metadata from parse to crawldb (jnioche)
> > NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher + call
> > scfilters.initialScore on newly created URL (jnioche)
> >
> > Apparently Julien Nioche is the god of all things metadata and I'd love to
> > get a few minutes of his (or anyone else's) time, so that I can better
> > understand how to fully take advantage of the above changes.
> >
> 
> not really but I can't resist a bit of flattery. Here is my 2 minutes answer
> 
> 
> >
> > FILE>> nutch/urls/seed.txt:
> > http://slashdot.org/    blawg_corp=Geeknet
> > http://geek.com/        blawg_corp=Geeknet
> > http://engadget.com/    blawg_corp=Weblogs
> > http://gizmodo.com/     blawg_corp=Gawker
> >
> 
> I suppose that you want to propagate this feature to the subpages of the
> sites above?
> 
> Yes, please.  Basically, anything gathered within the given crawl should have 
> the "blawg_corp" stapled to it, that was originally provided with the crawl 
> URLs.
> 
> >
> > FILE>> nutch/conf/nutch-site.xml (Snippet)
> > <configuration>
> >  <property>
> >   <name>db.parsemeta.to.crawldb</name>
> >   <value> blawg_corp </value>
> >   <description>Comma-separated list of parse metadata keys to transfer to
> > the crawldb (NUTCH-779).
> >    Assuming for instance that the languageidentifier plugin is enabled,
> > setting the value to 'lang'
> >    will copy both the key 'lang' and its value to the corresponding entry
> > in the crawldb.
> >   </description>
> >  </property>
> > </configuration>
> >
> 
> This one is about sending metadata back from the parsing to the crawldb.
> Since you've injected the metadata it is already in the crawldb. Can't see
> why you'd need that unless you do something special during the parsing ?
> 
> Gotcha, then no- I don't need it.  That snuck in there as my "why isn't this 
> working?!" turned into desperation+googling.
>  
> >
> > So, clearly it's pulling in the nonsense I'm feeding it, but when querying
> > inside Nutch (or anywhere else), it just does not get exposed
> 
> 
> If you want it in SOLR you need to (in reverse chronological order) :
> a) define the field in the solr schema
> b) create an indexingfilter that will populate this field (e.g from the
> parse or crawl metadata )
> c) if necessary - propagate the tag to all the pages of a given host
> 
> 
> a) That means just sticking a field in Solr's schema.xml, correct?  IE,
>     <field name="blawg_corp" type="string" stored="true" indexed="true"/>
> 
> b) To create an IndexingFilter, is that along the following lines:
> 
> http://wiki.apache.org/nutch/WritingPluginExample
> http://wiki.apache.org/nutch/HowToMakeCustomSearch
> http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/
> 
> c) So, the data would not already propagate to the pages that were crawled, 
> with the metadata?
> 
> I really appreciate any help or references you can give me on this. I've been 
> dealing with Nutch for about 4 days, so I apologize for my ignorance.  
> There's seemingly a lot of depth to Nutch that the documentation hasn't 
> exactly kept pace with.
> 
> The end result is that when I run a query in Solr (or even Nutch), I'd like 
> to have the "blawg_corp" be returned with the given set of query results.  
> Any guidelines/references you can point me to, to make that happen, is very 
> much appreciated.
> 
> Thank you,
> Scott Gonyea

Re: Storing Metadata with Crawled Sites

Reply via email to