Storing Metadata with Crawled Sites

Scott Gonyea Sat, 10 Jul 2010 17:32:06 -0700

I've googled this relentlessly and am just coming up short.  If anyone can give 
me some help on this, I'd sincerely appreciate it and have every intent to give 
back, by helping to improve documentation on this topic.  Thank you, in 
advance, for any help you might offer.


Also, if I've done anything wrong or looked down upon (ie, etiquette), please 
let me know in a private e-mail so I don't do it again :)


__Problem Description__:

Preface:  I've oversimplified the problem I'm addressing, so I'm not looking
          for other ways to parse parent company names out of websites.  Please 
assume
          that the arbitrary data cannot be divined from any external source, 
and that
          I must (therefore) provide this metadata myself.

I want to crawl URLs and Index them (I'm using Nutch+Solr+Ruby/Rails) and, when 
a search term is matched, I'd like to have arbitrary metadata be 
stored/associated with those results.  IE, suppose I crawl blogs and want to 
search for occurrences of "Android."  When I search the index that was 
collected, I'd like to have the parent company's name (for example) be returned 
with the URL who's index matched that query.


__What I've Found/Done So Far__:

NUTCH-655 Injecting Crawl metadata (jnioche)
NUTCH-779 Mechanism for passing metadata from parse to crawldb (jnioche)
NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher + call 
scfilters.initialScore on newly created URL (jnioche)

Apparently Julien Nioche is the god of all things metadata and I'd love to get 
a few minutes of his (or anyone else's) time, so that I can better understand 
how to fully take advantage of the above changes.

FILE>> nutch/urls/seed.txt:
http://slashdot.org/    blawg_corp=Geeknet
http://geek.com/        blawg_corp=Geeknet
http://engadget.com/    blawg_corp=Weblogs
http://gizmodo.com/     blawg_corp=Gawker

FILE>> nutch/conf/nutch-site.xml (Snippet)
<configuration>
  <property>
   <name>db.parsemeta.to.crawldb</name>
   <value> blawg_corp </value>
   <description>Comma-separated list of parse metadata keys to transfer to the 
crawldb (NUTCH-779).
    Assuming for instance that the languageidentifier plugin is enabled, 
setting the value to 'lang' 
    will copy both the key 'lang' and its value to the corresponding entry in 
the crawldb.
   </description>
  </property>
</configuration>


__I should write smaller e-mails__:

When grep'ing, the keywords I'm looking for all appear inside the files:
crawl/crawldb/current/part-00000/data
crawl/segments/20100710143125/crawl_generate/part-00000
crawl/segments/20100710143125/crawl_parse/part-00000

So, clearly it's pulling in the nonsense I'm feeding it, but when querying 
inside Nutch (or anywhere else), it just does not get exposed.  My delusional 
state has me convinced that I am very close to solving this stupid problem, and 
that I just can't figure out the accessor in-which it is kept.  My delusional 
state also has me convinced that I am a very large Mango, and I now live in 
fear of being eaten.

I really, really appreciate any help I get and will document this problem, so 
others don't have to go through the trouble that I have.

Thank you,
Scott Gonyea

smime.p7s
Description: S/MIME cryptographic signature

Storing Metadata with Crawled Sites

Reply via email to