I've googled this relentlessly and am just coming up short. If anyone can give me some help on this, I'd sincerely appreciate it and have every intent to give back, by helping to improve documentation on this topic. Thank you, in advance, for any help you might offer.
Also, if I've done anything wrong or looked down upon (ie, etiquette), please
let me know in a private e-mail so I don't do it again :)
__Problem Description__:
Preface: I've oversimplified the problem I'm addressing, so I'm not looking
for other ways to parse parent company names out of websites. Please
assume
that the arbitrary data cannot be divined from any external source,
and that
I must (therefore) provide this metadata myself.
I want to crawl URLs and Index them (I'm using Nutch+Solr+Ruby/Rails) and, when
a search term is matched, I'd like to have arbitrary metadata be
stored/associated with those results. IE, suppose I crawl blogs and want to
search for occurrences of "Android." When I search the index that was
collected, I'd like to have the parent company's name (for example) be returned
with the URL who's index matched that query.
__What I've Found/Done So Far__:
NUTCH-655 Injecting Crawl metadata (jnioche)
NUTCH-779 Mechanism for passing metadata from parse to crawldb (jnioche)
NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher + call
scfilters.initialScore on newly created URL (jnioche)
Apparently Julien Nioche is the god of all things metadata and I'd love to get
a few minutes of his (or anyone else's) time, so that I can better understand
how to fully take advantage of the above changes.
FILE>> nutch/urls/seed.txt:
http://slashdot.org/ blawg_corp=Geeknet
http://geek.com/ blawg_corp=Geeknet
http://engadget.com/ blawg_corp=Weblogs
http://gizmodo.com/ blawg_corp=Gawker
FILE>> nutch/conf/nutch-site.xml (Snippet)
<configuration>
<property>
<name>db.parsemeta.to.crawldb</name>
<value> blawg_corp </value>
<description>Comma-separated list of parse metadata keys to transfer to the
crawldb (NUTCH-779).
Assuming for instance that the languageidentifier plugin is enabled,
setting the value to 'lang'
will copy both the key 'lang' and its value to the corresponding entry in
the crawldb.
</description>
</property>
</configuration>
__I should write smaller e-mails__:
When grep'ing, the keywords I'm looking for all appear inside the files:
crawl/crawldb/current/part-00000/data
crawl/segments/20100710143125/crawl_generate/part-00000
crawl/segments/20100710143125/crawl_parse/part-00000
So, clearly it's pulling in the nonsense I'm feeding it, but when querying
inside Nutch (or anywhere else), it just does not get exposed. My delusional
state has me convinced that I am very close to solving this stupid problem, and
that I just can't figure out the accessor in-which it is kept. My delusional
state also has me convinced that I am a very large Mango, and I now live in
fear of being eaten.
I really, really appreciate any help I get and will document this problem, so
others don't have to go through the trouble that I have.
Thank you,
Scott Gonyea
smime.p7s
Description: S/MIME cryptographic signature

