Re: Storing Metadata with Crawled Sites

Julien Nioche Mon, 12 Jul 2010 01:36:17 -0700

Hi

I want to crawl URLs and Index them (I'm using Nutch+Solr+Ruby/Rails) and,
> when a search term is matched, I'd like to have arbitrary metadata be
> stored/associated with those results.  IE, suppose I crawl blogs and want to
> search for occurrences of "Android."  When I search the index that was
> collected, I'd like to have the parent company's name (for example) be
> returned with the URL who's index matched that query.
>


Ok, so it would be a matter of having a field for storing this in SOLR.


> __What I've Found/Done So Far__:
>
> NUTCH-655 Injecting Crawl metadata (jnioche)
> NUTCH-779 Mechanism for passing metadata from parse to crawldb (jnioche)
> NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher + call
> scfilters.initialScore on newly created URL (jnioche)
>
> Apparently Julien Nioche is the god of all things metadata and I'd love to
> get a few minutes of his (or anyone else's) time, so that I can better
> understand how to fully take advantage of the above changes.
>

not really but I can't resist a bit of flattery. Here is my 2 minutes answer


>
> FILE>> nutch/urls/seed.txt:
> http://slashdot.org/    blawg_corp=Geeknet
> http://geek.com/        blawg_corp=Geeknet
> http://engadget.com/    blawg_corp=Weblogs
> http://gizmodo.com/     blawg_corp=Gawker
>

I suppose that you want to propagate this feature to the subpages of the
sites above?


>
> FILE>> nutch/conf/nutch-site.xml (Snippet)
> <configuration>
>  <property>
>   <name>db.parsemeta.to.crawldb</name>
>   <value> blawg_corp </value>
>   <description>Comma-separated list of parse metadata keys to transfer to
> the crawldb (NUTCH-779).
>    Assuming for instance that the languageidentifier plugin is enabled,
> setting the value to 'lang'
>    will copy both the key 'lang' and its value to the corresponding entry
> in the crawldb.
>   </description>
>  </property>
> </configuration>
>

This one is about sending metadata back from the parsing to the crawldb.
Since you've injected the metadata it is already in the crawldb. Can't see
why you'd need that unless you do something special during the parsing ?


>
> So, clearly it's pulling in the nonsense I'm feeding it, but when querying
> inside Nutch (or anywhere else), it just does not get exposed


If you want it in SOLR you need to (in reverse chronological order) :
a) define the field in the solr schema
b) create an indexingfilter that will populate this field (e.g from the
parse or crawl metadata )
c) if necessary - propagate the tag to all the pages of a given host



> .  My delusional state has me convinced that I am very close to solving
> this stupid problem, and that I just can't figure out the accessor in-which
> it is kept.  My delusional state also has me convinced that I am a very
> large Mango, and I now live in fear of being eaten.
>

this must be quite stressful.

J.

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Storing Metadata with Crawled Sites

Reply via email to