More questions are below your answers. (Thank you!)

On Mon, Jul 12, 2010 at 1:34 AM, Julien Nioche <
[email protected]> wrote:

> Hi
>
> I want to crawl URLs and Index them (I'm using Nutch+Solr+Ruby/Rails) and,
> > when a search term is matched, I'd like to have arbitrary metadata be
> > stored/associated with those results.  IE, suppose I crawl blogs and want
> to
> > search for occurrences of "Android."  When I search the index that was
> > collected, I'd like to have the parent company's name (for example) be
> > returned with the URL who's index matched that query.
> >
>
> Ok, so it would be a matter of having a field for storing this in SOLR.


I imagine this would be the easy part--I threw it into the schema.xml.  The
rtfm'ing would be from the Nutch side of things, given everything I've
rtf&m-'d (in no particular order).


> > __What I've Found/Done So Far__:
> >
> > NUTCH-655 Injecting Crawl metadata (jnioche)
> > NUTCH-779 Mechanism for passing metadata from parse to crawldb (jnioche)
> > NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher +
> call
> > scfilters.initialScore on newly created URL (jnioche)
> >
> > Apparently Julien Nioche is the god of all things metadata and I'd love
> to
> > get a few minutes of his (or anyone else's) time, so that I can better
> > understand how to fully take advantage of the above changes.
> >
>
> not really but I can't resist a bit of flattery. Here is my 2 minutes
> answer
>
>
> >
> > FILE>> nutch/urls/seed.txt:
> > http://slashdot.org/    blawg_corp=Geeknet
> > http://geek.com/        blawg_corp=Geeknet
> > http://engadget.com/    blawg_corp=Weblogs
> > http://gizmodo.com/     blawg_corp=Gawker
> >
>
> I suppose that you want to propagate this feature to the subpages of the
> sites above?


Yes, please.  Basically, anything gathered within the given crawl should
have the "blawg_corp" stapled to it, that was originally provided with the
crawl URLs.

>
> > FILE>> nutch/conf/nutch-site.xml (Snippet)
> > <configuration>
> >  <property>
> >   <name>db.parsemeta.to.crawldb</name>
> >   <value> blawg_corp </value>
> >   <description>Comma-separated list of parse metadata keys to transfer to
> > the crawldb (NUTCH-779).
> >    Assuming for instance that the languageidentifier plugin is enabled,
> > setting the value to 'lang'
> >    will copy both the key 'lang' and its value to the corresponding entry
> > in the crawldb.
> >   </description>
> >  </property>
> > </configuration>
> >
>
> This one is about sending metadata back from the parsing to the crawldb.
> Since you've injected the metadata it is already in the crawldb. Can't see
> why you'd need that unless you do something special during the parsing ?


Gotcha, then no- I don't need it.  That snuck in there as my "why isn't this
working?!" turned into desperation+googling.


> >
> > So, clearly it's pulling in the nonsense I'm feeding it, but when
> querying
> > inside Nutch (or anywhere else), it just does not get exposed
>
>
> If you want it in SOLR you need to (in reverse chronological order) :
> a) define the field in the solr schema
> b) create an indexingfilter that will populate this field (e.g from the
> parse or crawl metadata )
> c) if necessary - propagate the tag to all the pages of a given host
>
>
a) That means just sticking a field in Solr's schema.xml, correct?  IE,
    <field name="blawg_corp" type="string" stored="true" indexed="true"/>

b) To create an IndexingFilter, is that along the following lines:

http://wiki.apache.org/nutch/WritingPluginExample
http://wiki.apache.org/nutch/HowToMakeCustomSearch
http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/

c) So, the data would not already propagate to the pages that were crawled,
with the metadata?

I really appreciate any help or references you can give me on this. I've
been dealing with Nutch for about 4 days, so I apologize for my ignorance.
 There's seemingly a lot of depth to Nutch that the documentation hasn't
exactly kept pace with.

The end result is that when I run a query in Solr (or even Nutch), I'd like
to have the "blawg_corp" be returned with the given set of query results.
 Any guidelines/references you can point me to, to make that happen, is very
much appreciated.

Thank you,
Scott Gonyea

Reply via email to