More questions are below your answers. (Thank you!) On Mon, Jul 12, 2010 at 1:34 AM, Julien Nioche < [email protected]> wrote:
> Hi > > I want to crawl URLs and Index them (I'm using Nutch+Solr+Ruby/Rails) and, > > when a search term is matched, I'd like to have arbitrary metadata be > > stored/associated with those results. IE, suppose I crawl blogs and want > to > > search for occurrences of "Android." When I search the index that was > > collected, I'd like to have the parent company's name (for example) be > > returned with the URL who's index matched that query. > > > > Ok, so it would be a matter of having a field for storing this in SOLR. I imagine this would be the easy part--I threw it into the schema.xml. The rtfm'ing would be from the Nutch side of things, given everything I've rtf&m-'d (in no particular order). > > __What I've Found/Done So Far__: > > > > NUTCH-655 Injecting Crawl metadata (jnioche) > > NUTCH-779 Mechanism for passing metadata from parse to crawldb (jnioche) > > NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher + > call > > scfilters.initialScore on newly created URL (jnioche) > > > > Apparently Julien Nioche is the god of all things metadata and I'd love > to > > get a few minutes of his (or anyone else's) time, so that I can better > > understand how to fully take advantage of the above changes. > > > > not really but I can't resist a bit of flattery. Here is my 2 minutes > answer > > > > > > FILE>> nutch/urls/seed.txt: > > http://slashdot.org/ blawg_corp=Geeknet > > http://geek.com/ blawg_corp=Geeknet > > http://engadget.com/ blawg_corp=Weblogs > > http://gizmodo.com/ blawg_corp=Gawker > > > > I suppose that you want to propagate this feature to the subpages of the > sites above? Yes, please. Basically, anything gathered within the given crawl should have the "blawg_corp" stapled to it, that was originally provided with the crawl URLs. > > > FILE>> nutch/conf/nutch-site.xml (Snippet) > > <configuration> > > <property> > > <name>db.parsemeta.to.crawldb</name> > > <value> blawg_corp </value> > > <description>Comma-separated list of parse metadata keys to transfer to > > the crawldb (NUTCH-779). > > Assuming for instance that the languageidentifier plugin is enabled, > > setting the value to 'lang' > > will copy both the key 'lang' and its value to the corresponding entry > > in the crawldb. > > </description> > > </property> > > </configuration> > > > > This one is about sending metadata back from the parsing to the crawldb. > Since you've injected the metadata it is already in the crawldb. Can't see > why you'd need that unless you do something special during the parsing ? Gotcha, then no- I don't need it. That snuck in there as my "why isn't this working?!" turned into desperation+googling. > > > > So, clearly it's pulling in the nonsense I'm feeding it, but when > querying > > inside Nutch (or anywhere else), it just does not get exposed > > > If you want it in SOLR you need to (in reverse chronological order) : > a) define the field in the solr schema > b) create an indexingfilter that will populate this field (e.g from the > parse or crawl metadata ) > c) if necessary - propagate the tag to all the pages of a given host > > a) That means just sticking a field in Solr's schema.xml, correct? IE, <field name="blawg_corp" type="string" stored="true" indexed="true"/> b) To create an IndexingFilter, is that along the following lines: http://wiki.apache.org/nutch/WritingPluginExample http://wiki.apache.org/nutch/HowToMakeCustomSearch http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/ c) So, the data would not already propagate to the pages that were crawled, with the metadata? I really appreciate any help or references you can give me on this. I've been dealing with Nutch for about 4 days, so I apologize for my ignorance. There's seemingly a lot of depth to Nutch that the documentation hasn't exactly kept pace with. The end result is that when I run a query in Solr (or even Nutch), I'd like to have the "blawg_corp" be returned with the given set of query results. Any guidelines/references you can point me to, to make that happen, is very much appreciated. Thank you, Scott Gonyea

