A simpler option would be to use a modified version of https://issues.apache.org/jira/browse/NUTCH-830 to transfer the features you're interested in to the outlinks. That's also a good way of keeping the crawl within a limited number of hosts / domains
On 13 July 2010 17:33, Scott Gonyea <[email protected]> wrote: > Awesome, thank you. I saw what you meant; I had cast it to a Text and > gotten that far... But I then had to "new Text (a_string)" when getting it > out of the getMetaData().get( ... ) crap. > > Do you have a place where you suggest I look, to implement that kind of a > feature? I'm perfectly happy to do it, but any help that I can be > given--on > where to focus my efforts--would be a greatly appreciated time saver. > > I've parsed out the meta tags for the base URL--and saw what you meant > about > propagation. It'd be great on my end to have this feature, and be great > for > my long lost twin who's due to run into Nutch any minute now. > > Scott > > On Tue, Jul 13, 2010 at 2:19 AM, Julien Nioche < > [email protected]> wrote: > > > > > > > public class MetaIndexingFilter implements IndexingFilter { > > > ... > > > public NutchDocument filter(NutchDocument doc, Parse parse, Text url, > > > CrawlDatum datum, Inlinks inlinks) throws IndexingException { > > > ... > > > for(String metatag : metatags) { > > > doc.add(my_internet_pollution, meta_dreck); > > > } > > > ... > > > } > > > ... > > > } > > > > > > The only minor thing I seem to be dealing with is pulling out a > specific > > > meta tag, from the getMetaData(), as it returns a Writable > object--which > > was > > > cleverly designed to fill me with a bottom-less, impotent rage. I like > > that > > > I can typecast it to a Text object, but not String. Nor is there a > > > toString() method, as I can't imagine such a thing having any use. > > > > > > > cast to a Text then call toString on the Text instance > > > > Question, if you've read this far: If I de-tard my code, is it something > > > worthwhile-enough to be submitted into the nebulous depths of Apache > > Nutch? > > > > > > > The interesting part of your problem is how to propagate the metadata to > > all > > the pages of a host. The best way would be to keep a separate list of > hosts > > and metadata then apply them to the whole crawldb just before indexing. > > Let's call that a Domain or HostFeatureApplier. That would be a nice > > contribution to Nutch and could be reused in 2.0 when we have a separate > > table for storing host or domain info > > > > HTH > > > > J. > > > > > > > > > > > > Thank you, > > > Scott Gonyea > > > > > > On Jul 12, 2010, at 10:50 AM, Scott Gonyea wrote: > > > > > > > More questions are below your answers. (Thank you!) > > > > > > > > On Mon, Jul 12, 2010 at 1:34 AM, Julien Nioche < > > > [email protected]> wrote: > > > > Hi > > > > > > > > I want to crawl URLs and Index them (I'm using Nutch+Solr+Ruby/Rails) > > > and, > > > > > when a search term is matched, I'd like to have arbitrary metadata > be > > > > > stored/associated with those results. IE, suppose I crawl blogs > and > > > want to > > > > > search for occurrences of "Android." When I search the index that > > was > > > > > collected, I'd like to have the parent company's name (for example) > > be > > > > > returned with the URL who's index matched that query. > > > > > > > > > > > > > Ok, so it would be a matter of having a field for storing this in > SOLR. > > > > > > > > I imagine this would be the easy part--I threw it into the > schema.xml. > > > The rtfm'ing would be from the Nutch side of things, given everything > > I've > > > rtf&m-'d (in no particular order). > > > > > > > > > __What I've Found/Done So Far__: > > > > > > > > > > NUTCH-655 Injecting Crawl metadata (jnioche) > > > > > NUTCH-779 Mechanism for passing metadata from parse to crawldb > > > (jnioche) > > > > > NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher > + > > > call > > > > > scfilters.initialScore on newly created URL (jnioche) > > > > > > > > > > Apparently Julien Nioche is the god of all things metadata and I'd > > love > > > to > > > > > get a few minutes of his (or anyone else's) time, so that I can > > better > > > > > understand how to fully take advantage of the above changes. > > > > > > > > > > > > > not really but I can't resist a bit of flattery. Here is my 2 minutes > > > answer > > > > > > > > > > > > > > > > > > FILE>> nutch/urls/seed.txt: > > > > > http://slashdot.org/ blawg_corp=Geeknet > > > > > http://geek.com/ blawg_corp=Geeknet > > > > > http://engadget.com/ blawg_corp=Weblogs > > > > > http://gizmodo.com/ blawg_corp=Gawker > > > > > > > > > > > > > I suppose that you want to propagate this feature to the subpages of > > the > > > > sites above? > > > > > > > > Yes, please. Basically, anything gathered within the given crawl > > should > > > have the "blawg_corp" stapled to it, that was originally provided with > > the > > > crawl URLs. > > > > > > > > > > > > > > FILE>> nutch/conf/nutch-site.xml (Snippet) > > > > > <configuration> > > > > > <property> > > > > > <name>db.parsemeta.to.crawldb</name> > > > > > <value> blawg_corp </value> > > > > > <description>Comma-separated list of parse metadata keys to > > transfer > > > to > > > > > the crawldb (NUTCH-779). > > > > > Assuming for instance that the languageidentifier plugin is > > enabled, > > > > > setting the value to 'lang' > > > > > will copy both the key 'lang' and its value to the corresponding > > > entry > > > > > in the crawldb. > > > > > </description> > > > > > </property> > > > > > </configuration> > > > > > > > > > > > > > This one is about sending metadata back from the parsing to the > > crawldb. > > > > Since you've injected the metadata it is already in the crawldb. > Can't > > > see > > > > why you'd need that unless you do something special during the > parsing > > ? > > > > > > > > Gotcha, then no- I don't need it. That snuck in there as my "why > isn't > > > this working?!" turned into desperation+googling. > > > > > > > > > > > > > > So, clearly it's pulling in the nonsense I'm feeding it, but when > > > querying > > > > > inside Nutch (or anywhere else), it just does not get exposed > > > > > > > > > > > > If you want it in SOLR you need to (in reverse chronological order) : > > > > a) define the field in the solr schema > > > > b) create an indexingfilter that will populate this field (e.g from > the > > > > parse or crawl metadata ) > > > > c) if necessary - propagate the tag to all the pages of a given host > > > > > > > > > > > > a) That means just sticking a field in Solr's schema.xml, correct? > IE, > > > > <field name="blawg_corp" type="string" stored="true" > > indexed="true"/> > > > > > > > > b) To create an IndexingFilter, is that along the following lines: > > > > > > > > http://wiki.apache.org/nutch/WritingPluginExample > > > > http://wiki.apache.org/nutch/HowToMakeCustomSearch > > > > > > > > > > http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/ > > > > > > > > c) So, the data would not already propagate to the pages that were > > > crawled, with the metadata? > > > > > > > > I really appreciate any help or references you can give me on this. > > I've > > > been dealing with Nutch for about 4 days, so I apologize for my > > ignorance. > > > There's seemingly a lot of depth to Nutch that the documentation > hasn't > > > exactly kept pace with. > > > > > > > > The end result is that when I run a query in Solr (or even Nutch), > I'd > > > like to have the "blawg_corp" be returned with the given set of query > > > results. Any guidelines/references you can point me to, to make that > > > happen, is very much appreciated. > > > > > > > > Thank you, > > > > Scott Gonyea > > > > > > > > > > > > -- > > DigitalPebble Ltd > > > > Open Source Solutions for Text Engineering > > http://www.digitalpebble.com > > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

