>
> public class MetaIndexingFilter implements IndexingFilter {
>  ...
>  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>      CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>        ...
>        for(String metatag : metatags) {
>          doc.add(my_internet_pollution, meta_dreck);
>        }
>        ...
>      }
>  ...
> }
>
> The only minor thing I seem to be dealing with is pulling out a specific
> meta tag, from the getMetaData(), as it returns a Writable object--which was
> cleverly designed to fill me with a bottom-less, impotent rage.  I like that
> I can typecast it to a Text object, but not String.  Nor is there a
> toString() method, as I can't imagine such a thing having any use.
>

cast to a Text then call toString on the Text instance

Question, if you've read this far: If I de-tard my code, is it something
> worthwhile-enough to be submitted into the nebulous depths of Apache Nutch?
>

The interesting part of your problem is how to propagate the metadata to all
the pages of a host. The best way would be to keep a separate list of hosts
and metadata then apply them to the whole crawldb just before indexing.
Let's call that a Domain or HostFeatureApplier. That would be a nice
contribution to Nutch and could be reused in 2.0 when we have a separate
table for storing host or domain info

HTH

J.



>
> Thank you,
> Scott Gonyea
>
> On Jul 12, 2010, at 10:50 AM, Scott Gonyea wrote:
>
> > More questions are below your answers. (Thank you!)
> >
> > On Mon, Jul 12, 2010 at 1:34 AM, Julien Nioche <
> [email protected]> wrote:
> > Hi
> >
> > I want to crawl URLs and Index them (I'm using Nutch+Solr+Ruby/Rails)
> and,
> > > when a search term is matched, I'd like to have arbitrary metadata be
> > > stored/associated with those results.  IE, suppose I crawl blogs and
> want to
> > > search for occurrences of "Android."  When I search the index that was
> > > collected, I'd like to have the parent company's name (for example) be
> > > returned with the URL who's index matched that query.
> > >
> >
> > Ok, so it would be a matter of having a field for storing this in SOLR.
> >
> > I imagine this would be the easy part--I threw it into the schema.xml.
>  The rtfm'ing would be from the Nutch side of things, given everything I've
> rtf&m-'d (in no particular order).
> >
> > > __What I've Found/Done So Far__:
> > >
> > > NUTCH-655 Injecting Crawl metadata (jnioche)
> > > NUTCH-779 Mechanism for passing metadata from parse to crawldb
> (jnioche)
> > > NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher +
> call
> > > scfilters.initialScore on newly created URL (jnioche)
> > >
> > > Apparently Julien Nioche is the god of all things metadata and I'd love
> to
> > > get a few minutes of his (or anyone else's) time, so that I can better
> > > understand how to fully take advantage of the above changes.
> > >
> >
> > not really but I can't resist a bit of flattery. Here is my 2 minutes
> answer
> >
> >
> > >
> > > FILE>> nutch/urls/seed.txt:
> > > http://slashdot.org/    blawg_corp=Geeknet
> > > http://geek.com/        blawg_corp=Geeknet
> > > http://engadget.com/    blawg_corp=Weblogs
> > > http://gizmodo.com/     blawg_corp=Gawker
> > >
> >
> > I suppose that you want to propagate this feature to the subpages of the
> > sites above?
> >
> > Yes, please.  Basically, anything gathered within the given crawl should
> have the "blawg_corp" stapled to it, that was originally provided with the
> crawl URLs.
> >
> > >
> > > FILE>> nutch/conf/nutch-site.xml (Snippet)
> > > <configuration>
> > >  <property>
> > >   <name>db.parsemeta.to.crawldb</name>
> > >   <value> blawg_corp </value>
> > >   <description>Comma-separated list of parse metadata keys to transfer
> to
> > > the crawldb (NUTCH-779).
> > >    Assuming for instance that the languageidentifier plugin is enabled,
> > > setting the value to 'lang'
> > >    will copy both the key 'lang' and its value to the corresponding
> entry
> > > in the crawldb.
> > >   </description>
> > >  </property>
> > > </configuration>
> > >
> >
> > This one is about sending metadata back from the parsing to the crawldb.
> > Since you've injected the metadata it is already in the crawldb. Can't
> see
> > why you'd need that unless you do something special during the parsing ?
> >
> > Gotcha, then no- I don't need it.  That snuck in there as my "why isn't
> this working?!" turned into desperation+googling.
> >
> > >
> > > So, clearly it's pulling in the nonsense I'm feeding it, but when
> querying
> > > inside Nutch (or anywhere else), it just does not get exposed
> >
> >
> > If you want it in SOLR you need to (in reverse chronological order) :
> > a) define the field in the solr schema
> > b) create an indexingfilter that will populate this field (e.g from the
> > parse or crawl metadata )
> > c) if necessary - propagate the tag to all the pages of a given host
> >
> >
> > a) That means just sticking a field in Solr's schema.xml, correct?  IE,
> >     <field name="blawg_corp" type="string" stored="true" indexed="true"/>
> >
> > b) To create an IndexingFilter, is that along the following lines:
> >
> > http://wiki.apache.org/nutch/WritingPluginExample
> > http://wiki.apache.org/nutch/HowToMakeCustomSearch
> >
> http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/
> >
> > c) So, the data would not already propagate to the pages that were
> crawled, with the metadata?
> >
> > I really appreciate any help or references you can give me on this. I've
> been dealing with Nutch for about 4 days, so I apologize for my ignorance.
>  There's seemingly a lot of depth to Nutch that the documentation hasn't
> exactly kept pace with.
> >
> > The end result is that when I run a query in Solr (or even Nutch), I'd
> like to have the "blawg_corp" be returned with the given set of query
> results.  Any guidelines/references you can point me to, to make that
> happen, is very much appreciated.
> >
> > Thank you,
> > Scott Gonyea
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to