Re: Storing Metadata with Crawled Sites

Scott Gonyea Tue, 13 Jul 2010 09:35:16 -0700

Awesome, thank you.  I saw what you meant; I had cast it to a Text and
gotten that far... But I then had to "new Text (a_string)" when getting it
out of the getMetaData().get( ... ) crap.


Do you have a place where you suggest I look, to implement that kind of a
feature?  I'm perfectly happy to do it, but any help that I can be given--on
where to focus my efforts--would be a greatly appreciated time saver.

I've parsed out the meta tags for the base URL--and saw what you meant about
propagation.  It'd be great on my end to have this feature, and be great for
my long lost twin who's due to run into Nutch any minute now.

Scott

On Tue, Jul 13, 2010 at 2:19 AM, Julien Nioche <
[email protected]> wrote:

> >
> > public class MetaIndexingFilter implements IndexingFilter {
> >  ...
> >  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> >      CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> >        ...
> >        for(String metatag : metatags) {
> >          doc.add(my_internet_pollution, meta_dreck);
> >        }
> >        ...
> >      }
> >  ...
> > }
> >
> > The only minor thing I seem to be dealing with is pulling out a specific
> > meta tag, from the getMetaData(), as it returns a Writable object--which
> was
> > cleverly designed to fill me with a bottom-less, impotent rage.  I like
> that
> > I can typecast it to a Text object, but not String.  Nor is there a
> > toString() method, as I can't imagine such a thing having any use.
> >
>
> cast to a Text then call toString on the Text instance
>
> Question, if you've read this far: If I de-tard my code, is it something
> > worthwhile-enough to be submitted into the nebulous depths of Apache
> Nutch?
> >
>
> The interesting part of your problem is how to propagate the metadata to
> all
> the pages of a host. The best way would be to keep a separate list of hosts
> and metadata then apply them to the whole crawldb just before indexing.
> Let's call that a Domain or HostFeatureApplier. That would be a nice
> contribution to Nutch and could be reused in 2.0 when we have a separate
> table for storing host or domain info
>
> HTH
>
> J.
>
>
>
> >
> > Thank you,
> > Scott Gonyea
> >
> > On Jul 12, 2010, at 10:50 AM, Scott Gonyea wrote:
> >
> > > More questions are below your answers. (Thank you!)
> > >
> > > On Mon, Jul 12, 2010 at 1:34 AM, Julien Nioche <
> > [email protected]> wrote:
> > > Hi
> > >
> > > I want to crawl URLs and Index them (I'm using Nutch+Solr+Ruby/Rails)
> > and,
> > > > when a search term is matched, I'd like to have arbitrary metadata be
> > > > stored/associated with those results.  IE, suppose I crawl blogs and
> > want to
> > > > search for occurrences of "Android."  When I search the index that
> was
> > > > collected, I'd like to have the parent company's name (for example)
> be
> > > > returned with the URL who's index matched that query.
> > > >
> > >
> > > Ok, so it would be a matter of having a field for storing this in SOLR.
> > >
> > > I imagine this would be the easy part--I threw it into the schema.xml.
> >  The rtfm'ing would be from the Nutch side of things, given everything
> I've
> > rtf&m-'d (in no particular order).
> > >
> > > > __What I've Found/Done So Far__:
> > > >
> > > > NUTCH-655 Injecting Crawl metadata (jnioche)
> > > > NUTCH-779 Mechanism for passing metadata from parse to crawldb
> > (jnioche)
> > > > NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher +
> > call
> > > > scfilters.initialScore on newly created URL (jnioche)
> > > >
> > > > Apparently Julien Nioche is the god of all things metadata and I'd
> love
> > to
> > > > get a few minutes of his (or anyone else's) time, so that I can
> better
> > > > understand how to fully take advantage of the above changes.
> > > >
> > >
> > > not really but I can't resist a bit of flattery. Here is my 2 minutes
> > answer
> > >
> > >
> > > >
> > > > FILE>> nutch/urls/seed.txt:
> > > > http://slashdot.org/    blawg_corp=Geeknet
> > > > http://geek.com/        blawg_corp=Geeknet
> > > > http://engadget.com/    blawg_corp=Weblogs
> > > > http://gizmodo.com/     blawg_corp=Gawker
> > > >
> > >
> > > I suppose that you want to propagate this feature to the subpages of
> the
> > > sites above?
> > >
> > > Yes, please.  Basically, anything gathered within the given crawl
> should
> > have the "blawg_corp" stapled to it, that was originally provided with
> the
> > crawl URLs.
> > >
> > > >
> > > > FILE>> nutch/conf/nutch-site.xml (Snippet)
> > > > <configuration>
> > > >  <property>
> > > >   <name>db.parsemeta.to.crawldb</name>
> > > >   <value> blawg_corp </value>
> > > >   <description>Comma-separated list of parse metadata keys to
> transfer
> > to
> > > > the crawldb (NUTCH-779).
> > > >    Assuming for instance that the languageidentifier plugin is
> enabled,
> > > > setting the value to 'lang'
> > > >    will copy both the key 'lang' and its value to the corresponding
> > entry
> > > > in the crawldb.
> > > >   </description>
> > > >  </property>
> > > > </configuration>
> > > >
> > >
> > > This one is about sending metadata back from the parsing to the
> crawldb.
> > > Since you've injected the metadata it is already in the crawldb. Can't
> > see
> > > why you'd need that unless you do something special during the parsing
> ?
> > >
> > > Gotcha, then no- I don't need it.  That snuck in there as my "why isn't
> > this working?!" turned into desperation+googling.
> > >
> > > >
> > > > So, clearly it's pulling in the nonsense I'm feeding it, but when
> > querying
> > > > inside Nutch (or anywhere else), it just does not get exposed
> > >
> > >
> > > If you want it in SOLR you need to (in reverse chronological order) :
> > > a) define the field in the solr schema
> > > b) create an indexingfilter that will populate this field (e.g from the
> > > parse or crawl metadata )
> > > c) if necessary - propagate the tag to all the pages of a given host
> > >
> > >
> > > a) That means just sticking a field in Solr's schema.xml, correct?  IE,
> > >     <field name="blawg_corp" type="string" stored="true"
> indexed="true"/>
> > >
> > > b) To create an IndexingFilter, is that along the following lines:
> > >
> > > http://wiki.apache.org/nutch/WritingPluginExample
> > > http://wiki.apache.org/nutch/HowToMakeCustomSearch
> > >
> >
> http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/
> > >
> > > c) So, the data would not already propagate to the pages that were
> > crawled, with the metadata?
> > >
> > > I really appreciate any help or references you can give me on this.
> I've
> > been dealing with Nutch for about 4 days, so I apologize for my
> ignorance.
> >  There's seemingly a lot of depth to Nutch that the documentation hasn't
> > exactly kept pace with.
> > >
> > > The end result is that when I run a query in Solr (or even Nutch), I'd
> > like to have the "blawg_corp" be returned with the given set of query
> > results.  Any guidelines/references you can point me to, to make that
> > happen, is very much appreciated.
> > >
> > > Thank you,
> > > Scott Gonyea
> >
> >
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>

Re: Storing Metadata with Crawled Sites

Reply via email to