Re: Storing Metadata with Crawled Sites

Scott Gonyea Wed, 14 Jul 2010 18:57:27 -0700

Ok, I've created a patch/plugin:

https://issues.apache.org/jira/browse/NUTCH-855


Now I really need a beer. Thanks for your assistance, Julien. I appreciate
it.

Scott

On Wed, Jul 14, 2010 at 1:28 AM, Julien Nioche <
[email protected]> wrote:

> A simpler option would be to use a modified version of
> https://issues.apache.org/jira/browse/NUTCH-830 to transfer the features
> you're interested in to the outlinks. That's also a good way of keeping the
> crawl within a limited number of hosts / domains
>
> On 13 July 2010 17:33, Scott Gonyea <[email protected]> wrote:
>
> > Awesome, thank you.  I saw what you meant; I had cast it to a Text and
> > gotten that far... But I then had to "new Text (a_string)" when getting
> it
> > out of the getMetaData().get( ... ) crap.
> >
> > Do you have a place where you suggest I look, to implement that kind of a
> > feature?  I'm perfectly happy to do it, but any help that I can be
> > given--on
> > where to focus my efforts--would be a greatly appreciated time saver.
> >
> > I've parsed out the meta tags for the base URL--and saw what you meant
> > about
> > propagation.  It'd be great on my end to have this feature, and be great
> > for
> > my long lost twin who's due to run into Nutch any minute now.
> >
> > Scott
> >
> > On Tue, Jul 13, 2010 at 2:19 AM, Julien Nioche <
> > [email protected]> wrote:
> >
> > > >
> > > > public class MetaIndexingFilter implements IndexingFilter {
> > > >  ...
> > > >  public NutchDocument filter(NutchDocument doc, Parse parse, Text
> url,
> > > >      CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> > > >        ...
> > > >        for(String metatag : metatags) {
> > > >          doc.add(my_internet_pollution, meta_dreck);
> > > >        }
> > > >        ...
> > > >      }
> > > >  ...
> > > > }
> > > >
> > > > The only minor thing I seem to be dealing with is pulling out a
> > specific
> > > > meta tag, from the getMetaData(), as it returns a Writable
> > object--which
> > > was
> > > > cleverly designed to fill me with a bottom-less, impotent rage.  I
> like
> > > that
> > > > I can typecast it to a Text object, but not String.  Nor is there a
> > > > toString() method, as I can't imagine such a thing having any use.
> > > >
> > >
> > > cast to a Text then call toString on the Text instance
> > >
> > > Question, if you've read this far: If I de-tard my code, is it
> something
> > > > worthwhile-enough to be submitted into the nebulous depths of Apache
> > > Nutch?
> > > >
> > >
> > > The interesting part of your problem is how to propagate the metadata
> to
> > > all
> > > the pages of a host. The best way would be to keep a separate list of
> > hosts
> > > and metadata then apply them to the whole crawldb just before indexing.
> > > Let's call that a Domain or HostFeatureApplier. That would be a nice
> > > contribution to Nutch and could be reused in 2.0 when we have a
> separate
> > > table for storing host or domain info
> > >
> > > HTH
> > >
> > > J.
> > >
> > >
> > >
> > > >
> > > > Thank you,
> > > > Scott Gonyea
> > > >
> > > > On Jul 12, 2010, at 10:50 AM, Scott Gonyea wrote:
> > > >
> > > > > More questions are below your answers. (Thank you!)
> > > > >
> > > > > On Mon, Jul 12, 2010 at 1:34 AM, Julien Nioche <
> > > > [email protected]> wrote:
> > > > > Hi
> > > > >
> > > > > I want to crawl URLs and Index them (I'm using
> Nutch+Solr+Ruby/Rails)
> > > > and,
> > > > > > when a search term is matched, I'd like to have arbitrary
> metadata
> > be
> > > > > > stored/associated with those results.  IE, suppose I crawl blogs
> > and
> > > > want to
> > > > > > search for occurrences of "Android."  When I search the index
> that
> > > was
> > > > > > collected, I'd like to have the parent company's name (for
> example)
> > > be
> > > > > > returned with the URL who's index matched that query.
> > > > > >
> > > > >
> > > > > Ok, so it would be a matter of having a field for storing this in
> > SOLR.
> > > > >
> > > > > I imagine this would be the easy part--I threw it into the
> > schema.xml.
> > > >  The rtfm'ing would be from the Nutch side of things, given
> everything
> > > I've
> > > > rtf&m-'d (in no particular order).
> > > > >
> > > > > > __What I've Found/Done So Far__:
> > > > > >
> > > > > > NUTCH-655 Injecting Crawl metadata (jnioche)
> > > > > > NUTCH-779 Mechanism for passing metadata from parse to crawldb
> > > > (jnioche)
> > > > > > NUTCH-785 Copy metadata from origin URL when redirecting in
> Fetcher
> > +
> > > > call
> > > > > > scfilters.initialScore on newly created URL (jnioche)
> > > > > >
> > > > > > Apparently Julien Nioche is the god of all things metadata and
> I'd
> > > love
> > > > to
> > > > > > get a few minutes of his (or anyone else's) time, so that I can
> > > better
> > > > > > understand how to fully take advantage of the above changes.
> > > > > >
> > > > >
> > > > > not really but I can't resist a bit of flattery. Here is my 2
> minutes
> > > > answer
> > > > >
> > > > >
> > > > > >
> > > > > > FILE>> nutch/urls/seed.txt:
> > > > > > http://slashdot.org/    blawg_corp=Geeknet
> > > > > > http://geek.com/        blawg_corp=Geeknet
> > > > > > http://engadget.com/    blawg_corp=Weblogs
> > > > > > http://gizmodo.com/     blawg_corp=Gawker
> > > > > >
> > > > >
> > > > > I suppose that you want to propagate this feature to the subpages
> of
> > > the
> > > > > sites above?
> > > > >
> > > > > Yes, please.  Basically, anything gathered within the given crawl
> > > should
> > > > have the "blawg_corp" stapled to it, that was originally provided
> with
> > > the
> > > > crawl URLs.
> > > > >
> > > > > >
> > > > > > FILE>> nutch/conf/nutch-site.xml (Snippet)
> > > > > > <configuration>
> > > > > >  <property>
> > > > > >   <name>db.parsemeta.to.crawldb</name>
> > > > > >   <value> blawg_corp </value>
> > > > > >   <description>Comma-separated list of parse metadata keys to
> > > transfer
> > > > to
> > > > > > the crawldb (NUTCH-779).
> > > > > >    Assuming for instance that the languageidentifier plugin is
> > > enabled,
> > > > > > setting the value to 'lang'
> > > > > >    will copy both the key 'lang' and its value to the
> corresponding
> > > > entry
> > > > > > in the crawldb.
> > > > > >   </description>
> > > > > >  </property>
> > > > > > </configuration>
> > > > > >
> > > > >
> > > > > This one is about sending metadata back from the parsing to the
> > > crawldb.
> > > > > Since you've injected the metadata it is already in the crawldb.
> > Can't
> > > > see
> > > > > why you'd need that unless you do something special during the
> > parsing
> > > ?
> > > > >
> > > > > Gotcha, then no- I don't need it.  That snuck in there as my "why
> > isn't
> > > > this working?!" turned into desperation+googling.
> > > > >
> > > > > >
> > > > > > So, clearly it's pulling in the nonsense I'm feeding it, but when
> > > > querying
> > > > > > inside Nutch (or anywhere else), it just does not get exposed
> > > > >
> > > > >
> > > > > If you want it in SOLR you need to (in reverse chronological order)
> :
> > > > > a) define the field in the solr schema
> > > > > b) create an indexingfilter that will populate this field (e.g from
> > the
> > > > > parse or crawl metadata )
> > > > > c) if necessary - propagate the tag to all the pages of a given
> host
> > > > >
> > > > >
> > > > > a) That means just sticking a field in Solr's schema.xml, correct?
> >  IE,
> > > > >     <field name="blawg_corp" type="string" stored="true"
> > > indexed="true"/>
> > > > >
> > > > > b) To create an IndexingFilter, is that along the following lines:
> > > > >
> > > > > http://wiki.apache.org/nutch/WritingPluginExample
> > > > > http://wiki.apache.org/nutch/HowToMakeCustomSearch
> > > > >
> > > >
> > >
> >
> http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/
> > > > >
> > > > > c) So, the data would not already propagate to the pages that were
> > > > crawled, with the metadata?
> > > > >
> > > > > I really appreciate any help or references you can give me on this.
> > > I've
> > > > been dealing with Nutch for about 4 days, so I apologize for my
> > > ignorance.
> > > >  There's seemingly a lot of depth to Nutch that the documentation
> > hasn't
> > > > exactly kept pace with.
> > > > >
> > > > > The end result is that when I run a query in Solr (or even Nutch),
> > I'd
> > > > like to have the "blawg_corp" be returned with the given set of query
> > > > results.  Any guidelines/references you can point me to, to make that
> > > > happen, is very much appreciated.
> > > > >
> > > > > Thank you,
> > > > > Scott Gonyea
> > > >
> > > >
> > >
> > >
> > > --
> > > DigitalPebble Ltd
> > >
> > > Open Source Solutions for Text Engineering
> > > http://www.digitalpebble.com
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>

Re: Storing Metadata with Crawled Sites

Reply via email to