Re: Storing Metadata with Crawled Sites

Scott Gonyea Tue, 20 Jul 2010 13:30:54 -0700

By the way, thanks for the kudos- Alex.

I've taken the time to revise my patch, once more:


https://issues.apache.org/jira/browse/NUTCH-855

<https://issues.apache.org/jira/browse/NUTCH-855>I renamed it to, simply,
"urlmeta" which tucks away an IndexingFilter and a ScoringFilter.  The
Scoring Filter will propagate meta tags, injected alongside URLs (see
NUTCH-655), to a URL's (out?-)links.  The IndexingFilter will then inject
these meta tags into the NutchDocument, which are then passed along to the
Indexer.

See the patch notes for guidance on how to use this. In the near future,
I'll allocate some time away from "googling my own name" to "writing nutch
wiki articles with my name on it," thereby completing the circle and
polluting Google with my ego.

Scott

On Wed, Jul 14, 2010 at 6:56 PM, Scott Gonyea <[email protected]> wrote:

> Ok, I've created a patch/plugin:
>
> https://issues.apache.org/jira/browse/NUTCH-855
>
> Now I really need a beer. Thanks for your assistance, Julien. I appreciate
> it.
>
> Scott
>
> On Wed, Jul 14, 2010 at 1:28 AM, Julien Nioche <
> [email protected]> wrote:
>
>> A simpler option would be to use a modified version of
>> https://issues.apache.org/jira/browse/NUTCH-830 to transfer the features
>> you're interested in to the outlinks. That's also a good way of keeping
>> the
>> crawl within a limited number of hosts / domains
>>
>> On 13 July 2010 17:33, Scott Gonyea <[email protected]> wrote:
>>
>> > Awesome, thank you.  I saw what you meant; I had cast it to a Text and
>> > gotten that far... But I then had to "new Text (a_string)" when getting
>> it
>> > out of the getMetaData().get( ... ) crap.
>> >
>> > Do you have a place where you suggest I look, to implement that kind of
>> a
>> > feature?  I'm perfectly happy to do it, but any help that I can be
>> > given--on
>> > where to focus my efforts--would be a greatly appreciated time saver.
>> >
>> > I've parsed out the meta tags for the base URL--and saw what you meant
>> > about
>> > propagation.  It'd be great on my end to have this feature, and be great
>> > for
>> > my long lost twin who's due to run into Nutch any minute now.
>> >
>> > Scott
>> >
>> > On Tue, Jul 13, 2010 at 2:19 AM, Julien Nioche <
>> > [email protected]> wrote:
>> >
>> > > >
>> > > > public class MetaIndexingFilter implements IndexingFilter {
>> > > >  ...
>> > > >  public NutchDocument filter(NutchDocument doc, Parse parse, Text
>> url,
>> > > >      CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>> > > >        ...
>> > > >        for(String metatag : metatags) {
>> > > >          doc.add(my_internet_pollution, meta_dreck);
>> > > >        }
>> > > >        ...
>> > > >      }
>> > > >  ...
>> > > > }
>> > > >
>> > > > The only minor thing I seem to be dealing with is pulling out a
>> > specific
>> > > > meta tag, from the getMetaData(), as it returns a Writable
>> > object--which
>> > > was
>> > > > cleverly designed to fill me with a bottom-less, impotent rage.  I
>> like
>> > > that
>> > > > I can typecast it to a Text object, but not String.  Nor is there a
>> > > > toString() method, as I can't imagine such a thing having any use.
>> > > >
>> > >
>> > > cast to a Text then call toString on the Text instance
>> > >
>> > > Question, if you've read this far: If I de-tard my code, is it
>> something
>> > > > worthwhile-enough to be submitted into the nebulous depths of Apache
>> > > Nutch?
>> > > >
>> > >
>> > > The interesting part of your problem is how to propagate the metadata
>> to
>> > > all
>> > > the pages of a host. The best way would be to keep a separate list of
>> > hosts
>> > > and metadata then apply them to the whole crawldb just before
>> indexing.
>> > > Let's call that a Domain or HostFeatureApplier. That would be a nice
>> > > contribution to Nutch and could be reused in 2.0 when we have a
>> separate
>> > > table for storing host or domain info
>> > >
>> > > HTH
>> > >
>> > > J.
>> > >
>> > >
>> > >
>> > > >
>> > > > Thank you,
>> > > > Scott Gonyea
>> > > >
>> > > > On Jul 12, 2010, at 10:50 AM, Scott Gonyea wrote:
>> > > >
>> > > > > More questions are below your answers. (Thank you!)
>> > > > >
>> > > > > On Mon, Jul 12, 2010 at 1:34 AM, Julien Nioche <
>> > > > [email protected]> wrote:
>> > > > > Hi
>> > > > >
>> > > > > I want to crawl URLs and Index them (I'm using
>> Nutch+Solr+Ruby/Rails)
>> > > > and,
>> > > > > > when a search term is matched, I'd like to have arbitrary
>> metadata
>> > be
>> > > > > > stored/associated with those results.  IE, suppose I crawl blogs
>> > and
>> > > > want to
>> > > > > > search for occurrences of "Android."  When I search the index
>> that
>> > > was
>> > > > > > collected, I'd like to have the parent company's name (for
>> example)
>> > > be
>> > > > > > returned with the URL who's index matched that query.
>> > > > > >
>> > > > >
>> > > > > Ok, so it would be a matter of having a field for storing this in
>> > SOLR.
>> > > > >
>> > > > > I imagine this would be the easy part--I threw it into the
>> > schema.xml.
>> > > >  The rtfm'ing would be from the Nutch side of things, given
>> everything
>> > > I've
>> > > > rtf&m-'d (in no particular order).
>> > > > >
>> > > > > > __What I've Found/Done So Far__:
>> > > > > >
>> > > > > > NUTCH-655 Injecting Crawl metadata (jnioche)
>> > > > > > NUTCH-779 Mechanism for passing metadata from parse to crawldb
>> > > > (jnioche)
>> > > > > > NUTCH-785 Copy metadata from origin URL when redirecting in
>> Fetcher
>> > +
>> > > > call
>> > > > > > scfilters.initialScore on newly created URL (jnioche)
>> > > > > >
>> > > > > > Apparently Julien Nioche is the god of all things metadata and
>> I'd
>> > > love
>> > > > to
>> > > > > > get a few minutes of his (or anyone else's) time, so that I can
>> > > better
>> > > > > > understand how to fully take advantage of the above changes.
>> > > > > >
>> > > > >
>> > > > > not really but I can't resist a bit of flattery. Here is my 2
>> minutes
>> > > > answer
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > FILE>> nutch/urls/seed.txt:
>> > > > > > http://slashdot.org/    blawg_corp=Geeknet
>> > > > > > http://geek.com/        blawg_corp=Geeknet
>> > > > > > http://engadget.com/    blawg_corp=Weblogs
>> > > > > > http://gizmodo.com/     blawg_corp=Gawker
>> > > > > >
>> > > > >
>> > > > > I suppose that you want to propagate this feature to the subpages
>> of
>> > > the
>> > > > > sites above?
>> > > > >
>> > > > > Yes, please.  Basically, anything gathered within the given crawl
>> > > should
>> > > > have the "blawg_corp" stapled to it, that was originally provided
>> with
>> > > the
>> > > > crawl URLs.
>> > > > >
>> > > > > >
>> > > > > > FILE>> nutch/conf/nutch-site.xml (Snippet)
>> > > > > > <configuration>
>> > > > > >  <property>
>> > > > > >   <name>db.parsemeta.to.crawldb</name>
>> > > > > >   <value> blawg_corp </value>
>> > > > > >   <description>Comma-separated list of parse metadata keys to
>> > > transfer
>> > > > to
>> > > > > > the crawldb (NUTCH-779).
>> > > > > >    Assuming for instance that the languageidentifier plugin is
>> > > enabled,
>> > > > > > setting the value to 'lang'
>> > > > > >    will copy both the key 'lang' and its value to the
>> corresponding
>> > > > entry
>> > > > > > in the crawldb.
>> > > > > >   </description>
>> > > > > >  </property>
>> > > > > > </configuration>
>> > > > > >
>> > > > >
>> > > > > This one is about sending metadata back from the parsing to the
>> > > crawldb.
>> > > > > Since you've injected the metadata it is already in the crawldb.
>> > Can't
>> > > > see
>> > > > > why you'd need that unless you do something special during the
>> > parsing
>> > > ?
>> > > > >
>> > > > > Gotcha, then no- I don't need it.  That snuck in there as my "why
>> > isn't
>> > > > this working?!" turned into desperation+googling.
>> > > > >
>> > > > > >
>> > > > > > So, clearly it's pulling in the nonsense I'm feeding it, but
>> when
>> > > > querying
>> > > > > > inside Nutch (or anywhere else), it just does not get exposed
>> > > > >
>> > > > >
>> > > > > If you want it in SOLR you need to (in reverse chronological
>> order) :
>> > > > > a) define the field in the solr schema
>> > > > > b) create an indexingfilter that will populate this field (e.g
>> from
>> > the
>> > > > > parse or crawl metadata )
>> > > > > c) if necessary - propagate the tag to all the pages of a given
>> host
>> > > > >
>> > > > >
>> > > > > a) That means just sticking a field in Solr's schema.xml, correct?
>> >  IE,
>> > > > >     <field name="blawg_corp" type="string" stored="true"
>> > > indexed="true"/>
>> > > > >
>> > > > > b) To create an IndexingFilter, is that along the following lines:
>> > > > >
>> > > > > http://wiki.apache.org/nutch/WritingPluginExample
>> > > > > http://wiki.apache.org/nutch/HowToMakeCustomSearch
>> > > > >
>> > > >
>> > >
>> >
>> http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/
>> > > > >
>> > > > > c) So, the data would not already propagate to the pages that were
>> > > > crawled, with the metadata?
>> > > > >
>> > > > > I really appreciate any help or references you can give me on
>> this.
>> > > I've
>> > > > been dealing with Nutch for about 4 days, so I apologize for my
>> > > ignorance.
>> > > >  There's seemingly a lot of depth to Nutch that the documentation
>> > hasn't
>> > > > exactly kept pace with.
>> > > > >
>> > > > > The end result is that when I run a query in Solr (or even Nutch),
>> > I'd
>> > > > like to have the "blawg_corp" be returned with the given set of
>> query
>> > > > results.  Any guidelines/references you can point me to, to make
>> that
>> > > > happen, is very much appreciated.
>> > > > >
>> > > > > Thank you,
>> > > > > Scott Gonyea
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > DigitalPebble Ltd
>> > >
>> > > Open Source Solutions for Text Engineering
>> > > http://www.digitalpebble.com
>> > >
>> >
>>
>>
>>
>> --
>> DigitalPebble Ltd
>>
>> Open Source Solutions for Text Engineering
>> http://www.digitalpebble.com
>>
>
>

Re: Storing Metadata with Crawled Sites

Reply via email to