I thought it seemed too good to be true. I understood the part about this 
picking up metadata from tags within the actual documents; that seems like a 
feature a lot of people would need. But I thought the whole point of the 
tab-delimited tags in my URLs file was that I could also inject tags that 
aren't in the source documents. That doesn't seem like it would be a standard 
feature, but it's what I need. Most of the pages I need to index aren't owned 
by us, and I won't always be able to get other sites to add an extra meta tag 
to their pages.

It looks like I might need to write my own plugin, which is a little daunting 
for me. Can anyone think of an existing plugin that injects metadata into 
indexed documents after the fact? It would be nice to have some existing code I 
could examine and learn from.

Thanks,
Chip

-----Original Message-----
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Monday, September 19, 2011 4:56 PM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

In addition, it looks like you are misinterpreting how the urlmeta plugin works 
Chip. It is designed to pick up addition meta tags with name and a content 
values respectively. e.g.

<meta name="humanURL" content="blahblahblah">

The plugin then gets this data as well as any additional values added in the 
urlmeta.tags property within nutch-site.xml and add this to the index which can 
then be queried.

Does this make sense?

On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche < lists.digitalpeb...@gmail.com> 
wrote:

> Hi
>
> Since the info is available thanks to the injection you can use the 
> url-meta plugin as-is and won't need to have a custom version.  See
> https://issues.apache.org/jira/browse/NUTCH-855
>
> Apart from that do not modify the content of  \runtime\local\conf\ 
> before re-compiling with ANT as this will be overwritten. Either 
> modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
>
> As Lewis suggested check the logs and see if the plugin is activated etc...
>
> J.
>
>
> On 19 September 2011 21:03, Chip Calhoun <ccalh...@aip.org> wrote:
>
> > Hi Lewis,
> >
> > My probably wrong understanding was that I'm supposed to add the 
> > tags for my new field to my list of seed URLs. So if I have a seed 
> > URL followed by
> "
> >        \t humanURL=http://www.aip.org/history/ead/20110369.html";, I 
> > get
> a
> > new field called "humanURL" which is populated with the string I've 
> > specified for that specific URL. I may just be greatly 
> > misunderstanding
> how
> > this plugin works.
> >
> > I've checked my Nutch logs now and it looks like nothing happened. 
> > The
> new
> > field does at least show up in the Solr admin UI's schema, but 
> > clearly my problem is on the Nutch end of things.
> >
> > -----Original Message-----
> > From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> > Sent: Monday, September 19, 2011 3:34 PM
> > To: user@nutch.apache.org
> > Subject: Re: Machine readable vs. human readable URLs.
> >
> > Hi Chip,
> >
> > There is no need to run ant war, there is no war target in the >= 
> > Nutch
> 1.3
> > build.xml file.
> >
> > Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. 
> > Do
> you
> > mean you've added your seed URLs?
> >
> > Have you had a look at any of your log output as to whether the 
> > urlmeta plugin is loaded and used when fetching?
> >
> > You should be able to get info on your schema, fields etc within the 
> > Solr admin UI
> >
> > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <ccalh...@aip.org> wrote:
> >
> > > Hi Julien,
> > >
> > > Thanks, that's encouraging. I'm trying to make this work, and I'm 
> > > definitely missing something. I hope I'm not too far off the mark.
> > > I've started with the instructions at 
> > > http://wiki.apache.org/nutch/WritingPluginExample . If I 
> > > understand this properly, the changes I needed to make were the following:
> > >
> > > In Nutch:
> > > Paste the prescribed block of code into 
> > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch 
> > > to look for and run the urlmeta plugin.
> > > In %NUTCH_HOME%, run "ant war".
> > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in 
> > > this
> > file
> > > now looks like: "http://www.aip.org/history/ead/20110369.xml        \t
> > > humanURL=http://www.aip.org/history/ead/20110369.html";
> > >
> > > In Solr:
> > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The 
> > > new line consists of: " <field name="humanURL" type="string" stored="true"
> > > indexed="false"/>"
> > >
> > > I've redone the indexing, and my new field still doesn't show up 
> > > in the search results. Can you tell where I'm going wrong?
> > >
> > > Thanks,
> > > Chip
> > >
> > > -----Original Message-----
> > > From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
> > > Sent: Friday, September 16, 2011 4:37 AM
> > > To: user@nutch.apache.org
> > > Subject: Re: Machine readable vs. human readable URLs.
> > >
> > > Hi Chip,
> > >
> > > Should simply be a matter of creating a custom field with an 
> > > IndexingFilter, you can then use it in any way you want on the 
> > > SOLR side
> > >
> > > Julien
> > >
> > > On 15 September 2011 21:50, Chip Calhoun <ccalh...@aip.org> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > We'd like to use Nutch and Solr to replace an existing Verity 
> > > > search that's become a bit long in the tooth. In our Verity 
> > > > search, we have a hack which allows each document to have a 
> > > > machine-readable URL which is indexed (generally an xml 
> > > > document), and a human-readable URL which we actually send users 
> > > > to. Has anyone done the same with
> > Nutch and Solr?
> > > >
> > > > Thanks,
> > > > Chip
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



--
*Lewis*

Reply via email to