Hi Julien,

Thanks for clarifying this! I've got it working now. Instead of seeding with a 
proper tab-delimited file created in Excel, I had been wrong-headedly seeding 
it with a text file that just had tabs in it. They look the same, but it makes 
a difference. Thanks!

Chip

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: Monday, September 19, 2011 5:23 PM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

> In addition, it looks like you are misinterpreting how the urlmeta 
> plugin works Chip. It is designed to pick up addition meta tags with 
> name and a content values respectively. e.g.
>
> <meta name="humanURL" content="blahblahblah">
>

Sorry Lewis but it does not do that at all. See link I gave earlier for a 
description of urlmeta. I agree that the name is misleading, it does not extra 
the content from the page but simply uses the crawldb metadata


>
> The plugin then gets this data as well as any additional values added 
> in the urlmeta.tags property within nutch-site.xml and add this to the 
> index which can then be queried.
>
> Does this make sense?
>
> On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche < 
> lists.digitalpeb...@gmail.com> wrote:
>
> > Hi
> >
> > Since the info is available thanks to the injection you can use the 
> > url-meta plugin as-is and won't need to have a custom version.  See
> > https://issues.apache.org/jira/browse/NUTCH-855
> >
> > Apart from that do not modify the content of  \runtime\local\conf\ 
> > before re-compiling with ANT as this will be overwritten. Either 
> > modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
> >
> > As Lewis suggested check the logs and see if the plugin is activated
> etc...
> >
> > J.
> >
> >
> > On 19 September 2011 21:03, Chip Calhoun <ccalh...@aip.org> wrote:
> >
> > > Hi Lewis,
> > >
> > > My probably wrong understanding was that I'm supposed to add the 
> > > tags
> for
> > > my new field to my list of seed URLs. So if I have a seed URL 
> > > followed
> by
> > "
> > >        \t humanURL=http://www.aip.org/history/ead/20110369.html";, 
> > > I
> get
> > a
> > > new field called "humanURL" which is populated with the string 
> > > I've specified for that specific URL. I may just be greatly 
> > > misunderstanding
> > how
> > > this plugin works.
> > >
> > > I've checked my Nutch logs now and it looks like nothing happened. 
> > > The
> > new
> > > field does at least show up in the Solr admin UI's schema, but 
> > > clearly
> my
> > > problem is on the Nutch end of things.
> > >
> > > -----Original Message-----
> > > From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> > > Sent: Monday, September 19, 2011 3:34 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Machine readable vs. human readable URLs.
> > >
> > > Hi Chip,
> > >
> > > There is no need to run ant war, there is no war target in the >= 
> > > Nutch
> > 1.3
> > > build.xml file.
> > >
> > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc 
> > > etc. Do
> > you
> > > mean you've added your seed URLs?
> > >
> > > Have you had a look at any of your log output as to whether the 
> > > urlmeta plugin is loaded and used when fetching?
> > >
> > > You should be able to get info on your schema, fields etc within 
> > > the
> Solr
> > > admin UI
> > >
> > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <ccalh...@aip.org>
> wrote:
> > >
> > > > Hi Julien,
> > > >
> > > > Thanks, that's encouraging. I'm trying to make this work, and 
> > > > I'm definitely missing something. I hope I'm not too far off the mark.
> > > > I've started with the instructions at 
> > > > http://wiki.apache.org/nutch/WritingPluginExample . If I 
> > > > understand this properly, the changes I needed to make were the 
> > > > following:
> > > >
> > > > In Nutch:
> > > > Paste the prescribed block of code into 
> > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch 
> > > > to look for and run the urlmeta plugin.
> > > > In %NUTCH_HOME%, run "ant war".
> > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line 
> > > > in
> this
> > > file
> > > > now looks like: "http://www.aip.org/history/ead/20110369.xml
>  \t
> > > > humanURL=http://www.aip.org/history/ead/20110369.html";
> > > >
> > > > In Solr:
> > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . 
> > > > The
> new
> > > > line consists of: " <field name="humanURL" type="string"
> stored="true"
> > > > indexed="false"/>"
> > > >
> > > > I've redone the indexing, and my new field still doesn't show up 
> > > > in the search results. Can you tell where I'm going wrong?
> > > >
> > > > Thanks,
> > > > Chip
> > > >
> > > > -----Original Message-----
> > > > From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
> > > > Sent: Friday, September 16, 2011 4:37 AM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Machine readable vs. human readable URLs.
> > > >
> > > > Hi Chip,
> > > >
> > > > Should simply be a matter of creating a custom field with an 
> > > > IndexingFilter, you can then use it in any way you want on the 
> > > > SOLR side
> > > >
> > > > Julien
> > > >
> > > > On 15 September 2011 21:50, Chip Calhoun <ccalh...@aip.org> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > We'd like to use Nutch and Solr to replace an existing Verity
> search
> > > > > that's become a bit long in the tooth. In our Verity search, 
> > > > > we
> have
> > > > > a hack which allows each document to have a machine-readable 
> > > > > URL which is indexed (generally an xml document), and a 
> > > > > human-readable URL which we actually send users to. Has anyone 
> > > > > done the same with
> > > Nutch and Solr?
> > > > >
> > > > > Thanks,
> > > > > Chip
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> *Lewis*
>



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to