Hi Julien, Thanks for clarifying this! I've got it working now. Instead of seeding with a proper tab-delimited file created in Excel, I had been wrong-headedly seeding it with a text file that just had tabs in it. They look the same, but it makes a difference. Thanks!
Chip -----Original Message----- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Monday, September 19, 2011 5:23 PM To: user@nutch.apache.org Subject: Re: Machine readable vs. human readable URLs. > In addition, it looks like you are misinterpreting how the urlmeta > plugin works Chip. It is designed to pick up addition meta tags with > name and a content values respectively. e.g. > > <meta name="humanURL" content="blahblahblah"> > Sorry Lewis but it does not do that at all. See link I gave earlier for a description of urlmeta. I agree that the name is misleading, it does not extra the content from the page but simply uses the crawldb metadata > > The plugin then gets this data as well as any additional values added > in the urlmeta.tags property within nutch-site.xml and add this to the > index which can then be queried. > > Does this make sense? > > On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > Hi > > > > Since the info is available thanks to the injection you can use the > > url-meta plugin as-is and won't need to have a custom version. See > > https://issues.apache.org/jira/browse/NUTCH-855 > > > > Apart from that do not modify the content of \runtime\local\conf\ > > before re-compiling with ANT as this will be overwritten. Either > > modify $NUTCH/conf/nutch-site.xml or recompile THEN modify. > > > > As Lewis suggested check the logs and see if the plugin is activated > etc... > > > > J. > > > > > > On 19 September 2011 21:03, Chip Calhoun <ccalh...@aip.org> wrote: > > > > > Hi Lewis, > > > > > > My probably wrong understanding was that I'm supposed to add the > > > tags > for > > > my new field to my list of seed URLs. So if I have a seed URL > > > followed > by > > " > > > \t humanURL=http://www.aip.org/history/ead/20110369.html", > > > I > get > > a > > > new field called "humanURL" which is populated with the string > > > I've specified for that specific URL. I may just be greatly > > > misunderstanding > > how > > > this plugin works. > > > > > > I've checked my Nutch logs now and it looks like nothing happened. > > > The > > new > > > field does at least show up in the Solr admin UI's schema, but > > > clearly > my > > > problem is on the Nutch end of things. > > > > > > -----Original Message----- > > > From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] > > > Sent: Monday, September 19, 2011 3:34 PM > > > To: user@nutch.apache.org > > > Subject: Re: Machine readable vs. human readable URLs. > > > > > > Hi Chip, > > > > > > There is no need to run ant war, there is no war target in the >= > > > Nutch > > 1.3 > > > build.xml file. > > > > > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc > > > etc. Do > > you > > > mean you've added your seed URLs? > > > > > > Have you had a look at any of your log output as to whether the > > > urlmeta plugin is loaded and used when fetching? > > > > > > You should be able to get info on your schema, fields etc within > > > the > Solr > > > admin UI > > > > > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <ccalh...@aip.org> > wrote: > > > > > > > Hi Julien, > > > > > > > > Thanks, that's encouraging. I'm trying to make this work, and > > > > I'm definitely missing something. I hope I'm not too far off the mark. > > > > I've started with the instructions at > > > > http://wiki.apache.org/nutch/WritingPluginExample . If I > > > > understand this properly, the changes I needed to make were the > > > > following: > > > > > > > > In Nutch: > > > > Paste the prescribed block of code into > > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch > > > > to look for and run the urlmeta plugin. > > > > In %NUTCH_HOME%, run "ant war". > > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line > > > > in > this > > > file > > > > now looks like: "http://www.aip.org/history/ead/20110369.xml > \t > > > > humanURL=http://www.aip.org/history/ead/20110369.html" > > > > > > > > In Solr: > > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . > > > > The > new > > > > line consists of: " <field name="humanURL" type="string" > stored="true" > > > > indexed="false"/>" > > > > > > > > I've redone the indexing, and my new field still doesn't show up > > > > in the search results. Can you tell where I'm going wrong? > > > > > > > > Thanks, > > > > Chip > > > > > > > > -----Original Message----- > > > > From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] > > > > Sent: Friday, September 16, 2011 4:37 AM > > > > To: user@nutch.apache.org > > > > Subject: Re: Machine readable vs. human readable URLs. > > > > > > > > Hi Chip, > > > > > > > > Should simply be a matter of creating a custom field with an > > > > IndexingFilter, you can then use it in any way you want on the > > > > SOLR side > > > > > > > > Julien > > > > > > > > On 15 September 2011 21:50, Chip Calhoun <ccalh...@aip.org> wrote: > > > > > > > > > Hi everyone, > > > > > > > > > > We'd like to use Nutch and Solr to replace an existing Verity > search > > > > > that's become a bit long in the tooth. In our Verity search, > > > > > we > have > > > > > a hack which allows each document to have a machine-readable > > > > > URL which is indexed (generally an xml document), and a > > > > > human-readable URL which we actually send users to. Has anyone > > > > > done the same with > > > Nutch and Solr? > > > > > > > > > > Thanks, > > > > > Chip > > > > > > > > > > > > > > > > > > > > > -- > > > > * > > > > *Open Source Solutions for Text Engineering > > > > > > > > http://digitalpebble.blogspot.com/ > > > > http://www.digitalpebble.com > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > > > > > -- > *Lewis* > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com