Re: Nutch 1.3 + Cygwin + hadoop + paths

2011-09-19 Thread webdev1977
I was afraid of this :-( I can't believe that no one has tried this configuration yet? -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-3-Cygwin-hadoop-paths-tp3336911p3348154.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.3 + Cygwin + hadoop + paths

2011-09-19 Thread lewis john mcgibbney
Hi, As you probably know, there are not very many active windows + Nutch users on this list. This leaves you in a bit of a catch 22. When I first started using Nutch it was on a windows desktop and I found it pretty painful at times. Most of the relevant documentation available caters for *nix

nutch 1.3 solrindex empty content field

2011-09-19 Thread Jann Forrer
Hi I tried to run nutch-1.3 together with solr 3.x according to http://wiki.apache.org/nutch/NutchTutorial. That worked as described but if I try to search the index using the Solr admin interface i always get an empty result. http://localhost:8983/solr/admin/schema.jsp Using the Schema

Re: nutch 1.3 solrindex empty content field

2011-09-19 Thread Markus Jelsma
Check line 79 of your Solr schema: http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/schema.xml?view=markup Maybe we should configure the field to be stored in 1.4. I can imagine this causes a lot of headaches for new users. Also highlighting will never work with unstored fields. On

Re: nutch 1.3 solrindex empty content field

2011-09-19 Thread Markus Jelsma
On Monday 19 September 2011 15:58:35 lewis john mcgibbney wrote: Yes, what Markus has pointed out is the problem I think Jann. This means you need to re-index you're data and change the stored and index value to true. Markus', out of interest do you know the pro's/con's if we were to make

Re: nutch 1.3 solrindex empty content field

2011-09-19 Thread Markus Jelsma
*previous sent by accident On Monday 19 September 2011 15:58:35 lewis john mcgibbney wrote: Yes, what Markus has pointed out is the problem I think Jann. This means you need to re-index you're data and change the stored and index value to true. Markus', out of interest do you know the

Re: nutch 1.3 solrindex empty content field

2011-09-19 Thread lewis john mcgibbney
Does this solve you're problem Jann? Is this worth filing an issue for as it is rather trivial to address but could help more users unfamiliar with specifics of Nutch (or Solr) Schema(s) On Mon, Sep 19, 2011 at 3:06 PM, Markus Jelsma markus.jel...@openindex.iowrote: *previous sent by accident

RE: Machine readable vs. human readable URLs.

2011-09-19 Thread Chip Calhoun
Hi Julien, Thanks, that's encouraging. I'm trying to make this work, and I'm definitely missing something. I hope I'm not too far off the mark. I've started with the instructions at http://wiki.apache.org/nutch/WritingPluginExample . If I understand this properly, the changes I needed to make

Re: Machine readable vs. human readable URLs.

2011-09-19 Thread lewis john mcgibbney
Hi Chip, There is no need to run ant war, there is no war target in the = Nutch 1.3 build.xml file. Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do you mean you've added your seed URLs? Have you had a look at any of your log output as to whether the urlmeta plugin is

RE: Machine readable vs. human readable URLs.

2011-09-19 Thread Chip Calhoun
Hi Lewis, My probably wrong understanding was that I'm supposed to add the tags for my new field to my list of seed URLs. So if I have a seed URL followed by \t humanURL=http://www.aip.org/history/ead/20110369.html;, I get a new field called humanURL which is populated with the string

Re: Machine readable vs. human readable URLs.

2011-09-19 Thread Julien Nioche
Hi Since the info is available thanks to the injection you can use the url-meta plugin as-is and won't need to have a custom version. See https://issues.apache.org/jira/browse/NUTCH-855 Apart from that do not modify the content of \runtime\local\conf\ before re-compiling with ANT as this will

Consider relative outlinks conditionally as absolute URL

2011-09-19 Thread Markus Jelsma
Hi, I sometimes come across relative outlinks in the source that are intended as absolute but where the webmaster or CMS omits the protocol scheme. This results in repeating URI segments and crap URL's. Would an option that treat such URL's as absolute be a good idea? This problem is similar

Re: Machine readable vs. human readable URLs.

2011-09-19 Thread lewis john mcgibbney
In addition, it looks like you are misinterpreting how the urlmeta plugin works Chip. It is designed to pick up addition meta tags with name and a content values respectively. e.g. meta name=humanURL content=blahblahblah The plugin then gets this data as well as any additional values added in

Re: Consider relative outlinks conditionally as absolute URL

2011-09-19 Thread Markus Jelsma
On Sep 19, 2011, at 1:52pm, Markus Jelsma wrote: Hi, I sometimes come across relative outlinks in the source that are intended as absolute but where the webmaster or CMS omits the protocol scheme. This results in repeating URI segments and crap URL's. Would an option that treat

RE: Machine readable vs. human readable URLs.

2011-09-19 Thread Chip Calhoun
I thought it seemed too good to be true. I understood the part about this picking up metadata from tags within the actual documents; that seems like a feature a lot of people would need. But I thought the whole point of the tab-delimited tags in my URLs file was that I could also inject tags

Re: Machine readable vs. human readable URLs.

2011-09-19 Thread Julien Nioche
In addition, it looks like you are misinterpreting how the urlmeta plugin works Chip. It is designed to pick up addition meta tags with name and a content values respectively. e.g. meta name=humanURL content=blahblahblah Sorry Lewis but it does not do that at all. See link I gave earlier