Re: Nutch 2.1 fields

2012-10-03 Thread Lewis John Mcgibbney
[0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java On Thu, Oct 4, 2012 at 7:36 AM, Lewis John Mcgibbney wrote: > Hi James, > > On Thu, Oct 4, 2012 at 2:59 AM, wrote: >> Lewis and Chris, >> >> Agree that "

Re: Nutch 2.1 fields

2012-10-03 Thread Lewis John Mcgibbney
Hi James, On Thu, Oct 4, 2012 at 2:59 AM, wrote: > Lewis and Chris, > > Agree that "The Index Structure" page is very useful documentation. I went > through the fields/plugins listed in your link using Nutch 2.1 rc and most > work. I was able to get positive results for everything except the f

RE: Nutch 2.1 fields

2012-10-03 Thread j.sullivan
subcollection does work with 2.x and the problem was the configuration on my side (the subcollections.xml file in the conf folder). So the list of fields in the "The Index Structure" page I can't confirm working with Nutch 2.x yet are: segment primaryType subtype urlmeta -Original Messag

RE: Nutch 2.1 fields

2012-10-03 Thread j.sullivan
Lewis and Chris, Agree that "The Index Structure" page is very useful documentation. I went through the fields/plugins listed in your link using Nutch 2.1 rc and most work. I was able to get positive results for everything except the following segment -- I am guessing this is not relevant to Nu

Re: Parse HTML Page with link generated by javascript

2012-10-03 Thread Sebastian Nagel
Hi Alexandre, > I try to crawl a website with a menu generated with some javascript code. > For exemple on this website: > http://www.beautycenter-riebenbauer.at/ Nutch does not interpret java script but is has a link extractor for java script based on regular expressions, see plugin parse-js. I

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

2012-10-03 Thread Lewis John Mcgibbney
Hi Matt, I know th6ere is a pile of stuff to add to this but for the time being (until I dive into your response in detail) please see below On Tue, Oct 2, 2012 at 11:17 PM, Matt MacDonald wrote: > Hi, ... > > 5) What value should I set for gora.buffer.read.limit? Currently it's > set to the def

Re: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-10-03 Thread Bai Shen
Gotcha. I wasn't sure if that was the case or not. Just wanted to make sure y'all were aware. On Wed, Oct 3, 2012 at 9:37 AM, Julien Nioche wrote: > Only the Apache distribution of Hadoop version 1.0.3 is officially > supported by Nutch. Obviously if we can get it to work on other > distributi

Re: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-10-03 Thread Julien Nioche
Only the Apache distribution of Hadoop version 1.0.3 is officially supported by Nutch. Obviously if we can get it to work on other distribution then the better it is but this can't be considered a bug or a blocker for the release On 3 October 2012 14:10, Bai Shen wrote: > I just tried to run it

Re: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-10-03 Thread Bai Shen
I just tried to run it and I'm getting the following bug on CDH4. https://issues.apache.org/jira/browse/NUTCH-1447 On Mon, Oct 1, 2012 at 8:17 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi All, > > Anyone else for this VOTE? > > Sorry to be a pest! > > Thanks > > Lewis > > On

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

2012-10-03 Thread Matt MacDonald
An answer to one of my own questions. I'd still love help with the others. > Some questions: > - > 1) After 12 iterations I'm still seeing more than 4,500 documents out > of 45,000 that are unfetched. How might I go about determining why the > unfeteched urls are not being

Parse HTML Page with link generated by javascript

2012-10-03 Thread Alexandre
Hi everyone, I'm using Nutch 1.5.1, and I configured my parse plugins like that: I try to crawl a website with a menu generated with some javascript code. For exemple