Re: Very long time just before fetching and just after parsing

2013-02-14 Thread kemical
HI, i didn't managed to run Invertlinks and solrindex command only for some segments since it seems those command works only for segments parent dir. Then i've made a little change to my fetch/parse/update/index loop. *In short:* I generate new segments in an empty "current_segments" dir. When th

Re: Nutch Incremental Crawl

2013-02-14 Thread kemical
Hi David, You can also consider setting shorter fetch interval time with nutch inject. This way you'll set higher score (so the url is always taken in priority when you generate a segment) and a fetch.interval of 1 day. If you have a case similar to me, you'll often want some homepage fetch each

RE: Nutch Incremental Crawl

2013-02-14 Thread Markus Jelsma
If you want records to be fetched at a fixed interval its easier to inject them with a fixed fetch interval. nutch.fixedFetchInterval=86400 -Original message- > From:kemical > Sent: Thu 14-Feb-2013 10:15 > To: user@nutch.apache.org > Subject: Re: Nutch Incremental Crawl > > Hi Davi

Nutch 2.1 over Hadoop 1.0.3 and HBase 0.94.2

2013-02-14 Thread Amit Sela
Hi everyone, I'm new to Nutch and I would appreciate some advice... I want to use Nutch to Crawl over urls and categorize them. I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase 0.94.2, and I saw that Nutch 2.1 with Gora supports HBase as backend. I would like to start by runn

Re: Nutch 2.1 over Hadoop 1.0.3 and HBase 0.94.2

2013-02-14 Thread Lewis John Mcgibbney
Hi Amit, On Thu, Feb 14, 2013 at 6:24 AM, Amit Sela wrote: > > I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase 0.94.2, > and I saw that Nutch 2.1 with Gora supports HBase as backend. > First thing's first. We cannot guarantee that Gora and subsequently Nutch will work with t

fields in solrindex-mapping.xml

2013-02-14 Thread alxsss
Hello, I see that there are fields in addition to title, host and content ones in nutch-2.x' solr-mapping.xml. I thought tstamp may be needed for sorting documents. What about the other fields, segment, boost and digest? Can

Re: fields in solrindex-mapping.xml

2013-02-14 Thread Lewis John Mcgibbney
Hi Alex, Tstamp represents fetch tiem, used for deduplication. Boost is for scoring-opic and link. This is required in 2.x as well. I don't have the code right now, but you can try removing digest and segment. To me they both look legacy. There is a wiki page on index structure which you can consul

Nutch 2.1 different batch id (null)

2013-02-14 Thread Dragan Menoski
Hi, I try to set Nutch 2.1 and Solr 4.0 with MySQL database, according to the instruction in this link: http://nlp.solutions.asia/?p=180. I made same changes in conf/nutch-site.xml (set threads to 50). When I start crawl (path: ~/Desktop/apache-nutch-2.1/runtime/local, command: bin/nutch crawl u