HI,
i didn't managed to run Invertlinks and solrindex command only for some
segments since it seems those command works only for segments parent dir.
Then i've made a little change to my fetch/parse/update/index loop.
*In short:*
I generate new segments in an empty "current_segments" dir. When th
Hi David,
You can also consider setting shorter fetch interval time with nutch inject.
This way you'll set higher score (so the url is always taken in priority
when you generate a segment) and a fetch.interval of 1 day.
If you have a case similar to me, you'll often want some homepage fetch each
If you want records to be fetched at a fixed interval its easier to inject them
with a fixed fetch interval.
nutch.fixedFetchInterval=86400
-Original message-
> From:kemical
> Sent: Thu 14-Feb-2013 10:15
> To: user@nutch.apache.org
> Subject: Re: Nutch Incremental Crawl
>
> Hi Davi
Hi everyone,
I'm new to Nutch and I would appreciate some advice...
I want to use Nutch to Crawl over urls and categorize them.
I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase 0.94.2,
and I saw that Nutch 2.1 with Gora supports HBase as backend.
I would like to start by runn
Hi Amit,
On Thu, Feb 14, 2013 at 6:24 AM, Amit Sela wrote:
>
> I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase 0.94.2,
> and I saw that Nutch 2.1 with Gora supports HBase as backend.
>
First thing's first. We cannot guarantee that Gora and subsequently Nutch
will work with t
Hello,
I see that there are
fields in addition to title, host and content ones in nutch-2.x'
solr-mapping.xml. I thought tstamp may be needed for sorting documents. What
about the other fields,
segment, boost and digest? Can
Hi Alex,
Tstamp represents fetch tiem, used for deduplication.
Boost is for scoring-opic and link. This is required in 2.x as well.
I don't have the code right now, but you can try removing digest and
segment. To me they both look legacy.
There is a wiki page on index structure which you can consul
Hi,
I try to set Nutch 2.1 and Solr 4.0 with MySQL database, according to the
instruction in this link: http://nlp.solutions.asia/?p=180.
I made same changes in conf/nutch-site.xml (set threads to 50).
When I start crawl (path: ~/Desktop/apache-nutch-2.1/runtime/local,
command: bin/nutch crawl u
8 matches
Mail list logo