Hi,
thanks for your answers, I've configured compression:
mapred.output.compress = true
mapred.compress.map.output = true
mapred.output.compression.type= BLOCK
( in xml format in hadoop-site.xml )
and it works (and uses less disk space, no more out of disk space
exception), but merging now
Hi,
I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one
machine contained in 10 segments, using:
bin/nutch mergesegs crawl/merge_seg -dir crawl/segments
,but there is not enough space on 500G disk to complete this merge task
(getting java.io.IOException: No space left on
Hi,
what you actually do when you create profile is train identifier
(classifier) on sample text (so it learns language most popular
n-grams and their statistics).
This n-gram language statistics is then written in a file langCode.npg
(this is an profile-name - it is an output file from this
Hi,
is there a way to do some of these operations in parallel safely:
generate, fetch, parse and updatedb (and if so, how)?
thanks,
Tomislav
Hi,
in my experience (with nutch-1.0-dev from trunk) you can use readseg to
get anything (content also) from segment, but depends on flags you use,
try this:
bin/nutch readseg -get crawl-20080311124208/segments/20080311124212/
http://test.dipiti.com/health/addiction -nofetch -nogenerate -noparse
Hi,
correction on Java code, to get only content you should put this in
constructor:
SegmentReader reader = new SegmentReader(conf, true, false, false,
false, false, false);
Tomislav
On Wed, 2008-03-12 at 09:19 +0100, Tomislav Poljak wrote:
Hi,
in my experience (with nutch-1.0-dev from
of this.
One more question: If I started a search server to background, can I use it
for receiving direct query from other webpage?
Thank you
Tomislav Poljak wrote:
Hi,
this is used for Distributed Search, so if you want to use it start
server(s):
bin/nutch server port crawl dir
. so there is noway other
than I using the webapp for query processing, or call the searcher in
command-line?
Thank you
Tomislav Poljak wrote:
Hi,
I'm not sure if I understand the question, but you can start server in
the background (bin/nutch server 4321 crawl/ ) and use it from
Hi,
this is used for Distributed Search, so if you want to use it start
server(s):
bin/nutch server port crawl dir
on the machine(s) where you have index(es) (you can put any free port
and crawl dir should point to your crawl folder). Then you should
configure Nutch search web app to use this
the job as well.
Cheers
On Wed, Mar 5, 2008 at 6:11 PM, Tomislav Poljak [EMAIL PROTECTED] wrote:
Hi,
try this:
bin/nutch merge crawl/index crawl/indexes crawl/indexes1
where crawl/index (not indexes) should be created by merge and
crawl/indexes and crawl/indexes1
Hi,
try this:
bin/nutch merge crawl/index crawl/indexes crawl/indexes1
where crawl/index (not indexes) should be created by merge and
crawl/indexes and crawl/indexes1 are existing indexes for merging. Nutch
search web application will use merged index form crawl/index and you
should see this in
Hi,
I think the simplest way to get parsed text from segment (Nutch stores
parse text in segment, for example :
crawl/segments/20080107120936/parse_text) to text file is dump option of
segment reader:
bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent
-nofetch -nogenerate
it in parseData
meta data)?
Tomislav
On Mon, 2008-01-14 at 20:41 +0100, Andrzej Bialecki wrote:
Tomislav Poljak wrote:
Hi, I have been reading data from Nutch segments and came across
pages/records with empty parse text. So I looked more into this and
manually fetched data for this urls. Lots
Hi,
I am trying to debug following exception:
ERROR http.Http - at java.util.regex.Pattern
$Curly.match0(Pattern.java:3773)
this exception occurs a lot while fetching so my question is: why Nutch
uses regex in fetching phase, is it for url filtering? Shouldn't
fetchlist be already filtered
Hi,
I have the same problem (same exception) after selecting phase of
generating, sometimes it works fine and sometimes this exception occurs.
Why is that and how can I fix it?
Thanks,
Tomislav
On Tue, 2007-11-06 at 15:02 +0100, Karol Rybak wrote:
ced that partitioning job generates a
Hi,
I have a few questions about fetching 1MM pages. I am trying to fetch
1MM pages on cluster of 2 machines (EC2 Systems), using 4 map and 4
reduce tasks each using 200 threads. Fetchlist is generated with
generate.max.per.host=5 and I get a fetchlist of about 5000 urls (so it
should be at least
Hi,
I have the same problem, can this be the reason for slow fetching ?
Thanks,
Tomislav
On Tue, 2007-11-20 at 12:57 +0100, Andrzej Bialecki wrote:
施兴 wrote:
HI
2007-11-20 11:07:28,712 WARN dfs.DataNode - Failed to transfer
blk_-3387595792800455675 to 192.168.140.244:50010 got
Hi,
I had the same problem using re-crawl scripts from wiki. They all work
fine with nutch versions up to 0.9 (0.9 included), but when using
nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that
merge in nutch-0.9 (from re-crawl scripts):
bin/nutch merge crawl/indexes
: Java heap space
2007-09-09 01:07:43,045 FATAL fetcher.Fetcher - fetcher
caught:java.lang.OutOfMemoryError: Java heap space
Any ideas why?
Thanks,
Tomislav
On Mon, 2007-09-10 at 21:30 +0200, Andrzej Bialecki wrote:
Tomislav Poljak wrote:
Hi,
so I have dedicated 1000 Mb (-Xmx1000m
:
Tomislav Poljak wrote:
Hi Andrtzej,
I am running fetcher in non-parsing mode, I have this in nutch-site.xml:
property
namefetcher.parse/name
valuefalse/value
descriptionIf true, fetcher will parse content./description
/property
Maybe I didn't post a question correctly. I
Hi,
I think -numFetchers is deprecated, read this:
http://www.nabble.com/mapred--numFetchers-gone--tf362358.html#a1003373
Tomislav
On Tue, 2007-09-11 at 09:37 -0700, Jenny LIU wrote:
When I do:
nutch generate crawl/db crawl/segments -numFetchers 30 -topN 5000
I was trying to get 30
Hi,
so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when
fetching (default settings). When using 10 threads I can fetch 25000
urls, but when using 20 threads fetcher fails with:
java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url
fetchlist. Is 20 threads to much
Hi,
I have noticed that Nutch while parsing segment data, doing update,
indexing or other CPU demanding operations is using only one CPU (core).
Actually it uses both but alternately: when one CPU goes 100% other CPU
is on 1%, and then they switch (never using both CPU on 100%). For
example, when
Hi,
what would be a recommended hardware specification for a machine running
searcher web application with 15K users per day witch uses index of 100K
urls (crawling is done by other machine)? What is a good practice for
getting index from crawl machine to search machine (if using separate
machines
Hi Renaud,
thank you for your reply. This is valuable information, but can you
elaborate a little bit more, like:
you say: Nutch is always using Hadoop.
I assume it does not uses Hadoop Distributed File System (HDFS) when
running on a single machine by default?
hadoop homepage says: Hadoop
Would it be recommended to use hadoop for crawling (100 sites with 1000
pages each) on a single machine? What would be the benefit?
Something like described on:
http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
machine.
Or is the simple crawl/recrawl (without hadoop, like
I need help determining hardware specs for crawling 100 sites with 1000
pages each. Regular re-crawl is needed probably every day (maybe even
more often). So will one server meet these crawling requirements (only
crawling, searching will be handled by other machine)? If so, what
hardware
Hi,
if it helps:
you don't need to restart tomcat to load index changes, it is enough to
restart an individual web application (without restarting the Tomcat
service) by touching the application's web.xml file. This is faster than
restarting tomcat. Add:
touch $tomcat_dir/WEB-INF/web.xml
to the
28 matches
Mail list logo