Re: mergesegs disk space

2009-07-21 Thread Tomislav Poljak
Hi, thanks for your answers, I've configured compression: mapred.output.compress = true mapred.compress.map.output = true mapred.output.compression.type= BLOCK ( in xml format in hadoop-site.xml ) and it works (and uses less disk space, no more out of disk space exception), but merging now

mergesegs disk space

2009-07-15 Thread Tomislav Poljak
Hi, I'm trying to merge (using nutch-1.0 mergesegs) about 1.2MM pages on one machine contained in 10 segments, using: bin/nutch mergesegs crawl/merge_seg -dir crawl/segments ,but there is not enough space on 500G disk to complete this merge task (getting java.io.IOException: No space left on

Re: how to crate Generating a new language profile in Nutch

2008-08-21 Thread Tomislav Poljak
Hi, what you actually do when you create profile is train identifier (classifier) on sample text (so it learns language most popular n-grams and their statistics). This n-gram language statistics is then written in a file langCode.npg (this is an profile-name - it is an output file from this

Parallel operations in fetch

2008-04-10 Thread Tomislav Poljak
Hi, is there a way to do some of these operations in parallel safely: generate, fetch, parse and updatedb (and if so, how)? thanks, Tomislav

Re: using readseg to get full contents?

2008-03-12 Thread Tomislav Poljak
Hi, in my experience (with nutch-1.0-dev from trunk) you can use readseg to get anything (content also) from segment, but depends on flags you use, try this: bin/nutch readseg -get crawl-20080311124208/segments/20080311124212/ http://test.dipiti.com/health/addiction -nofetch -nogenerate -noparse

Re: using readseg to get full contents?

2008-03-12 Thread Tomislav Poljak
Hi, correction on Java code, to get only content you should put this in constructor: SegmentReader reader = new SegmentReader(conf, true, false, false, false, false, false); Tomislav On Wed, 2008-03-12 at 09:19 +0100, Tomislav Poljak wrote: Hi, in my experience (with nutch-1.0-dev from

Re: Search server bin/nutch server?

2008-03-12 Thread Tomislav Poljak
of this. One more question: If I started a search server to background, can I use it for receiving direct query from other webpage? Thank you Tomislav Poljak wrote: Hi, this is used for Distributed Search, so if you want to use it start server(s): bin/nutch server port crawl dir

Re: Search server bin/nutch server?

2008-03-12 Thread Tomislav Poljak
. so there is noway other than I using the webapp for query processing, or call the searcher in command-line? Thank you Tomislav Poljak wrote: Hi, I'm not sure if I understand the question, but you can start server in the background (bin/nutch server 4321 crawl/ ) and use it from

Re: Search server bin/nutch server?

2008-03-11 Thread Tomislav Poljak
Hi, this is used for Distributed Search, so if you want to use it start server(s): bin/nutch server port crawl dir on the machine(s) where you have index(es) (you can put any free port and crawl dir should point to your crawl folder). Then you should configure Nutch search web app to use this

RE: merging indexes with nutch

2008-03-08 Thread Tomislav Poljak
the job as well. Cheers On Wed, Mar 5, 2008 at 6:11 PM, Tomislav Poljak [EMAIL PROTECTED] wrote: Hi, try this: bin/nutch merge crawl/index crawl/indexes crawl/indexes1 where crawl/index (not indexes) should be created by merge and crawl/indexes and crawl/indexes1

Re: merging indexes with nutch

2008-03-05 Thread Tomislav Poljak
Hi, try this: bin/nutch merge crawl/index crawl/indexes crawl/indexes1 where crawl/index (not indexes) should be created by merge and crawl/indexes and crawl/indexes1 are existing indexes for merging. Nutch search web application will use merged index form crawl/index and you should see this in

Re: How to use Nutch to parse Web-pages!

2008-01-16 Thread Tomislav Poljak
Hi, I think the simplest way to get parsed text from segment (Nutch stores parse text in segment, for example : crawl/segments/20080107120936/parse_text) to text file is dump option of segment reader: bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent -nofetch -nogenerate

Re: Redirect pages in segment

2008-01-15 Thread Tomislav Poljak
it in parseData meta data)? Tomislav On Mon, 2008-01-14 at 20:41 +0100, Andrzej Bialecki wrote: Tomislav Poljak wrote: Hi, I have been reading data from Nutch segments and came across pages/records with empty parse text. So I looked more into this and manually fetched data for this urls. Lots

Regex while fetching

2007-12-12 Thread Tomislav Poljak
Hi, I am trying to debug following exception: ERROR http.Http - at java.util.regex.Pattern $Curly.match0(Pattern.java:3773) this exception occurs a lot while fetching so my question is: why Nutch uses regex in fetching phase, is it for url filtering? Shouldn't fetchlist be already filtered

Re: Problem with partititioning

2007-12-11 Thread Tomislav Poljak
Hi, I have the same problem (same exception) after selecting phase of generating, sometimes it works fine and sometimes this exception occurs. Why is that and how can I fix it? Thanks, Tomislav On Tue, 2007-11-06 at 15:02 +0100, Karol Rybak wrote: ced that partitioning job generates a

fetching 1MM pages

2007-12-10 Thread Tomislav Poljak
Hi, I have a few questions about fetching 1MM pages. I am trying to fetch 1MM pages on cluster of 2 machines (EC2 Systems), using 4 map and 4 reduce tasks each using 200 threads. Fetchlist is generated with generate.max.per.host=5 and I get a fetchlist of about 5000 urls (so it should be at least

Re: dfs.DataNode - Failed to transfer blk_xxxx to 192.168.140.244:50010 got java.net.SocketException: Connection reset

2007-11-21 Thread Tomislav Poljak
Hi, I have the same problem, can this be the reason for slow fetching ? Thanks, Tomislav On Tue, 2007-11-20 at 12:57 +0100, Andrzej Bialecki wrote: 施兴 wrote: HI 2007-11-20 11:07:28,712 WARN dfs.DataNode - Failed to transfer blk_-3387595792800455675 to 192.168.140.244:50010 got

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

2007-09-20 Thread Tomislav Poljak
Hi, I had the same problem using re-crawl scripts from wiki. They all work fine with nutch versions up to 0.9 (0.9 included), but when using nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that merge in nutch-0.9 (from re-crawl scripts): bin/nutch merge crawl/indexes

Re: OutOfMemoryError while fetching

2007-09-11 Thread Tomislav Poljak
: Java heap space 2007-09-09 01:07:43,045 FATAL fetcher.Fetcher - fetcher caught:java.lang.OutOfMemoryError: Java heap space Any ideas why? Thanks, Tomislav On Mon, 2007-09-10 at 21:30 +0200, Andrzej Bialecki wrote: Tomislav Poljak wrote: Hi, so I have dedicated 1000 Mb (-Xmx1000m

Re: OutOfMemoryError while fetching

2007-09-11 Thread Tomislav Poljak
: Tomislav Poljak wrote: Hi Andrtzej, I am running fetcher in non-parsing mode, I have this in nutch-site.xml: property namefetcher.parse/name valuefalse/value descriptionIf true, fetcher will parse content./description /property Maybe I didn't post a question correctly. I

Re: Why 'nutch generate' is ignoring my argument of -numFetchers

2007-09-11 Thread Tomislav Poljak
Hi, I think -numFetchers is deprecated, read this: http://www.nabble.com/mapred--numFetchers-gone--tf362358.html#a1003373 Tomislav On Tue, 2007-09-11 at 09:37 -0700, Jenny LIU wrote: When I do: nutch generate crawl/db crawl/segments -numFetchers 30 -topN 5000 I was trying to get 30

OutOfMemoryError while fetching

2007-09-10 Thread Tomislav Poljak
Hi, so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when fetching (default settings). When using 10 threads I can fetch 25000 urls, but when using 20 threads fetcher fails with: java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url fetchlist. Is 20 threads to much

dual-core cpu usage while parsing and indexing

2007-09-08 Thread Tomislav Poljak
Hi, I have noticed that Nutch while parsing segment data, doing update, indexing or other CPU demanding operations is using only one CPU (core). Actually it uses both but alternately: when one CPU goes 100% other CPU is on 1%, and then they switch (never using both CPU on 100%). For example, when

Re: help with hardware requirements

2007-09-07 Thread Tomislav Poljak
Hi, what would be a recommended hardware specification for a machine running searcher web application with 15K users per day witch uses index of 100K urls (crawling is done by other machine)? What is a good practice for getting index from crawl machine to search machine (if using separate machines

Re: hadoop on single machine

2007-08-31 Thread Tomislav Poljak
Hi Renaud, thank you for your reply. This is valuable information, but can you elaborate a little bit more, like: you say: Nutch is always using Hadoop. I assume it does not uses Hadoop Distributed File System (HDFS) when running on a single machine by default? hadoop homepage says: Hadoop

hadoop on single machine

2007-08-30 Thread Tomislav Poljak
Would it be recommended to use hadoop for crawling (100 sites with 1000 pages each) on a single machine? What would be the benefit? Something like described on: http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single machine. Or is the simple crawl/recrawl (without hadoop, like

help with hardware requirements

2007-08-27 Thread Tomislav Poljak
I need help determining hardware specs for crawling 100 sites with 1000 pages each. Regular re-crawl is needed probably every day (maybe even more often). So will one server meet these crawling requirements (only crawling, searching will be handled by other machine)? If so, what hardware

Re: how to update CrawlDB instead of Recrawling???

2007-08-11 Thread Tomislav Poljak
Hi, if it helps: you don't need to restart tomcat to load index changes, it is enough to restart an individual web application (without restarting the Tomcat service) by touching the application's web.xml file. This is faster than restarting tomcat. Add: touch $tomcat_dir/WEB-INF/web.xml to the