How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread King Kong
I had fetched about 3Gbytes pages in Nutch-0.7.2 . Now, I want to move it to Nutch-0.8, How can I do it ? Any suggestion is appreciated. -- View this message in context: http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5940027 Sent from the Nutch - User forum at

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread Andrzej Bialecki
King Kong wrote: I had fetched about 3Gbytes pages in Nutch-0.7.2 . Now, I want to move it to Nutch-0.8, How can I do it ? Unfortunately, the data is not portable between these versions. The only thing you could do to preserve your webdb is to dump it into a text file, and then inject

How long to get 100 million page

2006-08-23 Thread Bui Quang Hung
Hi, I am planning to create an index of 100 million pages by using a back-end machine which includes a single-processor box with 1 gigabyte of RAM, 1 terabyte hard disk. Can you teach me that how long it will take? Thank you in advance. Regards, B.Q. Hung -- No virus found in this outgoing

Re: index/search filtering by category

2006-08-23 Thread Lourival Júnior
Hi Ernesto! Meta tags are custom tags that you add in your web page, to be more exactly, inside the head/head tag, to identify the contents of the web page to search engine indexes. For example your can add meta tag to describe the author of the page, keywords, cache, and so on. What you can

RE: How long to get 100 million page

2006-08-23 Thread Dan Morrill
Hi, I found that with a 3 meg DSL line I was averaging 8 pages per second with a similar set up, to reach 100 million pages would take about 144 days. 100,000,000 / 8 pages per second / 60 seconds per minute / 60 minutes per hour / 24 hours in a day. Just a FYI rule of thumb on a qwest DSL

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread King Kong
It's really a sad news for me. I must spend a lot of time on fetching it again. However... Andrzej,thanks for your help! Andrzej Bialecki wrote: King Kong wrote: I had fetched about 3Gbytes pages in Nutch-0.7.2 . Now, I want to move it to Nutch-0.8, How can I do it ?

Re: index/search filtering by category

2006-08-23 Thread Ernesto De Santis
Hi Lourival Thanks, I see, I undertstand it now. I know metatags in html, but I can't use it, because I want to crawl pages from others sites. I think categorize the pages by url, with regular expressions. muito obrigado! e até mais... ;) Ernesto. Lourival Júnior [EMAIL PROTECTED] escribió:

Re: index/search filtering by category

2006-08-23 Thread Chris Stephens
You probably still want to write a plugin. You can user whatever algorithms you like to identify a site category, then add that as a field in the index. Ernesto De Santis wrote: Hi Lourival Thanks, I see, I undertstand it now. I know metatags in html, but I can't use it, because I want to

File paths with symbolic links in crawl-urlfilter.txt do not work

2006-08-23 Thread Renaud Richardet
Hello List, I am trying to index local files. It seems like Nutch will not accept file paths in crawl-urlfilter.txt and in the root urls file that have sym links in it. so for me, +^file:/home/ren/src/me/svn/testdata/standardSentence/(.*) -- works

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread Howie Wang
Unfortunately, the data is not portable between these versions. The only thing you could do to preserve your webdb is to dump it into a text file, and then inject into a 0.8 crawldb. As for the segments, you will have to refetch them. Is it just that no migration utility has been written? Is

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread Andrzej Bialecki
Howie Wang wrote: Unfortunately, the data is not portable between these versions. The only thing you could do to preserve your webdb is to dump it into a text file, and then inject into a 0.8 crawldb. As for the segments, you will have to refetch them. Is it just that no migration utility

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread Howie Wang
Is it just that no migration utility has been written? Is there something about the structures in 0.8 that make migrating the data impossible, or extremely difficult? Hey, these are just bits and bytes on the disk, so nothing is impossible ;) Thanks, Andrzej, it sounds non-trivial :-( For

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread Ken Krugler
It's really a sad news for me. I must spend a lot of time on fetching it again. If it's only just HTML, then you could do a quick hack in 0.8 to fetch the pages from your 0.7 crawl, using a modified fetcher. You wouldn't have all of the header info, but if everything is text/html then you

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread King Kong
you could do a quick hack in 0.8 to fetch the pages from your 0.7 crawl, using a modified fetcher. what do you mean? Do I have to modify the fetcher code by myself ? Ken Krugler wrote: It's really a sad news for me. I must spend a lot of time on fetching it again. If it's only just

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread King Kong
Andrzej,How can I do to dump a 0.7 webdb into a text file that it could inject into the 0.8 crawldb? Andrzej Bialecki wrote: King Kong wrote: I had fetched about 3Gbytes pages in Nutch-0.7.2 . Now, I want to move it to Nutch-0.8, How can I do it ? Unfortunately, the data is not

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread Andrzej Bialecki
King Kong wrote: Andrzej,How can I do to dump a 0.7 webdb into a text file that it could inject into the 0.8 crawldb? bin/nutch readdb webdb -dumppageurl | awk '$1 ~ /^URL:/ {print $2}' urls.txt -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _

Re: How does Nutch-0.7.2 data upgrade to 0.8?

2006-08-23 Thread Ken Krugler
you could do a quick hack in 0.8 to fetch the pages from your 0.7 crawl, using a modified fetcher. what do you mean? Do I have to modify the fetcher code by myself ? Yes, you'd have to modify the 0.8 fetcher code (or rather create your own plug-in) that uses a Nutch 0.7 search setup to

Re: Problem with logging of Fetcher output in 0.8-dev

2006-08-23 Thread e w
Hi Doug, There was a discussion under the subject log4j.properties bug (?) a couple of weeks back. Please check it out. My (temporary) solution was to hardwire the log4j.appender.DRFA.File variable in log4j.properties to hadoop.logand then all the fetcher output from all tasks gets written there

Re: Problem with logging of Fetcher output in 0.8-dev

2006-08-23 Thread Doug Cook
Hi, Ed, Funny you should choose just now to reply. I just solved the problem on my own system and was about to post what I found. This appears to be related to HADOOP-406: https://issues.apache.org/jira/browse/HADOOP-406 That appears to be why the child JVM fails to inherit hadoop.log.dir,

Re: Problem with logging of Fetcher output in 0.8-dev

2006-08-23 Thread Stefan Groschupf
I don't know if Chris Schneider's patch for HADOOP-406 will prove to be the long-term solution, but it certainly works for me. If you like please vote for this issue! I also use it in several projects and wonder why it is not yet part of hadoop. Thanks. Stefan