I had fetched about 3Gbytes pages in Nutch-0.7.2 .
Now, I want to move it to Nutch-0.8, How can I do it ?
Any suggestion is appreciated.
--
View this message in context:
http://www.nabble.com/How-does-Nutch-0.7.2-data-upgrade-to-0.8--tf2151013.html#a5940027
Sent from the Nutch - User forum at
King Kong wrote:
I had fetched about 3Gbytes pages in Nutch-0.7.2 .
Now, I want to move it to Nutch-0.8, How can I do it ?
Unfortunately, the data is not portable between these versions. The only
thing you could do to preserve your webdb is to dump it into a text
file, and then inject
Hi,
I am planning to create an index of 100 million pages by using a back-end
machine which includes a single-processor box with 1 gigabyte of RAM, 1
terabyte hard disk. Can you teach me that how long it will take?
Thank you in advance.
Regards,
B.Q. Hung
--
No virus found in this outgoing
Hi Ernesto!
Meta tags are custom tags that you add in your web page, to be more
exactly, inside the head/head tag, to identify the contents of the
web page to search engine indexes. For example your can add meta tag to
describe the author of the page, keywords, cache, and so on. What you can
Hi,
I found that with a 3 meg DSL line I was averaging 8 pages per second with a
similar set up, to reach 100 million pages would take about 144 days.
100,000,000 / 8 pages per second / 60 seconds per minute / 60 minutes per
hour / 24 hours in a day.
Just a FYI rule of thumb on a qwest DSL
It's really a sad news for me. I must spend a lot of time on fetching it
again.
However...
Andrzej,thanks for your help!
Andrzej Bialecki wrote:
King Kong wrote:
I had fetched about 3Gbytes pages in Nutch-0.7.2 .
Now, I want to move it to Nutch-0.8, How can I do it ?
Hi Lourival
Thanks, I see, I undertstand it now. I know metatags in html, but I can't use
it, because I want to crawl pages from others sites. I think categorize the
pages by url, with regular expressions.
muito obrigado! e até mais...
;)
Ernesto.
Lourival Júnior [EMAIL PROTECTED] escribió:
You probably still want to write a plugin. You can user whatever
algorithms you like to identify a site category, then add that as a
field in the index.
Ernesto De Santis wrote:
Hi Lourival
Thanks, I see, I undertstand it now. I know metatags in html, but I can't use
it, because I want to
Hello List,
I am trying to index local files. It seems like Nutch will not accept
file paths in crawl-urlfilter.txt and in the root urls file that have
sym links in it. so for me,
+^file:/home/ren/src/me/svn/testdata/standardSentence/(.*)
-- works
Unfortunately, the data is not portable between these versions. The only
thing you could do to preserve your webdb is to dump it into a text file,
and then inject into a 0.8 crawldb. As for the segments, you will have to
refetch them.
Is it just that no migration utility has been written? Is
Howie Wang wrote:
Unfortunately, the data is not portable between these versions. The
only thing you could do to preserve your webdb is to dump it into a
text file, and then inject into a 0.8 crawldb. As for the segments,
you will have to refetch them.
Is it just that no migration utility
Is it just that no migration utility has been written? Is there something
about the structures in 0.8 that make migrating the data impossible,
or extremely difficult?
Hey, these are just bits and bytes on the disk, so nothing is impossible ;)
Thanks, Andrzej, it sounds non-trivial :-( For
It's really a sad news for me. I must spend a lot of time on fetching it
again.
If it's only just HTML, then you could do a quick hack in 0.8 to
fetch the pages from your 0.7 crawl, using a modified fetcher. You
wouldn't have all of the header info, but if everything is text/html
then you
you could do a quick hack in 0.8 to
fetch the pages from your 0.7 crawl, using a modified fetcher.
what do you mean? Do I have to modify the fetcher code by myself ?
Ken Krugler wrote:
It's really a sad news for me. I must spend a lot of time on fetching it
again.
If it's only just
Andrzej,How can I do to dump a 0.7 webdb into a text file that it could
inject into the 0.8 crawldb?
Andrzej Bialecki wrote:
King Kong wrote:
I had fetched about 3Gbytes pages in Nutch-0.7.2 .
Now, I want to move it to Nutch-0.8, How can I do it ?
Unfortunately, the data is not
King Kong wrote:
Andrzej,How can I do to dump a 0.7 webdb into a text file that it could
inject into the 0.8 crawldb?
bin/nutch readdb webdb -dumppageurl | awk '$1 ~ /^URL:/ {print $2}'
urls.txt
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _
you could do a quick hack in 0.8 to
fetch the pages from your 0.7 crawl, using a modified fetcher.
what do you mean? Do I have to modify the fetcher code by myself ?
Yes, you'd have to modify the 0.8 fetcher code (or rather create your
own plug-in) that uses a Nutch 0.7 search setup to
Hi Doug,
There was a discussion under the subject log4j.properties bug (?) a couple
of weeks back. Please check it out. My (temporary) solution was to hardwire
the log4j.appender.DRFA.File variable in log4j.properties to
hadoop.logand then all the fetcher output from all tasks gets written
there
Hi, Ed,
Funny you should choose just now to reply. I just solved the problem on my
own system and was about to post what I found.
This appears to be related to HADOOP-406:
https://issues.apache.org/jira/browse/HADOOP-406
That appears to be why the child JVM fails to inherit hadoop.log.dir,
I don't know if Chris Schneider's patch for HADOOP-406 will prove
to be the
long-term solution, but it certainly works for me.
If you like please vote for this issue! I also use it in several
projects and wonder why it is not yet part of hadoop.
Thanks.
Stefan
20 matches
Mail list logo