subject:"Re\: Newbie questions"

Re: newbie questions

2009-12-01 Thread Mischa Tuffield

Hello Brian, 

Getting a response from another newbie here, so I could be wrong (do excuse if 
I am).

If you are attempting to run a search index from the filesystem you need to 
have the following in your nutch-site.xml : 

  property
namefs.default.name/name
valuefile:value
  /property

The fs.default.name is require by the nutch-site.xml when you build your .war 
file for deployment to tomcat. This should be accompanied by the below config, 
which should point to the direct where your index has been copied to, in my 
case it looks something like below :

 property
   namesearcher.dir/name
   value/home/nutch/nutch/service/crawl/value
   description
   Path to root of crawl.  This directory is searched (in
   order) for either the file search-servers.txt, containing a list of
   distributed search servers, or the directory index containing
   merged indexes, or the directory segments containing segment
   indexes.
   /description
 /property

Regarding your second question :

bin/nutch readdb yourcrawldir/crawldb -dump -format csv

Gives you a nice flat file serialisation of your crawl database.

I hope this helps, 

Mischa
On 1 Dec 2009, at 08:44, brian wrote:

 also, I would like to know how to extract flat text files of the crawl data.

___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Martin Kuen

Hi there,

On Jan 29, 2008 5:23 PM, Vinci [EMAIL PROTECTED] wrote:

Hi,

Thank you :)
One more question for the fetched page reading: I prefer I can dump the
fetched page into a single html file.

You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to
create a seperate file for each downloaded file.
You could modify the SegmentReader class (
org.apache.nutch.segment.SegmentReader) if you want to do that.

No other way besides invert the
inverted file?

The index is not inverted if you use the readseg command. The fetched
content (e.g html pages) is stored in the crawl/segments folder. The
lucene index is stored in crawl/indexes. This (lucene) index is created
after all crawling has finished. The readseg command (SegmentReader class)
only accesses crawl/segments, so the lucene index is not touched. lucene
index -- the inverted index

Best Regards,

Martin

Martin Kuen wrote:

Hi,

On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:

Hi,

I am new to nutch and I am trying to run a nutch to fetch something
from
specific websites. Currently I am running 0.9.

As I have limited resources, I don't want nutch be too aggressive, so I
want
to set some delay, but I am confused with the value of http.max.delays,
does
it use milliseconds insteads of seconds? (Some people said it is in 3
second
by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)

http.max.delays doesn't specify a timespan - read the description more
carefully. I think fetcher.server.delay is what you are looking for.
It
is
the amount of time the fetcher will at least wait until it fetches
another
url from the same host. Keep in mind that the fetcher obeys robots.txt
files
(by default) - so if a robots.txt file is present the crawling will
occur
polite enough.

Also, I need to read the fetched page so that I can do some
modification
on
the html structure for future parsing, where is the files located? Are
they
store in pure html or they are breaken down into multiple file? if this
is
not html file, how can I read the fetched page?

If you are looking for a way to programmatically read the fetched
content
(
e.g. html pages) have a look at the IndexReader class.
If you are looking for a way to dump the whole downloaded content to a
Text
file or want to see some statistical information about it, try the
readseg
command.
Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions

And will the cached page losing all the original html attribute when it
viewed in cached page?

The page will be stored character by character, including html tags.

Also, how can I read the link that nutch found and how can I control
the
crawling sequence? (change it to breadth-first search at the top level,
then
depth-first one by one)

Crawling always occurs breadth-first. If you want fine-grained control
over
the crawling sequence you should follow the procedure in the nutch
tutorial
for whole internet crawling. Nevertheless the crawling occurs
breath-first.

Sorry for many questions.

HTH,

Martin

PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
(nice semester abroad . . . hehe ;)

--
View this message in context:

http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--
View this message in context:
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci

Hi,

Thank you :)
One more question for the fetched page reading: I prefer I can dump the
fetched page into a single html file. No other way besides invert the
inverted file?

Martin Kuen wrote:

Hi,

On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:

Hi,

I am new to nutch and I am trying to run a nutch to fetch something from
specific websites. Currently I am running 0.9.

http.max.delays doesn't specify a timespan - read the description more
carefully. I think fetcher.server.delay is what you are looking for. It
is
the amount of time the fetcher will at least wait until it fetches another
url from the same host. Keep in mind that the fetcher obeys robots.txt
files
(by default) - so if a robots.txt file is present the crawling will occur
polite enough.

Also, I need to read the fetched page so that I can do some modification
on
the html structure for future parsing, where is the files located? Are
they
store in pure html or they are breaken down into multiple file? if this
is
not html file, how can I read the fetched page?

If you are looking for a way to programmatically read the fetched content
(
e.g. html pages) have a look at the IndexReader class.
If you are looking for a way to dump the whole downloaded content to a
Text
file or want to see some statistical information about it, try the
readseg
command.
Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions

And will the cached page losing all the original html attribute when it
viewed in cached page?

The page will be stored character by character, including html tags.

Also, how can I read the link that nutch found and how can I control the
crawling sequence? (change it to breadth-first search at the top level,
then
depth-first one by one)

Sorry for many questions.

HTH,

Martin

PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
(nice semester abroad . . . hehe ;)

--
View this message in context:
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Martin Kuen

Hi,

On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:

Hi,

I am new to nutch and I am trying to run a nutch to fetch something from
specific websites. Currently I am running 0.9.

http.max.delays doesn't specify a timespan - read the description more
carefully. I think fetcher.server.delay is what you are looking for. It is
the amount of time the fetcher will at least wait until it fetches another
url from the same host. Keep in mind that the fetcher obeys robots.txt files
(by default) - so if a robots.txt file is present the crawling will occur
polite enough.

Also, I need to read the fetched page so that I can do some modification
on
the html structure for future parsing, where is the files located? Are
they
store in pure html or they are breaken down into multiple file? if this is
not html file, how can I read the fetched page?

If you are looking for a way to programmatically read the fetched content (
e.g. html pages) have a look at the IndexReader class.
If you are looking for a way to dump the whole downloaded content to a Text
file or want to see some statistical information about it, try the readseg
command.
Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions

And will the cached page losing all the original html attribute when it
viewed in cached page?

The page will be stored character by character, including html tags.

Also, how can I read the link that nutch found and how can I control the
crawling sequence? (change it to breadth-first search at the top level,
then
depth-first one by one)

Crawling always occurs breadth-first. If you want fine-grained control over
the crawling sequence you should follow the procedure in the nutch tutorial
for whole internet crawling. Nevertheless the crawling occurs
breath-first.

Sorry for many questions.

HTH,

Martin

PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
(nice semester abroad . . . hehe ;)

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci

Hi,

thank you.:)
Seems I need to write a Java program to write out the file and do the
transformation.
Another question to the dumped linkdb: I find escaped html appear in the end
of the link, is it the fault of the parser (the html most likely not valid,
but I really don't need the chunk of the invalid code)?
If I want to change the link parser, what do I need to do (especially I
prefer the change it by plugins)?

Martin Kuen wrote:

Hi there,

On Jan 29, 2008 5:23 PM, Vinci [EMAIL PROTECTED] wrote:

Hi,

Thank you :)
One more question for the fetched page reading: I prefer I can dump the
fetched page into a single html file.

No other way besides invert the
inverted file?

Best Regards,

Martin

Martin Kuen wrote:

Hi,

On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:

Hi,

I am new to nutch and I am trying to run a nutch to fetch something
from
specific websites. Currently I am running 0.9.

As I have limited resources, I don't want nutch be too aggressive, so
I
want
to set some delay, but I am confused with the value of
http.max.delays,
does
it use milliseconds insteads of seconds? (Some people said it is in 3
second
by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)

http.max.delays doesn't specify a timespan - read the description
more
carefully. I think fetcher.server.delay is what you are looking for.
It
is
the amount of time the fetcher will at least wait until it fetches
another
url from the same host. Keep in mind that the fetcher obeys robots.txt
files
(by default) - so if a robots.txt file is present the crawling will
occur
polite enough.

Also, I need to read the fetched page so that I can do some
modification
on
the html structure for future parsing, where is the files located? Are
they
store in pure html or they are breaken down into multiple file? if
this
is
not html file, how can I read the fetched page?

And will the cached page losing all the original html attribute when
it
viewed in cached page?

The page will be stored character by character, including html tags.

Also, how can I read the link that nutch found and how can I control
the
crawling sequence? (change it to breadth-first search at the top
level,
then
depth-first one by one)

Sorry for many questions.

HTH,

Martin

PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
(nice semester abroad . . . hehe ;)

--
View this message in context:

http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--
View this message in context:
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15175746.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Newbie questions about followed links

2007-03-08 Thread Hasan Diwan


Sir:
On 08/03/07, Jeroen Verhagen [EMAIL PROTECTED] wrote:

Surely these links look ordinary enough to be seen and followed by
nutch? Could someone please tell me what could be causing these links
not be followed?


conf/urlfilter.txt.template contains the line:
[EMAIL PROTECTED]

Remove the '?' and the links will be followed.

--
Cheers,
Hasan Diwan [EMAIL PROTECTED]

Re: Newbie questions about followed links

2007-03-08 Thread Paul Liddelow


exactly what I was going to say!

Cheers
Paul

On 3/8/07, Hasan Diwan [EMAIL PROTECTED] wrote:


Sir:
On 08/03/07, Jeroen Verhagen [EMAIL PROTECTED] wrote:
 Surely these links look ordinary enough to be seen and followed by
 nutch? Could someone please tell me what could be causing these links
 not be followed?

conf/urlfilter.txt.template contains the line:
[EMAIL PROTECTED]

Remove the '?' and the links will be followed.

--
Cheers,
Hasan Diwan [EMAIL PROTECTED]

Re: Newbie questions about followed links

2007-03-08 Thread Jeroen Verhagen


Hi Hasan,

On 3/8/07, Hasan Diwan [EMAIL PROTECTED] wrote:


conf/urlfilter.txt.template contains the line:
[EMAIL PROTECTED]

Remove the '?' and the links will be followed.


Thanks, that made it work.

I had to comment out the whole line '[EMAIL PROTECTED]' to make it work though
? Even though there do not seem to be @ charachter in the links for
example?

--

regards,

Jeroen

Re: Newbie questions

2005-07-05 Thread Jack Tang

Hi Vacuum

I hope nutch wiki will help you much:)
http://wiki.apache.org/nutch/


Regards
/Jack

On 7/6/05, Vacuum Joe [EMAIL PROTECTED] wrote:
 Hello Nutch-gurus,
 
 I have some very straightforward and yet totally
 newbie questions which I hope some kind person would
 answer.
 
 First of all, what is a db?  It seems like I have to
 inject links into the db to get the process started.
 So the links are in the db, and then I run fetch on
 them.  That brings me to the next question: what's a
 segment?  I notice that it creates timestamped segment
 directories.  What's in them?  Does the running Nutch
 web application automatically pick up new segment
 files when they are added, or do I have to restart it?
 
 I'm trying to figure this out because I want to get
 started with automated crawling, so I'll have one or
 two machines crawling all the time, and then have a
 cluster of web server machines.  I assume that the web
 server front-end machines need the segments and the
 crawlers need the db, but I'm not sure exactly what
 the functions of these are.
 
 Thanks for your help and thanks for the awesome piece
 of software.  Hopefully as we do some work on it,
 we'll have some code to return to the source.
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com

Re: Newbie questions

2005-07-05 Thread Vacuum Joe

 I hope nutch wiki will help you much:)
 http://wiki.apache.org/nutch/

Hello Jack,

Yes, I have been reading it.  The db file contains a
database of all the link structure and pages of the
web.  But what is a segment in this case?  I assume a
segment contains page content?  And then there is the
updatedb command which takes the newly-discovered
links in a segment and puts them back in the db, so
the new links can be followed in the next segment the
next time there is a crawl?

I am more confused about segments than I am about dbs,
I guess.  Do I need to keep old segments after
generating a new one?


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: newbie questions

Re: Newbie Questions: http.max.delays, view fetched page, view link db

Re: Newbie Questions: http.max.delays, view fetched page, view link db

Re: Newbie Questions: http.max.delays, view fetched page, view link db

Re: Newbie Questions: http.max.delays, view fetched page, view link db

Re: Newbie questions about followed links

Re: Newbie questions about followed links

Re: Newbie questions about followed links

Re: Newbie questions

Re: Newbie questions

10 matches

Site Navigation

Mail list logo

Footer information