Re: newbie questions

2009-12-01 Thread Mischa Tuffield
Hello Brian, 

Getting a response from another newbie here, so I could be wrong (do excuse if 
I am).

If you are attempting to run a search index from the filesystem you need to 
have the following in your nutch-site.xml : 

  property
namefs.default.name/name
valuefile:value
  /property

The fs.default.name is require by the nutch-site.xml when you build your .war 
file for deployment to tomcat. This should be accompanied by the below config, 
which should point to the direct where your index has been copied to, in my 
case it looks something like below :

 property
   namesearcher.dir/name
   value/home/nutch/nutch/service/crawl/value
   description
   Path to root of crawl.  This directory is searched (in
   order) for either the file search-servers.txt, containing a list of
   distributed search servers, or the directory index containing
   merged indexes, or the directory segments containing segment
   indexes.
   /description
 /property

Regarding your second question :

bin/nutch readdb yourcrawldir/crawldb -dump -format csv

Gives you a nice flat file serialisation of your crawl database.

I hope this helps, 

Mischa
On 1 Dec 2009, at 08:44, brian wrote:

 also, I would like to know how to extract flat text files of the crawl data.

___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD



Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Martin Kuen
Hi there,

On Jan 29, 2008 5:23 PM, Vinci [EMAIL PROTECTED] wrote:


 Hi,

 Thank you :)
 One more question for the fetched page reading: I prefer I can dump the
 fetched page into a single html file.

You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to
create a seperate file for each downloaded file.
You could modify the SegmentReader class (
org.apache.nutch.segment.SegmentReader) if you want to do that.

No other way besides invert the
 inverted file?

The index is not inverted if you use the readseg command. The fetched
content (e.g html pages) is stored in the crawl/segments folder. The
lucene index is stored in crawl/indexes. This (lucene) index is created
after all crawling has finished. The readseg command (SegmentReader class)
only accesses crawl/segments, so the lucene index is not touched. lucene
index -- the inverted index

Best Regards,

Martin



 Martin Kuen wrote:
 
  Hi,
 
  On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:
 
 
  Hi,
 
  I am new to nutch and I am trying to run a nutch to fetch something
 from
  specific websites. Currently I am running 0.9.
 
  As I have limited resources, I don't want nutch be too aggressive, so I
  want
  to set some delay, but I am confused with the value of http.max.delays,
  does
  it use milliseconds insteads of seconds? (Some people said it is in 3
  second
  by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)
 
 
  http.max.delays doesn't specify a timespan - read the description more
  carefully. I think fetcher.server.delay is what you are looking for.
 It
  is
  the amount of time the fetcher will at least wait until it fetches
 another
  url from the same host. Keep in mind that the fetcher obeys robots.txt
  files
  (by default) - so if a robots.txt file is present the crawling will
 occur
  polite enough.
 
 
  Also, I need to read the fetched page so that I can do some
 modification
  on
  the html structure for future parsing, where is the files located? Are
  they
  store in pure html or they are breaken down into multiple file? if this
  is
  not html file, how can I read the fetched page?
 
 
  If you are looking for a way to programmatically read the fetched
 content
  (
  e.g. html pages) have a look at the IndexReader class.
  If you are looking for a way to dump the whole downloaded content to a
  Text
  file or want to see some statistical information about it, try the
  readseg
  command.
  Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions
 
 
  And will the cached page losing all the original html attribute when it
  viewed in cached page?
 
  The page will be stored character by character, including html tags.
 
 
  Also, how can I read the link that nutch found and how can I control
 the
  crawling sequence? (change it to breadth-first search at the top level,
  then
  depth-first one by one)
 
  Crawling always occurs breadth-first. If you want fine-grained control
  over
  the crawling sequence you should follow the procedure in the nutch
  tutorial
  for whole internet crawling. Nevertheless the crawling occurs
  breath-first.
 
 
  Sorry for many questions.
 
 
  HTH,
 
  Martin
 
  PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
  (nice semester abroad . . . hehe ;)
 
 
  --
  View this message in context:
 
 http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci

Hi,

Thank you :) 
One more question for the fetched page reading: I prefer I can dump the
fetched page into a single html file. No other way besides invert the
inverted file?


Martin Kuen wrote:
 
 Hi,
 
 On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:
 

 Hi,

 I am new to nutch and I am trying to run a nutch to fetch something from
 specific websites. Currently I am running 0.9.

 As I have limited resources, I don't want nutch be too aggressive, so I
 want
 to set some delay, but I am confused with the value of http.max.delays,
 does
 it use milliseconds insteads of seconds? (Some people said it is in 3
 second
 by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)

 
 http.max.delays doesn't specify a timespan - read the description more
 carefully. I think fetcher.server.delay is what you are looking for. It
 is
 the amount of time the fetcher will at least wait until it fetches another
 url from the same host. Keep in mind that the fetcher obeys robots.txt
 files
 (by default) - so if a robots.txt file is present the crawling will occur
 polite enough.
 
 
 Also, I need to read the fetched page so that I can do some modification
 on
 the html structure for future parsing, where is the files located? Are
 they
 store in pure html or they are breaken down into multiple file? if this
 is
 not html file, how can I read the fetched page?

 
 If you are looking for a way to programmatically read the fetched content
 (
 e.g. html pages) have a look at the IndexReader class.
 If you are looking for a way to dump the whole downloaded content to a
 Text
 file or want to see some statistical information about it, try the
 readseg
 command.
 Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions
 

 And will the cached page losing all the original html attribute when it
 viewed in cached page?

 The page will be stored character by character, including html tags.
 

 Also, how can I read the link that nutch found and how can I control the
 crawling sequence? (change it to breadth-first search at the top level,
 then
 depth-first one by one)

 Crawling always occurs breadth-first. If you want fine-grained control
 over
 the crawling sequence you should follow the procedure in the nutch
 tutorial
 for whole internet crawling. Nevertheless the crawling occurs
 breath-first.
 

 Sorry for many questions.
 
 
 HTH,
 
 Martin
 
 PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
 (nice semester abroad . . . hehe ;)
 
 
 --
 View this message in context:
 http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
 Sent from the Nutch - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Martin Kuen
Hi,

On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:


 Hi,

 I am new to nutch and I am trying to run a nutch to fetch something from
 specific websites. Currently I am running 0.9.

 As I have limited resources, I don't want nutch be too aggressive, so I
 want
 to set some delay, but I am confused with the value of http.max.delays,
 does
 it use milliseconds insteads of seconds? (Some people said it is in 3
 second
 by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)


http.max.delays doesn't specify a timespan - read the description more
carefully. I think fetcher.server.delay is what you are looking for. It is
the amount of time the fetcher will at least wait until it fetches another
url from the same host. Keep in mind that the fetcher obeys robots.txt files
(by default) - so if a robots.txt file is present the crawling will occur
polite enough.


 Also, I need to read the fetched page so that I can do some modification
 on
 the html structure for future parsing, where is the files located? Are
 they
 store in pure html or they are breaken down into multiple file? if this is
 not html file, how can I read the fetched page?


If you are looking for a way to programmatically read the fetched content (
e.g. html pages) have a look at the IndexReader class.
If you are looking for a way to dump the whole downloaded content to a Text
file or want to see some statistical information about it, try the readseg
command.
Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions


 And will the cached page losing all the original html attribute when it
 viewed in cached page?

The page will be stored character by character, including html tags.


 Also, how can I read the link that nutch found and how can I control the
 crawling sequence? (change it to breadth-first search at the top level,
 then
 depth-first one by one)

Crawling always occurs breadth-first. If you want fine-grained control over
the crawling sequence you should follow the procedure in the nutch tutorial
for whole internet crawling. Nevertheless the crawling occurs
breath-first.


 Sorry for many questions.


HTH,

Martin

PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
(nice semester abroad . . . hehe ;)


 --
 View this message in context:
 http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Vinci

Hi,

thank you.:)
Seems I need to write a Java program to write out the file and do the
transformation.
Another question to the dumped linkdb: I find escaped html appear in the end
of the link, is it the fault of the parser (the html most likely not valid,
but I really don't need the chunk of the invalid code)? 
If I want to change the link parser, what do I need to do (especially I
prefer the change it by plugins)?


Martin Kuen wrote:
 
 Hi there,
 
 On Jan 29, 2008 5:23 PM, Vinci [EMAIL PROTECTED] wrote:
 

 Hi,

 Thank you :)
 One more question for the fetched page reading: I prefer I can dump the
 fetched page into a single html file.
 
 You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to
 create a seperate file for each downloaded file.
 You could modify the SegmentReader class (
 org.apache.nutch.segment.SegmentReader) if you want to do that.
 
 No other way besides invert the
 inverted file?

 The index is not inverted if you use the readseg command. The fetched
 content (e.g html pages) is stored in the crawl/segments folder. The
 lucene index is stored in crawl/indexes. This (lucene) index is created
 after all crawling has finished. The readseg command (SegmentReader class)
 only accesses crawl/segments, so the lucene index is not touched. lucene
 index -- the inverted index
 
 Best Regards,
 
 Martin
 
 

 Martin Kuen wrote:
 
  Hi,
 
  On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote:
 
 
  Hi,
 
  I am new to nutch and I am trying to run a nutch to fetch something
 from
  specific websites. Currently I am running 0.9.
 
  As I have limited resources, I don't want nutch be too aggressive, so
 I
  want
  to set some delay, but I am confused with the value of
 http.max.delays,
  does
  it use milliseconds insteads of seconds? (Some people said it is in 3
  second
  by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)
 
 
  http.max.delays doesn't specify a timespan - read the description
 more
  carefully. I think fetcher.server.delay is what you are looking for.
 It
  is
  the amount of time the fetcher will at least wait until it fetches
 another
  url from the same host. Keep in mind that the fetcher obeys robots.txt
  files
  (by default) - so if a robots.txt file is present the crawling will
 occur
  polite enough.
 
 
  Also, I need to read the fetched page so that I can do some
 modification
  on
  the html structure for future parsing, where is the files located? Are
  they
  store in pure html or they are breaken down into multiple file? if
 this
  is
  not html file, how can I read the fetched page?
 
 
  If you are looking for a way to programmatically read the fetched
 content
  (
  e.g. html pages) have a look at the IndexReader class.
  If you are looking for a way to dump the whole downloaded content to a
  Text
  file or want to see some statistical information about it, try the
  readseg
  command.
  Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions
 
 
  And will the cached page losing all the original html attribute when
 it
  viewed in cached page?
 
  The page will be stored character by character, including html tags.
 
 
  Also, how can I read the link that nutch found and how can I control
 the
  crawling sequence? (change it to breadth-first search at the top
 level,
  then
  depth-first one by one)
 
  Crawling always occurs breadth-first. If you want fine-grained control
  over
  the crawling sequence you should follow the procedure in the nutch
  tutorial
  for whole internet crawling. Nevertheless the crawling occurs
  breath-first.
 
 
  Sorry for many questions.
 
 
  HTH,
 
  Martin
 
  PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
  (nice semester abroad . . . hehe ;)
 
 
  --
  View this message in context:
 
 http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html
 Sent from the Nutch - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15175746.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Newbie questions about followed links

2007-03-08 Thread Hasan Diwan

Sir:
On 08/03/07, Jeroen Verhagen [EMAIL PROTECTED] wrote:

Surely these links look ordinary enough to be seen and followed by
nutch? Could someone please tell me what could be causing these links
not be followed?


conf/urlfilter.txt.template contains the line:
[EMAIL PROTECTED]

Remove the '?' and the links will be followed.

--
Cheers,
Hasan Diwan [EMAIL PROTECTED]


Re: Newbie questions about followed links

2007-03-08 Thread Paul Liddelow

exactly what I was going to say!

Cheers
Paul

On 3/8/07, Hasan Diwan [EMAIL PROTECTED] wrote:


Sir:
On 08/03/07, Jeroen Verhagen [EMAIL PROTECTED] wrote:
 Surely these links look ordinary enough to be seen and followed by
 nutch? Could someone please tell me what could be causing these links
 not be followed?

conf/urlfilter.txt.template contains the line:
[EMAIL PROTECTED]

Remove the '?' and the links will be followed.

--
Cheers,
Hasan Diwan [EMAIL PROTECTED]



Re: Newbie questions about followed links

2007-03-08 Thread Jeroen Verhagen

Hi Hasan,

On 3/8/07, Hasan Diwan [EMAIL PROTECTED] wrote:


conf/urlfilter.txt.template contains the line:
[EMAIL PROTECTED]

Remove the '?' and the links will be followed.


Thanks, that made it work.

I had to comment out the whole line '[EMAIL PROTECTED]' to make it work though
? Even though there do not seem to be @ charachter in the links for
example?

--

regards,

Jeroen


Re: Newbie questions

2005-07-05 Thread Jack Tang
Hi Vacuum

I hope nutch wiki will help you much:)
http://wiki.apache.org/nutch/


Regards
/Jack

On 7/6/05, Vacuum Joe [EMAIL PROTECTED] wrote:
 Hello Nutch-gurus,
 
 I have some very straightforward and yet totally
 newbie questions which I hope some kind person would
 answer.
 
 First of all, what is a db?  It seems like I have to
 inject links into the db to get the process started.
 So the links are in the db, and then I run fetch on
 them.  That brings me to the next question: what's a
 segment?  I notice that it creates timestamped segment
 directories.  What's in them?  Does the running Nutch
 web application automatically pick up new segment
 files when they are added, or do I have to restart it?
 
 I'm trying to figure this out because I want to get
 started with automated crawling, so I'll have one or
 two machines crawling all the time, and then have a
 cluster of web server machines.  I assume that the web
 server front-end machines need the segments and the
 crawlers need the db, but I'm not sure exactly what
 the functions of these are.
 
 Thanks for your help and thanks for the awesome piece
 of software.  Hopefully as we do some work on it,
 we'll have some code to return to the source.
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com



Re: Newbie questions

2005-07-05 Thread Vacuum Joe
 I hope nutch wiki will help you much:)
 http://wiki.apache.org/nutch/

Hello Jack,

Yes, I have been reading it.  The db file contains a
database of all the link structure and pages of the
web.  But what is a segment in this case?  I assume a
segment contains page content?  And then there is the
updatedb command which takes the newly-discovered
links in a segment and puts them back in the db, so
the new links can be followed in the next segment the
next time there is a crawl?

I am more confused about segments than I am about dbs,
I guess.  Do I need to keep old segments after
generating a new one?


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com