Re: crawl time for depth param 50 and topN not passed

2013-04-06 Thread David Philip
Hi Sebastian,

   yes, its taking 2-3 days. Ok I will consider to giving incremental depth
and check stats every step. Thanks.
Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and
have removed  +.

what should be the depth for next recrawl case?  I mean this question: say
I had crawldb crawled with depth param 5 only and topN 10.. Now I find that
3-4 urls were deleted and 4 were modified.. I don’t know which those urls
are. So what I am doing is re-initate crawl.  At this time, what I should
give depth param?

Thanks - David



On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel wastl.na...@googlemail.com
 wrote:

 Hi David,

   What can be crawl time for very big site, given depth param as 50, topN
  default(not passed ) and default fetch interval as 2mins..
 afaik, the default of topN is Long.MAX_VALUE which is very large.
 So, the size of the crawl is mainly limited by the number of links you get.
 Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
 very defensive) your crawl will take a long time.

 Try to start with small values for depth and topN, e.g. 3 and 50.
 Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
 and check how the numbers of fetch/unfetched/gone/etc. URLs increase
 to get a feeling which values make sense for your crawl.

  Case: Crawling website spicemobilephones.co.in, and in the
  regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in.
 This doesn't look like a valid Java regex.
 Did you remove these lines:
   # accept anything else
   +.

 Sebastian



Re: crawl time for depth param 50 and topN not passed

2013-04-06 Thread Tejas Patil
On Sat, Apr 6, 2013 at 3:31 AM, David Philip davidphilipshe...@gmail.comwrote:

 Hi Sebastian,

yes, its taking 2-3 days. Ok I will consider to giving incremental depth
 and check stats every step. Thanks.
 Yes I have given like this +^http://([a-z0-9]*\.)*spicemobiles.co.in/ and
 have removed  +.

 what should be the depth for next recrawl case?  I mean this question: say
 I had crawldb crawled with depth param 5 only and topN 10.. Now I find that
 3-4 urls were deleted and 4 were modified.. I don’t know which those urls
 are. So what I am doing is re-initate crawl.  At this time, what I should
 give depth param?

Once those urls enter the crawldb, crawler won't need to reach those from
their parent page again. The crawler has stored those urls in its crawldb /
webtable. With each url, a re-crawl interval is maintained (which is by
default set to 30 days). Crawler wont pick a url for crawling if its fetch
interval aint elapsed since the last time the url was fetched. Crawl
interval can be configured using the db.fetch.interval.default property in
nutch-site.xml.


 Thanks - David



 On Sat, Apr 6, 2013 at 12:54 AM, Sebastian Nagel 
 wastl.na...@googlemail.com
  wrote:

  Hi David,
 
What can be crawl time for very big site, given depth param as 50,
 topN
   default(not passed ) and default fetch interval as 2mins..
  afaik, the default of topN is Long.MAX_VALUE which is very large.
  So, the size of the crawl is mainly limited by the number of links you
 get.
  Anyway, a depth of 50 is a high values, with a delay of 2min. (which is
  very defensive) your crawl will take a long time.
 
  Try to start with small values for depth and topN, e.g. 3 and 50.
  Then look at your crawlDb statistics (bin/nutch readdb ... -stats)
  and check how the numbers of fetch/unfetched/gone/etc. URLs increase
  to get a feeling which values make sense for your crawl.
 
   Case: Crawling website spicemobilephones.co.in, and in the
   regexurlfilter.txt – added +^ http://(a-z 0-9)spicemobilephones.co.in.
  This doesn't look like a valid Java regex.
  Did you remove these lines:
# accept anything else
+.
 
  Sebastian
 



Setting up nutch 1.6 with Solr 4.2

2013-04-06 Thread Amit Sela
Hi all,

I have nutch 1.6 setup and running with Solr 3.6.2 and I'm trying to
upgrade to Solr 4.2 but I'm missing something...

I re-built nutch with schema-solr4.xml as schema.xml and copied the
schema-solr4.xml to Solr example/collection1/conf/schema.xml

The index phase keeps failing throwing errors about unknown field host
and metatag.description (metatags worked just fine with 3.6.2).

What else am I missing ?

Thanks.


Nutch

2013-04-06 Thread Parin Jogani
Hi,
Is there any way to perform a urlfilter from level 1-5 and a different one
from 5 onwards. I need to extract pdf files which will be only after a
given level (just to experiment).
After that I believe the pdf files will be stored in a compressed binary
format in the crawl\segment folder. I would like to extract these pdf files
and store all in 1 folder. (I guess since Nutch uses MapReduce by segments
the data, I will need to use the hadoop api present by default in the lib
folder. I can not find more tutorials on the same except
allendayhttp://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html
).

PJ


Re: Nutch

2013-04-06 Thread Tejas Patil
On Sat, Apr 6, 2013 at 9:58 AM, Parin Jogani ppjog...@usc.edu wrote:

 Hi,
 Is there any way to perform a urlfilter from level 1-5 and a different one
 from 5 onwards. I need to extract pdf files which will be only after a
 given level (just to experiment).

You can run 2 crawls over the same crawldb using different urlfilter files.
First one would be rejecting pdf files and executed till a depth just
before you discover pdf files. For later crawl, modify the regex rule to
accept pdf files.


 After that I believe the pdf files will be stored in a compressed binary
 format in the crawl\segment folder. I would like to extract these pdf files
 and store all in 1 folder. (I guess since Nutch uses MapReduce by segments
 the data, I will need to use the hadoop api present by default in the lib
 folder. I can not find more tutorials on the same except
 allenday
 http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html
 
 ).

I had a peek at the link that you gave and seems like that code snippet
should work. Its an old article (from 2010) so it might happen that some
classes are replaced with new ones. If you face any issues, please feel
free to shoot an email to us !!!


 PJ



encode special characters in url

2013-04-06 Thread Jun Zhou
Hi all,

I'm using nutch 1.6 to crawl a web site which have lots of special
characters in the url, like ?,=@ etc.  For each character, I can add a
regex in the regex-normalize.xml to change it into percent encoding.

My question is, is there an easier way to do this? Like a url-encode method
to encode all the special characters rather than add regex one by one?

Thanks!