Change of analyzer for specific language

2008-03-15 Thread Vinci

Hi all,

How can I change the analyzer which is used by the indexer for specific
language? Also, can I use all the analyzer that I see in luke?

Thank you.
-- 
View this message in context: 
http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16065385.html
Sent from the Nutch - User mailing list archive at Nabble.com.



FW: Problem in running Nutch where proxy authentication is required.

2008-03-15 Thread naveen.goswami

 Hi Susam,

I have mailed on the list 2 times but the mails bounced back with the
following message

ezmlm-reject: fatal: Sorry, I don't accept messages larger than 10
bytes (#5.2.3)


Thanks  Regards,
Naveen Goswami

-Original Message-
From: Naveen Goswami (WT01 - E-ENABLING)
Sent: Saturday, March 15, 2008 5:01 PM
To: '[EMAIL PROTECTED]'
Cc: 'nutch-user@lucene.apache.org'
Subject: RE: Problem in running Nutch where proxy authentication is
required.

Hi Susam,


Thanks for the help. Yeah I have got your earlier mail.
I have followed all the steps given by you.
I am attaching the hadoop.log and crawl.log for your reference.

I have used the below command to run the crawl.
 bin/nutch crawl urls -dir crawl -depth 1 -threads 1  crawl.log

Please tell me what is the problem.

Thanks  Regards,
Naveen Goswami
91 9899547886

-Original Message-
From: Susam Pal [mailto:[EMAIL PROTECTED]
Sent: Friday, March 14, 2008 11:12 PM
To: [EMAIL PROTECTED]
Subject: Re: Problem in running Nutch where proxy authentication is
required.

I still can't see any DEBUG logs in your log file. Did you go through my
earlier mail?

Regards,
Susam Pal

On Wed, Mar 12, 2008 at 9:39 PM,  [EMAIL PROTECTED] wrote:

 Hi All,

  I am facing a problem in running nutch where the proxy authentication

 is  required to crawl the site.(eg. google.com, yahoo.com)  I am able
 to crawl the sites which do not require proxy authentication  from our

 domain (eg abc.com), it is successfully creating a crawl folder  and 5

 subfolders..
  I have put all the values in conf/nutch-site.xml 
 conf/nutch-default.xml as given.
  I have given below all the entries which i have modified to run
 nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt,
 conf/nutch-site.xml, conf/nutch-default.xml)  I have also given the
 crawl.log text for your reference.

  while crawling through cygwin, it is giving an exception(Please help
 me  out what i have to do to run nutch successfully(where i have to
 put any  entry to pass through Proxy Authentication))

  Dedup: starting
  Dedup: adding indexes in: crawl/indexes  Exception in thread main
 java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
   at

 org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
 43
  9)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)


 =
 
  


  ===crawl.log

  crawl started in: crawl
  rootUrlDir = urls
  threads = 10
  depth = 3
  topN = 50
  Injector: starting
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: done
  Generator: Selecting best-scoring urls due for fetch.
  Generator: starting
  Generator: segment: crawl/segments/20080109122052
  Generator: filtering: false
  Generator: topN: 50
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls by host, for politeness.
  Generator: done.
  Fetcher: starting
  Fetcher: segment: crawl/segments/20080109122052
  Fetcher: threads: 10
  fetching http://www.yahoo.com/
  fetch of http://www.yahoo.com/ http://www.yahoo.com/  failed with:
  Http code=407, url=http://www.yahoo.com/
  Fetcher: done
  CrawlDb update: starting
  CrawlDb update: db: crawl/crawldb
  CrawlDb update: segments: [crawl/segments/20080109122052]  CrawlDb
 update: additions allowed: true  CrawlDb update: URL normalizing: true

 CrawlDb update: URL filtering: true  CrawlDb update: Merging segment
 data into db.
  CrawlDb update: done
  Generator: Selecting best-scoring urls due for fetch.
  Generator: starting
  Generator: segment: crawl/segments/20080109122101
  Generator: filtering: false
  Generator: topN: 50
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls by host, for politeness.
  Generator: done.
  Fetcher: starting
  Fetcher: segment: crawl/segments/20080109122101
  Fetcher: threads: 10
  fetching http://www.yahoo.com/
  fetch of http://www.yahoo.com/ http://www.yahoo.com/  failed with:
  Http code=407, url=http://www.yahoo.com/
  Fetcher: done
  CrawlDb update: starting
  CrawlDb update: db: crawl/crawldb
  CrawlDb update: segments: [crawl/segments/20080109122101]  CrawlDb
 update: additions allowed: true  CrawlDb update: URL normalizing: true

 CrawlDb update: URL filtering: true  CrawlDb update: Merging segment
 data into db.
  CrawlDb update: done
  Generator: Selecting best-scoring urls due for fetch.
  Generator: starting
  Generator: segment: crawl/segments/20080109122110
  Generator: filtering: false
  Generator: topN: 50
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls by host, for politeness.
  Generator: done.
  Fetcher: starting
  Fetcher: segment: crawl/segments/20080109122110
  Fetcher: 

Thread behaviour in Nutch Crawl

2008-03-15 Thread naveen.goswami

Hi All,

Could anyone please tell me that how the threads behave in Nutch.

I have Run the same test on similar condition by giving the different
no. of threads.
Below is the output


No. of Threads  Time Taken in ms

1   235407
2   244569
3   235594
4   226555
5   229323
6   231400
7   219391
8   216384
9   215756
10  221586



See the behaviour:

One thread is taking 235407 ms to crawl whereas two threads are taking
more time with the same set under similar test conditions. How it can be
possible. And again with thread 3 and 4 time is decreasing and then it
is increased with 5  6 threads. And then again decreasing with 7,8  9
threads. And again increased with 10 threads. Could anyone please tell
me how threads are behaving like this. Because as I know with increasing
the threads the response time decreases.







The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com



Re: Change of analyzer for specific language

2008-03-15 Thread Vinci

Hi all,
[Follow up post]
I found the method by myself. 
1. Write a plugin for your own language. The method can refer to the
analysis-de and analysis-fr to wrap the luence analyzer into your plugin.

2. Then you need to add them to your plugin-include list in nutch-site.xml
or nutch-sites.xml . Also you need to add the language-identifier 

3. [For those language is not supported by language identifier or think
language identifier is too slow] 
OK, their is 50% chance you will fail if you are writing for eurpoean
lanuguage, and 100% fail if you writing for Eastern Asia Language.

The reason for that is , language-identifier fail  - your language is not
supported and you will see the default indexer do the indexing task for you.

There is 2 method 

A. Hack the plugin language-identifier.
i. hack all the class except the LanguageIdentifier.java: The detail will
not mention here, because this is too many step and I write in rush. But 2
principle here is:
a. remove all the reference to a LanguageIdentifier object, include
declaration and call of this method via this reference. This is much easier
if you have an IDE like NetBeans or Eclipse  
b. remember to change the language variable inner class of
HTMLLanguageParser or Change the default return language when all the case
fail.
ii. change the langmappings.properties to the acutal encoding of your
language - include all possible combination, in lower case. e.g.
za = za, zah, utf, utf8
For the full list you can refer to the list of Iconv language support list -
most system will support everything and you will see your language variance
(well, utf-8 can be utf-8 or utf_8 or utf8!). Also, you may need to include
the first part if the target encoding has - or _ , like utf-8 written in utf
and utf8 in example.

then build the language-identifier again

*XML is you need to create your own Parser based on HTMLLanguageParser . But
you will fail in to default case quite soon if the xml witten bad enough
that using UTF-8 as encoding but no lang element here.

B. Hack the Indexer.java , mentioned by this post:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05952.html
*For CJK, the default CJKAnalyzer can handle most of the case (especially
you change documents to unicode...), just let zh/ja/kr go as default case.


Vinci wrote:
 
 Hi all,
 
 How can I change the analyzer which is used by the indexer for specific
 language? Also, can I use all the analyzer that I see in luke?
 
 Thank you.
 

-- 
View this message in context: 
http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16067807.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Confusion of -depth parameter

2008-03-15 Thread Vinci

Hi all,
[This is a follow up post]

I found this is my fault so I need to crawl one more level that I expected.

Thank you


Vinci wrote:
 
 Hi all,
 
 I have a confusion of the keyword depth...
 
 -seed.txt url1 -link1
-link2
-link3
-link4
   url2 -link5
 ...etc
 
 However, I found the second level link (begin with -link) cannot be
 crawled unless I set the depth is 3 but not 2, why? Does the depth 1 is
 the seed url file?
 

-- 
View this message in context: 
http://www.nabble.com/Confusion-of--depth-parameter-tp16047305p16067808.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Missing zh.ngp for zh locate support for language Identifier

2008-03-15 Thread Vinci

Hi all,

I found there is missing zh.ngp for zh locate. I have seen this file via a
screenshot and then I googled the filename return nothing for me...can
anyone provide this file for me?

Thank you
-- 
View this message in context: 
http://www.nabble.com/Missing-zh.ngp-for-zh-locate-support-for-language-Identifier-tp16068532p16068532.html
Sent from the Nutch - User mailing list archive at Nabble.com.



incorrect Query tokenization

2008-03-15 Thread Vinci

Hi all,

I have change the NutchAnalyzer in the indexing phase by plugin (plug-in
based on anaylsis-fr or analysis-fr), but I found the query tokenized in its
old way - look like the tokenizer did not parse the query with the same
tokenizer index them...
I checked the index, they are indexed as I want. I also checked the hadoop
log, all plugin loaded (Include the one changed the Indexer). However, both
from the nutchBean and webapps, the tokenization is not correct.

How can I fix it? 
(*The fastest solution Look like assign the language [by plugin
language-identifier] of query, but I don't know where to start...)
-- 
View this message in context: 
http://www.nabble.com/incorrect-Query-tokenization-tp16070144p16070144.html
Sent from the Nutch - User mailing list archive at Nabble.com.



nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread John Mendenhall
I am running nutch 0.9, with tomcat 6.0.14.
When I use the NutchBean to search the index,
it works fine.  I get back results, no errors.
I have used tomcat before and it has worked
fine.

Now I am getting an error searching through
tomcat.  This is the tomcat error I am seeing
in the catalina.out log file:

-
2008-03-15 15:38:38,715 INFO  NutchBean - query request from 192.168.245.58
2008-03-15 15:38:38,717 INFO  NutchBean - query: penasquitos
2008-03-15 15:38:38,717 INFO  NutchBean - lang: en
Mar 15, 2008 3:38:41 PM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet jsp threw exception
java.lang.NullPointerException
at 
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159)
at 
org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177)
-

When I run a search using the NutchBean, I
see debug log entries in the hadoop.log.
When I run the search using Tomcat, I never
see any hadoop.log entires.

We have 1.4 million indexed pages, taking
up 31gb for the nutch/crawl directory.

The search term doesn't matter.

My guess is it may be a memory error,
but I am not seeing it anywhere.
Is there a place where I can set the memory
footprint for tomcat to use more memory?

Or, is there another place I should be looking?

Thanks in advance for any pointers or assistance.

JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services


Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread Vinci

Hi,

please check the path of the search.dir in property file located in
webapps/nutch_depoly_directory/WEB-INF/classes, check it is accessable or
not.

if you use absolute path then this will be another problem

Hope it help



John Mendenhall wrote:
 
 I am running nutch 0.9, with tomcat 6.0.14.
 When I use the NutchBean to search the index,
 it works fine.  I get back results, no errors.
 I have used tomcat before and it has worked
 fine.
 
 Now I am getting an error searching through
 tomcat.  This is the tomcat error I am seeing
 in the catalina.out log file:
 
 -
 2008-03-15 15:38:38,715 INFO  NutchBean - query request from
 192.168.245.58
 2008-03-15 15:38:38,717 INFO  NutchBean - query: penasquitos
 2008-03-15 15:38:38,717 INFO  NutchBean - lang: en
 Mar 15, 2008 3:38:41 PM org.apache.catalina.core.StandardWrapperValve
 invoke
 SEVERE: Servlet.service() for servlet jsp threw exception
 java.lang.NullPointerException
 at
 org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159)
 at
 org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177)
 -
 
 When I run a search using the NutchBean, I
 see debug log entries in the hadoop.log.
 When I run the search using Tomcat, I never
 see any hadoop.log entires.
 
 We have 1.4 million indexed pages, taking
 up 31gb for the nutch/crawl directory.
 
 The search term doesn't matter.
 
 My guess is it may be a memory error,
 but I am not seeing it anywhere.
 Is there a place where I can set the memory
 footprint for tomcat to use more memory?
 
 Or, is there another place I should be looking?
 
 Thanks in advance for any pointers or assistance.
 
 JohnM
 
 -- 
 john mendenhall
 [EMAIL PROTECTED]
 surf utopia
 internet services
 
 

-- 
View this message in context: 
http://www.nabble.com/nutch-0.9%2C-tomcat-6.0.14%2C-nutchbean-okay%2C-tomcat-search-error-tp16073740p16075186.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread John Mendenhall
 please check the path of the search.dir in property file (nutch-site.xml)
 located in webapps/nutch_depoly_directory/WEB-INF/classes, check it is
 accessable or not.
 
 if you use absolute path then this will be another problem

Super!  Thanks a bunch!  That was it.
The property is actually serverer.dir.
We always use absolute paths since it helps tremendously
not having to worry about where one is when the process is
started.

We had moved it from one matchine to another and had
forgotten to make sure the tomcat process owner 'tomcat'
was in the nutch group 'nutch'.  Fixed that and it works
like a charm.

Thanks again!

JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services


Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error

2008-03-15 Thread Vinci

Hi,

congrat:)

btw, unless you set permission other then 755, no much permission thing you
need to care if you use tomcat.

one question: did you changed the plugin list? What plugin are you using? I
wonder how can you get the language of your query...



John Mendenhall wrote:
 
 please check the path of the search.dir in property file (nutch-site.xml)
 located in webapps/nutch_depoly_directory/WEB-INF/classes, check it is
 accessable or not.
 
 if you use absolute path then this will be another problem
 
 Super!  Thanks a bunch!  That was it.
 The property is actually serverer.dir.
 We always use absolute paths since it helps tremendously
 not having to worry about where one is when the process is
 started.
 
 We had moved it from one matchine to another and had
 forgotten to make sure the tomcat process owner 'tomcat'
 was in the nutch group 'nutch'.  Fixed that and it works
 like a charm.
 
 Thanks again!
 
 JohnM
 
 -- 
 john mendenhall
 [EMAIL PROTECTED]
 surf utopia
 internet services
 
 

-- 
View this message in context: 
http://www.nabble.com/nutch-0.9%2C-tomcat-6.0.14%2C-nutchbean-okay%2C-tomcat-search-error-tp16073740p16075816.html
Sent from the Nutch - User mailing list archive at Nabble.com.