Re: Proxy Authentication

2010-03-15 Thread Graziano Aliberti

Il 13/03/2010 22.55, Susam Pal ha scritto:

On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com  wrote:
   

On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
graziano.alibe...@eng.it  wrote:
 

Il 11/03/2010 16.20, Susam Pal ha scritto:
   

On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
graziano.alibe...@eng.itwrote:

 

Hi everyone,

I'm trying to use nutch ver. 1.0 on a system under squid proxy control.
When
I try to fetch my website list, into the log file I see that the
authentication was failed...

I've configured my nutch-site.xml file with all that properties needed
for
proxy auth, but my error is httpclient.HttpMethodDirector - No
credentials
available for BASIC 'Squid proxy-caching web
server'@proxy.my.host:my.port


   

Did you replace 'protocol-http' with 'protocol-httpclient' in the
value for 'plugins.include' property in 'conf/nutch-site.xml'?

Regards,
Susam Pal



 

Hi Susam,

yes of course!! :) Maybe I can post you the configuration file:

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration

property
namehttp.agent.name/name
valuemy.agent.name/value
description
/description
/property

property
nameplugin.includes/name
valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
description
/description
/property

property
namehttp.auth.file/name
valuemy_file.xml/value
descriptionAuthentication configuration file for
  'protocol-httpclient' plugin.
/description
/property

property
namehttp.proxy.host/name
valueip.my.proxy/value
descriptionThe proxy hostname.  If empty, no proxy is used./description
/property

property
namehttp.proxy.port/name
valuemy.port/value
descriptionThe proxy port./description
/property

property
namehttp.proxy.username/name
valuemy.user/value
description
/description
/property

property
namehttp.proxy.password/name
valuemy.pwd/value
description
/description
/property

property
namehttp.proxy.realm/name
valuemy_realm/value
description
/description
/property

property
namehttp.agent.host/name
valuemy.local.pc/value
descriptionThe agent host./description
/property

property
namehttp.useHttp11/name
valuetrue/value
description
/description
/property

/configuration

Only another question: where i must put the user authentication parameters
(user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
authentication?

Thank you for your attention,


--
---

Graziano Aliberti

Engineering Ingegneria Informatica S.p.A

Via S. Martino della Battaglia, 56 - 00185 ROMA

*Tel.:* 06.49.201.387

*E-Mail:* graziano.alibe...@eng.it



   

The configuration looks okay to me. Yes, the proxy authentication
details are set in 'conf/nutch-site.xml'. The file mentioned in
'http.auth.file' property is used for configuring authentication
details for authenticating to a web server.

Unfortunately, there aren't any log statements in the part of the code
that reads the proxy authentication details. So, I can't suggest you
to turn on debug logs to get some clues about the issue. However, in
case you want to troubleshoot it yourself by building Nutch from
source, I can tell you the code that deals with this.

The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup

The line number is: 200.

If I get time this weekend, I will try to insert some log statements
into this code and send a modified JAR file to you which might help
you to troubleshoot what is going on. But I can't promise this since
it depends on my weekend plans.

Two questions before I end this mail. Did you set the value of
'http.proxy.realm' property as: Squid proxy-caching web server ?

Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
file? I'm not sure whether this line should appear for proxy
authentication but it does appear for web server authentication.

Regards,
Susam Pal

 

I managed to find some time to insert more logs into
protocol-httpclient and create a JAR. I have attached it with this
email.

Please replace your
'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
one that I have attached. Also, edit your 'conf/log4j.properties' file
to add these two lines:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout

When you run a crawl now, you should see more logs in
'logs/hadoop.log' than before. I hope it helps you in providing some
clues. In case you want to compare the logs with how the control flows
from the source code, I have attached the JAVA file as well.

Regards,
Susam Pal
   


Hi Susam,

first of all I want to thank you for your support :). 

Re: Content of redirected urls empty

2010-03-15 Thread Julien Nioche
Adam,

Could you please tell us what the http and https entries look like in the
crawlDB (using readdb -url)?

J.
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:


 no one have an answer !?





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Wed, 10 Mar 2010 21:01:54 +
 
 
  i read lotoff post regarding redirected urls but didnt find a sollution !
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Tue, 9 Mar 2010 16:59:05 +
  
  
  
   hi,
  
   i dont know if you did find few minutes to see my problem :)
  
   but i want to explain it again, mabe it wasnt clear :
  
  
   i have HTTP  pages redirected to HTTPS   (but it's the same URL):
  
   HTTP://page1.com   redirrected to HTTPS://page1.com
  
   the content of my page HTTP is empty.
   the content of my page HTTPS is not empty
  
   in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content
 of HTTPS page is not empty
  
   but in my index i found the HTTP one with the empty content.
  
   is there a maner to tell to nutch to index the url with the non empty
 content? or why nutch doesnt index the target URL rather than indexing the
 empty (origin) one ??
  
   thx a lot
  
  
  
  
  
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: RE: Content of redirected urls empty
Date: Mon, 8 Mar 2010 17:08:06 +
   
   
i'm sorry...i just checked twice...and in my index i have the
 original URL, which is  the HTTP one with the empty content...but it dosent
 index the HTTPS oneand i using solr index
thx
   
   
   
 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: Content of redirected urls empty
 Date: Mon, 8 Mar 2010 17:01:34 +




 Hi, i'v just dumped my segments and found that i have both 2 URLS,
 the original one (HTTP) with an empty content and the REDIRCTED TO or the
 DESTINATION URL (HTTPS) with NON EMPTY content !

 but in my search i found only the HTTPS URL with an empty content
 !! logically the content of the HTTPS  URL is not empty !
 it's just mixing the HTTPS url with the content of the HTTP one.


 our redirect is done by java code  response.sendRedirect(…), so it
 seams to be http redirect right ??

 thx for helping me :)


  Date: Mon, 8 Mar 2010 15:51:34 +0100
  From: a...@getopt.org
  To: nutch-user@lucene.apache.org
  Subject: Re: Content of redirected urls empty
 
  On 2010-03-08 14:55, BELLINI ADAM wrote:
  
  
   is there any idea guys ??
  
  
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: Content of redirected urls empty
   Date: Fri, 5 Mar 2010 22:01:05 +
  
  
  
   hi,
   the content of my redirected urls is empty...but still have
 the other metadata...
   i have an http urls that is redirected to https.
   in my index i find the http URL but with an empty content...
   could you explain it plz?
 
  There are two ways to redirect - one is with protocol, and the
 other is
  with content (either meta refresh, or javascript).
 
  When you dump the segment, is there really no content for the
 redirected
  url?
 
 
  --
  Best regards,
  Andrzej Bialecki 
___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 

 _
 Live connected with Messenger on your phone
 http://go.microsoft.com/?linkid=9712958
   
_
IM on the go with Messenger on your phone
http://go.microsoft.com/?linkid=9712960
  
   _
   Stay in touch.
   http://go.microsoft.com/?linkid=9712959
 
  _
  Take your contacts everywhere
  http://go.microsoft.com/?linkid=9712959

 _
 Stay in touch.
 http://go.microsoft.com/?linkid=9712959



RE: Content of redirected urls empty

2010-03-15 Thread BELLINI ADAM

Hi
thx for your help,

this is a fresh crwal of today:


1- HTTP:
bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html

URL: http://myDNS/index.html
Version: 7
Status: 4 (db_redir_temp)
Fetch time: Mon Mar 15 12:15:52 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 36000 seconds (0 days)
Score: 0.018119827
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html




2- HTTPS: 
bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html

URL: https://myDNS/index.html
Version: 7
Status: 2 (db_fetched)
Fetch time: Mon Mar 15 12:32:34 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 36000 seconds (0 days)
Score: 0.00511379
Signature: 5f84dcec905c24e3e2af902ad9ad7398
Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html






and as i said the last day, on my segment the https has an empty content.

thx


 Date: Mon, 15 Mar 2010 11:39:46 +
 Subject: Re: Content of redirected urls empty
 From: lists.digitalpeb...@gmail.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 
 Could you please tell us what the http and https entries look like in the
 crawlDB (using readdb -url)?
 
 J.
 -- 
 DigitalPebble Ltd
 http://www.digitalpebble.com
 
 On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
 
 
  no one have an answer !?
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Wed, 10 Mar 2010 21:01:54 +
  
  
   i read lotoff post regarding redirected urls but didnt find a sollution !
  
  
  
  
  
From: mbel...@msn.com
To: nutch-user@lucene.apache.org; mille...@gmail.com
Subject: RE: Content of redirected urls empty
Date: Tue, 9 Mar 2010 16:59:05 +
   
   
   
hi,
   
i dont know if you did find few minutes to see my problem :)
   
but i want to explain it again, mabe it wasnt clear :
   
   
i have HTTP  pages redirected to HTTPS   (but it's the same URL):
   
HTTP://page1.com   redirrected to HTTPS://page1.com
   
the content of my page HTTP is empty.
the content of my page HTTPS is not empty
   
in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content
  of HTTPS page is not empty
   
but in my index i found the HTTP one with the empty content.
   
is there a maner to tell to nutch to index the url with the non empty
  content? or why nutch doesnt index the target URL rather than indexing the
  empty (origin) one ??
   
thx a lot
   
   
   
   
   
 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: Content of redirected urls empty
 Date: Mon, 8 Mar 2010 17:08:06 +


 i'm sorry...i just checked twice...and in my index i have the
  original URL, which is  the HTTP one with the empty content...but it dosent
  index the HTTPS oneand i using solr index
 thx



  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: RE: Content of redirected urls empty
  Date: Mon, 8 Mar 2010 17:01:34 +
 
 
 
 
  Hi, i'v just dumped my segments and found that i have both 2 URLS,
  the original one (HTTP) with an empty content and the REDIRCTED TO or the
  DESTINATION URL (HTTPS) with NON EMPTY content !
 
  but in my search i found only the HTTPS URL with an empty content
  !! logically the content of the HTTPS  URL is not empty !
  it's just mixing the HTTPS url with the content of the HTTP one.
 
 
  our redirect is done by java code  response.sendRedirect(…), so it
  seams to be http redirect right ??
 
  thx for helping me :)
 
 
   Date: Mon, 8 Mar 2010 15:51:34 +0100
   From: a...@getopt.org
   To: nutch-user@lucene.apache.org
   Subject: Re: Content of redirected urls empty
  
   On 2010-03-08 14:55, BELLINI ADAM wrote:
   
   
is there any idea guys ??
   
   
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: Content of redirected urls empty
Date: Fri, 5 Mar 2010 22:01:05 +
   
   
   
hi,
the content of my redirected urls is empty...but still have
  the other metadata...
i have an http urls that is redirected to https.
in my index i find the http URL but with an empty content...
could you explain it plz?
  
   There are two ways to redirect - one is with protocol, and the
  other is
   with content (either meta refresh, or javascript).
  
   When you dump the segment, is there really no content for the
  redirected
   url?
  
  
   --
   Best regards,
   Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
   [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
   ___|||__||  \|  

Problem with ANT in building new Plugin for Nutch 1.0 ----- error in finding classes in packages

2010-03-15 Thread Arnaud Garcia
Hello everyone

I'm trying to add a new plugin to Nutch/Solr for having new fields and
finally searching about it in the terminal interface.

For that , i have readen the howto WritingPluginexample 0.9 from the apache
wiki  and i 'm trying to doing that .

I have a problem with the building of the new plugin with ant :
This problem like cames from packages and ant can't import classes from
other packages .

Can anyone help me please what i must do for resolving this problem 

Here is a copy of results of ANT building :

[javac]
/home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34:
cannot find symbol
[javac] symbol  : class Parse
[javac] location: class
org.apache.nutch.parse.recommended.RecommendedParser
[javac]   public Parse filter(Content content, Parse parse,
[javac]^
[javac]
/home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:35:
cannot find symbol
[javac] symbol  : class HTMLMetaTags
[javac] location: class
org.apache.nutch.parse.recommended.RecommendedParser
[javac] HTMLMetaTags metaTags, DocumentFragment doc) {
[javac] ^
[javac]
/home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34:
cannot find symbol
[javac] symbol  : class Parse
[javac] location: class
org.apache.nutch.parse.recommended.RecommendedParser
[javac]   public Parse filter(Content content, Parse parse,
[javac]  ^
[javac]
/home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:3:
package org.apache.nutch.searcher does not exist
[javac] import org.apache.nutch.searcher.FieldQueryFilter;
[javac] ^
[javac]
/home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:12:
cannot find symbol
[javac] symbol: class FieldQueryFilter
[javac] public class RecommendedQueryFilter extends FieldQueryFilter {
[javac] ^
[javac] Note:
/home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedIndexer.java
uses or overrides a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] 23 errors
BUILD FAILED
/home/Arnaud/nutch/src/plugin/build-plugin.xml:111: Compile failed; see the
compiler error output for details.

Total time: 3 seconds


Note : I m using Nutch 1.0 but the howto writingpluginexample on the wiki is
for 0.9 version ???

I didn't  think that can be come from that ??Maybe  YES or NOT ???


Please can anyone from the expert give me a help ?
Thks


Re: Content of redirected urls empty

2010-03-15 Thread Julien Nioche

 and as i said the last day, on my segment the https has an empty content.


hmm it's not what you said in your previous message + I can see it has a
signature in the crawlDB so it must have a content.

I expect that the content would be indexed under the http://  URL thanks to
*_repr_: **http://myDNS/index.html*

See BasicIndexingFilter for details.

it's just mixing the HTTPS url with the content of the HTTP one.


it should be the other way round : the HTTPS content *with* the HTTP URL.
Actually the http:// document is not sent to the index at all (see around
line 86 in IndexerMapReduce 86) so what you are seeing in the index must be
the https doc with _repr_ used as a URL.

can you please confirm that :
1/ the segment has a content for the https:// doc
2/ you can find the http:// URL in the index and it has no content

HTH

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com
On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:


 Hi
 thx for your help,

 this is a fresh crwal of today:


 1- HTTP:
 bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html

 URL: http://myDNS/index.html
 Version: 7
 Status: 4 (db_redir_temp)
 Fetch time: Mon Mar 15 12:15:52 EDT 2010
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 36000 seconds (0 days)
 Score: 0.018119827
 Signature: null
 Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html




 2- HTTPS:
 bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html

 URL: https://myDNS/index.html
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Mon Mar 15 12:32:34 EDT 2010
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 36000 seconds (0 days)
 Score: 0.00511379
 Signature: 5f84dcec905c24e3e2af902ad9ad7398
 Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html






 and as i said the last day, on my segment the https has an empty content.

 thx


  Date: Mon, 15 Mar 2010 11:39:46 +
  Subject: Re: Content of redirected urls empty
  From: lists.digitalpeb...@gmail.com
  To: nutch-user@lucene.apache.org
 
  Adam,
 
  Could you please tell us what the http and https entries look like in the
  crawlDB (using readdb -url)?
 
  J.
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
 
  On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
 
  
   no one have an answer !?
  
  
  
  
  
From: mbel...@msn.com
To: nutch-user@lucene.apache.org; mille...@gmail.com
Subject: RE: Content of redirected urls empty
Date: Wed, 10 Mar 2010 21:01:54 +
   
   
i read lotoff post regarding redirected urls but didnt find a
 sollution !
   
   
   
   
   
 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org; mille...@gmail.com
 Subject: RE: Content of redirected urls empty
 Date: Tue, 9 Mar 2010 16:59:05 +



 hi,

 i dont know if you did find few minutes to see my problem :)

 but i want to explain it again, mabe it wasnt clear :


 i have HTTP  pages redirected to HTTPS   (but it's the same URL):

 HTTP://page1.com   redirrected to HTTPS://page1.com

 the content of my page HTTP is empty.
 the content of my page HTTPS is not empty

 in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
 content
   of HTTPS page is not empty

 but in my index i found the HTTP one with the empty content.

 is there a maner to tell to nutch to index the url with the non
 empty
   content? or why nutch doesnt index the target URL rather than indexing
 the
   empty (origin) one ??

 thx a lot





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: RE: Content of redirected urls empty
  Date: Mon, 8 Mar 2010 17:08:06 +
 
 
  i'm sorry...i just checked twice...and in my index i have the
   original URL, which is  the HTTP one with the empty content...but it
 dosent
   index the HTTPS oneand i using solr index
  thx
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: RE: Content of redirected urls empty
   Date: Mon, 8 Mar 2010 17:01:34 +
  
  
  
  
   Hi, i'v just dumped my segments and found that i have both 2
 URLS,
   the original one (HTTP) with an empty content and the REDIRCTED TO or
 the
   DESTINATION URL (HTTPS) with NON EMPTY content !
  
   but in my search i found only the HTTPS URL with an empty
 content
   !! logically the content of the HTTPS  URL is not empty !
   it's just mixing the HTTPS url with the content of the HTTP
 one.
  
  
   our redirect is done by java code  response.sendRedirect(…), so
 it
   seams to be http redirect right ??
  
   thx for helping me :)
  
  
Date: Mon, 8 Mar 2010 15:51:34 +0100
From: a...@getopt.org
To: nutch-user@lucene.apache.org
 

Re: Problem with ANT in building new Plugin for Nutch 1.0 ----- error in finding classes in packages

2010-03-15 Thread Arnaud Garcia
HELLO

There is'nt anyone who know from where this problem came ?

PLEASE A HELP

2010/3/15 Arnaud Garcia arnaud1...@gmail.com

 Hello everyone

 I'm trying to add a new plugin to Nutch/Solr for having new fields and
 finally searching about it in the terminal interface.

 For that , i have readen the howto WritingPluginexample 0.9 from the apache
 wiki  and i 'm trying to doing that .

 I have a problem with the building of the new plugin with ant :
 This problem like cames from packages and ant can't import classes from
 other packages .

 Can anyone help me please what i must do for resolving this problem 

 Here is a copy of results of ANT building :

 [javac]
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34:
 cannot find symbol
 [javac] symbol  : class Parse
 [javac] location: class
 org.apache.nutch.parse.recommended.RecommendedParser
 [javac]   public Parse filter(Content content, Parse parse,
 [javac]^
 [javac]
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:35:
 cannot find symbol
 [javac] symbol  : class HTMLMetaTags
 [javac] location: class
 org.apache.nutch.parse.recommended.RecommendedParser
 [javac] HTMLMetaTags metaTags, DocumentFragment doc) {
 [javac] ^
 [javac]
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34:
 cannot find symbol
 [javac] symbol  : class Parse
 [javac] location: class
 org.apache.nutch.parse.recommended.RecommendedParser
 [javac]   public Parse filter(Content content, Parse parse,
 [javac]  ^
 [javac]
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:3:
 package org.apache.nutch.searcher does not exist
 [javac] import org.apache.nutch.searcher.FieldQueryFilter;
 [javac] ^
 [javac]
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:12:
 cannot find symbol
 [javac] symbol: class FieldQueryFilter
 [javac] public class RecommendedQueryFilter extends FieldQueryFilter {
 [javac] ^
 [javac] Note:
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedIndexer.java
 uses or overrides a deprecated API.
 [javac] Note: Recompile with -Xlint:deprecation for details.
 [javac] 23 errors
 BUILD FAILED
 /home/Arnaud/nutch/src/plugin/build-plugin.xml:111: Compile failed; see the
 compiler error output for details.

 Total time: 3 seconds


 Note : I m using Nutch 1.0 but the howto writingpluginexample on the wiki
 is for 0.9 version ???

 I didn't  think that can be come from that ??Maybe  YES or NOT ???


 Please can anyone from the expert give me a help ?
 Thks




Re: Problem with ANT in building new Plugin for Nutch 1.0 ----- error in finding classes in packages

2010-03-15 Thread Alexander Aristov
Hi

Obviously You didn't include necessary references to other JARs or source
directories. It's configured in plugin.xml
Check the file and add there all necessary references. Remember that each
plugin is compiled separately and it doesn't know about other plugins.
Compare your ANT files with files from other plugins. Maybe it will give you
a hint.


Best Regards
Alexander Aristov


On 15 March 2010 17:26, Arnaud Garcia arnaud1...@gmail.com wrote:

 HELLO

 There is'nt anyone who know from where this problem came ?

 PLEASE A HELP

 2010/3/15 Arnaud Garcia arnaud1...@gmail.com

  Hello everyone
 
  I'm trying to add a new plugin to Nutch/Solr for having new fields and
  finally searching about it in the terminal interface.
 
  For that , i have readen the howto WritingPluginexample 0.9 from the
 apache
  wiki  and i 'm trying to doing that .
 
  I have a problem with the building of the new plugin with ant :
  This problem like cames from packages and ant can't import classes from
  other packages .
 
  Can anyone help me please what i must do for resolving this problem 
 
  Here is a copy of results of ANT building :
 
  [javac]
 
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34:
  cannot find symbol
  [javac] symbol  : class Parse
  [javac] location: class
  org.apache.nutch.parse.recommended.RecommendedParser
  [javac]   public Parse filter(Content content, Parse parse,
  [javac]^
  [javac]
 
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:35:
  cannot find symbol
  [javac] symbol  : class HTMLMetaTags
  [javac] location: class
  org.apache.nutch.parse.recommended.RecommendedParser
  [javac] HTMLMetaTags metaTags, DocumentFragment doc) {
  [javac] ^
  [javac]
 
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedParser.java:34:
  cannot find symbol
  [javac] symbol  : class Parse
  [javac] location: class
  org.apache.nutch.parse.recommended.RecommendedParser
  [javac]   public Parse filter(Content content, Parse parse,
  [javac]  ^
  [javac]
 
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:3:
  package org.apache.nutch.searcher does not exist
  [javac] import org.apache.nutch.searcher.FieldQueryFilter;
  [javac] ^
  [javac]
 
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedQueryFilter.java:12:
  cannot find symbol
  [javac] symbol: class FieldQueryFilter
  [javac] public class RecommendedQueryFilter extends FieldQueryFilter
 {
  [javac] ^
  [javac] Note:
 
 /home/Arnaud/nutch/src/plugin/recommended/src/java/org/apache/nutch/parse/recommended/RecommendedIndexer.java
  uses or overrides a deprecated API.
  [javac] Note: Recompile with -Xlint:deprecation for details.
  [javac] 23 errors
  BUILD FAILED
  /home/Arnaud/nutch/src/plugin/build-plugin.xml:111: Compile failed; see
 the
  compiler error output for details.
 
  Total time: 3 seconds
 
 
  Note : I m using Nutch 1.0 but the howto writingpluginexample on the wiki
  is for 0.9 version ???
 
  I didn't  think that can be come from that ??Maybe  YES or NOT ???
 
 
  Please can anyone from the expert give me a help ?
  Thks
 
 



RE: Content of redirected urls empty

2010-03-15 Thread BELLINI ADAM



Oh sorry i mistook again, and yes you are complitely right
1- The HTTPS has a content in my segment.
2- the HTTP has an empty content.

in
my index i have the HTTPS  url with the empty content (...it's exactely
what you said : it's just mixing the HTTPS url with
the content of the HTTP one,) and i expected the other way round : the
HTTPS content *with* the HTTP URL.


i dont know if i have the HTTP url in my index, i dont know how to see all the 
indexed URLS in SOLR. but i'm sure that when a perform a search using RMS i 
obtain only the HTTPS url with an empty content (i guess it's the empty content 
of the HTTP one).
but again in the segment the content of the https is not empty.



 Date: Mon, 15 Mar 2010 13:44:33 +
 Subject: Re: Content of redirected urls empty
 From: lists.digitalpeb...@gmail.com
 To: nutch-user@lucene.apache.org
 
 
  and as i said the last day, on my segment the https has an empty content.
 
 
 hmm it's not what you said in your previous message + I can see it has a
 signature in the crawlDB so it must have a content.
 
 I expect that the content would be indexed under the http://  URL thanks to
 *_repr_: **http://myDNS/index.html*
 
 See BasicIndexingFilter for details.
 
 it's just mixing the HTTPS url with the content of the HTTP one.
 
 
 it should be the other way round : the HTTPS content *with* the HTTP URL.
 Actually the http:// document is not sent to the index at all (see around
 line 86 in IndexerMapReduce 86) so what you are seeing in the index must be
 the https doc with _repr_ used as a URL.
 
 can you please confirm that :
 1/ the segment has a content for the https:// doc
 2/ you can find the http:// URL in the index and it has no content
 
 HTH
 
 Julien
 
 -- 
 DigitalPebble Ltd
 http://www.digitalpebble.com
 On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:
 
 
  Hi
  thx for your help,
 
  this is a fresh crwal of today:
 
 
  1- HTTP:
  bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
 
  URL: http://myDNS/index.html
  Version: 7
  Status: 4 (db_redir_temp)
  Fetch time: Mon Mar 15 12:15:52 EDT 2010
  Modified time: Wed Dec 31 19:00:00 EST 1969
  Retries since fetch: 0
  Retry interval: 36000 seconds (0 days)
  Score: 0.018119827
  Signature: null
  Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html
 
 
 
 
  2- HTTPS:
  bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
 
  URL: https://myDNS/index.html
  Version: 7
  Status: 2 (db_fetched)
  Fetch time: Mon Mar 15 12:32:34 EDT 2010
  Modified time: Wed Dec 31 19:00:00 EST 1969
  Retries since fetch: 0
  Retry interval: 36000 seconds (0 days)
  Score: 0.00511379
  Signature: 5f84dcec905c24e3e2af902ad9ad7398
  Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html
 
 
 
 
 
 
  and as i said the last day, on my segment the https has an empty content.
 
  thx
 
 
   Date: Mon, 15 Mar 2010 11:39:46 +
   Subject: Re: Content of redirected urls empty
   From: lists.digitalpeb...@gmail.com
   To: nutch-user@lucene.apache.org
  
   Adam,
  
   Could you please tell us what the http and https entries look like in the
   crawlDB (using readdb -url)?
  
   J.
   --
   DigitalPebble Ltd
   http://www.digitalpebble.com
  
   On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
  
   
no one have an answer !?
   
   
   
   
   
 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org; mille...@gmail.com
 Subject: RE: Content of redirected urls empty
 Date: Wed, 10 Mar 2010 21:01:54 +


 i read lotoff post regarding redirected urls but didnt find a
  sollution !





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Tue, 9 Mar 2010 16:59:05 +
 
 
 
  hi,
 
  i dont know if you did find few minutes to see my problem :)
 
  but i want to explain it again, mabe it wasnt clear :
 
 
  i have HTTP  pages redirected to HTTPS   (but it's the same URL):
 
  HTTP://page1.com   redirrected to HTTPS://page1.com
 
  the content of my page HTTP is empty.
  the content of my page HTTPS is not empty
 
  in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
  content
of HTTPS page is not empty
 
  but in my index i found the HTTP one with the empty content.
 
  is there a maner to tell to nutch to index the url with the non
  empty
content? or why nutch doesnt index the target URL rather than indexing
  the
empty (origin) one ??
 
  thx a lot
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: RE: Content of redirected urls empty
   Date: Mon, 8 Mar 2010 17:08:06 +
  
  
   i'm sorry...i just checked twice...and in my index i have the
original URL, which is  the HTTP one 

RE: Content of redirected urls empty

2010-03-15 Thread BELLINI ADAM


hi again,

i forgot to ask what does mean   _repr_  ?



 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: Content of redirected urls empty
 Date: Mon, 15 Mar 2010 15:29:48 +
 
 
 
 
 Oh sorry i mistook again, and yes you are complitely right
 1- The HTTPS has a content in my segment.
 2- the HTTP has an empty content.
 
 in
 my index i have the HTTPS  url with the empty content (...it's exactely
 what you said : it's just mixing the HTTPS url with
 the content of the HTTP one,) and i expected the other way round : the
 HTTPS content *with* the HTTP URL.
 
 
 i dont know if i have the HTTP url in my index, i dont know how to see all 
 the indexed URLS in SOLR. but i'm sure that when a perform a search using RMS 
 i obtain only the HTTPS url with an empty content (i guess it's the empty 
 content of the HTTP one).
 but again in the segment the content of the https is not empty.
 
 
 
  Date: Mon, 15 Mar 2010 13:44:33 +
  Subject: Re: Content of redirected urls empty
  From: lists.digitalpeb...@gmail.com
  To: nutch-user@lucene.apache.org
  
  
   and as i said the last day, on my segment the https has an empty content.
  
  
  hmm it's not what you said in your previous message + I can see it has a
  signature in the crawlDB so it must have a content.
  
  I expect that the content would be indexed under the http://  URL thanks to
  *_repr_: **http://myDNS/index.html*
  
  See BasicIndexingFilter for details.
  
  it's just mixing the HTTPS url with the content of the HTTP one.
  
  
  it should be the other way round : the HTTPS content *with* the HTTP URL.
  Actually the http:// document is not sent to the index at all (see around
  line 86 in IndexerMapReduce 86) so what you are seeing in the index must be
  the https doc with _repr_ used as a URL.
  
  can you please confirm that :
  1/ the segment has a content for the https:// doc
  2/ you can find the http:// URL in the index and it has no content
  
  HTH
  
  Julien
  
  -- 
  DigitalPebble Ltd
  http://www.digitalpebble.com
  On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:
  
  
   Hi
   thx for your help,
  
   this is a fresh crwal of today:
  
  
   1- HTTP:
   bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
  
   URL: http://myDNS/index.html
   Version: 7
   Status: 4 (db_redir_temp)
   Fetch time: Mon Mar 15 12:15:52 EDT 2010
   Modified time: Wed Dec 31 19:00:00 EST 1969
   Retries since fetch: 0
   Retry interval: 36000 seconds (0 days)
   Score: 0.018119827
   Signature: null
   Metadata: _pst_: temp_moved(13), lastModified=0: https://myDNS/index.html
  
  
  
  
   2- HTTPS:
   bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
  
   URL: https://myDNS/index.html
   Version: 7
   Status: 2 (db_fetched)
   Fetch time: Mon Mar 15 12:32:34 EDT 2010
   Modified time: Wed Dec 31 19:00:00 EST 1969
   Retries since fetch: 0
   Retry interval: 36000 seconds (0 days)
   Score: 0.00511379
   Signature: 5f84dcec905c24e3e2af902ad9ad7398
   Metadata: _pst_: success(1), lastModified=0_repr_: http://myDNS/index.html
  
  
  
  
  
  
   and as i said the last day, on my segment the https has an empty content.
  
   thx
  
  
Date: Mon, 15 Mar 2010 11:39:46 +
Subject: Re: Content of redirected urls empty
From: lists.digitalpeb...@gmail.com
To: nutch-user@lucene.apache.org
   
Adam,
   
Could you please tell us what the http and https entries look like in 
the
crawlDB (using readdb -url)?
   
J.
--
DigitalPebble Ltd
http://www.digitalpebble.com
   
On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
   

 no one have an answer !?





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Wed, 10 Mar 2010 21:01:54 +
 
 
  i read lotoff post regarding redirected urls but didnt find a
   sollution !
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Tue, 9 Mar 2010 16:59:05 +
  
  
  
   hi,
  
   i dont know if you did find few minutes to see my problem :)
  
   but i want to explain it again, mabe it wasnt clear :
  
  
   i have HTTP  pages redirected to HTTPS   (but it's the same URL):
  
   HTTP://page1.com   redirrected to HTTPS://page1.com
  
   the content of my page HTTP is empty.
   the content of my page HTTPS is not empty
  
   in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
   content
 of HTTPS page is not empty
  
   but in my index i found the HTTP one with the empty content.
  
   is there a maner to tell to nutch to index the url with the non
   empty
 content? or why nutch doesnt index the target URL rather 

Re: Content of redirected urls empty

2010-03-15 Thread Julien Nioche
 my index i have the HTTPS  url with the empty content (...it's exactely
 what you said : it's just mixing the HTTPS url with
 the content of the HTTP one,) and i expected the other way round : the
 HTTPS content *with* the HTTP URL.


strange



 i dont know if i have the HTTP url in my index, i dont know how to see all
 the indexed URLS in SOLR.


well you could query on the hostname or the whole URL is suppose.

You could also index with Lucene and use Luke to debug the content of the
index


 but i'm sure that when a perform a search using RMS i obtain only the HTTPS
 url with an empty content (i guess it's the empty content of the HTTP one).
 but again in the segment the content of the https is not empty.


_repr_  : representative - see class ReprUrlFixer









  Date: Mon, 15 Mar 2010 13:44:33 +
  Subject: Re: Content of redirected urls empty
  From: lists.digitalpeb...@gmail.com
  To: nutch-user@lucene.apache.org
 
  
   and as i said the last day, on my segment the https has an empty
 content.
 
 
  hmm it's not what you said in your previous message + I can see it has a
  signature in the crawlDB so it must have a content.
 
  I expect that the content would be indexed under the http://  URL thanks
 to
  *_repr_: **http://myDNS/index.html*
 
  See BasicIndexingFilter for details.
 
  it's just mixing the HTTPS url with the content of the HTTP one.
 
 
  it should be the other way round : the HTTPS content *with* the HTTP URL.
  Actually the http:// document is not sent to the index at all (see
 around
  line 86 in IndexerMapReduce 86) so what you are seeing in the index must
 be
  the https doc with _repr_ used as a URL.
 
  can you please confirm that :
  1/ the segment has a content for the https:// doc
  2/ you can find the http:// URL in the index and it has no content
 
  HTH
 
  Julien
 
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
  On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:
 
  
   Hi
   thx for your help,
  
   this is a fresh crwal of today:
  
  
   1- HTTP:
   bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
  
   URL: http://myDNS/index.html
   Version: 7
   Status: 4 (db_redir_temp)
   Fetch time: Mon Mar 15 12:15:52 EDT 2010
   Modified time: Wed Dec 31 19:00:00 EST 1969
   Retries since fetch: 0
   Retry interval: 36000 seconds (0 days)
   Score: 0.018119827
   Signature: null
   Metadata: _pst_: temp_moved(13), lastModified=0:
 https://myDNS/index.html
  
  
  
  
   2- HTTPS:
   bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
  
   URL: https://myDNS/index.html
   Version: 7
   Status: 2 (db_fetched)
   Fetch time: Mon Mar 15 12:32:34 EDT 2010
   Modified time: Wed Dec 31 19:00:00 EST 1969
   Retries since fetch: 0
   Retry interval: 36000 seconds (0 days)
   Score: 0.00511379
   Signature: 5f84dcec905c24e3e2af902ad9ad7398
   Metadata: _pst_: success(1), lastModified=0_repr_:
 http://myDNS/index.html
  
  
  
  
  
  
   and as i said the last day, on my segment the https has an empty
 content.
  
   thx
  
  
Date: Mon, 15 Mar 2010 11:39:46 +
Subject: Re: Content of redirected urls empty
From: lists.digitalpeb...@gmail.com
To: nutch-user@lucene.apache.org
   
Adam,
   
Could you please tell us what the http and https entries look like in
 the
crawlDB (using readdb -url)?
   
J.
--
DigitalPebble Ltd
http://www.digitalpebble.com
   
On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:
   

 no one have an answer !?





  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Wed, 10 Mar 2010 21:01:54 +
 
 
  i read lotoff post regarding redirected urls but didnt find a
   sollution !
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Tue, 9 Mar 2010 16:59:05 +
  
  
  
   hi,
  
   i dont know if you did find few minutes to see my problem :)
  
   but i want to explain it again, mabe it wasnt clear :
  
  
   i have HTTP  pages redirected to HTTPS   (but it's the same
 URL):
  
   HTTP://page1.com   redirrected to HTTPS://page1.com
  
   the content of my page HTTP is empty.
   the content of my page HTTPS is not empty
  
   in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the
   content
 of HTTPS page is not empty
  
   but in my index i found the HTTP one with the empty content.
  
   is there a maner to tell to nutch to index the url with the non
   empty
 content? or why nutch doesnt index the target URL rather than
 indexing
   the
 empty (origin) one ??
  
   thx a lot
  
  
  
  
  
From: mbel...@msn.com
To: 

problem crawling entire internal website

2010-03-15 Thread ksee

Hi,

I'm a new nutch user. My company wants me to look into using this technology
to index our internal wiki website as well as sharepoint docs (using tika).

Right now, I just want nutch to index the entire wiki site but I'm having
problems. I've read other people's problems with this but I haven't found a
solution that worked for me.

I have nutch 1.0 installed.
The wiki site is MoinMoin if that helps. The pages don't have extensions
like .html. They're in the form of http://wiki:8000/Engineering as an
example. So all pages only have 1-level depth paths.

I'm running nutch with the follow command:
bin/nuch crawl urls -dir crawl -depth 100 -topN 100  crawl.log

I have a urls folder with a file called wiki that points to the top-level
page of the site.

I set the crawl-urlfilter.txt to accept everything except the default
exclusions:
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-[...@=]
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+.

And I set the db.ignore.external.links property in nutch-default.xml to true
so it doesn't go outside of the site. (db.ignore.interal.links is set to
false)

After the crawl command completes, the search returns some pages, but there
are still some pages that are maybe 2 or 3 levels from the starting page
that don't show up on search.

Any help would be appreciated.

Thanks,
Kane
-- 
View this message in context: 
http://old.nabble.com/problem-crawling-entire-internal-website-tp27908943p27908943.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Proxy Authentication

2010-03-15 Thread Susam Pal
On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti
graziano.alibe...@eng.it wrote:
 Il 13/03/2010 22.55, Susam Pal ha scritto:

 On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com  wrote:


 On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
 graziano.alibe...@eng.it  wrote:


 Il 11/03/2010 16.20, Susam Pal ha scritto:


 On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
 graziano.alibe...@eng.it    wrote:



 Hi everyone,

 I'm trying to use nutch ver. 1.0 on a system under squid proxy
 control.
 When
 I try to fetch my website list, into the log file I see that the
 authentication was failed...

 I've configured my nutch-site.xml file with all that properties needed
 for
 proxy auth, but my error is httpclient.HttpMethodDirector - No
 credentials
 available for BASIC 'Squid proxy-caching web
 server'@proxy.my.host:my.port




 Did you replace 'protocol-http' with 'protocol-httpclient' in the
 value for 'plugins.include' property in 'conf/nutch-site.xml'?

 Regards,
 Susam Pal





 Hi Susam,

 yes of course!! :) Maybe I can post you the configuration file:

 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration

 property
 namehttp.agent.name/name
 valuemy.agent.name/value
 description
 /description
 /property

 property
 nameplugin.includes/name

 valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
 description
 /description
 /property

 property
 namehttp.auth.file/name
 valuemy_file.xml/value
 descriptionAuthentication configuration file for
  'protocol-httpclient' plugin.
 /description
 /property

 property
 namehttp.proxy.host/name
 valueip.my.proxy/value
 descriptionThe proxy hostname.  If empty, no proxy is
 used./description
 /property

 property
 namehttp.proxy.port/name
 valuemy.port/value
 descriptionThe proxy port./description
 /property

 property
 namehttp.proxy.username/name
 valuemy.user/value
 description
 /description
 /property

 property
 namehttp.proxy.password/name
 valuemy.pwd/value
 description
 /description
 /property

 property
 namehttp.proxy.realm/name
 valuemy_realm/value
 description
 /description
 /property

 property
 namehttp.agent.host/name
 valuemy.local.pc/value
 descriptionThe agent host./description
 /property

 property
 namehttp.useHttp11/name
 valuetrue/value
 description
 /description
 /property

 /configuration

 Only another question: where i must put the user authentication
 parameters
 (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
 authentication?

 Thank you for your attention,


 --
 ---

 Graziano Aliberti

 Engineering Ingegneria Informatica S.p.A

 Via S. Martino della Battaglia, 56 - 00185 ROMA

 *Tel.:* 06.49.201.387

 *E-Mail:* graziano.alibe...@eng.it





 The configuration looks okay to me. Yes, the proxy authentication
 details are set in 'conf/nutch-site.xml'. The file mentioned in
 'http.auth.file' property is used for configuring authentication
 details for authenticating to a web server.

 Unfortunately, there aren't any log statements in the part of the code
 that reads the proxy authentication details. So, I can't suggest you
 to turn on debug logs to get some clues about the issue. However, in
 case you want to troubleshoot it yourself by building Nutch from
 source, I can tell you the code that deals with this.

 The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :

 http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup

 The line number is: 200.

 If I get time this weekend, I will try to insert some log statements
 into this code and send a modified JAR file to you which might help
 you to troubleshoot what is going on. But I can't promise this since
 it depends on my weekend plans.

 Two questions before I end this mail. Did you set the value of
 'http.proxy.realm' property as: Squid proxy-caching web server ?

 Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
 file? I'm not sure whether this line should appear for proxy
 authentication but it does appear for web server authentication.

 Regards,
 Susam Pal



 I managed to find some time to insert more logs into
 protocol-httpclient and create a JAR. I have attached it with this
 email.

 Please replace your
 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
 one that I have attached. Also, edit your 'conf/log4j.properties' file
 to add these two lines:

 log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
 log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout

 When you run a crawl now, you should see more logs in
 'logs/hadoop.log' than before. I hope it helps you in providing some
 clues. In case you want to compare the logs with how the 

RE: Content of redirected urls empty

2010-03-15 Thread BELLINI ADAM

Hi, 

finaly i learned how to display only indexed URLs in the solr index

the url is  http://localhost:8080/solr/select/?q=*:*fl=url,content

q=*:*  is for all entries in the index
fl=url,content  display only urls and their content.


Now i'm 100 % sure that i dont have the source HTTP urls in my index, i have 
only the target ones (HTTPS) with an empty content.



i dont know if some one could explain why nutch is missing the content of 
redirected urls  when indexing !!!



 Date: Mon, 15 Mar 2010 16:28:03 +
 Subject: Re: Content of redirected urls empty
 From: lists.digitalpeb...@gmail.com
 To: nutch-user@lucene.apache.org
 
  my index i have the HTTPS  url with the empty content (...it's exactely
  what you said : it's just mixing the HTTPS url with
  the content of the HTTP one,) and i expected the other way round : the
  HTTPS content *with* the HTTP URL.
 
 
 strange
 
 
 
  i dont know if i have the HTTP url in my index, i dont know how to see all
  the indexed URLS in SOLR.
 
 
 well you could query on the hostname or the whole URL is suppose.
 
 You could also index with Lucene and use Luke to debug the content of the
 index
 
 
  but i'm sure that when a perform a search using RMS i obtain only the HTTPS
  url with an empty content (i guess it's the empty content of the HTTP one).
  but again in the segment the content of the https is not empty.
 
 
 _repr_  : representative - see class ReprUrlFixer
 
 
 
 
 
 
 
 
 
   Date: Mon, 15 Mar 2010 13:44:33 +
   Subject: Re: Content of redirected urls empty
   From: lists.digitalpeb...@gmail.com
   To: nutch-user@lucene.apache.org
  
   
and as i said the last day, on my segment the https has an empty
  content.
  
  
   hmm it's not what you said in your previous message + I can see it has a
   signature in the crawlDB so it must have a content.
  
   I expect that the content would be indexed under the http://  URL thanks
  to
   *_repr_: **http://myDNS/index.html*
  
   See BasicIndexingFilter for details.
  
   it's just mixing the HTTPS url with the content of the HTTP one.
  
  
   it should be the other way round : the HTTPS content *with* the HTTP URL.
   Actually the http:// document is not sent to the index at all (see
  around
   line 86 in IndexerMapReduce 86) so what you are seeing in the index must
  be
   the https doc with _repr_ used as a URL.
  
   can you please confirm that :
   1/ the segment has a content for the https:// doc
   2/ you can find the http:// URL in the index and it has no content
  
   HTH
  
   Julien
  
   --
   DigitalPebble Ltd
   http://www.digitalpebble.com
   On 15 March 2010 13:00, BELLINI ADAM mbel...@msn.com wrote:
  
   
Hi
thx for your help,
   
this is a fresh crwal of today:
   
   
1- HTTP:
bin/nutch readdb crawl_portal/crawldb/ -url http://myDNS/index.html
   
URL: http://myDNS/index.html
Version: 7
Status: 4 (db_redir_temp)
Fetch time: Mon Mar 15 12:15:52 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 36000 seconds (0 days)
Score: 0.018119827
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0:
  https://myDNS/index.html
   
   
   
   
2- HTTPS:
bin/nutch readdb crawl_portal/crawldb/ -url https://myDNS/index.html
   
URL: https://myDNS/index.html
Version: 7
Status: 2 (db_fetched)
Fetch time: Mon Mar 15 12:32:34 EDT 2010
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 36000 seconds (0 days)
Score: 0.00511379
Signature: 5f84dcec905c24e3e2af902ad9ad7398
Metadata: _pst_: success(1), lastModified=0_repr_:
  http://myDNS/index.html
   
   
   
   
   
   
and as i said the last day, on my segment the https has an empty
  content.
   
thx
   
   
 Date: Mon, 15 Mar 2010 11:39:46 +
 Subject: Re: Content of redirected urls empty
 From: lists.digitalpeb...@gmail.com
 To: nutch-user@lucene.apache.org

 Adam,

 Could you please tell us what the http and https entries look like in
  the
 crawlDB (using readdb -url)?

 J.
 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 On 13 March 2010 04:29, BELLINI ADAM mbel...@msn.com wrote:

 
  no one have an answer !?
 
 
 
 
 
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org; mille...@gmail.com
   Subject: RE: Content of redirected urls empty
   Date: Wed, 10 Mar 2010 21:01:54 +
  
  
   i read lotoff post regarding redirected urls but didnt find a
sollution !
  
  
  
  
  
From: mbel...@msn.com
To: nutch-user@lucene.apache.org; mille...@gmail.com
Subject: RE: Content of redirected urls empty
Date: Tue, 9 Mar 2010 16:59:05 +
   
   
   
hi,
   
i dont know if you did find few minutes to see my problem :)
   

Re: Proxy Authentication

2010-03-15 Thread Susam Pal
On Tue, Mar 16, 2010 at 12:55 AM, Susam Pal susam@gmail.com wrote:
 On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti
 graziano.alibe...@eng.it wrote:
 Il 13/03/2010 22.55, Susam Pal ha scritto:

 On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com  wrote:


 On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
 graziano.alibe...@eng.it  wrote:


 Il 11/03/2010 16.20, Susam Pal ha scritto:


 On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
 graziano.alibe...@eng.it    wrote:



 Hi everyone,

 I'm trying to use nutch ver. 1.0 on a system under squid proxy
 control.
 When
 I try to fetch my website list, into the log file I see that the
 authentication was failed...

 I've configured my nutch-site.xml file with all that properties needed
 for
 proxy auth, but my error is httpclient.HttpMethodDirector - No
 credentials
 available for BASIC 'Squid proxy-caching web
 server'@proxy.my.host:my.port




 Did you replace 'protocol-http' with 'protocol-httpclient' in the
 value for 'plugins.include' property in 'conf/nutch-site.xml'?

 Regards,
 Susam Pal





 Hi Susam,

 yes of course!! :) Maybe I can post you the configuration file:

 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration

 property
 namehttp.agent.name/name
 valuemy.agent.name/value
 description
 /description
 /property

 property
 nameplugin.includes/name

 valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
 description
 /description
 /property

 property
 namehttp.auth.file/name
 valuemy_file.xml/value
 descriptionAuthentication configuration file for
  'protocol-httpclient' plugin.
 /description
 /property

 property
 namehttp.proxy.host/name
 valueip.my.proxy/value
 descriptionThe proxy hostname.  If empty, no proxy is
 used./description
 /property

 property
 namehttp.proxy.port/name
 valuemy.port/value
 descriptionThe proxy port./description
 /property

 property
 namehttp.proxy.username/name
 valuemy.user/value
 description
 /description
 /property

 property
 namehttp.proxy.password/name
 valuemy.pwd/value
 description
 /description
 /property

 property
 namehttp.proxy.realm/name
 valuemy_realm/value
 description
 /description
 /property

 property
 namehttp.agent.host/name
 valuemy.local.pc/value
 descriptionThe agent host./description
 /property

 property
 namehttp.useHttp11/name
 valuetrue/value
 description
 /description
 /property

 /configuration

 Only another question: where i must put the user authentication
 parameters
 (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
 authentication?

 Thank you for your attention,


 --
 ---

 Graziano Aliberti

 Engineering Ingegneria Informatica S.p.A

 Via S. Martino della Battaglia, 56 - 00185 ROMA

 *Tel.:* 06.49.201.387

 *E-Mail:* graziano.alibe...@eng.it





 The configuration looks okay to me. Yes, the proxy authentication
 details are set in 'conf/nutch-site.xml'. The file mentioned in
 'http.auth.file' property is used for configuring authentication
 details for authenticating to a web server.

 Unfortunately, there aren't any log statements in the part of the code
 that reads the proxy authentication details. So, I can't suggest you
 to turn on debug logs to get some clues about the issue. However, in
 case you want to troubleshoot it yourself by building Nutch from
 source, I can tell you the code that deals with this.

 The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :

 http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup

 The line number is: 200.

 If I get time this weekend, I will try to insert some log statements
 into this code and send a modified JAR file to you which might help
 you to troubleshoot what is going on. But I can't promise this since
 it depends on my weekend plans.

 Two questions before I end this mail. Did you set the value of
 'http.proxy.realm' property as: Squid proxy-caching web server ?

 Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
 file? I'm not sure whether this line should appear for proxy
 authentication but it does appear for web server authentication.

 Regards,
 Susam Pal



 I managed to find some time to insert more logs into
 protocol-httpclient and create a JAR. I have attached it with this
 email.

 Please replace your
 'plugins/protocol-httpclient/protocol-httpclient.jar' file with the
 one that I have attached. Also, edit your 'conf/log4j.properties' file
 to add these two lines:

 log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
 log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout

 When you run a crawl now, you should see more logs in
 'logs/hadoop.log' than before. I hope it helps you in