Re: Proxy Authentication

2010-03-12 Thread Graziano Aliberti

Il 11/03/2010 16.20, Susam Pal ha scritto:

On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
graziano.alibe...@eng.it  wrote:
   

Hi everyone,

I'm trying to use nutch ver. 1.0 on a system under squid proxy control. When
I try to fetch my website list, into the log file I see that the
authentication was failed...

I've configured my nutch-site.xml file with all that properties needed for
proxy auth, but my error is httpclient.HttpMethodDirector - No credentials
available for BASIC 'Squid proxy-caching web server'@proxy.my.host:my.port

 

Did you replace 'protocol-http' with 'protocol-httpclient' in the
value for 'plugins.include' property in 'conf/nutch-site.xml'?

Regards,
Susam Pal


   

Hi Susam,

yes of course!! :) Maybe I can post you the configuration file:

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration

property
namehttp.agent.name/name
valuemy.agent.name/value
description
/description
/property

property
nameplugin.includes/name
valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
description
/description
/property

property
namehttp.auth.file/name
valuemy_file.xml/value
descriptionAuthentication configuration file for
  'protocol-httpclient' plugin.
/description
/property

property
namehttp.proxy.host/name
valueip.my.proxy/value
descriptionThe proxy hostname.  If empty, no proxy is used./description
/property

property
namehttp.proxy.port/name
valuemy.port/value
descriptionThe proxy port./description
/property

property
namehttp.proxy.username/name
valuemy.user/value
description
/description
/property

property
namehttp.proxy.password/name
valuemy.pwd/value
description
/description
/property

property
namehttp.proxy.realm/name
valuemy_realm/value
description
/description
/property

property
namehttp.agent.host/name
valuemy.local.pc/value
descriptionThe agent host./description
/property

property
namehttp.useHttp11/name
valuetrue/value
description
/description
/property

/configuration

Only another question: where i must put the user authentication 
parameters (user,pwd)? In nutch-site.xml file or in my_file.xml that I 
use for authentication?


Thank you for your attention,


--
---

Graziano Aliberti

Engineering Ingegneria Informatica S.p.A

Via S. Martino della Battaglia, 56 - 00185 ROMA

*Tel.:* 06.49.201.387

*E-Mail:* graziano.alibe...@eng.it




Re: Proxy Authentication

2010-03-12 Thread Susam Pal
On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
graziano.alibe...@eng.it wrote:
 Il 11/03/2010 16.20, Susam Pal ha scritto:

 On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
 graziano.alibe...@eng.it  wrote:


 Hi everyone,

 I'm trying to use nutch ver. 1.0 on a system under squid proxy control.
 When
 I try to fetch my website list, into the log file I see that the
 authentication was failed...

 I've configured my nutch-site.xml file with all that properties needed
 for
 proxy auth, but my error is httpclient.HttpMethodDirector - No
 credentials
 available for BASIC 'Squid proxy-caching web
 server'@proxy.my.host:my.port



 Did you replace 'protocol-http' with 'protocol-httpclient' in the
 value for 'plugins.include' property in 'conf/nutch-site.xml'?

 Regards,
 Susam Pal




 Hi Susam,

 yes of course!! :) Maybe I can post you the configuration file:

 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration

 property
 namehttp.agent.name/name
 valuemy.agent.name/value
 description
 /description
 /property

 property
 nameplugin.includes/name
 valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
 description
 /description
 /property

 property
 namehttp.auth.file/name
 valuemy_file.xml/value
 descriptionAuthentication configuration file for
  'protocol-httpclient' plugin.
 /description
 /property

 property
 namehttp.proxy.host/name
 valueip.my.proxy/value
 descriptionThe proxy hostname.  If empty, no proxy is used./description
 /property

 property
 namehttp.proxy.port/name
 valuemy.port/value
 descriptionThe proxy port./description
 /property

 property
 namehttp.proxy.username/name
 valuemy.user/value
 description
 /description
 /property

 property
 namehttp.proxy.password/name
 valuemy.pwd/value
 description
 /description
 /property

 property
 namehttp.proxy.realm/name
 valuemy_realm/value
 description
 /description
 /property

 property
 namehttp.agent.host/name
 valuemy.local.pc/value
 descriptionThe agent host./description
 /property

 property
 namehttp.useHttp11/name
 valuetrue/value
 description
 /description
 /property

 /configuration

 Only another question: where i must put the user authentication parameters
 (user,pwd)? In nutch-site.xml file or in my_file.xml that I use for
 authentication?

 Thank you for your attention,


 --
 ---

 Graziano Aliberti

 Engineering Ingegneria Informatica S.p.A

 Via S. Martino della Battaglia, 56 - 00185 ROMA

 *Tel.:* 06.49.201.387

 *E-Mail:* graziano.alibe...@eng.it




The configuration looks okay to me. Yes, the proxy authentication
details are set in 'conf/nutch-site.xml'. The file mentioned in
'http.auth.file' property is used for configuring authentication
details for authenticating to a web server.

Unfortunately, there aren't any log statements in the part of the code
that reads the proxy authentication details. So, I can't suggest you
to turn on debug logs to get some clues about the issue. However, in
case you want to troubleshoot it yourself by building Nutch from
source, I can tell you the code that deals with this.

The file is: src/java/org/apache/nutch/protocol/httpclient/Http.java :
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java?view=markup

The line number is: 200.

If I get time this weekend, I will try to insert some log statements
into this code and send a modified JAR file to you which might help
you to troubleshoot what is going on. But I can't promise this since
it depends on my weekend plans.

Two questions before I end this mail. Did you set the value of
'http.proxy.realm' property as: Squid proxy-caching web server ?

Also, do you see any 'auth.AuthChallengeProcessor' lines in the log
file? I'm not sure whether this line should appear for proxy
authentication but it does appear for web server authentication.

Regards,
Susam Pal


Avoid indexing common html to all pages, promoting page titles.

2010-03-12 Thread Pedro Bezunartea López
Hi,

I'm developing a site that has shows the dynamic content in a div
id=content, the rest of the page doesn't really change. I'd like to store
and index only the contents of this div, to basically avoid re-indexing
over and over the same content (header, footer, menu).

I've checked the WritingPluginExample-0.9 howto, but I couldn't figure out a
couple of things:

1.- Should I extend the parse-html plugin, or should I just replace it?
2.- The example talks about finding a meta tag, extracting some information
from it, and adding a field in the index. I think I just need to get rid of
all html except the div id=content tag, and index its content. Can someone
point me in the right direction?

And just one more thing, I'd like to give a higher score to pages which the
search terms appear in the title. Right now pages that contain the terms in
the body rank higher than those that contain the search terms in the title,
how could I modify this behaviour?

Thanks for your help,

Pedro.


Can nutch index file-exchanger such as depositfiles.com

2010-03-12 Thread michaelnazaruk

Is there possible to do this with nutch?
-- 
View this message in context: 
http://old.nabble.com/Can-nutch-index-file-exchanger-such-as-depositfiles.com-tp27874535p27874535.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Avoid indexing common html to all pages, promoting page titles.

2010-03-12 Thread Andrzej Bialecki

On 2010-03-12 12:52, Pedro Bezunartea López wrote:

Hi,

I'm developing a site that has shows the dynamic content in adiv
id=content, the rest of the page doesn't really change. I'd like to store
and index only the contents of thisdiv, to basically avoid re-indexing
over and over the same content (header, footer, menu).

I've checked the WritingPluginExample-0.9 howto, but I couldn't figure out a
couple of things:

1.- Should I extend the parse-html plugin, or should I just replace it?


You should write an HtmlParseFilter and extract only the portions that 
you care about, and then replace the output parseText with your 
extracted text.



2.- The example talks about finding a meta tag, extracting some information
from it, and adding a field in the index. I think I just need to get rid of
all html except the div id=content tag, and index its content. Can someone
point me in the right direction?


See above.


And just one more thing, I'd like to give a higher score to pages which the
search terms appear in the title. Right now pages that contain the terms in
the body rank higher than those that contain the search terms in the title,
how could I modify this behaviour?


You can define these weights in the configuration, look for query boost 
properties.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Abt: Detect slow and timeout servers and drop their URLs

2010-03-12 Thread Yves Petinot
Merci Julien ~ deployed it yesterday and it worked like a charm ! I was 
still using the official 1.0 release and obviously i've been missing out 
on quite a few nice improvements ;-)


-yves

Julien Nioche wrote:

Bonjour Yves,

Did you see https://issues.apache.org/jira/browse/NUTCH-770? It has been
committed to the trunk back in December.

HTH

Julien
  




setting search dir for nutch web app

2010-03-12 Thread Mark Lim
Just sharing my experience with setting the search directory for the
nutch webapp.  This is a leading cause of the disappointing Hits 0-0
(out of about 0 total matching pages) message.

I had a situation like Noah Silverman:

 On Thu, 2009-12-17 at 16:32 -0800, Noah Silverman wrote:
   
 Hello,

 Just to summarize.

 1) Nutch crawl completes without error.

 2) I can search from command line and see results.  (Assume this
means
 that index is created.)
 bin/nutch org.apache.nutch.searcher.NutchBean foobar

 3) Tomcat configured through nutch-site file to point to nutch/crawl
 directory

 4) catalina.out logfile indicates that tomcat is opening nutch/crawl
 2009-12-16 22:00:39,740 INFO SearchBean - opening indexes in
 /home/noah/Documents/nutch/crawl/indexes

 5) No results when searching in web front end

 6) No errors in the logs

 Is there some way to debug this?  Perhaps more verbose logging?

 Thanks!!!

 -N

The log message in 4 is only somewhat helpful since if anything goes
wrong, nothing will be said. Noah's problem was that he needed to point
to the top level directory.  My case was that I needed to set the
permissions correctly.

I had crawled as root so the crawl directory was root:root with
permissions 544. (at least readable)  I moved it to $TOMCAT/work and
gave it ownership $TOMCAT_USER:$TOMCAT_GROUP with permissions 755.   Now
it works.  

In any case, the nutch web app will simply log at info that it's opening
indexes at $DIR.  If permissions are wrong, or the directory doesn't
exist, it will say nothing, not even at debug logging.  No exceptions
will be thrown.




Recrawl and crawl-urlfilter.txt

2010-03-12 Thread Joshua J Pavel


I'm having multiple problems recrawling with nutch 0.9.  Here are 2
questions.  :-)

Right now, using the script I find here (
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
), I think I'm close to a workable solution, but the recrawl doesn't
respect the crawl-urlfilter.txt.  Is there a way to specify this
configuration for the recrawl?

Our final implementation will be a single-sited crawl with
close-to-realtime search results (ideally, we'll crawl about every 30
minutes or 1 hour).  In that regard, is there any way to have nutch respect
cache value response codes (304 Not Modified) instead of the fetcher time
in the configuration file?

Thanks!
-Josh Pavel

Nutch Fetch Stuck

2010-03-12 Thread Abhi Yerra
Hi,

We did a fetch and the maps are 100% done, but the reducers have crashed. We 
did a large fetch so is there a way to restart the reducers without restarting 
the fetch?

-Abhi


Re: Nutch Fetch Stuck

2010-03-12 Thread Andrzej Bialecki

On 2010-03-12 23:39, Abhi Yerra wrote:

Hi,

We did a fetch and the maps are 100% done, but the reducers have crashed. We 
did a large fetch so is there a way to restart the reducers without restarting 
the fetch?


Unfortunately no. Was the fetcher in the parsing mode? If so, I 
strongly recommend that you first fetch, and then run the parsing as a 
separate step.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Fetch Stuck

2010-03-12 Thread Abhi Yerra
So I had -noParsing set. So parsing was not part of the fetch. The pages have 
been crawled, but the reducers have crashed. So if I restart the fetch will it 
try to crawl all those pages again?

-Abhi 

 
- Original Message -
From: Andrzej Bialecki a...@getopt.org
To: nutch-user@lucene.apache.org
Sent: Friday, March 12, 2010 3:05:00 PM
Subject: Re: Nutch Fetch Stuck

On 2010-03-12 23:39, Abhi Yerra wrote:
 Hi,

 We did a fetch and the maps are 100% done, but the reducers have crashed. We 
 did a large fetch so is there a way to restart the reducers without 
 restarting the fetch?

Unfortunately no. Was the fetcher in the parsing mode? If so, I 
strongly recommend that you first fetch, and then run the parsing as a 
separate step.

-- 
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Content of redirected urls empty

2010-03-12 Thread BELLINI ADAM

no one have an answer !?





 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org; mille...@gmail.com
 Subject: RE: Content of redirected urls empty
 Date: Wed, 10 Mar 2010 21:01:54 +
 
 
 i read lotoff post regarding redirected urls but didnt find a sollution !
 
 
 
 
 
  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org; mille...@gmail.com
  Subject: RE: Content of redirected urls empty
  Date: Tue, 9 Mar 2010 16:59:05 +
  
  
  
  hi,
  
  i dont know if you did find few minutes to see my problem :)
  
  but i want to explain it again, mabe it wasnt clear :
  
  
  i have HTTP  pages redirected to HTTPS   (but it's the same URL):
  
  HTTP://page1.com   redirrected to HTTPS://page1.com
  
  the content of my page HTTP is empty.
  the content of my page HTTPS is not empty
  
  in my segment i found botch the 2 URLS (HTTP and HTTPS ) , the content of 
  HTTPS page is not empty
  
  but in my index i found the HTTP one with the empty content.
  
  is there a maner to tell to nutch to index the url with the non empty 
  content? or why nutch doesnt index the target URL rather than indexing the 
  empty (origin) one ??
  
  thx a lot
  
  
  
  
  
   From: mbel...@msn.com
   To: nutch-user@lucene.apache.org
   Subject: RE: Content of redirected urls empty
   Date: Mon, 8 Mar 2010 17:08:06 +
   
   
   i'm sorry...i just checked twice...and in my index i have the original 
   URL, which is  the HTTP one with the empty content...but it dosent index 
   the HTTPS oneand i using solr index
   thx
   
   
   
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: RE: Content of redirected urls empty
Date: Mon, 8 Mar 2010 17:01:34 +




Hi, i'v just dumped my segments and found that i have both 2 URLS, the 
original one (HTTP) with an empty content and the REDIRCTED TO or the 
DESTINATION URL (HTTPS) with NON EMPTY content !

but in my search i found only the HTTPS URL with an empty content !! 
logically the content of the HTTPS  URL is not empty !
it's just mixing the HTTPS url with the content of the HTTP one.


our redirect is done by java code  response.sendRedirect(…), so it 
seams to be http redirect right ??

thx for helping me :)


 Date: Mon, 8 Mar 2010 15:51:34 +0100
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: Content of redirected urls empty
 
 On 2010-03-08 14:55, BELLINI ADAM wrote:
 
 
  is there any idea guys ??
 
 
  From: mbel...@msn.com
  To: nutch-user@lucene.apache.org
  Subject: Content of redirected urls empty
  Date: Fri, 5 Mar 2010 22:01:05 +
 
 
 
  hi,
  the content of my redirected urls is empty...but still have the 
  other metadata...
  i have an http urls that is redirected to https.
  in my index i find the http URL but with an empty content...
  could you explain it plz?
 
 There are two ways to redirect - one is with protocol, and the other 
 is 
 with content (either meta refresh, or javascript).
 
 When you dump the segment, is there really no content for the 
 redirected 
 url?
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
Live connected with Messenger on your phone
http://go.microsoft.com/?linkid=9712958
   
   _
   IM on the go with Messenger on your phone
   http://go.microsoft.com/?linkid=9712960

  _
  Stay in touch.
  http://go.microsoft.com/?linkid=9712959
 
 _
 Take your contacts everywhere
 http://go.microsoft.com/?linkid=9712959
  
_
Stay in touch.
http://go.microsoft.com/?linkid=9712959