date:20100527

[jira] Created: (NUTCH-827) HTTP POST Authentication

2010-05-27 Thread Jasper van Veghel (JIRA)

HTTP POST Authentication


 Key: NUTCH-827
 URL: https://issues.apache.org/jira/browse/NUTCH-827
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.1, 2.0
Reporter: Jasper van Veghel
Priority: Minor


I've created a patch against the trunk which adds support for very rudimentary 
POST-based authentication support. It takes a link from nutch-site.xml with a 
site to POST to and its respective parameters (username, password, etc.). It 
then checks upon every request whether any cookies have been initialized, and 
if none have, it fetches them from the given link.

This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
results from a single domain and so have no cookie overlap (i.e. if the domain 
cookies expire, all cookies disappear from the HttpClient and I can simply 
re-fetch them). A natural improvement would be to be able to specify one 
particular cookie to check the expiration-date against. If anyone is interested 
in this beside me I'd be glad to put some more effort into making this more 
universally applicable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2010-05-27 Thread Michela Becchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michela Becchi resolved NUTCH-824.
--

Fix Version/s: 1.0.0
   Resolution: Fixed

Hi,

I fixed (or, at least, circumvented) this by modifying the 
org/apache.nutch.protocol.file.FileResponse class belonging to the 
protocol-file plugin.

In particular, at line 120, I added

120 String path = .equals(url.getPath()) ? / : url.getPath();
121 +String decoded_path = path;  //@Michela 
122 
123 +try {
124 +decoded_path=java.net.URLDecoder.decode(path,UTF-8);
125 +}catch(Exception ex){
126 +}

Then, rather than

- java.io.File f = new java.io.File(path);

I have

+ java.io.File f = new java.io.File(decoded_path);

Thanks,

Michela

 Crawling - File Error 404 when fetching file with an hexadecimal character in 
 the file name.
 

 Key: NUTCH-824
 URL: https://issues.apache.org/jira/browse/NUTCH-824
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 
 GNU/Linux
Reporter: Michela Becchi
 Fix For: 1.0.0


 Hello,
 I am performing a local file system crawling.
 My problem is the following: all files that contain some hexadecimal 
 characters in the name do not get crawled.
 For example, I will see the following error:
 fetching 
 file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
 at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
 fetch of 
 file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
  failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
 I am using nutch-1.0.
 Among other standard settings, I configured nutch-site.conf as follows:
 property
   nameplugin.includes/name
   
 valueprotocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
   descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable
   protocol-httpclient, but be aware of possible intermittent problems with the
   underlying commons-httpclient library.
   /description
 /property
 property
   namefile.content.limit/name
   value-1/value
 /property
 Moreover, crawl-urlfilter.txt   looks like:
 # skip http:, ftp:,  mailto: urls
 -^(http|ftp|mailto):
 # skip image and other suffixes we can't yet parse
 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
 # skip URLs containing certain characters as probable queries, etc.
 -[...@=]
 # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
 -.*(/[^/]+)/[^/]+\1/[^/]+\1/
 # accept hosts in MY.DOMAIN.NAME
 #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
 # accept everything else
 +.*
 ~
 ---
 Thanks,
 Michela

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2010-05-27 Thread Michela Becchi


Hi,

I circumvented this problem by modifying the
org.apache.nutch.protocol.file.FileResponse class belonging to the
protocol-file plugin.

In particular, at line 120, I added

String path = .equals(url.getPath()) ? / : url.getPath();
+String decoded_path = path;
+try { 
+ decoded_path=java.net.URLDecoder.decode(path,UTF-8);
+}catch(Exception ex){}

Then, rather than

- java.io.File f = new java.io.File(path);

I have

+ java.io.File f = new java.io.File(decoded_path);

Thanks,

Michela
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-File-Error-404-when-fetching-file-with-an-hexadecimal-character-in-the-file-name-tp826407p848871.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

[Nutch Wiki] Update of PublicServers by seegnify

2010-05-27 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PublicServers page has been changed by seegnify.
http://wiki.apache.org/nutch/PublicServers?action=diffrev1=73rev2=74

--

  
   * [[http://search2.net/|search2.net]] General search engine with an 
international index based on Nutch.
   * [[http://www.searchmitchell.com/|SearchMitchell.com]] is a community 
search engine for businesses and organizations in Mitchell, SD.
+  * [[http://www.seegnify.com/|Seegnify]] is a news search engine providing 
fresh information on every topic from around the world. It is free, open and 
transparent.
  
   * [[http://www.umkreisfinder.de/|UmkreisFinder.de]] is running the 
GeoPosition plugin for local searches in Germany and in German. Please insert a 
search term in the first field, a German city name in the second field and 
choose a perimeter at the last field.

[jira] Created: (NUTCH-827) HTTP POST Authentication

[jira] Resolved: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

[Nutch Wiki] Update of PublicServers by seegnify

4 matches

Site Navigation

Mail list logo

Footer information