[ 
https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846932#action_12846932
 ] 

Andrzej Bialecki  commented on NUTCH-802:
-----------------------------------------

We already have a general way to control this and other aspects of URL-s as 
such, namely with URLFilters. I agree that this functionality could be useful, 
but in a form of a URLFilter (or adding this control to e.g. urlfilter-basic or 
urlfilter-validator).

> Problems managing outlinks with large url length
> ------------------------------------------------
>
>                 Key: NUTCH-802
>                 URL: https://issues.apache.org/jira/browse/NUTCH-802
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>            Reporter: Pablo Aragón
>            Assignee: Andrzej Bialecki 
>         Attachments: ParseOutputFormat.patch
>
>
> Nutch can get idle during the collection of outlinks if  the URL address of 
> the outlink is too large.
> The maximum sizes of an URL for the main web servers are:
>     * Apache: 4,000 bytes
>     * Microsoft Internet Information Server (IIS): 16, 384 bytes
>     * Perl HTTP::Daemon: 8.000 bytes
> URL adress sizes bigger than 4000 bytes are problematic, so the limit should 
> be set in the nutch-default.xml configuration file.
> I attached a patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to