[jira] Created: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread JIRA
Problems managing outlinks with large url length Key: NUTCH-802 URL: https://issues.apache.org/jira/browse/NUTCH-802 Project: Nutch Issue Type: Bug Components: parser

[jira] Updated: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Aragón updated NUTCH-802: --- Attachment: ParseOutputFormat.patch Problems managing outlinks with large url length

[jira] Closed: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Aragón closed NUTCH-802. -- Resolution: Fixed Problems managing outlinks with large url length

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-18 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846865#action_12846865 ] Jukka Zitting commented on NUTCH-797: - I guess we need to apply the same logic also to

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846910#action_12846910 ] Julien Nioche commented on NUTCH-762: - OK, there was indeed an assumption that the

[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2010-03-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846923#action_12846923 ] Andrzej Bialecki commented on NUTCH-797: - That's one option, at least until the

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846927#action_12846927 ] Andrzej Bialecki commented on NUTCH-762: - In my experience the IP-based fetching

[jira] Reopened: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reopened NUTCH-802: - Assignee: Andrzej Bialecki Submitting a patch is not fixing, it's fixed when the patch

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846930#action_12846930 ] Julien Nioche commented on NUTCH-762: - Yes, I came across that situation too on a large

[jira] Commented: (NUTCH-802) Problems managing outlinks with large url length

2010-03-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846932#action_12846932 ] Andrzej Bialecki commented on NUTCH-802: - We already have a general way to control

Crawling authenticated websites !

2010-03-18 Thread Ranganath Cuddapah
Hello, Is there a way to configure Nutch to crawl forms authenticated websites? What I mean is the kind of websites which look up a database for authentication/authorization and does not allow you to view secure pages unless authenticated. This need not be specifically on https, but on http

Re: Crawling authenticated websites !

2010-03-18 Thread Susam Pal
On Thu, Mar 18, 2010 at 7:27 PM, Ranganath Cuddapah ranganat...@gmail.com wrote: Hello, Is there a way to configure Nutch to crawl forms authenticated websites? What I mean is the kind of websites which look up a database for authentication/authorization and does not allow you to view secure

[jira] Closed: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging

2010-03-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-796. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki Patch

[jira] Commented: (NUTCH-800) Generator builds a URL list that is not encoded

2010-03-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847071#action_12847071 ] Andrzej Bialecki commented on NUTCH-800: - I'm puzzled by your problem description.

[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2010-03-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847074#action_12847074 ] Andrzej Bialecki commented on NUTCH-693: - This patch is controversial in the sense

[jira] Commented: (NUTCH-795) Add ability to maintain nofollow attribute in linkdb

2010-03-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847075#action_12847075 ] Andrzej Bialecki commented on NUTCH-795: - Please see my comment to that issue. Or

[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files

2010-03-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847094#action_12847094 ] Andrzej Bialecki commented on NUTCH-780: - Is the purpose of this issue to make

[jira] Commented: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging

2010-03-18 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847219#action_12847219 ] Hudson commented on NUTCH-796: -- Integrated in Nutch-trunk #1100 (See