From: Oscar <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: subscribe
This is how you are trying to subscribe. This is incorrect. You should
send a mail to the following email address to subscribe to the mailing
list.
[EMAIL PROTECTED]
Regards,
Susam Pal
http://susam.in/
On 6/30/07,
Is it that the interface 'org.apache.nutch.net.URLFilter' was compiled
with JDK 1.5 earlier? I have seen this problem happening with a beta
version of JDK 1.6.
Are you using the latest version, JDK 1.6 Update 2?
Regards,
Susam Pal
http://susam.in/
On 9/11/07, Doğacan Güney <[EM
where. Any suggestions?
Regards,
Susam Pal
http://susam.in/
-Jim
Jim,
Have you tried parse-pdf?
Regards,
Susam Pal
http://susam.in/
mailing list.
Regards,
Susam Pal
http://susam.in/
On 10/10/07, Christopher Bader <[EMAIL PROTECTED]> wrote:
> I ran Nutch on a subset of Wikipedia, and it works. But for each search it
> always gives exactly two choices.
>
>
>
> How do I configure it so that it gives (a) N
t is not a bug in Nutch 0.9 This
looks like a configuration problem at your end. Please discuss this
properly in [EMAIL PROTECTED] instead of submitting it as a
bug in Nutch.
Regards,
Susam Pal
On Jan 8, 2008 7:16 AM, sudarat (JIRA) <[EMAIL PROTECTED]> wrote:
> nutch craw
I wanted to send this as a private reply but sent it to the list
instead. Sorry for the inconvenience.
On Jan 8, 2008 10:21 AM, Susam Pal <[EMAIL PROTECTED]> wrote:
> I have replied this query of yours yesterday in
> [EMAIL PROTECTED] If you haven't received the reply,
> p
r in Linux. I am not well
acquainted with the Hadoop code yet. Could someone throw light on what
might be going wrong?
Regards,
Susam Pal
On 2/7/08, DS jha <[EMAIL PROTECTED]> wrote:
Hi -
>
> Looks like latest trunk version of nutch is failing with the following
> exception when
this failed with the same error.
Right now I don't have a Windows system with me. I will try setting it
as /cygdrive/d/tmp/ tomorrow when I again have access to a Windows
system and then I'll update the mailing list with the observations.
Thanks for the suggestion.
Regards,
Susam Pal
O
org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:426)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:165)
Regards,
Susam Pal
On Thu, Feb 14, 2008 at 10:07 PM, Susam Pal <[EMAIL PROTECTED]> wrote:
> What I did try was setting hadoop.tmp.dir to /opt/tmp. I found the
&g
I still can't see any DEBUG logs in your log file. Did you go through
my earlier mail?
Regards,
Susam Pal
On Wed, Mar 12, 2008 at 9:39 PM, <[EMAIL PROTECTED]> wrote:
>
> Hi All,
>
> I am facing a problem in running nutch where the proxy authentication is
> requ
some valuable
work can be done. What do you say?
Regards,
Susam Pal
interact with the community and his assigned mentor
through the mailing list and since the whole community is there to
guide him, there is not much of a burden on the mentor.
Regards,
Susam Pal
On Sun, Mar 30, 2008 at 8:55 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> How much of a time commitm
I agree with John too. Probably you meant $ 0.02, since 0.02 cents is too
less. It is usually 2 cents. :-P
Regards,
Susam Pal
On Tue, Dec 2, 2008 at 6:09 PM, John Martyniak <[EMAIL PROTECTED]> wrote:
> Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/Solr
> integ
es 39 to 40) :
conf.addResource("nutch-default.xml");
conf.addResource("nutch-site.xml");
So, shouldn't that XML comment be removed from 'conf/crawl-tool.xml' ?
Regards,
Susam Pal
On Tue, Apr 7, 2009 at 1:07 AM, Susam Pal wrote:
> The inline documentation of 'conf/crawl-tool.xml' mentions:
>
>
>
>
>
> However, I don't see any way of overriding the properties defined in
> 'conf/crawl-tool.xml' as 'conf/nutch-site.x
a 6.
The subversion details are available at:
http://lucene.apache.org/nutch/version_control.html
Regards,
Susam Pal
e right place ask this. I've
included it in CC.
This feature is not present in Nutch. We have recorded the summary of
some old discussions regarding this here:
http://wiki.apache.org/nutch/HttpPostAuthentication But this was never
implemented.
Regards,
Susam Pal
[
https://issues.apache.org/jira/browse/NUTCH-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-44:
---
Attachment: NUTCH-44.patch
Attached a patch.
To apply:-
patch -p0 < NUTCH-44.patch
ant war
cp bu
[
https://issues.apache.org/jira/browse/NUTCH-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-44:
---
Attachment: (was: NUTCH-44.patch)
> too many search resu
[
https://issues.apache.org/jira/browse/NUTCH-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-44:
---
Attachment: NUTCH-44.patch
Updated my previous patch to fix the issue in opensearch too.
To apply:-
patch
[
https://issues.apache.org/jira/browse/NUTCH-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-281:
Attachment: NUTCH-281.patch
Uploading a patch.
Put the tag outside comments and now the relative links in
Type: Improvement
Components: fetcher
Affects Versions: 1.0.0
Reporter: Susam Pal
'protocol-http11' is a protocol plugin which supports retrieving documents via
the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and
NTLM authentication s
[
https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-557:
Attachment: protocol-http11v0.1.patch
I have generated this patch against Nutch trunk.
To apply:-
patch
[
https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-557:
Priority: Minor (was: Major)
> protocol-http11 for HTTP 1.1, HTTPS, NTLM, Basic and Digest Authenticat
[
https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528854
]
Susam Pal commented on NUTCH-557:
-
No, there isn't any significant difference in performance. Here's a l
[
https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529528
]
Susam Pal commented on NUTCH-557:
-
Thank you, Doğacan and Andrzej for your comments. I started developing it in a
[
https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529530
]
Susam Pal commented on NUTCH-557:
-
Point no. 2 of my previous comment is incorrect. The SSL related files are
being
Components: fetcher
Affects Versions: 1.0.0
Reporter: Susam Pal
Priority: Minor
Added basic, digest and NTLM authentication schemes to protocol-httpclient. The
authentication schemes can be configured for proxy server as well as web
servers of a domain
[
https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-559:
Attachment: NUTCH-559v0.1.patch
I have generated this patch against Nutch trunk. It will add support for
[
https://issues.apache.org/jira/browse/NUTCH-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal closed NUTCH-557.
---
Resolution: Won't Fix
As per the discussion, 'protocol-http11' has been turned into a patch
[
https://issues.apache.org/jira/browse/NUTCH-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530175
]
susam edited comment on NUTCH-539 at 9/25/07 10:54 AM:
---
1. There is a bug in the patch. The
[
https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-559:
Priority: Major (was: Minor)
Apart from adding the authentication features, this patch would fix three
[
https://issues.apache.org/jira/browse/NUTCH-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530519
]
Susam Pal commented on NUTCH-560:
-
I analysed 'protocol-http' and it behaves almost in the same man
[
https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-559:
Attachment: NUTCH-559v0.2.patch
Uploading a revised (v0.2) patch which accommodates most of the suggestions
[
https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-559:
Attachment: NUTCH-559v0.3.patch
Uploading a revised (v0.3) patch that allows flexible authentication
[
https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-559:
Attachment: NUTCH-559v0.4.patch
Uploading a revised (v0.4) patch that has all authentication configuration
[
https://issues.apache.org/jira/browse/NUTCH-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-559:
Attachment: NUTCH-559v0.5.patch
Uploading a revised (v0.5) patch with some test cases. Added a 's
[
https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-601:
Attachment: NUTCH-601v0.2.patch
Attached a revised patch (NUTCH-601v0.2.patch), which removes the old
[
https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-601:
Attachment: NUTCH-601v0.1.patch
Patch attached.
> Recrawling on existing crawl directory using fo
Versions: 1.0.0
Reporter: Susam Pal
Priority: Minor
Added a '-force' option to the 'bin/nutch crawl' command line. With this
option, one can crawl and recrawl in the following manner:
{code}
bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
bin/n
[
https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565848#action_12565848
]
Susam Pal commented on NUTCH-601:
-
The 'if (newIndex != index)' condition i
[
https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-601:
Attachment: NUTCH-601v1.0.patch
Attached another patch (NUTCH-601v1.0.patch) that always deletes the old
[
https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-601:
Attachment: NUTCH-601v0.3.patch
Attached a revised patch (NUTCH-601v0.3.patch) that makes the code simpler
[
https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-612:
Attachment: NUTCH-612v0.1.patch
Attached patch to fix the bug. This modifies Crawl.java and Generator.java
Components: generator
Affects Versions: 1.0.0
Reporter: Susam Pal
Fix For: 1.0.0
When a crawl is done using the 'bin/nutch crawl' command, no filtering is done
in Generator even if 'crawl.generate.filter' is set to true in the
configuration f
[
https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573790#action_12573790
]
Susam Pal commented on NUTCH-601:
-
It continues the recrawl using the existing c
Issue Type: Bug
Components: web gui
Affects Versions: 1.0.0
Reporter: Susam Pal
Priority: Minor
The inline documentation of 'conf/crawl-tool.xml' mentions:
{code:xml}
{code}
However, I don't see any way of overriding the proper
[
https://issues.apache.org/jira/browse/NUTCH-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Susam Pal updated NUTCH-735:
Attachment: NUTCH-735v0.1.patch
Attached patch.
> crawl-tool.xml must be read before nutch-site.xml w
49 matches
Mail list logo