[jira] [Commented] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-04 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143984#comment-13143984 ] Radim Kolar commented on NUTCH-1070: i closed it because i removed my patches, i

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: patch-with-utf8-encoding.diff) better url-normalizer basic

[jira] [Resolved] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar resolved NUTCH-1098. Resolution: Invalid Attached patch was in improper format. better url-normalizer

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144020#comment-13144020 ] Radim Kolar commented on NUTCH-1098: By removing my patch i also withdraw permission

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144183#comment-13144183 ] Radim Kolar commented on NUTCH-1098: Remove my patch from this ticket. I hold

[jira] [Updated] (NUTCH-1194) CrawlDB lock should be released earlier

2011-11-03 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1194: --- Comment: was deleted (was: locking should be done in setup/cleanup task. Currently if you kill

[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: (was: nutch.bat) Run nutch under native windows (no cygwin

[jira] [Resolved] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar resolved NUTCH-1070. Resolution: Won't Fix Run nutch under native windows (no cygwin

[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: (was: bash.c) Run nutch under native windows (no cygwin

[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: (was: chmod.c) Run nutch under native windows (no cygwin

[jira] [Commented] (NUTCH-1194) CrawlDB lock should be released earlier

2011-11-02 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142544#comment-13142544 ] Radim Kolar commented on NUTCH-1194: locking should be done in setup/cleanup task

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-11-02 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142699#comment-13142699 ] Radim Kolar commented on NUTCH-1098: a/ Please direct your complains about quality

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-02 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: patch-urlnormalizer.diff) better url-normalizer basic

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-02 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: patch-with-utf8-encoding.diff Added support for encoding string to UTF-8 and then URL

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-10-24 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: patch-urlnormalizer.diff) better url-normalizer basic

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-15 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128264#comment-13128264 ] Radim Kolar commented on NUTCH-1098: Browsers seems to send spaces in URL encoded like

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-14 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127440#comment-13127440 ] Radim Kolar commented on NUTCH-1098: I did, but due to lack of time to test what

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-13 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13126594#comment-13126594 ] Radim Kolar commented on NUTCH-1098: Patch is good. i will add replace high bit chars

injector in nutch-1.4

2011-10-13 Thread Radim Kolar
I have problems with running injector in nutch-1.4 on hadoop, same command with nutch-1.3 works fine. As you can see, list of URLs is loaded from hdfs correctly Map input records=66906 but no records are on map ouput. Could it be some problems with broken filtering?

Re: injector in nutch-1.4

2011-10-13 Thread Radim Kolar
Let me know if anybody got injector to work in 1.4 branch i have Hadoop 0.20.204.0 and cant make it to insert single url

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-05 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120912#comment-13120912 ] Radim Kolar commented on NUTCH-1098: 1. Some servers sends spaces in URLs 2. Based

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-05 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120966#comment-13120966 ] Radim Kolar commented on NUTCH-1098: Actually it might be even better to add

Re: Prepare for 1.4 release?

2011-09-27 Thread Radim Kolar
can you add NUTCH-1098 to 1.4?

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-09-25 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: nutch.diff Updated patch. It also normalizes unprintable % sequences to upper case. Like

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-09-25 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: urlnormalizer.patch) better url-normalizer basic

Re: [VOTE] Move 2.0 out of trunk

2011-09-19 Thread Radim Kolar
I'm glad to hear that there at least 2 people in the community that do business in their field and proudly use a Nutch-based crawler together with Cassandra to store the data through Gora. That would not have been possible with Nutch 1.x version. what about to drop Gora, because it is

Re: [VOTE] Move 2.0 out of trunk

2011-09-18 Thread Radim Kolar
-1 I don't want to mark release 2.0 as unmaintained. Cassandra backend works really well for us and fixed performance problems with hadoop database. Instead of moving it out trunk, recruit more ppl should come and fix open problems. don't give up.

[jira] [Commented] (NUTCH-937) When nutch is run on hadoop 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967)

2011-08-26 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13091740#comment-13091740 ] Radim Kolar commented on NUTCH-937: --- we should stick with hadoop 0.20.203.0 not CDH

[jira] [Created] (NUTCH-1098) better url-normalizer basic

2011-08-25 Thread Radim Kolar (JIRA)
Environment: Any Reporter: Radim Kolar Fix For: 1.4 Attachments: urlnormalizer.patch Basic URL normalizer lacks 2 important features Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL Ability to decode %33 encoding

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-08-25 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: urlnormalizer.patch Patch against branch-1.4 better url-normalizer basic

Re: The crawl command, keep or get rid of

2011-08-23 Thread Radim Kolar
I agree. Nuke crawl command

[jira] [Commented] (NUTCH-990) protocol-httpclient fails with short pages

2011-08-21 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088443#comment-13088443 ] Radim Kolar commented on NUTCH-990: --- I have this problem too protocol-httpclient fails

Re: Unreleased Gora dependencies in Nutch Trunk build

2011-08-19 Thread Radim Kolar
in a nutshell you can't use Ivy or Maven for the Gora dependency, which is why we are currently stuck with the trunk and can't compile it without first downloading and compiling GORA locally. i compiled gora-*-0.2-incubating.jars locally. Where should i put them to get nutch trunk compiled?