[
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143984#comment-13143984
]
Radim Kolar commented on NUTCH-1070:
i closed it because i removed my patches, i
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1098:
---
Attachment: (was: patch-with-utf8-encoding.diff)
better url-normalizer basic
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar resolved NUTCH-1098.
Resolution: Invalid
Attached patch was in improper format.
better url-normalizer
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144020#comment-13144020
]
Radim Kolar commented on NUTCH-1098:
By removing my patch i also withdraw permission
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144183#comment-13144183
]
Radim Kolar commented on NUTCH-1098:
Remove my patch from this ticket. I hold
[
https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1194:
---
Comment: was deleted
(was: locking should be done in setup/cleanup task. Currently if you kill
[
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1070:
---
Attachment: (was: nutch.bat)
Run nutch under native windows (no cygwin
[
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar resolved NUTCH-1070.
Resolution: Won't Fix
Run nutch under native windows (no cygwin
[
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1070:
---
Attachment: (was: bash.c)
Run nutch under native windows (no cygwin
[
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1070:
---
Attachment: (was: chmod.c)
Run nutch under native windows (no cygwin
[
https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142544#comment-13142544
]
Radim Kolar commented on NUTCH-1194:
locking should be done in setup/cleanup task
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142699#comment-13142699
]
Radim Kolar commented on NUTCH-1098:
a/ Please direct your complains about quality
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1098:
---
Attachment: (was: patch-urlnormalizer.diff)
better url-normalizer basic
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1098:
---
Attachment: patch-with-utf8-encoding.diff
Added support for encoding string to UTF-8 and then URL
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1098:
---
Attachment: (was: patch-urlnormalizer.diff)
better url-normalizer basic
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128264#comment-13128264
]
Radim Kolar commented on NUTCH-1098:
Browsers seems to send spaces in URL encoded like
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127440#comment-13127440
]
Radim Kolar commented on NUTCH-1098:
I did, but due to lack of time to test what
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13126594#comment-13126594
]
Radim Kolar commented on NUTCH-1098:
Patch is good. i will add replace high bit chars
I have problems with running injector in nutch-1.4 on hadoop, same
command with nutch-1.3 works fine. As you can see, list of URLs is
loaded from hdfs correctly Map input records=66906 but no records are on
map ouput. Could it be some problems with broken filtering?
Let me know if anybody got injector to work in 1.4 branch i have Hadoop
0.20.204.0 and cant make it to insert single url
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120912#comment-13120912
]
Radim Kolar commented on NUTCH-1098:
1. Some servers sends spaces in URLs
2. Based
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120966#comment-13120966
]
Radim Kolar commented on NUTCH-1098:
Actually it might be even better to add
can you add NUTCH-1098 to 1.4?
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1098:
---
Attachment: nutch.diff
Updated patch. It also normalizes unprintable % sequences to upper case. Like
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1098:
---
Attachment: (was: urlnormalizer.patch)
better url-normalizer basic
I'm glad to hear that there at least 2 people in the community that
do business in their field and proudly use a Nutch-based crawler
together with
Cassandra to store the data through Gora. That would not have been
possible with Nutch 1.x version.
what about to drop Gora, because it is
-1
I don't want to mark release 2.0 as unmaintained. Cassandra backend
works really well for us and fixed performance problems with hadoop
database. Instead of moving it out trunk, recruit more ppl should come
and fix open problems. don't give up.
[
https://issues.apache.org/jira/browse/NUTCH-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13091740#comment-13091740
]
Radim Kolar commented on NUTCH-937:
---
we should stick with hadoop 0.20.203.0 not CDH
Environment: Any
Reporter: Radim Kolar
Fix For: 1.4
Attachments: urlnormalizer.patch
Basic URL normalizer lacks 2 important features
Encode space in URL into %20 to unbreak httpclient and possibly others who do
not expect space inside URL
Ability to decode %33 encoding
[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Radim Kolar updated NUTCH-1098:
---
Attachment: urlnormalizer.patch
Patch against branch-1.4
better url-normalizer basic
I agree. Nuke crawl command
[
https://issues.apache.org/jira/browse/NUTCH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088443#comment-13088443
]
Radim Kolar commented on NUTCH-990:
---
I have this problem too protocol-httpclient fails
in a nutshell you can't use Ivy or Maven for the Gora dependency,
which is why we are currently stuck with the trunk and can't compile it
without first downloading and compiling GORA locally.
i compiled gora-*-0.2-incubating.jars locally. Where should i put them
to get nutch trunk compiled?
33 matches
Mail list logo