[
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860114#comment-13860114
]
Markus Jelsma commented on NUTCH-1325:
--
Hi Tejas - i think most seems fine now and i
[
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860121#comment-13860121
]
Markus Jelsma commented on NUTCH-1080:
--
+1!
Type safe members , arguments for
[
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860124#comment-13860124
]
Markus Jelsma commented on NUTCH-1670:
--
+1
set same crawldb directory in mergedb
[
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860140#comment-13860140
]
Markus Jelsma commented on NUTCH-1360:
--
Almost all unit tests fail due to improper
[
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma reopened NUTCH-1360:
--
Suport the storing of IP address connected to when web crawling
[
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860142#comment-13860142
]
Markus Jelsma commented on NUTCH-1360:
--
{code}
--- conf/nutch-default.xml
[
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-356:
Attachment: NUTCH-356-trunk.patch
Updated patch for trunk. All tests pass.
According to
[
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860153#comment-13860153
]
Markus Jelsma commented on NUTCH-1360:
--
Committed revision 1554791.
Suport the
[
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma resolved NUTCH-1360.
--
Resolution: Fixed
This issue is not in 2.x, just trunk. All tests pass again.
Suport the
See https://builds.apache.org/job/Nutch-trunk/2472/changes
Changes:
[markus] NUTCH-1360 fix entity in configuration
--
[...truncated 6752 lines...]
clean-lib:
resolve-default:
[ivy:resolve] :: loading settings :: file =
[
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860222#comment-13860222
]
Hudson commented on NUTCH-1360:
---
FAILURE: Integrated in Nutch-trunk #2472 (See
Markus Jelsma created NUTCH-1691:
Summary: DomainBlacklist url filter does not allow -D filter file
override
Key: NUTCH-1691
URL: https://issues.apache.org/jira/browse/NUTCH-1691
Project: Nutch
[
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860281#comment-13860281
]
Markus Jelsma commented on NUTCH-1691:
--
This means existing behaviour is unchanged,
[
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1691:
-
Attachment: NUTCH-1691-trunk.patch
Patch for trunk. This fixes the issue by defaulting it in
[
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860288#comment-13860288
]
Markus Jelsma commented on NUTCH-1691:
--
Well, there is a small issue now:
{code}
WARN
Markus Jelsma created NUTCH-1692:
Summary: SegmentReader broken in distributed mode
Key: NUTCH-1692
URL: https://issues.apache.org/jira/browse/NUTCH-1692
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1692:
-
Attachment: NUTCH-1692-trunk.patch
Patch for trunk. Fix works, issue is gone.
SegmentReader
[
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860643#comment-13860643
]
Tejas Patil commented on NUTCH-1080:
Committed to trunk (rev 1554881). Will port the
[
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860678#comment-13860678
]
Tejas Patil commented on NUTCH-1691:
Hi [~markus17],
Its a good solution. +1 from me.
[
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860739#comment-13860739
]
Hudson commented on NUTCH-1080:
---
FAILURE: Integrated in Nutch-trunk #2473 (See
[
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860740#comment-13860740
]
Hudson commented on NUTCH-1670:
---
FAILURE: Integrated in Nutch-trunk #2473 (See
[
https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860803#comment-13860803
]
Tejas Patil commented on NUTCH-1454:
TIKA-1122 is fixed and I have verified that
Thanks for all the response, they are very inspiring and diving into the
log level is very beneficial to learn Nutch.
The fact is that I use Python BeautifulSoup to parse the sitemap of my
targeted website, which comes up with those 150K URLs, however, it turned
out that there are many many
Hi,
I have a robot that scrapes a website daily and store the HTML locally so
far(in nutch binary format in segment/content folder).
The size of the scraping is fairly big. Million pages per day.
One thing about the HTML pages themselves is that they follow exactly the
same format.. so I can
[
https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861142#comment-13861142
]
Tien Nguyen Manh commented on NUTCH-1686:
-
In this patch i also fixed an bug with
See https://builds.apache.org/job/Nutch-trunk/2474/
--
[...truncated 6749 lines...]
deps-jar:
clean-lib:
resolve-default:
[ivy:resolve] :: loading settings :: file =
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml
compile:
[
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tien Nguyen Manh updated NUTCH-1693:
Issue Type: New Feature (was: Bug)
TextMD5Signatue compute on textual content
[
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tien Nguyen Manh updated NUTCH-1693:
Fix Version/s: 2.3
TextMD5Signatue compute on textual content
[
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861195#comment-13861195
]
Tien Nguyen Manh commented on NUTCH-1693:
-
this patch only work with a minor
Here is what I would do:
If you running a crawl, let it run with the default parser. Write a nutch
plugin with your customized parse implementation to evaluate your parse
logic. Now get some real segments (with a subset of those million pages)
and run only the 'bin/nutch parse' command and see how
The config 'fs.default.name' of core-site.xml is what makes this happen.
Its default value is file:/// which corresponds to local mode of Hadoop.
In local mode Hadoop looks for paths on the local file system. In
distributed mode of Hadoop, 'fs.default.name' would be
hdfs://IP_OF_NAMENODE/ and it
[
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861217#comment-13861217
]
Tejas Patil commented on NUTCH-356:
---
+1 for commit.
Plugin repository cache can lead to
32 matches
Mail list logo