[jira] [Commented] (NUTCH-1944) Add raw content to indexes

2015-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332417#comment-14332417 ] Sebastian Nagel commented on NUTCH-1944: This issue duplicates NUTCH-1785 but this

[jira] [Updated] (NUTCH-1870) Generic xsl parser plugin

2015-02-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1870: --- Attachment: NUTCH-1870-trunk-v4.patch New patch including: * load all configuration files from

[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338198#comment-14338198 ] Sebastian Nagel commented on NUTCH-1950: Is it really a good idea to take the syst

[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338684#comment-14338684 ] Sebastian Nagel commented on NUTCH-1950: Great! For a MD5 calculation, see o.a.had

[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions

2015-03-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357691#comment-14357691 ] Sebastian Nagel commented on NUTCH-1957: Just a few thoughts to finally solve this

[jira] [Commented] (NUTCH-1956) Members to be public in URLCrawlDatum

2015-03-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358349#comment-14358349 ] Sebastian Nagel commented on NUTCH-1956: +1 > Members to be public in URLCrawlDat

[jira] [Commented] (NUTCH-1967) Possible SIooBE in MimeAdaptiveFetchSchedule

2015-03-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366044#comment-14366044 ] Sebastian Nagel commented on NUTCH-1967: +1 MimeUtil.cleanMimeType() could be an a

[jira] [Commented] (NUTCH-1971) The crawldb.url.filters property is not present in any configuration file

2015-03-20 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371018#comment-14371018 ] Sebastian Nagel commented on NUTCH-1971: +1 Since NUTCH-1786 crawldb.url.filters a

[jira] [Commented] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375693#comment-14375693 ] Sebastian Nagel commented on NUTCH-1941: Hi [~asitangm], thanks! The patch needs s

[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376227#comment-14376227 ] Sebastian Nagel commented on NUTCH-1958: Scoring-oping is not that bad, scores are

[jira] [Commented] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-24 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378821#comment-14378821 ] Sebastian Nagel commented on NUTCH-1941: Great, that's a step forward. Before goin

[jira] [Commented] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-25 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379895#comment-14379895 ] Sebastian Nagel commented on NUTCH-1941: It's not about concurrent write accesses

[jira] [Commented] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381679#comment-14381679 ] Sebastian Nagel commented on NUTCH-1941: Solution 2 is simpler because it does not

[jira] [Updated] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1941: --- Attachment: NUTCH-1941-v5.patch Attached new patch v5 - including descriptions in nutch-defaul

[jira] [Updated] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1941: --- Patch Info: Patch Available Affects Version/s: 2.3 1.9

[jira] [Assigned] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1941: -- Assignee: Sebastian Nagel > Optional rolling http.agent.name's > --

[jira] [Commented] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384636#comment-14384636 ] Sebastian Nagel commented on NUTCH-1941: Great! Also protocol-httpclient is now ro

[jira] [Work started] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1941 started by Sebastian Nagel. -- > Optional rolling http.agent.name's > -- > >

[jira] [Updated] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1941: --- Attachment: NUTCH-1941-2x-v6.patch Patch for 2.x > Optional rolling http.agent.name's > -

[jira] [Resolved] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1941. Resolution: Fixed Committed to trunk and 2.x, r1669692. Thanks, [~asitang]! > Optional roll

[jira] [Commented] (NUTCH-1979) CrawlDbReader to implement Tool

2015-03-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388715#comment-14388715 ] Sebastian Nagel commented on NUTCH-1979: +1 > CrawlDbReader to implement Tool > -

[jira] [Commented] (NUTCH-1979) CrawlDbReader to implement Tool

2015-03-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389221#comment-14389221 ] Sebastian Nagel commented on NUTCH-1979: Needs a trivial fix in TestCrawlDbMerger:

[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-03-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389351#comment-14389351 ] Sebastian Nagel commented on NUTCH-1771: >From [~chongli] in NUTCH-1978: {quote} S

[jira] [Resolved] (NUTCH-1978) solrindex will fail when indexing corrupted segments

2015-03-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1978. Resolution: Duplicate Hi [~chongli], this is clearly a duplicate of NUTCH-1771. It's better

[jira] [Updated] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-03-31 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1771: --- Affects Version/s: 1.10 > Solrindex fails if a segment is corrupted or incomplete > --

[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391647#comment-14391647 ] Sebastian Nagel commented on NUTCH-1771: Hi [~chongli], the patch looks clean and

[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394230#comment-14394230 ] Sebastian Nagel commented on NUTCH-1771: Again: nice patch. * SegmentChecker holds

[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483527#comment-14483527 ] Sebastian Nagel commented on NUTCH-1771: +1 : will commit soon. Thanks, [~chongli]

[jira] [Updated] (NUTCH-1981) Upgrade icu4j to version 51.1

2015-04-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1981: --- Fix Version/s: 1.11 2.4 > Upgrade icu4j to version 51.1 > -

[jira] [Commented] (NUTCH-1981) Upgrade icu4j to version 51.1

2015-04-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485157#comment-14485157 ] Sebastian Nagel commented on NUTCH-1981: There should be no problem to upgrade the

[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2015-04-08 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485297#comment-14485297 ] Sebastian Nagel commented on NUTCH-1247: Close this issue? With NUTCH-578 and NUTC

[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-09 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487035#comment-14487035 ] Sebastian Nagel commented on NUTCH-1854: Definitely: fetcher.store.content=false a

[jira] [Resolved] (NUTCH-1981) Upgrade icu4j

2015-04-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1981. Resolution: Fixed Fix Version/s: (was: 1.11) 1.10 Committed to

[jira] [Commented] (NUTCH-1984) Eliminate unnecessary dependencies

2015-04-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491240#comment-14491240 ] Sebastian Nagel commented on NUTCH-1984: Thanks, [~aspa]! That's 3 problems which

[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492286#comment-14492286 ] Sebastian Nagel commented on NUTCH-1854: Thanks, [~asitang]! * NUTCH-1771 is commi

[jira] [Comment Edited] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492286#comment-14492286 ] Sebastian Nagel edited comment on NUTCH-1854 at 4/13/15 11:24 AM: --

[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492419#comment-14492419 ] Sebastian Nagel commented on NUTCH-1927: * http.robot.rules.whitelist should be em

[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493655#comment-14493655 ] Sebastian Nagel commented on NUTCH-1854: +1 Great! Needs formatting. Will commit s

[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496986#comment-14496986 ] Sebastian Nagel commented on NUTCH-1987: Agreed: it's time to skip the Solr-URL be

[jira] [Commented] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings

2015-04-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497000#comment-14497000 ] Sebastian Nagel commented on NUTCH-1986: +1 that's the default values you have to

[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional

2015-04-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497149#comment-14497149 ] Sebastian Nagel commented on NUTCH-1988: +1 Could be alternatively {{-dirlevels n}

[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497251#comment-14497251 ] Sebastian Nagel commented on NUTCH-1927: Hi Chris, the class WhiteListRobotRules s

[jira] [Updated] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-16 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1927: --- Attachment: NUTCH-1927.2015-04-16.patch Hi Chris, bq. Can you please reply with code? yep, att

[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500544#comment-14500544 ] Sebastian Nagel commented on NUTCH-1927: Hi, Chris: agreed to log more verbosely.

[jira] [Updated] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1927: --- Attachment: test_NUTCH-1927.2015-04-17.txt NUTCH-1927.2015-04-17.patch Patch t

[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500652#comment-14500652 ] Sebastian Nagel commented on NUTCH-1927: Committed to trunk r1674399. Should be ea

[jira] [Resolved] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1854. Resolution: Fixed Committed to trunk, r1674581. Thanks! > ./bin/crawl fails with a parsing

[jira] [Updated] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-04-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1990: --- Attachment: NUTCH-1990-trial1.patch Sounds reasonable and would "en passant" resolve NUTCH-106

[jira] [Updated] (NUTCH-1697) SegmentMerger to implement Tool

2015-04-19 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1697: --- Attachment: NUTCH-1697-trunk-v2.patch Patch which applies to recent trunk. Both variants to pa

[jira] [Updated] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-04-20 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1990: --- Attachment: NUTCH-1990-v1.patch Uuuh, a lot of garbage :( I've also run the test after spendi

[jira] [Updated] (NUTCH-1991) Tika mime detection not using Nutch supplied tika-mimetypes.xml for content based detection

2015-04-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1991: --- Attachment: NUTCH-1991-trunk.v2.patch Thanks, [~ilopata1]! Updated patch to apply against trun

[jira] [Commented] (NUTCH-1993) Nutch does not use backup parsers

2015-04-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507865#comment-14507865 ] Sebastian Nagel commented on NUTCH-1993: +1 > Nutch does not use backup parsers >

[jira] [Commented] (NUTCH-1990) Use URI.normalise() in BasicURLNormalizer

2015-04-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507895#comment-14507895 ] Sebastian Nagel commented on NUTCH-1990: Applied also to 2.x, r1675499 to finally

[jira] [Resolved] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex

2015-04-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1062. Resolution: Fixed Fix Version/s: (was: 1.11) 1.10

[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-23 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509678#comment-14509678 ] Sebastian Nagel commented on NUTCH-1994: +1 > Upgrade to Apache Tika 1.8 > --

[jira] [Reopened] (NUTCH-1998) Add support for user-defined file extension to CommonCrawlDataDumper

2015-05-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-1998: Commit r1678520 breaks the Jenkins build: TestCommonCrawlDataDumper needs to be adapted to the

[jira] [Resolved] (NUTCH-1998) Add support for user-defined file extension to CommonCrawlDataDumper

2015-05-11 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1998. Resolution: Fixed Fixed unit test and issue ID in change log, r1678824. > Add support for u

[jira] [Created] (NUTCH-2007) add test libs to classpath of bin/nutch junit

2015-05-12 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2007: -- Summary: add test libs to classpath of bin/nutch junit Key: NUTCH-2007 URL: https://issues.apache.org/jira/browse/NUTCH-2007 Project: Nutch Issue Type: B

[jira] [Updated] (NUTCH-2007) add test libs to classpath of bin/nutch junit

2015-05-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2007: --- Attachment: NUTCH-2007-trunk-v1.patch > add test libs to classpath of bin/nutch junit > --

[jira] [Created] (NUTCH-2008) IndexerMapReduce to use single instance of NutchIndexAction for deletions

2015-05-13 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2008: -- Summary: IndexerMapReduce to use single instance of NutchIndexAction for deletions Key: NUTCH-2008 URL: https://issues.apache.org/jira/browse/NUTCH-2008 Project:

[jira] [Updated] (NUTCH-2008) IndexerMapReduce to use single instance of NutchIndexAction for deletions

2015-05-13 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2008: --- Attachment: NUTCH-2008-trunk-v1.patch > IndexerMapReduce to use single instance of NutchIndexA

[jira] [Updated] (NUTCH-2008) IndexerMapReduce to use single instance of NutchIndexAction for deletions

2015-05-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2008: --- Attachment: NUTCH-2008-trunk-v2.patch Right, could be static. Thanks! > IndexerMapReduce to u

[jira] [Resolved] (NUTCH-2008) IndexerMapReduce to use single instance of NutchIndexAction for deletions

2015-05-14 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2008. Resolution: Fixed Assignee: Sebastian Nagel Committed to trunk, r1679335. > IndexerMa

[jira] [Commented] (NUTCH-2002) ParserChecker to check robots.txt

2015-05-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545473#comment-14545473 ] Sebastian Nagel commented on NUTCH-2002: +1 makes ParserChecker a more powerful de

[jira] [Created] (NUTCH-2012) Merge parsechecker and indexchecker

2015-05-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2012: -- Summary: Merge parsechecker and indexchecker Key: NUTCH-2012 URL: https://issues.apache.org/jira/browse/NUTCH-2012 Project: Nutch Issue Type: Improvement

[jira] [Commented] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545527#comment-14545527 ] Sebastian Nagel commented on NUTCH-2006: +1 to complete indexchecker (opened NUTCH

[jira] [Commented] (NUTCH-2002) ParserChecker to check robots.txt

2015-05-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545532#comment-14545532 ] Sebastian Nagel commented on NUTCH-2002: one point: also redirects should be check

[jira] [Created] (NUTCH-2013) Fetcher: missing logs "fetching ..." on stdout

2015-05-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2013: -- Summary: Fetcher: missing logs "fetching ..." on stdout Key: NUTCH-2013 URL: https://issues.apache.org/jira/browse/NUTCH-2013 Project: Nutch Issue Type:

[jira] [Created] (NUTCH-2014) Fetcher hang-up on completion

2015-05-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2014: -- Summary: Fetcher hang-up on completion Key: NUTCH-2014 URL: https://issues.apache.org/jira/browse/NUTCH-2014 Project: Nutch Issue Type: Bug R

[jira] [Updated] (NUTCH-2014) Fetcher hang-up on completion

2015-05-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2014: --- Attachment: NUTCH-2014-v1.patch The reason is a mix-up of the counters for active threads and

[jira] [Updated] (NUTCH-2014) Fetcher hang-up on completion

2015-05-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2014: --- Component/s: fetcher Patch Info: Patch Available Affects Version/s: 1.11

[jira] [Reopened] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-2011: Sorry, but this needs some rework: - after 35.000+ fetched pages and the default max. heap size

[jira] [Commented] (NUTCH-2015) Make FetchNodeDb optional (off by default) if NutchServer is not used

2015-05-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547213#comment-14547213 ] Sebastian Nagel commented on NUTCH-2015: Ok. Ev. this could be changed to make it

[jira] [Commented] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-17 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547337#comment-14547337 ] Sebastian Nagel commented on NUTCH-1995: Hi Guiseppe, wild cards are ok if it is a

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547676#comment-14547676 ] Sebastian Nagel commented on NUTCH-2011: Yes, that's because of the nodeDB feature

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547760#comment-14547760 ] Sebastian Nagel commented on NUTCH-2011: Hi [~sujenshah], first a few questions to

[jira] [Created] (NUTCH-2016) Remove OldFetcher from trunk

2015-05-18 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2016: -- Summary: Remove OldFetcher from trunk Key: NUTCH-2016 URL: https://issues.apache.org/jira/browse/NUTCH-2016 Project: Nutch Issue Type: Wish Com

[jira] [Created] (NUTCH-2017) Remove debug log from MimeUtil

2015-05-18 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2017: -- Summary: Remove debug log from MimeUtil Key: NUTCH-2017 URL: https://issues.apache.org/jira/browse/NUTCH-2017 Project: Nutch Issue Type: Bug Affects

[jira] [Updated] (NUTCH-2017) Remove debug log from MimeUtil

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2017: --- Attachment: NUTCH-2017.patch > Remove debug log from MimeUtil > --

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547845#comment-14547845 ] Sebastian Nagel commented on NUTCH-2011: ??about modifying the CrawlDb to hold one

[jira] [Updated] (NUTCH-2013) Fetcher: missing logs "fetching ..." on stdout

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2013: --- Attachment: NUTCH-2013-v1.patch Patch to make all classes in the fetcher package pulled out fr

[jira] [Updated] (NUTCH-2013) Fetcher: missing logs "fetching ..." on stdout

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2013: --- Patch Info: Patch Available > Fetcher: missing logs "fetching ..." on stdout > ---

[jira] [Resolved] (NUTCH-2014) Fetcher hang-up on completion

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2014. Resolution: Fixed Committed to trunk/1.x, r1680109. Thanks for the review, [~lewismc]! > Fe

[jira] [Commented] (NUTCH-2013) Fetcher: missing logs "fetching ..." on stdout

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549280#comment-14549280 ] Sebastian Nagel commented on NUTCH-2013: Thanks! Committed to trunk/1.x, r1680110.

[jira] [Resolved] (NUTCH-2013) Fetcher: missing logs "fetching ..." on stdout

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2013. Resolution: Fixed Assignee: Sebastian Nagel > Fetcher: missing logs "fetching ..." on

[jira] [Assigned] (NUTCH-2014) Fetcher hang-up on completion

2015-05-18 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2014: -- Assignee: Sebastian Nagel > Fetcher hang-up on completion > ---

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-20 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552524#comment-14552524 ] Sebastian Nagel commented on NUTCH-2011: Yes, relying on CrawlDb should be the rig

[jira] [Commented] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-20 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553219#comment-14553219 ] Sebastian Nagel commented on NUTCH-1995: Hi Chris, it's not about Guiseppe's use c

[jira] [Commented] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-21 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14554954#comment-14554954 ] Sebastian Nagel commented on NUTCH-1995: Agreed. If you know how to modify the cod

[jira] [Commented] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-26 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559883#comment-14559883 ] Sebastian Nagel commented on NUTCH-1995: +1, yes * there is already an output / lo

[jira] [Commented] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560659#comment-14560659 ] Sebastian Nagel commented on NUTCH-1995: The result of {{conf.getStrings("http.rob

[jira] [Comment Edited] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560659#comment-14560659 ] Sebastian Nagel edited comment on NUTCH-1995 at 5/27/15 9:04 AM: ---

[jira] [Assigned] (NUTCH-2007) add test libs to classpath of bin/nutch junit

2015-05-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2007: -- Assignee: Sebastian Nagel > add test libs to classpath of bin/nutch junit > ---

[jira] [Commented] (NUTCH-1995) Add support for wildcard to http.robot.rules.whitelist

2015-05-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561567#comment-14561567 ] Sebastian Nagel commented on NUTCH-1995: Great work, [~gostep]! Please, resolve!

[jira] [Resolved] (NUTCH-2007) add test libs to classpath of bin/nutch junit

2015-05-27 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2007. Resolution: Fixed Committed to trunk, r1682103. > add test libs to classpath of bin/nutch j

[jira] [Resolved] (NUTCH-1247) CrawlDatum.retries should be int

2015-05-29 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1247. Resolution: Not A Problem Resolving. This is hardly necessary and would make CrawlDb incompa

[jira] [Commented] (NUTCH-2015) Make FetchNodeDb optional (off by default) if NutchServer is not used

2015-06-01 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567037#comment-14567037 ] Sebastian Nagel commented on NUTCH-2015: +1 to commit [~sujenshah]'s latest patch

[jira] [Commented] (NUTCH-2035) Regex filter using case sensitive rules.

2015-06-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571680#comment-14571680 ] Sebastian Nagel commented on NUTCH-2035: Thanks, [~betolink]! * is it possible to

[jira] [Commented] (NUTCH-2032) Plugin to index the raw content of a readable document.

2015-06-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571694#comment-14571694 ] Sebastian Nagel commented on NUTCH-2032: Hi [~betolink], your solution/patch alrea

[jira] [Commented] (NUTCH-2034) CrawlDB filtered documents counter.

2015-06-03 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571698#comment-14571698 ] Sebastian Nagel commented on NUTCH-2034: Thanks, good idea! But strictly speaking

<    3   4   5   6   7   8   9   10   11   12   >