[jira] [Commented] (NUTCH-2795) CrawlDbReader: compress CrawlDb dumps if configured

2022-08-21 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582539#comment-17582539
 ] 

Hudson commented on NUTCH-2795:
---

SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #85 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/85/])
NUTCH-2795 CrawlDbReader: compress CrawlDb dumps if configured (snagel: 
[https://github.com/apache/nutch/commit/bca5fc0d0e25a213c704d9ac486ebf9d88b3cf7a])
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java


> CrawlDbReader: compress CrawlDb dumps if configured
> ---
>
> Key: NUTCH-2795
> URL: https://issues.apache.org/jira/browse/NUTCH-2795
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: help-wanted
> Fix For: 1.19
>
>
> The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the 
> configured file output compression. E.g., if running
> {noformat}
> $> bin/nutch readdb \
>-Dmapreduce.output.fileoutputformat.compress=true  \
>
> -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
>  \
>crawldb/ -dump crawldb.dump -format json
> {noformat}
> the output should be compressed using bzip2.
> See the Hadoop class 
> [FileOutputFormat|https://hadoop.apache.org/docs/r3.1.3/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html]
>  and the [implementation in 
> TextOutputFormat|https://github.com/apache/hadoop/blob/639acb6d8921127cde3174a302f2e3d71b44f052/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2795) CrawlDbReader: compress CrawlDb dumps if configured

2022-08-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582526#comment-17582526
 ] 

ASF GitHub Bot commented on NUTCH-2795:
---

sebastian-nagel merged PR #746:
URL: https://github.com/apache/nutch/pull/746




> CrawlDbReader: compress CrawlDb dumps if configured
> ---
>
> Key: NUTCH-2795
> URL: https://issues.apache.org/jira/browse/NUTCH-2795
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: help-wanted
> Fix For: 1.19
>
>
> The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the 
> configured file output compression. E.g., if running
> {noformat}
> $> bin/nutch readdb \
>-Dmapreduce.output.fileoutputformat.compress=true  \
>
> -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
>  \
>crawldb/ -dump crawldb.dump -format json
> {noformat}
> the output should be compressed using bzip2.
> See the Hadoop class 
> [FileOutputFormat|https://hadoop.apache.org/docs/r3.1.3/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html]
>  and the [implementation in 
> TextOutputFormat|https://github.com/apache/hadoop/blob/639acb6d8921127cde3174a302f2e3d71b44f052/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2795) CrawlDbReader: compress CrawlDb dumps if configured

2022-08-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581751#comment-17581751
 ] 

ASF GitHub Bot commented on NUTCH-2795:
---

sebastian-nagel opened a new pull request, #746:
URL: https://github.com/apache/nutch/pull/746

   - configure CSV and JSON LineRecordWriters to compress the output files 
according to the configuration
   




> CrawlDbReader: compress CrawlDb dumps if configured
> ---
>
> Key: NUTCH-2795
> URL: https://issues.apache.org/jira/browse/NUTCH-2795
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.17
>Reporter: Sebastian Nagel
>Priority: Minor
>  Labels: help-wanted
> Fix For: 1.19
>
>
> The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the 
> configured file output compression. E.g., if running
> {noformat}
> $> bin/nutch readdb \
>-Dmapreduce.output.fileoutputformat.compress=true  \
>
> -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
>  \
>crawldb/ -dump crawldb.dump -format json
> {noformat}
> the output should be compressed using bzip2.
> See the Hadoop class 
> [FileOutputFormat|https://hadoop.apache.org/docs/r3.1.3/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html]
>  and the [implementation in 
> TextOutputFormat|https://github.com/apache/hadoop/blob/639acb6d8921127cde3174a302f2e3d71b44f052/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)