Hi,
We are currently facing a problem when using NUTCH Rest API. We try to run
Nutch API through Postman and It works perfectly fine if we don't define the
segment pathway. This is the command we run in Postman.
Inject
{
"type":"INJECT",
"confId":"default",
"crawlId":"crawl01",
"args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt",
"crawldb": "/tmp/crawl/crawldb"
}
}
Generate
{
"type":"GENERATE",
"confId":"default",
"crawlId":"crawl01",
"args": { "crawldb": "/tmp/crawl/crawldb",
"segment_dir": "/tmp/crawl/segments"
}
}
Fetch
{
"type":"FETCH",
"confId":"default",
"crawlId":"crawl01",
"args": {"segment": "/tmp/crawl/segments"}
}
We try to define the pathway to store the crawled data in a specific
directory. However, when come to fetch part, it cannot retrieve data from a
specific folder (folder name that is generated by current date and time)
under the segments folder. We have tried /tmp/crawl/segments/* and it can
successfully retrieve the data, but it will also generate a new folder
called *.
Therefore, may we know if there is any way that could define the folder name
in segments folder or is it got other way to change the output directory?
Attached is our log for your reference. Kindly advise. Thanks in advance.
Best Regards,
Shi Wei
2021-12-24 17:27:01,852 INFO crawl.Injector - Injector: starting at 2021-12-24
17:27:01
2021-12-24 17:27:01,853 INFO crawl.Injector - Injector: crawlDb:
/tmp/crawl/crawldb
2021-12-24 17:27:01,853 INFO crawl.Injector - Injector: urlDir:
/opt/apache-nutch-1.18/runtime/local/urls/seed.txt
2021-12-24 17:27:01,853 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2021-12-24 17:27:01,865 INFO crawl.Injector - Injecting seed URL file
file:/opt/apache-nutch-1.18/runtime/local/urls/seed.txt
2021-12-24 17:27:01,866 WARN impl.MetricsSystemImpl - JobTracker metrics
system already initialized!
2021-12-24 17:27:01,871 WARN mapreduce.JobResourceUploader - Hadoop
command-line option parsing not performed. Implement the Tool interface and
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:01,971 INFO mapreduce.Job - The url to track the job:
http://localhost:8080/
2021-12-24 17:27:01,971 INFO mapreduce.Job - Running job:
job_local463605357_0260
2021-12-24 17:27:02,002 INFO regex.RegexURLNormalizer - can't find rules for
scope 'inject', using default
2021-12-24 17:27:02,014 WARN impl.MetricsSystemImpl - JobTracker metrics
system already initialized!
2021-12-24 17:27:02,034 INFO crawl.Injector - Injector: overwrite: false
2021-12-24 17:27:02,034 INFO crawl.Injector - Injector: update: false
2021-12-24 17:27:02,972 INFO mapreduce.Job - Job job_local463605357_0260
running in uber mode : false
2021-12-24 17:27:02,972 INFO mapreduce.Job - map 100% reduce 100%
2021-12-24 17:27:02,972 INFO mapreduce.Job - Job job_local463605357_0260
completed successfully
2021-12-24 17:27:02,973 INFO mapreduce.Job - Counters: 31
File System Counters
FILE: Number of bytes read=503885294
FILE: Number of bytes written=747488148
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=260
Map output materialized bytes=272
Input split bytes=282
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=272
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=1995440128
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
injector
urls_injected=3
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=622
2021-12-24 17:27:02,974 INFO crawl.Injector - Injector: Total urls rejected by
filters: 0
2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: Total urls injected
after normalization and filtering: 3
2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: Total urls injected
but already in CrawlDb: 0
2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: Total new urls
injected: 3
2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: finished at 2021-12-24
17:27:02, elapsed: 00:00:01
2021-12-24 17:27:04,912 INFO crawl.Generator - Generator: starting at
2021-12-24 17:27:04
2021-12-24 17:27:04,913 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2021-12-24 17:27:04,913 INFO crawl.Generator - Generator: filtering: true
2021-12-24 17:27:04,913 INFO crawl.Generator - Generator: normalizing: true
2021-12-24 17:27:04,913 INFO crawl.Generator - Generator: running in local
mode, generating exactly one partition.
2021-12-24 17:27:04,914 WARN impl.MetricsSystemImpl - JobTracker metrics
system already initialized!
2021-12-24 17:27:04,918 WARN mapreduce.JobResourceUploader - Hadoop
command-line option parsing not performed. Implement the Tool interface and
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:05,025 INFO mapreduce.Job - The url to track the job:
http://localhost:8080/
2021-12-24 17:27:05,025 INFO mapreduce.Job - Running job:
job_local1719362067_0261
2021-12-24 17:27:05,062 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2021-12-24 17:27:05,062 INFO crawl.AbstractFetchSchedule - defaultInterval=0
2021-12-24 17:27:05,062 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2021-12-24 17:27:05,074 WARN impl.MetricsSystemImpl - JobTracker metrics
system already initialized!
2021-12-24 17:27:05,094 INFO regex.RegexURLNormalizer - can't find rules for
scope 'generate_host_count', using default
2021-12-24 17:27:06,026 INFO mapreduce.Job - Job job_local1719362067_0261
running in uber mode : false
2021-12-24 17:27:06,026 INFO mapreduce.Job - map 100% reduce 100%
2021-12-24 17:27:06,026 INFO mapreduce.Job - Job job_local1719362067_0261
completed successfully
2021-12-24 17:27:06,027 INFO mapreduce.Job - Counters: 30
File System Counters
FILE: Number of bytes read=505778868
FILE: Number of bytes written=750607198
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=347
Map output materialized bytes=359
Input split bytes=114
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=359
Reduce input records=3
Reduce output records=0
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=6
Total committed heap usage (bytes)=1900019712
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=382
File Output Format Counters
Bytes Written=8
2021-12-24 17:27:06,027 INFO crawl.Generator - Generator: number of items
rejected during selection:
2021-12-24 17:27:06,028 INFO crawl.Generator - Generator: Partitioning
selected urls for politeness.
2021-12-24 17:27:07,029 INFO crawl.Generator - Generator: segment:
/tmp/crawl/segments/20211224172707
2021-12-24 17:27:07,030 WARN impl.MetricsSystemImpl - JobTracker metrics
system already initialized!
2021-12-24 17:27:07,037 WARN mapreduce.JobResourceUploader - Hadoop
command-line option parsing not performed. Implement the Tool interface and
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:07,153 INFO mapreduce.Job - The url to track the job:
http://localhost:8080/
2021-12-24 17:27:07,154 INFO mapreduce.Job - Running job:
job_local209587332_0262
2021-12-24 17:27:07,194 WARN impl.MetricsSystemImpl - JobTracker metrics
system already initialized!
2021-12-24 17:27:08,154 INFO mapreduce.Job - Job job_local209587332_0262
running in uber mode : false
2021-12-24 17:27:08,154 INFO mapreduce.Job - map 100% reduce 100%
2021-12-24 17:27:08,154 INFO mapreduce.Job - Job job_local209587332_0262
completed successfully
2021-12-24 17:27:08,156 INFO mapreduce.Job - Counters: 30
File System Counters
FILE: Number of bytes read=507673248
FILE: Number of bytes written=753716105
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=508
Map output materialized bytes=520
Input split bytes=160
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=520
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=1900019712
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=491
File Output Format Counters
Bytes Written=445
2021-12-24 17:27:08,156 INFO crawl.Generator - Generator: finished at
2021-12-24 17:27:08, elapsed: 00:00:03
2021-12-24 17:27:09,043 INFO fetcher.Fetcher - Fetcher: starting at 2021-12-24
17:27:09
2021-12-24 17:27:09,044 INFO fetcher.Fetcher - Fetcher: segment:
/tmp/crawl/segments
2021-12-24 17:27:09,045 WARN impl.MetricsSystemImpl - JobTracker metrics
system already initialized!
2021-12-24 17:27:09,051 WARN mapreduce.JobResourceUploader - Hadoop
command-line option parsing not performed. Implement the Tool interface and
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:09,066 ERROR fetcher.Fetcher - Fetcher:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist: file:/tmp/crawl/segments/crawl_generate
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at
org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:115)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:498)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:613)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)