Hi,
We are currently facing a problem when using NUTCH Rest API. We try to run
Nutch API through Postman and It works perfectly fine if we don't define the
segment pathway. This is the command we run in Postman.
Inject
{
"type":"INJECT",
"confId":"default",
"crawlId":"crawl01",
"args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt",
"crawldb": "/tmp/crawl/crawldb"
}
}
Generate
{
"type":"GENERATE",
"confId":"default",
"crawlId":"crawl01",
"args": {"crawldb": "/tmp/crawl/crawldb",
"segment_dir": "/tmp/crawl/segments"
}
}
Fetch
{
"type":"FETCH",
"confId":"default",
"crawlId":"crawl01",
"args": {"segment": "/tmp/crawl/segments"}
}
We try to define the pathway to store the crawled data in a specific
directory. However, when come to fetch part, it cannot retrieve data from a
specific folder (folder name that is generated by current date and time)
under the segments folder. We have tried /tmp/crawl/segments/* and it can
successfully retrieve the data, but it will also generate a new folder
called *.
Therefore, may we know if there is any way that could define the folder name
in segments folder or is it got other way to change the output directory?
Attached is our log for your reference. Kindly advise. Thanks in advance.
Best Regards,
Shi Wei
2021-12-24 17:27:01,852 INFO crawl.Injector - Injector: starting at 2021-12-24
17:27:01
2021-12-24 17:27:01,853 INFO crawl.Injector - Injector: crawlDb:
/tmp/crawl/crawldb
2021-12-24 17:27:01,853 INFO crawl.Injector - Injector: urlDir:
/opt/apache-nutch-1.18/runtime/local/urls/seed.txt
2021-12-24 17:27:01,853 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2021-12-24 17:27:01,865 INFO crawl.Injector - Injecting seed URL file
file:/opt/apache-nutch-1.18/runtime/local/urls/seed.txt
2021-12-24 17:27:01,866 WARN impl.MetricsSystemImpl - JobTracker metrics
system already initialized!
2021-12-24 17:27:01,871 WARN mapreduce.JobResourceUploader - Hadoop
command-line option parsing not performed. Implement the Tool interface and
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:01,971 INFO mapreduce.Job - The url to track the job:
http://localhost:8080/
2021-12-24 17:27:01,971 INFO mapreduce.Job - Running job:
job_local463605357_0260
2021-12-24 17:27:02,002 INFO regex.RegexURLNormalizer - can't find rules for
scope 'inject', using default
2021-12-24 17:27:02,014 WARN impl.MetricsSystemImpl - JobTracker metrics
system already initialized!
2021-12-24 17:27:02,034 INFO crawl.Injector - Injector: overwrite: false
2021-12-24 17:27:02,034 INFO crawl.Injector - Injector: update: false
2021-12-24 17:27:02,972 INFO mapreduce.Job - Job job_local463605357_0260
running in uber mode : false
2021-12-24 17:27:02,972 INFO mapreduce.Job - map 100% reduce 100%
2021-12-24 17:27:02,972 INFO mapreduce.Job - Job job_local463605357_0260
completed successfully
2021-12-24 17:27:02,973 INFO mapreduce.Job - Counters: 31
File System Counters
FILE: Number of bytes read=503885294
FILE: Number of bytes written=747488148
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=260
Map output materialized bytes=272
Input split bytes=282
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=272
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=1995440128
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
injector
urls_injected=3
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=622
2021-12-24 17:27:02,974 INFO crawl.Injector - Injector: Total urls rejected by
filters: 0
2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: Total urls injected
after normalization and filtering: 3
2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: Total urls injected
but already in CrawlDb: 0
2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: Total new urls
injected: 3
2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: finished at 2021-12-24
17:27:02,