Unable to fetch data from segment folder

sw.ling Fri, 24 Dec 2021 02:07:48 -0800

Hi,


We are currently facing a problem when using NUTCH Rest API. We try to run
Nutch API through Postman and It works perfectly fine if we don't define the
segment pathway. This is the command we run in Postman.

 

Inject

 

{

"type":"INJECT",

    "confId":"default",

    "crawlId":"crawl01",

    "args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt",

              "crawldb": "/tmp/crawl/crawldb"

    }

}

 

Generate

 

{

"type":"GENERATE",

    "confId":"default",

    "crawlId":"crawl01",

    "args": {    "crawldb": "/tmp/crawl/crawldb",

                "segment_dir": "/tmp/crawl/segments"

               }

}

 

Fetch 

 

{

"type":"FETCH",

    "confId":"default",

    "crawlId":"crawl01",

    "args": {"segment": "/tmp/crawl/segments"}

}

 

We try to define the pathway to store the crawled data in a specific
directory. However, when come to fetch part, it cannot retrieve data from a
specific folder (folder name that is generated by current date and time)
under the segments folder. We have tried /tmp/crawl/segments/* and it can
successfully retrieve the data, but it will also generate a new folder
called *. 

 

Therefore, may we know if there is any way that could define the folder name
in segments folder or is it got other way to change the output directory?

 

Attached is our log for your reference. Kindly advise. Thanks in advance.

 

Best Regards,

Shi Wei

2021-12-24 17:27:01,852 INFO  crawl.Injector - Injector: starting at 2021-12-24 
17:27:01
2021-12-24 17:27:01,853 INFO  crawl.Injector - Injector: crawlDb: 
/tmp/crawl/crawldb
2021-12-24 17:27:01,853 INFO  crawl.Injector - Injector: urlDir: 
/opt/apache-nutch-1.18/runtime/local/urls/seed.txt
2021-12-24 17:27:01,853 INFO  crawl.Injector - Injector: Converting injected 
urls to crawl db entries.
2021-12-24 17:27:01,865 INFO  crawl.Injector - Injecting seed URL file 
file:/opt/apache-nutch-1.18/runtime/local/urls/seed.txt
2021-12-24 17:27:01,866 WARN  impl.MetricsSystemImpl - JobTracker metrics 
system already initialized!
2021-12-24 17:27:01,871 WARN  mapreduce.JobResourceUploader - Hadoop 
command-line option parsing not performed. Implement the Tool interface and 
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:01,971 INFO  mapreduce.Job - The url to track the job: 
http://localhost:8080/
2021-12-24 17:27:01,971 INFO  mapreduce.Job - Running job: 
job_local463605357_0260
2021-12-24 17:27:02,002 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'inject', using default
2021-12-24 17:27:02,014 WARN  impl.MetricsSystemImpl - JobTracker metrics 
system already initialized!
2021-12-24 17:27:02,034 INFO  crawl.Injector - Injector: overwrite: false
2021-12-24 17:27:02,034 INFO  crawl.Injector - Injector: update: false
2021-12-24 17:27:02,972 INFO  mapreduce.Job - Job job_local463605357_0260 
running in uber mode : false
2021-12-24 17:27:02,972 INFO  mapreduce.Job -  map 100% reduce 100%
2021-12-24 17:27:02,972 INFO  mapreduce.Job - Job job_local463605357_0260 
completed successfully
2021-12-24 17:27:02,973 INFO  mapreduce.Job - Counters: 31
        File System Counters
                FILE: Number of bytes read=503885294
                FILE: Number of bytes written=747488148
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=3
                Map output records=3
                Map output bytes=260
                Map output materialized bytes=272
                Input split bytes=282
                Combine input records=0
                Combine output records=0
                Reduce input groups=3
                Reduce shuffle bytes=272
                Reduce input records=3
                Reduce output records=3
                Spilled Records=6
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=0
                Total committed heap usage (bytes)=1995440128
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        injector
                urls_injected=3
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=622
2021-12-24 17:27:02,974 INFO  crawl.Injector - Injector: Total urls rejected by 
filters: 0
2021-12-24 17:27:02,975 INFO  crawl.Injector - Injector: Total urls injected 
after normalization and filtering: 3
2021-12-24 17:27:02,975 INFO  crawl.Injector - Injector: Total urls injected 
but already in CrawlDb: 0
2021-12-24 17:27:02,975 INFO  crawl.Injector - Injector: Total new urls 
injected: 3
2021-12-24 17:27:02,975 INFO  crawl.Injector - Injector: finished at 2021-12-24 
17:27:02, elapsed: 00:00:01
2021-12-24 17:27:04,912 INFO  crawl.Generator - Generator: starting at 
2021-12-24 17:27:04
2021-12-24 17:27:04,913 INFO  crawl.Generator - Generator: Selecting 
best-scoring urls due for fetch.
2021-12-24 17:27:04,913 INFO  crawl.Generator - Generator: filtering: true
2021-12-24 17:27:04,913 INFO  crawl.Generator - Generator: normalizing: true
2021-12-24 17:27:04,913 INFO  crawl.Generator - Generator: running in local 
mode, generating exactly one partition.
2021-12-24 17:27:04,914 WARN  impl.MetricsSystemImpl - JobTracker metrics 
system already initialized!
2021-12-24 17:27:04,918 WARN  mapreduce.JobResourceUploader - Hadoop 
command-line option parsing not performed. Implement the Tool interface and 
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:05,025 INFO  mapreduce.Job - The url to track the job: 
http://localhost:8080/
2021-12-24 17:27:05,025 INFO  mapreduce.Job - Running job: 
job_local1719362067_0261
2021-12-24 17:27:05,062 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2021-12-24 17:27:05,062 INFO  crawl.AbstractFetchSchedule - defaultInterval=0
2021-12-24 17:27:05,062 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2021-12-24 17:27:05,074 WARN  impl.MetricsSystemImpl - JobTracker metrics 
system already initialized!
2021-12-24 17:27:05,094 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'generate_host_count', using default
2021-12-24 17:27:06,026 INFO  mapreduce.Job - Job job_local1719362067_0261 
running in uber mode : false
2021-12-24 17:27:06,026 INFO  mapreduce.Job -  map 100% reduce 100%
2021-12-24 17:27:06,026 INFO  mapreduce.Job - Job job_local1719362067_0261 
completed successfully
2021-12-24 17:27:06,027 INFO  mapreduce.Job - Counters: 30
        File System Counters
                FILE: Number of bytes read=505778868
                FILE: Number of bytes written=750607198
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=3
                Map output records=3
                Map output bytes=347
                Map output materialized bytes=359
                Input split bytes=114
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=359
                Reduce input records=3
                Reduce output records=0
                Spilled Records=6
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=6
                Total committed heap usage (bytes)=1900019712
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=382
        File Output Format Counters
                Bytes Written=8
2021-12-24 17:27:06,027 INFO  crawl.Generator - Generator: number of items 
rejected during selection:
2021-12-24 17:27:06,028 INFO  crawl.Generator - Generator: Partitioning 
selected urls for politeness.
2021-12-24 17:27:07,029 INFO  crawl.Generator - Generator: segment: 
/tmp/crawl/segments/20211224172707
2021-12-24 17:27:07,030 WARN  impl.MetricsSystemImpl - JobTracker metrics 
system already initialized!
2021-12-24 17:27:07,037 WARN  mapreduce.JobResourceUploader - Hadoop 
command-line option parsing not performed. Implement the Tool interface and 
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:07,153 INFO  mapreduce.Job - The url to track the job: 
http://localhost:8080/
2021-12-24 17:27:07,154 INFO  mapreduce.Job - Running job: 
job_local209587332_0262
2021-12-24 17:27:07,194 WARN  impl.MetricsSystemImpl - JobTracker metrics 
system already initialized!
2021-12-24 17:27:08,154 INFO  mapreduce.Job - Job job_local209587332_0262 
running in uber mode : false
2021-12-24 17:27:08,154 INFO  mapreduce.Job -  map 100% reduce 100%
2021-12-24 17:27:08,154 INFO  mapreduce.Job - Job job_local209587332_0262 
completed successfully
2021-12-24 17:27:08,156 INFO  mapreduce.Job - Counters: 30
        File System Counters
                FILE: Number of bytes read=507673248
                FILE: Number of bytes written=753716105
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=3
                Map output records=3
                Map output bytes=508
                Map output materialized bytes=520
                Input split bytes=160
                Combine input records=0
                Combine output records=0
                Reduce input groups=3
                Reduce shuffle bytes=520
                Reduce input records=3
                Reduce output records=3
                Spilled Records=6
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=0
                Total committed heap usage (bytes)=1900019712
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=491
        File Output Format Counters
                Bytes Written=445
2021-12-24 17:27:08,156 INFO  crawl.Generator - Generator: finished at 
2021-12-24 17:27:08, elapsed: 00:00:03
2021-12-24 17:27:09,043 INFO  fetcher.Fetcher - Fetcher: starting at 2021-12-24 
17:27:09
2021-12-24 17:27:09,044 INFO  fetcher.Fetcher - Fetcher: segment: 
/tmp/crawl/segments
2021-12-24 17:27:09,045 WARN  impl.MetricsSystemImpl - JobTracker metrics 
system already initialized!
2021-12-24 17:27:09,051 WARN  mapreduce.JobResourceUploader - Hadoop 
command-line option parsing not performed. Implement the Tool interface and 
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:09,066 ERROR fetcher.Fetcher - Fetcher: 
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: file:/tmp/crawl/segments/crawl_generate
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332)
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
        at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
        at 
org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:115)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:498)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:613)
        at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Unable to fetch data from segment folder

Reply via email to