Re: Unable to fetch data from segment folder

2022-01-11 Thread Lewis John McGibbney
I created  https://issues.apache.org/jira/browse/NUTCH-2931 to track all of 
this work.
If you are interested in working on any of this it would be great to 
collaborate.
There is much more we can do over and above the few tickets I created.
lewismc

On 2021/12/24 10:07:20 sw.l...@quandatics.com wrote:
> Hi, 
> 
>  
> 
> We are currently facing a problem when using NUTCH Rest API. We try to run
> Nutch API through Postman and It works perfectly fine if we don't define the
> segment pathway. This is the command we run in Postman.
> 
>  
> 
> Inject
> 
>  
> 
> {
> 
> "type":"INJECT",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt",
> 
>   "crawldb": "/tmp/crawl/crawldb"
> 
> }
> 
> }
> 
>  
> 
> Generate
> 
>  
> 
> {
> 
> "type":"GENERATE",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"crawldb": "/tmp/crawl/crawldb",
> 
> "segment_dir": "/tmp/crawl/segments"
> 
>}
> 
> }
> 
>  
> 
> Fetch 
> 
>  
> 
> {
> 
> "type":"FETCH",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"segment": "/tmp/crawl/segments"}
> 
> }
> 
>  
> 
> We try to define the pathway to store the crawled data in a specific
> directory. However, when come to fetch part, it cannot retrieve data from a
> specific folder (folder name that is generated by current date and time)
> under the segments folder. We have tried /tmp/crawl/segments/* and it can
> successfully retrieve the data, but it will also generate a new folder
> called *. 
> 
>  
> 
> Therefore, may we know if there is any way that could define the folder name
> in segments folder or is it got other way to change the output directory?
> 
>  
> 
> Attached is our log for your reference. Kindly advise. Thanks in advance.
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
> 


Re: Unable to fetch data from segment folder

2022-01-11 Thread Lewis John McGibbney
Hi Shi Wei,
I missed this thread over the holidays!
Which version of Nutch are you using?
The REST API needs quite a bit of attention. It is not a particularly mature 
aspect of the Nutch codebase and there are a catalog of issues which needs to 
be addressed.
If you are interested in learning about these issues then we can create an EPIC 
issue in JIRA and then begin flushing out all of the things wrong.
lewismc

On 2021/12/24 10:07:20 sw.l...@quandatics.com wrote:
> Hi, 
> 
>  
> 
> We are currently facing a problem when using NUTCH Rest API. We try to run
> Nutch API through Postman and It works perfectly fine if we don't define the
> segment pathway. This is the command we run in Postman.
> 
>  
> 
> Inject
> 
>  
> 
> {
> 
> "type":"INJECT",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt",
> 
>   "crawldb": "/tmp/crawl/crawldb"
> 
> }
> 
> }
> 
>  
> 
> Generate
> 
>  
> 
> {
> 
> "type":"GENERATE",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"crawldb": "/tmp/crawl/crawldb",
> 
> "segment_dir": "/tmp/crawl/segments"
> 
>}
> 
> }
> 
>  
> 
> Fetch 
> 
>  
> 
> {
> 
> "type":"FETCH",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"segment": "/tmp/crawl/segments"}
> 
> }
> 
>  
> 
> We try to define the pathway to store the crawled data in a specific
> directory. However, when come to fetch part, it cannot retrieve data from a
> specific folder (folder name that is generated by current date and time)
> under the segments folder. We have tried /tmp/crawl/segments/* and it can
> successfully retrieve the data, but it will also generate a new folder
> called *. 
> 
>  
> 
> Therefore, may we know if there is any way that could define the folder name
> in segments folder or is it got other way to change the output directory?
> 
>  
> 
> Attached is our log for your reference. Kindly advise. Thanks in advance.
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
> 


Unable to fetch data from segment folder

2021-12-24 Thread sw.ling
Hi, 

 

We are currently facing a problem when using NUTCH Rest API. We try to run
Nutch API through Postman and It works perfectly fine if we don't define the
segment pathway. This is the command we run in Postman.

 

Inject

 

{

"type":"INJECT",

"confId":"default",

"crawlId":"crawl01",

"args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt",

  "crawldb": "/tmp/crawl/crawldb"

}

}

 

Generate

 

{

"type":"GENERATE",

"confId":"default",

"crawlId":"crawl01",

"args": {"crawldb": "/tmp/crawl/crawldb",

"segment_dir": "/tmp/crawl/segments"

   }

}

 

Fetch 

 

{

"type":"FETCH",

"confId":"default",

"crawlId":"crawl01",

"args": {"segment": "/tmp/crawl/segments"}

}

 

We try to define the pathway to store the crawled data in a specific
directory. However, when come to fetch part, it cannot retrieve data from a
specific folder (folder name that is generated by current date and time)
under the segments folder. We have tried /tmp/crawl/segments/* and it can
successfully retrieve the data, but it will also generate a new folder
called *. 

 

Therefore, may we know if there is any way that could define the folder name
in segments folder or is it got other way to change the output directory?

 

Attached is our log for your reference. Kindly advise. Thanks in advance.

 

Best Regards,

Shi Wei

 

2021-12-24 17:27:01,852 INFO  crawl.Injector - Injector: starting at 2021-12-24 
17:27:01
2021-12-24 17:27:01,853 INFO  crawl.Injector - Injector: crawlDb: 
/tmp/crawl/crawldb
2021-12-24 17:27:01,853 INFO  crawl.Injector - Injector: urlDir: 
/opt/apache-nutch-1.18/runtime/local/urls/seed.txt
2021-12-24 17:27:01,853 INFO  crawl.Injector - Injector: Converting injected 
urls to crawl db entries.
2021-12-24 17:27:01,865 INFO  crawl.Injector - Injecting seed URL file 
file:/opt/apache-nutch-1.18/runtime/local/urls/seed.txt
2021-12-24 17:27:01,866 WARN  impl.MetricsSystemImpl - JobTracker metrics 
system already initialized!
2021-12-24 17:27:01,871 WARN  mapreduce.JobResourceUploader - Hadoop 
command-line option parsing not performed. Implement the Tool interface and 
execute your application with ToolRunner to remedy this.
2021-12-24 17:27:01,971 INFO  mapreduce.Job - The url to track the job: 
http://localhost:8080/
2021-12-24 17:27:01,971 INFO  mapreduce.Job - Running job: 
job_local463605357_0260
2021-12-24 17:27:02,002 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'inject', using default
2021-12-24 17:27:02,014 WARN  impl.MetricsSystemImpl - JobTracker metrics 
system already initialized!
2021-12-24 17:27:02,034 INFO  crawl.Injector - Injector: overwrite: false
2021-12-24 17:27:02,034 INFO  crawl.Injector - Injector: update: false
2021-12-24 17:27:02,972 INFO  mapreduce.Job - Job job_local463605357_0260 
running in uber mode : false
2021-12-24 17:27:02,972 INFO  mapreduce.Job -  map 100% reduce 100%
2021-12-24 17:27:02,972 INFO  mapreduce.Job - Job job_local463605357_0260 
completed successfully
2021-12-24 17:27:02,973 INFO  mapreduce.Job - Counters: 31
File System Counters
FILE: Number of bytes read=503885294
FILE: Number of bytes written=747488148
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=260
Map output materialized bytes=272
Input split bytes=282
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=272
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=1995440128
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
injector
urls_injected=3
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=622
2021-12-24 17:27:02,974 INFO  crawl.Injector - Injector: Total urls rejected by 
filters: 0
2021-12-24 17:27:02,975 INFO  crawl.Injector - Injector: Total urls injected 
after normalization and filtering: 3
2021-12-24 17:27:02,975 INFO  crawl.Injector - Injector: Total urls injected 
but already in CrawlDb: 0
2021-12-24 17:27:02,975 INFO  crawl.Injector - Injector: Total new urls 
injected: 3
2021-12-24 17:27:02,975 INFO  crawl.Injector - Injector: finished at 2021-12-24 
17:27:02,