Vyacheslav Pascarel created NUTCH-2407:
------------------------------------------

             Summary: Memory leak causing Nutch Server to run out of memory
                 Key: NUTCH-2407
                 URL: https://issues.apache.org/jira/browse/NUTCH-2407
             Project: Nutch
          Issue Type: Bug
          Components: nutch server
    Affects Versions: 2.3.1
         Environment: Ubuntu 16.04 64-bit
Oracle Java 8 64-bit
Nutch 2.3.1 (standalone deployment)
MongoDB 3.4
            Reporter: Vyacheslav Pascarel


My application is trying to perform continuous crawling using Nutch REST 
services. The application injects a seed URL and then repeats 
GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times (each step in 
the sequence is executed upon successful competition of the previous step then 
the whole sequence is repeated again). Here is a brief description of the job:
* Number of GENERATE/FETCH/PARSE/UPDATEDB cycles per run: 50
* 'topN' parameter value of GENERATE step in each cycle: 10
* Seed URL: http://www.cnn.com
* Regex URL filters for all jobs: 
** *"-^.\{1000,\}$"* - exclude very long URLs
** *"+."* - include the rest

To monitor Nutch server I use Java VisualVM that comes with Java SDK. After 
each run (50 cycles of GENERATE/FETCH/PARSE/UPDATEDB) I perform garbage 
collection using the mentioned tool and check memory usage. My observation is 
that Nutch Server leaks ~25MB per run.

NOTES: I added custom HTTP DELETE services to clean job history in 
NutchServerPoolExecutor and remove all custom configurations from 
RAMConfManager after each run. So observed ~25MB memory leak is after job 
history/configuration cleanup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to