[jira] [Updated] (NUTCH-1774) Crawling from REST API giving NullPointerException

2014-05-12 Thread sreemanth pulagam (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sreemanth pulagam updated NUTCH-1774: - Attachment: NUTCH-1774.patch Patch file to fix this issue. Resolution: 1. Generate the b

[jira] [Created] (NUTCH-1774) Crawling from REST API giving NullPointerException

2014-05-12 Thread sreemanth pulagam (JIRA)
sreemanth pulagam created NUTCH-1774: Summary: Crawling from REST API giving NullPointerException Key: NUTCH-1774 URL: https://issues.apache.org/jira/browse/NUTCH-1774 Project: Nutch Issu

Re: Clean up in case of error is not handled

2014-05-12 Thread Markus Jelsma
Hi Diaa,  Yes, you can open an issue for these fixes and attach patches if you can. Cheers, Markus Diaa Abdallah schreef:Hi, I noticed that nutch doesn't handle cleaning up (removing temp folders) in case of error. In the following classes temp directories are created but not removed when th

Clean up in case of error is not handled

2014-05-12 Thread Diaa Abdallah
Hi, I noticed that nutch doesn't handle cleaning up (removing temp folders) in case of error. In the following classes temp directories are created but not removed when there is an error: 1. Injector 2. CrawlDBReader 3. Deduplication 4. SegmentReader For example in injector you find: RunningJob ma

[jira] [Updated] (NUTCH-1772) Injector does not need merging if no pre-existing crawldb

2014-05-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1772: - Attachment: NUTCH-1772.patch > Injector does not need merging if no pre-existing crawldb > --

[jira] [Created] (NUTCH-1772) Injector does not need merging if no pre-existing crawldb

2014-05-12 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1772: Summary: Injector does not need merging if no pre-existing crawldb Key: NUTCH-1772 URL: https://issues.apache.org/jira/browse/NUTCH-1772 Project: Nutch Issue

[jira] [Commented] (NUTCH-1752) cache robots.txt rules per protocol:host:port

2014-05-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995055#comment-13995055 ] Julien Nioche commented on NUTCH-1752: -- Looks good! +1 > cache robots.txt rules per

[jira] [Closed] (NUTCH-1766) Generator to unlock crawldb and remove tempdir if generate job fails

2014-05-12 Thread Diaa (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diaa closed NUTCH-1766. --- Fixed. Thanks > Generator to unlock crawldb and remove tempdir if generate job fails > -

[jira] [Commented] (NUTCH-1770) Nutch is failing to parse all PDFs

2014-05-12 Thread Ralf (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993766#comment-13993766 ] Ralf commented on NUTCH-1770: - I just compiled the 2.x branch, no problems parsing PDF's here.

[jira] [Commented] (NUTCH-1669) FTP crawl does not use FTP's server root folder

2014-05-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995059#comment-13995059 ] Julien Nioche commented on NUTCH-1669: -- Hi Rafael Looks like this issue went unnotic

[jira] [Updated] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols

2014-05-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1613: - Fix Version/s: (was: 2.4) 1.9 2.3 > Timeouts in protoco

[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-05-12 Thread Ralf (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992728#comment-13992728 ] Ralf commented on NUTCH-1679: - HI, I would love to participate, how can I check out the 2.3 c

[jira] [Assigned] (NUTCH-1766) Generator to unlock crawldb and remove tempdir if generate job fails

2014-05-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-1766: Assignee: Julien Nioche > Generator to unlock crawldb and remove tempdir if generate job fa

[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994921#comment-13994921 ] Julien Nioche commented on NUTCH-1714: -- [~shekoufa] bq. After applying NUTCH-1714 a

[jira] [Created] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2014-05-12 Thread Diaa (JIRA)
Diaa created NUTCH-1771: --- Summary: Solrindex fails if a segment is corrupted or incomplete Key: NUTCH-1771 URL: https://issues.apache.org/jira/browse/NUTCH-1771 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-1770) Nutch is failing to parse all PDFs

2014-05-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994908#comment-13994908 ] Julien Nioche commented on NUTCH-1770: -- [~tilman] There are warnings in the logs + a

[jira] [Commented] (NUTCH-1622) Create Outlinks with metadata

2014-05-12 Thread Daniel Kugel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994191#comment-13994191 ] Daniel Kugel commented on NUTCH-1622: - I might have done something wrong but reading t

[jira] [Resolved] (NUTCH-1766) Generator to unlock crawldb and remove tempdir if generate job fails

2014-05-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1766. -- Resolution: Fixed Committed revision 1593901. Thanks! > Generator to unlock crawldb and remov

[jira] [Updated] (NUTCH-1766) Generator to unlock crawldb and remove tempdir if generate job fails

2014-05-12 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1766: - Priority: Minor (was: Major) > Generator to unlock crawldb and remove tempdir if generate job fa