Why is fetcher one big class?

2014-05-22 Thread Diaa Abdallah
Currently the fetcher class is a 1,500 line piece of code. I'd like to suggest splitting it up to multiple files to improve readability and maintainability of the code instead of this one big class with many nested classes. The classes are grouped anyways by the fetcher namespace so having them

Re: Creating Windows bash files for nutch

2014-05-18 Thread Diaa Abdallah
I meant writing batch/cmd scripts for windows that don't require Cygwin. I was thinking of writing those scripts but wanted to check if people think it's a good idea. On Sunday, May 18, 2014, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Currently nutch isn't very friendly to

Creating Windows bash files for nutch

2014-05-17 Thread Diaa Abdallah
Hi, Currently nutch isn't very friendly to windows users as it requires cygwin to run and there are a lot of issues with Hadoop 1.x branch, which nutch bundles with it, due to the set tmp permission issue. What do you think about doing two things: 1. Move to Hadoop 2.4 to support windows/linux

Re: Clean up in case of error is not handled

2014-05-16 Thread Diaa Abdallah
Thanks! Created a JIRA issue with the patch https://issues.apache.org/jira/browse/NUTCH-1783 On Tue, May 13, 2014 at 12:19 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Diaa, Yes, you can open an issue for these fixes and attach patches if you can. Cheers, Markus Diaa

Inject auto generated urls

2014-05-16 Thread Diaa Abdallah
Hi, In some cases when you crawl a webpage you already know many page urls that have a similar structure. For example in imdb entertainment artists have the following link structure: http://www.imdb.com/name/nm1/ http://www.imdb.com/name/nm2/ http://www.imdb.com/name/nm6499112/ How about

Clean up in case of error is not handled

2014-05-12 Thread Diaa Abdallah
Hi, I noticed that nutch doesn't handle cleaning up (removing temp folders) in case of error. In the following classes temp directories are created but not removed when there is an error: 1. Injector 2. CrawlDBReader 3. Deduplication 4. SegmentReader For example in injector you find: RunningJob

Re: [jira] [Commented] (NUTCH-1766) Generator to unlock crawldb and remove tempdir if generate job fails

2014-05-11 Thread Diaa Abdallah
Anyone wanna commit this? On Mon, Apr 28, 2014 at 12:04 AM, Diaa (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/NUTCH-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982485#comment-13982485] Diaa commented on

Re: Contributing Improvements to Classes documentation

2014-04-24 Thread Diaa Abdallah
/display/solr/Apache+Solr+Reference+Guide What do you think? On Thu, Apr 24, 2014 at 5:01 PM, Diaa Abdallah diaa.abdelmon...@gmail.com wrote: Hi, I am trying to improve the documentation of nutch while I'm going through its classes. I do that by creating tasks on jira. Is that the correct

Debugging Nutch from Windows

2014-04-23 Thread Diaa Abdallah
Hi, Is there a way to debug nutch from Windows? I followed the steps on https://wiki.apache.org/nutch/RunNutchInEclipse and reached step 6 however when I run the application it says: Cannot run program chmod: CreateProcess error=2, The system cannot find the file specified How would I go about