from:"Diaa Abdallah"

Why is fetcher one big class?

2014-05-22 Thread Diaa Abdallah

Currently the fetcher class is a 1,500 line piece of code. I'd like to suggest splitting it up to multiple files to improve readability and maintainability of the code instead of this one big class with many nested classes. The classes are grouped anyways by the fetcher namespace so having them a

Re: Creating Windows bash files for nutch

2014-05-18 Thread Diaa Abdallah

I meant writing batch/cmd scripts for windows that don't require Cygwin. I was thinking of writing those scripts but wanted to check if people think it's a good idea. On Sunday, May 18, 2014, Julien Nioche wrote: > Hi > > >> Currently nutch isn't very friendly to windows users as it requires >>

Creating Windows bash files for nutch

2014-05-17 Thread Diaa Abdallah

Hi, Currently nutch isn't very friendly to windows users as it requires cygwin to run and there are a lot of issues with Hadoop 1.x branch, which nutch bundles with it, due to the "set tmp permission" issue. What do you think about doing two things: 1. Move to Hadoop 2.4 to support windows/linux a

Inject auto generated urls

2014-05-16 Thread Diaa Abdallah

Hi, In some cases when you crawl a webpage you already know many page urls that have a similar structure. For example in imdb entertainment artists have the following link structure: http://www.imdb.com/name/nm1/ http://www.imdb.com/name/nm2/ http://www.imdb.com/name/nm6499112/ How about allowing

Re: Clean up in case of error is not handled

2014-05-16 Thread Diaa Abdallah

Thanks! Created a JIRA issue with the patch https://issues.apache.org/jira/browse/NUTCH-1783 On Tue, May 13, 2014 at 12:19 AM, Markus Jelsma wrote: > Hi Diaa, > > Yes, you can open an issue for these fixes and attach patches if you can. > > Cheers, > Markus > > &g

Clean up in case of error is not handled

2014-05-12 Thread Diaa Abdallah

Hi, I noticed that nutch doesn't handle cleaning up (removing temp folders) in case of error. In the following classes temp directories are created but not removed when there is an error: 1. Injector 2. CrawlDBReader 3. Deduplication 4. SegmentReader For example in injector you find: RunningJob ma

Re: [jira] [Commented] (NUTCH-1766) Generator to unlock crawldb and remove tempdir if generate job fails

2014-05-11 Thread Diaa Abdallah

Anyone wanna commit this? On Mon, Apr 28, 2014 at 12:04 AM, Diaa (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/NUTCH-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982485#comment-13982485] > > Diaa commented on NUTCH-1766: > -

Why are web urls not assumed to be http

2014-04-25 Thread Diaa Abdallah

Hi, I tried injecting www.google.com into my crawldb without prepending http://to it. It injected it fine, however when I ran generate on it it gave the following warning: "Malformed URL: 'www.google.com', skipping (java.net.MalformedURLException: no protocol: www.google.com" Why doesn't nutch ass

Re: Contributing Improvements to Classes documentation

2014-04-24 Thread Diaa Abdallah

> https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide > > What do you think? > > > > On Thu, Apr 24, 2014 at 5:01 PM, Diaa Abdallah > wrote: > >> Hi, >> I am trying to improve the documentation of nutch while I'm going thro

Re: Debugging Nutch from Windows

2014-04-24 Thread Diaa Abdallah

> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows > > Sebastian > > On 04/23/2014 04:57 PM, Diaa Abdallah wrote: > > Hi, > > Is there a way to debug nutch from Windows? > > I followed the steps on https://wiki.apache.org/nutch/RunNutchInEclipse > > and reach

Contributing Improvements to Classes documentation

2014-04-24 Thread Diaa Abdallah

Hi, I am trying to improve the documentation of nutch while I'm going through its classes. I do that by creating tasks on jira. Is that the correct way to go? Thanks, Diaa

Debugging Nutch from Windows

2014-04-23 Thread Diaa Abdallah

Hi, Is there a way to debug nutch from Windows? I followed the steps on https://wiki.apache.org/nutch/RunNutchInEclipse and reached step 6 however when I run the application it says: "Cannot run program "chmod": CreateProcess error=2, The system cannot find the file specified" How would I go about

Why is fetcher one big class?

Re: Creating Windows bash files for nutch

Creating Windows bash files for nutch

Inject auto generated urls

Re: Clean up in case of error is not handled

Clean up in case of error is not handled

Re: [jira] [Commented] (NUTCH-1766) Generator to unlock crawldb and remove tempdir if generate job fails

Why are web urls not assumed to be http

Re: Contributing Improvements to Classes documentation

Re: Debugging Nutch from Windows

Contributing Improvements to Classes documentation

Debugging Nutch from Windows

12 matches

Site Navigation

Mail list logo

Footer information