Currently the fetcher class is a 1,500 line piece of code.
I'd like to suggest splitting it up to multiple files to improve
readability and maintainability of the code instead of this one big class
with many nested classes.
The classes are grouped anyways by the fetcher namespace so having them a
I meant writing batch/cmd scripts for windows that don't require Cygwin.
I was thinking of writing those scripts but wanted to check if people think
it's a good idea.
On Sunday, May 18, 2014, Julien Nioche
wrote:
> Hi
>
>
>> Currently nutch isn't very friendly to windows users as it requires
>>
Hi,
Currently nutch isn't very friendly to windows users as it requires cygwin
to run and there are a lot of issues with Hadoop 1.x branch, which nutch
bundles with it, due to the "set tmp permission" issue.
What do you think about doing two things:
1. Move to Hadoop 2.4 to support windows/linux a
Hi,
In some cases when you crawl a webpage you already know many page urls that
have a similar structure.
For example in imdb entertainment artists have the following link structure:
http://www.imdb.com/name/nm1/
http://www.imdb.com/name/nm2/
http://www.imdb.com/name/nm6499112/
How about allowing
Thanks!
Created a JIRA issue with the patch
https://issues.apache.org/jira/browse/NUTCH-1783
On Tue, May 13, 2014 at 12:19 AM, Markus Jelsma
wrote:
> Hi Diaa,
>
> Yes, you can open an issue for these fixes and attach patches if you can.
>
> Cheers,
> Markus
>
>
&g
Hi,
I noticed that nutch doesn't handle cleaning up (removing temp folders) in
case of error.
In the following classes temp directories are created but not removed when
there is an error:
1. Injector
2. CrawlDBReader
3. Deduplication
4. SegmentReader
For example in injector you find:
RunningJob ma
Anyone wanna commit this?
On Mon, Apr 28, 2014 at 12:04 AM, Diaa (JIRA) wrote:
>
> [
> https://issues.apache.org/jira/browse/NUTCH-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982485#comment-13982485]
>
> Diaa commented on NUTCH-1766:
> -
Hi,
I tried injecting www.google.com into my crawldb without prepending
http://to it.
It injected it fine, however when I ran generate on it it gave the
following warning:
"Malformed URL: 'www.google.com', skipping (java.net.MalformedURLException:
no protocol: www.google.com"
Why doesn't nutch ass
> https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide
>
> What do you think?
>
>
>
> On Thu, Apr 24, 2014 at 5:01 PM, Diaa Abdallah > wrote:
>
>> Hi,
>> I am trying to improve the documentation of nutch while I'm going thro
> http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
>
> Sebastian
>
> On 04/23/2014 04:57 PM, Diaa Abdallah wrote:
> > Hi,
> > Is there a way to debug nutch from Windows?
> > I followed the steps on https://wiki.apache.org/nutch/RunNutchInEclipse
> > and reach
Hi,
I am trying to improve the documentation of nutch while I'm going through
its classes.
I do that by creating tasks on jira.
Is that the correct way to go?
Thanks,
Diaa
Hi,
Is there a way to debug nutch from Windows?
I followed the steps on https://wiki.apache.org/nutch/RunNutchInEclipse
and reached step 6 however when I run the application it says:
"Cannot run program "chmod": CreateProcess error=2, The system cannot find
the file specified"
How would I go about
12 matches
Mail list logo