[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic resolved NUTCH-570.
Resolution: Won't Fix
Improvement of URL Ordering in Generator.java
[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854665#action_12854665
]
Otis Gospodnetic commented on NUTCH-570:
I'm tempted to close this issue as Won't
[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851461#action_12851461
]
Otis Gospodnetic commented on NUTCH-570:
Serykh, what does your version of the patch
Personally, I don't see the advantage of Nutch going for a TLP. It's not like
new committers are having a hard time getting in today, it's not like they are
being proposed and rejected. I also don't feel like Nutch lacks
exposure/visibility -- lots of people know about it. It's just that
[
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated NUTCH-740:
---
Assignee: (was: Otis Gospodnetic)
Configuration option to override default language
Hello,
Has anyone seen this:
http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck ?
Is this something that needs to be addressed in Nutch (and thus in Bixo, and
thus in the common crawler project)?
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
[
https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated NUTCH-746:
---
Patch Info: [Patch Available]
Fix Version/s: 1.1
NutchBeanConstructor does not close
[
https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated NUTCH-731:
---
Fix Version/s: 1.1
Assignee: Otis Gospodnetic
Redirection of robots.txt
[
https://issues.apache.org/jira/browse/NUTCH-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic resolved NUTCH-742.
Resolution: Incomplete
Could you please post more detailed information to nutch-user
[
https://issues.apache.org/jira/browse/NUTCH-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic resolved NUTCH-101.
Resolution: Fixed
Thank you Ken.
RobotRulesParser
Key
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714536#action_12714536
]
Otis Gospodnetic commented on NUTCH-739:
Yeah, sounds right. That Tool should make
Hi Kirby,
Do you think you could add this to Nutch's JIRA?
Please see http://wiki.apache.org/nutch/HowToContribute
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Kirby Bohling kirby.bohl...@gmail.com
To: nutch-dev@lucene.apache.org
[
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated NUTCH-740:
---
Priority: Minor (was: Major)
Affects Version/s: (was: 0.9.0)
Fix
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714286#action_12714286
]
Otis Gospodnetic commented on NUTCH-739:
Yes, external optimize calls will work, I
[
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic reassigned NUTCH-693:
--
Assignee: Otis Gospodnetic
Add configurable option for treating nofollow behaviour
[
https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712489#action_12712489
]
Otis Gospodnetic commented on NUTCH-731:
People have redirects on their robots.txt
[
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712492#action_12712492
]
Otis Gospodnetic commented on NUTCH-721:
Questions:
Has anyone tried profiling
[
https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712494#action_12712494
]
Otis Gospodnetic commented on NUTCH-721:
Ken's thoughts:
http://ken
Hello,
(I saw the first copy of this email went to nutch-user, but I assume nutch-dev
was a resend and the right list to follow-up on)
I agree with the list of core competencies. For example, and I don't know
where I said/wrote this, but I know I said it a few times before -- I think
Solr is
I absolutely agree. Duplicating the work and focusing on non-core when the
same functionality can be gotten by using Tika is not wise for Nutch.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Andrzej Bialecki a...@getopt.org
To:
Hi,
This has been bugging me for a while now. For some reason Nutch MLs get the
most junk emails - both rude/rudeish emails, as well as clear spam (with
SPAM in the subject - something must be detecting it).
I just looked at the headers of the clearly labeled spam messages and found
that
If you use the Nutch-Solr functionality, you can rely on Solr's MoreLikeThis
and Solr's SpellCheckComponent (both are described on Solr's wiki)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: dealmaker vin...@gmail.com
To:
Absolutely! I see you are at home with JIRA, so I don't have to ask. :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Frank McCown fmcc...@harding.edu
To: nutch-dev@lucene.apache.org
Sent: Tuesday, March 3, 2009 9:39:24 AM
Subject:
[
https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated NUTCH-707:
---
Fix Version/s: (was: 0.9.0)
1.1
Generation of multiple segments
I believe Lucene has (in contrib/analyzers) a class called WordLoader or
something like that. Perhaps you can use that to load stopwords from a file
(like Solr does) and submit that as a patch?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668135#action_12668135
]
Otis Gospodnetic commented on NUTCH-628:
Thanks for the update. Sorry, I don't
[
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666763#action_12666763
]
Otis Gospodnetic commented on NUTCH-666:
Dennis, could you please describe how
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666764#action_12666764
]
Otis Gospodnetic commented on NUTCH-628:
Could you take it if you have time, please
[
https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666283#action_12666283
]
Otis Gospodnetic commented on NUTCH-655:
1.1 sounds good to me.
Injecting Crawl
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666290#action_12666290
]
Otis Gospodnetic commented on NUTCH-628:
I'm +1 on getting Domain Stats into 1.0
[
https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665482#action_12665482
]
Otis Gospodnetic commented on NUTCH-679:
I'm not sure, but committing this may mess
Lucene doesn't use anything.
Hadoop uses pmd integrate in Hudson.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Doğacan Güney doga...@gmail.com
To: nutch-dev@lucene.apache.org
Sent: Tuesday, January 20, 2009 10:49:44 AM
Subject: Re:
...@gmail.com
To: nutch-dev@lucene.apache.org
Sent: Tuesday, January 20, 2009 1:13:20 PM
Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
versions
On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
wrote:
Lucene doesn't use anything.
Hadoop uses pmd integrate in Hudson
: Doğacan Güney doga...@gmail.com
To: nutch-dev@lucene.apache.org
Sent: Tuesday, January 20, 2009 3:40:20 PM
Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
versions
On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
wrote:
That I don't know...
I don't see
[
https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic resolved NUTCH-627.
Resolution: Fixed
Thanks Otis.
SendingCHANGES.txt
Sendingsrc/java/org
denied
[o...@minotaur /www/lucene.apache.org/nutch]$ chmod g+w skin/translations/.svn
chmod: skin/translations/.svn: Operation not permitted
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Otis Gospodnetic ogjunk-nu...@yahoo.com
To: nutch-dev
Hello,
Quick heads up - I'm about to regenerate the files (HTML + PDF) for the site
and update it tomorrow according to the instructions on
http://wiki.apache.org/nutch/Website_Update_HOWTO . I have Forrest 0.8, and
the site files were last generated with Forrest 0.7, so there will be some
# *.failonerror=(true|false) - stop when an XML file is invalid
#forrest.validate.failonerror=true
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Otis Gospodnetic otis_gospodne...@yahoo.com
To: Nutch Developer List nutch-dev@lucene.apache.org
To: nutch-dev@lucene.apache.org
Sent: Monday, January 5, 2009 5:28:45 PM
Subject: Re: Site update
Otis Gospodnetic wrote:
One more thing. Forrest 0.8 wouldn't generate site files without me
making the following change (so I'll commit this, too, unless
somebody thinks this is bad):
Does
[
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660397#action_12660397
]
Otis Gospodnetic commented on NUTCH-669:
Todd, and when you say sustained rate of 25
[
https://issues.apache.org/jira/browse/NUTCH-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659610#action_12659610
]
Otis Gospodnetic commented on NUTCH-171:
But does generate.update.crawldb=true
[
https://issues.apache.org/jira/browse/NUTCH-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659639#action_12659639
]
Otis Gospodnetic commented on NUTCH-171:
Hm, yes, it's nice to be able
[
https://issues.apache.org/jira/browse/NUTCH-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic resolved NUTCH-675.
Resolution: Won't Fix
According to Dennis Kubes's response on the mailing list, this has
[
https://issues.apache.org/jira/browse/NUTCH-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658610#action_12658610
]
Otis Gospodnetic commented on NUTCH-675:
Sha Feng, could you please bring this up
[
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated NUTCH-669:
---
Priority: Major (was: Minor)
Fix Version/s: 1.0.0
+1 -- people, vote
Hi,
Just found this email is my Nutch folder and as I was reading it was
thinking Got to ask Dennis if he/they will do the Nutch-Droids integration
when I saw Dennis' name below. So, Dennis, is Droids on the roadmap for you?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene -
[
https://issues.apache.org/jira/browse/NUTCH-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic resolved NUTCH-660.
Resolution: Invalid
I see you already asked on the list. That's the right place to ask
Hi,
Yes, Nutch has the ability to build N indices and query those N indices,
merging the results.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Mohammad Monirul Hoque [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Sunday, June
[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated NUTCH-570:
---
Assignee: Otis Gospodnetic
Another nudge for feedback from Ned or anyone else who tried
[
https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic reassigned NUTCH-627:
--
Assignee: Otis Gospodnetic
Minimize host address lookup
[
https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic reassigned NUTCH-629:
--
Assignee: Otis Gospodnetic
Detect slow and timeout servers and drop their URLs
[
https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590486#action_12590486
]
Otis Gospodnetic commented on NUTCH-596:
This looks beautifully simply to me! +1
: fetcher, generator
Reporter: Otis Gospodnetic
Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch
Nutch would benefit from having a DB with per-host/domain/TLD information.
For instance, Nutch could detect hosts that are timing out, store information
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated NUTCH-628:
---
Attachment: NUTCH-628-HostDb.patch
HostDatum.java
- really just a holds MapWritable
Reporter: Otis Gospodnetic
Fetch jobs will finish faster if we find a way to prevent servers that are
either slow or time out from slowing down the whole process.
I'll attach a patch that counts per-server exceptions and timeouts and tracks
download speed per server.
Queues
: fetcher, generator
Reporter: Otis Gospodnetic
Nutch would benefit from having a DB with per-host/domain/TLD information. For
instance, Nutch could detect hosts that are timing out, store information about
that in this DB. Segment/fetchlist Generator could then skip such hosts, so
[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587786#action_12587786
]
Otis Gospodnetic commented on NUTCH-570:
Ned - are you still using this? Still
Minimize host address lookup
Key: NUTCH-627
URL: https://issues.apache.org/jira/browse/NUTCH-627
Project: Nutch
Issue Type: Improvement
Components: generator
Reporter: Otis Gospodnetic
[
https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Otis Gospodnetic updated NUTCH-627:
---
Attachment: NUTCH-627.patch
Minimize host address lookup
[
https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12577669#action_12577669
]
Otis Gospodnetic commented on NUTCH-296:
Steve:
I was going to say Great to see you
[
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547642
]
Otis Gospodnetic commented on NUTCH-585:
A more general solution is needed. This solution should not rely
[
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547672
]
Otis Gospodnetic commented on NUTCH-442:
Doğacan -- can you please explain what you mean by blog up your
[
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547741
]
Otis Gospodnetic commented on NUTCH-442:
Doğacan - ah, good!
The Nutch side of the functionality included
[
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543427
]
Otis Gospodnetic commented on NUTCH-442:
Doğacan - your comments sound good and I'd guess bean piece should
Those using IntelliJ or Eclipse may want to grab code styles for Lucene (and
Solr, Nutch, and Hadoop) that Grant and I put in
https://issues.apache.org/jira/browse/SOLR-245 . I hope they are helpful. The
plan is to stick them on the Wiki (and link from HowToContribute pages?).
Otis
. .
[
https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474663
]
Otis Gospodnetic commented on NUTCH-447:
The idea being to limit crawling only to links under a certain
[
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472078
]
Otis Gospodnetic commented on NUTCH-444:
The ASF FeedParser you are talking about has, I believe, continued
Hi,
I've been meaning to write this message for a while, and Andrzej's
StrategicGoals made me compose it, finally.
Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes,
it will be even more valuable than it is today. However, I think there is
still a need for
Stephane,
Nutch uses Lucene for indexing, and Lucene has a class called IndexWriter that
is used for indexing Lucene Documents. Here is a quick grep in Nutch's *java
files:
$ ffjg -l IndexWriter
./src/test/org/apache/nutch/indexer/TestDeleteDuplicates.java
[
http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12444510 ]
Otis Gospodnetic commented on NUTCH-389:
Enis:
Can you give us some examples of how URLs were tokenized before, and how they
are tokenized with your patch
[
http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12444514 ]
Otis Gospodnetic commented on NUTCH-61:
---
Has anyone been using the code with this patch applied? Just wondering if/how
well it works.
Adaptive re-fetch
[
http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ]
Otis Gospodnetic commented on NUTCH-387:
This indeed looks wrong.
My guess is that the new URL() line just needs to be removed, but I'm not
sure, so
[
http://issues.apache.org/jira/browse/NUTCH-377?page=comments#action_12439016 ]
Otis Gospodnetic commented on NUTCH-377:
You'd need to modify ./src/java/org/apache/nutch/analysis/NutchAnalysis.jj and
regenerate the .java files
[
http://issues.apache.org/jira/browse/NUTCH-359?page=comments#action_12433315 ]
Otis Gospodnetic commented on NUTCH-359:
Looks fine and simple (and has a small typo in the last comment). Sami is
doing 0.8.1 soon, so I won't mess
Build: plugins' Jars not found
--
Key: NUTCH-347
URL: http://issues.apache.org/jira/browse/NUTCH-347
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8
Reporter: Otis Gospodnetic
While
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12427677 ]
Otis Gospodnetic commented on NUTCH-233:
I haven't noticed this regexp being a problem so far either, but maybe I've
just been lucky not to have run
[
http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12329473 ]
Otis Gospodnetic commented on NUTCH-92:
---
I recall a discussion on lucene-dev list several (6+?) months back about this
or very similar issue. Lucene's MultiSearcher has
[
http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12329476 ]
Otis Gospodnetic commented on NUTCH-92:
---
Ah, you are right, I remember this getting in the core. As a matter of fact,
it might have been me who committed it in the end
78 matches
Mail list logo