[jira] Resolved: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-04-12 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-570. Resolution: Won't Fix Improvement of URL Ordering in Generator.java

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-04-07 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854665#action_12854665 ] Otis Gospodnetic commented on NUTCH-570: I'm tempted to close this issue as Won't

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851461#action_12851461 ] Otis Gospodnetic commented on NUTCH-570: Serykh, what does your version of the patch

Re: [DISCUSS] Nutch as a top level project (TLP)?

2010-03-20 Thread Otis Gospodnetic
Personally, I don't see the advantage of Nutch going for a TLP. It's not like new committers are having a hard time getting in today, it's not like they are being proposed and rejected. I also don't feel like Nutch lacks exposure/visibility -- lots of people know about it. It's just that

[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-16 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-740: --- Assignee: (was: Otis Gospodnetic) Configuration option to override default language

java.net.URL synchronization

2009-12-09 Thread Otis Gospodnetic
Hello, Has anyone seen this: http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck ? Is this something that needs to be addressed in Nutch (and thus in Bixo, and thus in the common crawler project)? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

[jira] Updated: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

2009-08-04 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-746: --- Patch Info: [Patch Available] Fix Version/s: 1.1 NutchBeanConstructor does not close

[jira] Updated: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-06-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-731: --- Fix Version/s: 1.1 Assignee: Otis Gospodnetic Redirection of robots.txt

[jira] Resolved: (NUTCH-742) Checksum Error

2009-06-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-742. Resolution: Incomplete Could you please post more detailed information to nutch-user

[jira] Resolved: (NUTCH-101) RobotRulesParser

2009-06-19 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-101. Resolution: Fixed Thank you Ken. RobotRulesParser Key

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-29 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714536#action_12714536 ] Otis Gospodnetic commented on NUTCH-739: Yeah, sounds right. That Tool should make

Re: Remove duplicate nutch conf files from .job file

2009-05-28 Thread Otis Gospodnetic
Hi Kirby, Do you think you could add this to Nutch's JIRA? Please see http://wiki.apache.org/nutch/HowToContribute Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kirby Bohling kirby.bohl...@gmail.com To: nutch-dev@lucene.apache.org

[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2009-05-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-740: --- Priority: Minor (was: Major) Affects Version/s: (was: 0.9.0) Fix

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714286#action_12714286 ] Otis Gospodnetic commented on NUTCH-739: Yes, external optimize calls will work, I

[jira] Assigned: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2009-05-27 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned NUTCH-693: -- Assignee: Otis Gospodnetic Add configurable option for treating nofollow behaviour

[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-05-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712489#action_12712489 ] Otis Gospodnetic commented on NUTCH-731: People have redirects on their robots.txt

[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-05-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712492#action_12712492 ] Otis Gospodnetic commented on NUTCH-721: Questions: Has anyone tried profiling

[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-05-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712494#action_12712494 ] Otis Gospodnetic commented on NUTCH-721: Ken's thoughts: http://ken

Re: The Future of Nutch, reactivated

2009-05-23 Thread Otis Gospodnetic
Hello, (I saw the first copy of this email went to nutch-user, but I assume nutch-dev was a resend and the right list to follow-up on) I agree with the list of core competencies. For example, and I don't know where I said/wrote this, but I know I said it a few times before -- I think Solr is

Re: Moving Nutch parsers to Tika

2009-03-10 Thread Otis Gospodnetic
I absolutely agree. Duplicating the work and focusing on non-core when the same functionality can be gotten by using Tika is not wise for Nutch. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Andrzej Bialecki a...@getopt.org To:

Nutch ML cleanup

2009-03-09 Thread Otis Gospodnetic
Hi, This has been bugging me for a while now. For some reason Nutch MLs get the most junk emails - both rude/rudeish emails, as well as clear spam (with SPAM in the subject - something must be detecting it). I just looked at the headers of the clearly labeled spam messages and found that

Re: Is there the functions of More Like This and Spell Checking?

2009-03-03 Thread Otis Gospodnetic
If you use the Nutch-Solr functionality, you can rely on Solr's MoreLikeThis and Solr's SpellCheckComponent (both are described on Solr's wiki) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: dealmaker vin...@gmail.com To:

Re: site: operator with no query term

2009-03-03 Thread Otis Gospodnetic
Absolutely! I see you are at home with JIRA, so I don't have to ask. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Frank McCown fmcc...@harding.edu To: nutch-dev@lucene.apache.org Sent: Tuesday, March 3, 2009 9:39:24 AM Subject:

[jira] Updated: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment

2009-02-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-707: --- Fix Version/s: (was: 0.9.0) 1.1 Generation of multiple segments

Re: NutchAnalysis.java STOP_WORDS not configurable?

2009-02-27 Thread Otis Gospodnetic
I believe Lucene has (in contrib/analyzers) a class called WordLoader or something like that. Perhaps you can use that to load stopwords from a file (like Solr does) and submit that as a patch? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message

[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668135#action_12668135 ] Otis Gospodnetic commented on NUTCH-628: Thanks for the update. Sorry, I don't

[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666763#action_12666763 ] Otis Gospodnetic commented on NUTCH-666: Dennis, could you please describe how

[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666764#action_12666764 ] Otis Gospodnetic commented on NUTCH-628: Could you take it if you have time, please

[jira] Commented: (NUTCH-655) Injecting Crawl metadata

2009-01-22 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666283#action_12666283 ] Otis Gospodnetic commented on NUTCH-655: 1.1 sounds good to me. Injecting Crawl

[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-22 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666290#action_12666290 ] Otis Gospodnetic commented on NUTCH-628: I'm +1 on getting Domain Stats into 1.0

[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool

2009-01-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665482#action_12665482 ] Otis Gospodnetic commented on NUTCH-679: I'm not sure, but committing this may mess

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Otis Gospodnetic
Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re:

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Otis Gospodnetic
...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Otis Gospodnetic
: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 3:40:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic wrote: That I don't know... I don't see

[jira] Resolved: (NUTCH-627) Minimize host address lookup

2009-01-13 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-627. Resolution: Fixed Thanks Otis. SendingCHANGES.txt Sendingsrc/java/org

Re: Site update

2009-01-06 Thread Otis Gospodnetic
denied [o...@minotaur /www/lucene.apache.org/nutch]$ chmod g+w skin/translations/.svn chmod: skin/translations/.svn: Operation not permitted Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Otis Gospodnetic ogjunk-nu...@yahoo.com To: nutch-dev

Site update

2009-01-05 Thread Otis Gospodnetic
Hello, Quick heads up - I'm about to regenerate the files (HTML + PDF) for the site and update it tomorrow according to the instructions on http://wiki.apache.org/nutch/Website_Update_HOWTO . I have Forrest 0.8, and the site files were last generated with Forrest 0.7, so there will be some

Re: Site update

2009-01-05 Thread Otis Gospodnetic
# *.failonerror=(true|false) - stop when an XML file is invalid #forrest.validate.failonerror=true Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Otis Gospodnetic otis_gospodne...@yahoo.com To: Nutch Developer List nutch-dev@lucene.apache.org

Re: Site update

2009-01-05 Thread Otis Gospodnetic
To: nutch-dev@lucene.apache.org Sent: Monday, January 5, 2009 5:28:45 PM Subject: Re: Site update Otis Gospodnetic wrote: One more thing. Forrest 0.8 wouldn't generate site files without me making the following change (so I'll commit this, too, unless somebody thinks this is bad): Does

[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-02 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660397#action_12660397 ] Otis Gospodnetic commented on NUTCH-669: Todd, and when you say sustained rate of 25

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2008-12-29 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659610#action_12659610 ] Otis Gospodnetic commented on NUTCH-171: But does generate.update.crawldb=true

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2008-12-29 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659639#action_12659639 ] Otis Gospodnetic commented on NUTCH-171: Hm, yes, it's nice to be able

[jira] Resolved: (NUTCH-675) Reduce tasks do not report their status and are killed by jobtracker

2008-12-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-675. Resolution: Won't Fix According to Dennis Kubes's response on the mailing list, this has

[jira] Commented: (NUTCH-675) Reduce tasks do not report their status and are killed by jobtracker

2008-12-22 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658610#action_12658610 ] Otis Gospodnetic commented on NUTCH-675: Sha Feng, could you please bring this up

[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-10 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-669: --- Priority: Major (was: Minor) Fix Version/s: 1.0.0 +1 -- people, vote

Re: Droids crawler

2008-11-13 Thread Otis Gospodnetic
Hi, Just found this email is my Nutch folder and as I was reading it was thinking Got to ask Dennis if he/they will do the Nutch-Droids integration when I saw Dennis' name below. So, Dennis, is Droids on the roadmap for you? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene -

[jira] Resolved: (NUTCH-660) Does anybody know how to let nutch crawl this kind of website?

2008-11-11 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-660. Resolution: Invalid I see you already asked on the list. That's the right place to ask

Re: need some help about distribution

2008-06-22 Thread Otis Gospodnetic
Hi, Yes, Nutch has the ability to build N indices and query those N indices, merging the results. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mohammad Monirul Hoque [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Sunday, June

[jira] Updated: (NUTCH-570) Improvement of URL Ordering in Generator.java

2008-05-21 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-570: --- Assignee: Otis Gospodnetic Another nudge for feedback from Ned or anyone else who tried

[jira] Assigned: (NUTCH-627) Minimize host address lookup

2008-05-21 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned NUTCH-627: -- Assignee: Otis Gospodnetic Minimize host address lookup

[jira] Assigned: (NUTCH-629) Detect slow and timeout servers and drop their URLs

2008-05-21 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned NUTCH-629: -- Assignee: Otis Gospodnetic Detect slow and timeout servers and drop their URLs

[jira] Commented: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS

2008-04-18 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590486#action_12590486 ] Otis Gospodnetic commented on NUTCH-596: This looks beautifully simply to me! +1

[jira] Issue Comment Edited: (NUTCH-628) Host database to keep track of host-level information

2008-04-17 Thread Otis Gospodnetic (JIRA)
: fetcher, generator Reporter: Otis Gospodnetic Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information

[jira] Updated: (NUTCH-628) Host database to keep track of host-level information

2008-04-16 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-628: --- Attachment: NUTCH-628-HostDb.patch HostDatum.java - really just a holds MapWritable

[jira] Created: (NUTCH-629) Detect slow and timeout servers and drop their URLs

2008-04-12 Thread Otis Gospodnetic (JIRA)
Reporter: Otis Gospodnetic Fetch jobs will finish faster if we find a way to prevent servers that are either slow or time out from slowing down the whole process. I'll attach a patch that counts per-server exceptions and timeouts and tracks download speed per server. Queues

[jira] Created: (NUTCH-628) Host database to keep track of host-level information

2008-04-11 Thread Otis Gospodnetic (JIRA)
: fetcher, generator Reporter: Otis Gospodnetic Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2008-04-10 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587786#action_12587786 ] Otis Gospodnetic commented on NUTCH-570: Ned - are you still using this? Still

[jira] Created: (NUTCH-627) Minimize host address lookup

2008-04-09 Thread Otis Gospodnetic (JIRA)
Minimize host address lookup Key: NUTCH-627 URL: https://issues.apache.org/jira/browse/NUTCH-627 Project: Nutch Issue Type: Improvement Components: generator Reporter: Otis Gospodnetic

[jira] Updated: (NUTCH-627) Minimize host address lookup

2008-04-09 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-627: --- Attachment: NUTCH-627.patch Minimize host address lookup

[jira] Commented: (NUTCH-296) Image Search

2008-03-11 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12577669#action_12577669 ] Otis Gospodnetic commented on NUTCH-296: Steve: I was going to say Great to see you

[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2007-12-02 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547642 ] Otis Gospodnetic commented on NUTCH-585: A more general solution is needed. This solution should not rely

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2007-12-02 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547672 ] Otis Gospodnetic commented on NUTCH-442: Doğacan -- can you please explain what you mean by blog up your

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2007-12-02 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547741 ] Otis Gospodnetic commented on NUTCH-442: Doğacan - ah, good! The Nutch side of the functionality included

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2007-11-18 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543427 ] Otis Gospodnetic commented on NUTCH-442: Doğacan - your comments sound good and I'd guess bean piece should

IntelliJ Eclipse Lucene code styles available

2007-05-23 Thread Otis Gospodnetic
Those using IntelliJ or Eclipse may want to grab code styles for Lucene (and Solr, Nutch, and Hadoop) that Grant and I put in https://issues.apache.org/jira/browse/SOLR-245 . I hope they are helpful. The plan is to stick them on the Wiki (and link from HowToContribute pages?). Otis . .

[jira] Commented: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-21 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474663 ] Otis Gospodnetic commented on NUTCH-447: The idea being to limit crawling only to links under a certain

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-11 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472078 ] Otis Gospodnetic commented on NUTCH-444: The ASF FeedParser you are talking about has, I believe, continued

Reviving Nutch 0.7

2007-01-21 Thread Otis Gospodnetic
Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for

Re: Nutch Indexing

2006-10-26 Thread Otis Gospodnetic
Stephane, Nutch uses Lucene for indexing, and Lucene has a class called IndexWriter that is used for indexing Lucene Documents. Here is a quick grep in Nutch's *java files: $ ffjg -l IndexWriter ./src/test/org/apache/nutch/indexer/TestDeleteDuplicates.java

[jira] Commented: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-24 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12444510 ] Otis Gospodnetic commented on NUTCH-389: Enis: Can you give us some examples of how URLs were tokenized before, and how they are tokenized with your patch

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2006-10-24 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12444514 ] Otis Gospodnetic commented on NUTCH-61: --- Has anyone been using the code with this patch applied? Just wondering if/how well it works. Adaptive re-fetch

[jira] Commented: (NUTCH-387) host normalization in Generator$Selector

2006-10-20 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ] Otis Gospodnetic commented on NUTCH-387: This indeed looks wrong. My guess is that the new URL() line just needs to be removed, but I'm not sure, so

[jira] Commented: (NUTCH-377) Add possibility to search for multiple values

2006-10-01 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-377?page=comments#action_12439016 ] Otis Gospodnetic commented on NUTCH-377: You'd need to modify ./src/java/org/apache/nutch/analysis/NutchAnalysis.jj and regenerate the .java files

[jira] Commented: (NUTCH-359) extraction of links will fail for whole page if one single link cannot be parsed

2006-09-07 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-359?page=comments#action_12433315 ] Otis Gospodnetic commented on NUTCH-359: Looks fine and simple (and has a small typo in the last comment). Sami is doing 0.8.1 soon, so I won't mess

[jira] Created: (NUTCH-347) Build: plugins' Jars not found

2006-08-11 Thread Otis Gospodnetic (JIRA)
Build: plugins' Jars not found -- Key: NUTCH-347 URL: http://issues.apache.org/jira/browse/NUTCH-347 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Otis Gospodnetic While

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-11 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12427677 ] Otis Gospodnetic commented on NUTCH-233: I haven't noticed this regexp being a problem so far either, but maybe I've just been lucky not to have run

[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

2005-09-15 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12329473 ] Otis Gospodnetic commented on NUTCH-92: --- I recall a discussion on lucene-dev list several (6+?) months back about this or very similar issue. Lucene's MultiSearcher has

[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

2005-09-15 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12329476 ] Otis Gospodnetic commented on NUTCH-92: --- Ah, you are right, I remember this getting in the core. As a matter of fact, it might have been me who committed it in the end