Re: Nutch Indexing

2006-10-26 Thread Otis Gospodnetic
Stephane, Nutch uses Lucene for indexing, and Lucene has a class called IndexWriter that is used for indexing Lucene Documents. Here is a quick grep in Nutch's *java files: $ ffjg -l IndexWriter ./src/test/org/apache/nutch/indexer/TestDeleteDuplicates.java ./src/java/org/apache/nutch/indexer/I

Reviving Nutch 0.7

2007-01-21 Thread Otis Gospodnetic
Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something

IntelliJ & Eclipse Lucene code styles available

2007-05-22 Thread Otis Gospodnetic
Those using IntelliJ or Eclipse may want to grab code styles for Lucene (and Solr, Nutch, and Hadoop) that Grant and I put in https://issues.apache.org/jira/browse/SOLR-245 . I hope they are helpful. The plan is to stick them on the Wiki (and link from HowToContribute pages?). Otis . . .

Adding Otis to JIRA

2008-05-21 Thread Otis Gospodnetic
Hi, I was about to go assign some JIRA issues to myself and get the commits going when I noticed that I'm not in Nutch JIRA yet. Could somebody please add me there? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: need some help about distribution

2008-06-22 Thread Otis Gospodnetic
Hi, Yes, Nutch has the ability to build N indices and query those N indices, merging the results. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mohammad Monirul Hoque <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Sunday, June 2

Re: Timeline for 1.0 release?

2008-06-22 Thread Otis Gospodnetic
Hi Dave, It's really mostly about closing out some of the open bugs and going through the release process. My guess is we'll have 1.0 this Fall. Otis - Original Message > From: David Grandinetti <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org > Sent: Monday, June 23, 2008 5:16

New algo: Near duplicate detection

2008-08-07 Thread Otis Gospodnetic
This sounds simple and apparently it's effective...should anyone want to give it a try: http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Droids crawler

2008-11-13 Thread Otis Gospodnetic
Hi, Just found this email is my Nutch folder and as I was reading it was thinking "Got to ask Dennis if he/they will do the Nutch-Droids integration" when I saw Dennis' name below. So, Dennis, is Droids on the roadmap for you? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - S

Site update

2009-01-05 Thread Otis Gospodnetic
Hello, Quick heads up - I'm about to regenerate the files (HTML + PDF) for the site and update it tomorrow according to the instructions on http://wiki.apache.org/nutch/Website_Update_HOWTO . I have Forrest 0.8, and the site files were last generated with Forrest 0.7, so there will be some ch

Re: Site update

2009-01-05 Thread Otis Gospodnetic
emap=false # *.failonerror=(true|false) - stop when an XML file is invalid #forrest.validate.failonerror=true Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message ---- > From: Otis Gospodnetic > To: Nutch Developer List > Sent: Monday, January

Re: Site update

2009-01-05 Thread Otis Gospodnetic
zed /home/otis/apache-forrest/main/webapp/resources/schema/relaxng/sitemap-v06.rng:2107:29: error: datatype library "http://www.w3.org/2001/XMLSchema-datatypes"; not recognized BUILD FAILED /home/otis/apache-forrest/main/targets/validate.xml:158: Validation failed, messages should hav

Re: Site update

2009-01-05 Thread Otis Gospodnetic
Site update > > http://www.mail-archive.com/d...@forrest.apache.org/msg15136.html > > This might help. > > Dennis > > Andrzej Bialecki wrote: > > Otis Gospodnetic wrote: > >> Below is what it spits out. I'm not sure what the cause is. I did

Re: Site update

2009-01-06 Thread Otis Gospodnetic
lations/.svn/foo: Permission denied [o...@minotaur /www/lucene.apache.org/nutch]$ chmod g+w skin/translations/.svn chmod: skin/translations/.svn: Operation not permitted Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message ---- > From: Otis Gospodnetic &g

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Otis Gospodnetic
Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Doğacan Güney > To: nutch-dev@lucene.apache.org > Sent: Tuesday, January 20, 2009 10:49:44 AM > Subject: Re: [jira] Created: (

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Otis Gospodnetic
oğacan Güney > To: nutch-dev@lucene.apache.org > Sent: Tuesday, January 20, 2009 1:13:20 PM > Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest > versions > > On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic > wrote: > > Lucene doesn't use

Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Otis Gospodnetic
--- > From: Doğacan Güney > To: nutch-dev@lucene.apache.org > Sent: Tuesday, January 20, 2009 3:40:20 PM > Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest > versions > > On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic > wrote: > > That I do

Re: NutchAnalysis.java STOP_WORDS not configurable?

2009-02-27 Thread Otis Gospodnetic
I believe Lucene has (in contrib/analyzers) a class called WordLoader or something like that. Perhaps you can use that to load stopwords from a file (like Solr does) and submit that as a patch? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message

Re: Is there the functions of "More Like This" and "Spell Checking"?

2009-03-03 Thread Otis Gospodnetic
If you use the Nutch->Solr functionality, you can rely on Solr's MoreLikeThis and Solr's SpellCheckComponent (both are described on Solr's wiki) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: dealmaker > To: nutch-dev@lucene.apache.org

Re: Is there the functions of "More Like This" and "Spell Checking"?

2009-03-03 Thread Otis Gospodnetic
Subject: Re: Is there the functions of "More Like This" and "Spell Checking"? > > > I am not using solr. I am using nutch to search for related urls to a url > that user type. Can I still use solr's morelikethis in this case? > > > Otis Gospodnetic-

Re: site: operator with no query term

2009-03-03 Thread Otis Gospodnetic
Absolutely! I see you are at home with JIRA, so I don't have to ask. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Frank McCown > To: nutch-dev@lucene.apache.org > Sent: Tuesday, March 3, 2009 9:39:24 AM > Subject: site: operator wi

Nutch ML cleanup

2009-03-09 Thread Otis Gospodnetic
Hi, This has been bugging me for a while now. For some reason Nutch MLs get the most "junk" emails - both rude/rudeish emails, as well as clear spam (with "SPAM" in the subject - something must be detecting it). I just looked at the headers of the clearly labeled spam messages and found th

Re: Moving Nutch parsers to Tika

2009-03-10 Thread Otis Gospodnetic
I absolutely agree. Duplicating the work and focusing on non-core when the same functionality can be gotten by using Tika is not wise for Nutch. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Andrzej Bialecki > To: nutch-dev@lucene.apa

Re: Nutch ML cleanup

2009-03-10 Thread Otis Gospodnetic
ny email. Please check the message > headers > to see how this message is routed to you. If it is indeed routed through > Apache > servers then please send the headers to me. > > Doug > > Andrzej Bialecki wrote: > > Otis Gospodnetic wrote: > >> Hi, > >

Re: The Future of Nutch, reactivated

2009-05-23 Thread Otis Gospodnetic
Hello, (I saw the first copy of this email went to nutch-user, but I assume nutch-dev was a resend and the right list to follow-up on) I agree with the list of core competencies. For example, and I don't know where I said/wrote this, but I know I said it a few times before -- I think Solr is

Re: Remove duplicate nutch conf files from .job file

2009-05-28 Thread Otis Gospodnetic
Hi Kirby, Do you think you could add this to Nutch's JIRA? Please see http://wiki.apache.org/nutch/HowToContribute Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Kirby Bohling > To: nutch-dev@lucene.apache.org > Sent: Thursday, May 28,

java.net.URL synchronization

2009-12-09 Thread Otis Gospodnetic
Hello, Has anyone seen this: http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck ? Is this something that needs to be addressed in Nutch (and thus in Bixo, and thus in the common crawler project)? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

Re: [DISCUSS] Nutch as a top level project (TLP)?

2010-03-20 Thread Otis Gospodnetic
Personally, I don't see the advantage of Nutch going for a TLP. It's not like new committers are having a hard time getting in today, it's not like they are being proposed and rejected. I also don't feel like Nutch lacks exposure/visibility -- lots of people know about it. It's just that very

[jira] Created: (NUTCH-347) Build: plugins' Jars not found

2006-08-11 Thread Otis Gospodnetic (JIRA)
Build: plugins' Jars not found -- Key: NUTCH-347 URL: http://issues.apache.org/jira/browse/NUTCH-347 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Otis Gospodnetic

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-11 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12427677 ] Otis Gospodnetic commented on NUTCH-233: I haven't noticed this regexp being a problem so far either, but maybe I've just been lucky not to hav

[jira] Commented: (NUTCH-359) extraction of links will fail for whole page if one single link cannot be parsed

2006-09-07 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-359?page=comments#action_12433315 ] Otis Gospodnetic commented on NUTCH-359: Looks fine and simple (and has a small typo in the last comment). Sami is doing 0.8.1 soon, so I won't mess

[jira] Commented: (NUTCH-377) Add possibility to search for multiple values

2006-10-01 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-377?page=comments#action_12439016 ] Otis Gospodnetic commented on NUTCH-377: You'd need to modify ./src/java/org/apache/nutch/analysis/NutchAnalysis.jj and regenerate the .java files

[jira] Commented: (NUTCH-387) host normalization in Generator$Selector

2006-10-20 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ] Otis Gospodnetic commented on NUTCH-387: This indeed looks wrong. My guess is that the new URL() line just needs to be removed, but I'm not sur

[jira] Commented: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-24 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12444510 ] Otis Gospodnetic commented on NUTCH-389: Enis: Can you give us some examples of how URLs were tokenized before, and how they are tokenized with your patch

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2006-10-24 Thread Otis Gospodnetic (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12444514 ] Otis Gospodnetic commented on NUTCH-61: --- Has anyone been using the code with this patch applied? Just wondering if/how well it works. > Adaptive re-fe

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-11 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472078 ] Otis Gospodnetic commented on NUTCH-444: The ASF FeedParser you are talking about has, I believe, continued

[jira] Commented: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-21 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474663 ] Otis Gospodnetic commented on NUTCH-447: The idea being to limit crawling only to links under a certain

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2007-11-18 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543427 ] Otis Gospodnetic commented on NUTCH-442: Doğacan - your comments sound good and I'd guess "bean&quo

[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2007-12-02 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547642 ] Otis Gospodnetic commented on NUTCH-585: A more general solution is needed. This solution should not rely on

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2007-12-02 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547672 ] Otis Gospodnetic commented on NUTCH-442: Doğacan -- can you please explain what you mean by "blog up

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2007-12-02 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547741 ] Otis Gospodnetic commented on NUTCH-442: Doğacan - ah, good! The Nutch side of the functionality included in

[jira] Commented: (NUTCH-296) Image Search

2008-03-11 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577669#action_12577669 ] Otis Gospodnetic commented on NUTCH-296: Steve: I was going to say "Gre

[jira] Created: (NUTCH-627) Minimize host address lookup

2008-04-09 Thread Otis Gospodnetic (JIRA)
Minimize host address lookup Key: NUTCH-627 URL: https://issues.apache.org/jira/browse/NUTCH-627 Project: Nutch Issue Type: Improvement Components: generator Reporter: Otis Gospodnetic

[jira] Updated: (NUTCH-627) Minimize host address lookup

2008-04-09 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-627: --- Attachment: NUTCH-627.patch > Minimize host address loo

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2008-04-10 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587786#action_12587786 ] Otis Gospodnetic commented on NUTCH-570: Ned - are you still using this? S

[jira] Created: (NUTCH-628) Host database to keep track of host-level information

2008-04-11 Thread Otis Gospodnetic (JIRA)
: fetcher, generator Reporter: Otis Gospodnetic Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so

[jira] Created: (NUTCH-629) Detect slow and timeout servers and drop their URLs

2008-04-12 Thread Otis Gospodnetic (JIRA)
Reporter: Otis Gospodnetic Fetch jobs will finish faster if we find a way to prevent servers that are either slow or time out from slowing down the whole process. I'll attach a patch that counts per-server exceptions and timeouts and tracks download speed per server. Q

[jira] Updated: (NUTCH-629) Detect slow and timeout servers and drop their URLs

2008-04-12 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-629: --- Attachment: NUTCH-629.patch > Detect slow and timeout servers and drop their U

[jira] Commented: (NUTCH-629) Detect slow and timeout servers and drop their URLs

2008-04-14 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588746#action_12588746 ] Otis Gospodnetic commented on NUTCH-629: While the patch improves fetch speed

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2008-04-14 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588772#action_12588772 ] Otis Gospodnetic commented on NUTCH-442: This issue has a lot of votes and a lo

[jira] Updated: (NUTCH-628) Host database to keep track of host-level information

2008-04-15 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-628: --- Attachment: NUTCH-628-DomainStatistics.patch Enis' DomainStatistics tool from NUTCH-439.

[jira] Updated: (NUTCH-628) Host database to keep track of host-level information

2008-04-16 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-628: --- Attachment: NUTCH-628-HostDb.patch HostDatum.java - really just a holds MapWritable

[jira] Issue Comment Edited: (NUTCH-628) Host database to keep track of host-level information

2008-04-16 Thread Otis Gospodnetic (JIRA)
Project: Nutch > Issue Type: New Feature > Components: fetcher, generator > Reporter: Otis Gospodnetic > Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch > > > Nutch would benefit from having a DB with per-host/domai

[jira] Issue Comment Edited: (NUTCH-628) Host database to keep track of host-level information

2008-04-16 Thread Otis Gospodnetic (JIRA)
Key: NUTCH-628 > URL: https://issues.apache.org/jira/browse/NUTCH-628 > Project: Nutch > Issue Type: New Feature > Components: fetcher, generator >Reporter: Otis Gospodnetic > Attachments: NUTCH-628-DomainStatistics.pat

[jira] Issue Comment Edited: (NUTCH-628) Host database to keep track of host-level information

2008-04-17 Thread Otis Gospodnetic (JIRA)
t; Issue Type: New Feature > Components: fetcher, generator >Reporter: Otis Gospodnetic > Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch > > > Nutch would benefit from having a DB with per-host/domain/TLD information. > For insta

[jira] Commented: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS

2008-04-18 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590486#action_12590486 ] Otis Gospodnetic commented on NUTCH-596: This looks beautifully simply to me

[jira] Updated: (NUTCH-570) Improvement of URL Ordering in Generator.java

2008-05-21 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-570: --- Assignee: Otis Gospodnetic Another nudge for feedback from Ned or anyone else who tried this

[jira] Assigned: (NUTCH-627) Minimize host address lookup

2008-05-21 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned NUTCH-627: -- Assignee: Otis Gospodnetic > Minimize host address loo

[jira] Assigned: (NUTCH-629) Detect slow and timeout servers and drop their URLs

2008-05-21 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned NUTCH-629: -- Assignee: Otis Gospodnetic > Detect slow and timeout servers and drop their U

[jira] Updated: (NUTCH-626) fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects

2008-10-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-626: --- Fix Version/s: 1.0.0 > fetcher2 breaks out the domain with db.ignore.external.links set

[jira] Commented: (NUTCH-655) Injecting Crawl metadata

2008-10-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641047#action_12641047 ] Otis Gospodnetic commented on NUTCH-655: I think we need a generic way for kee

[jira] Commented: (NUTCH-650) Hbase Integration

2008-10-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641050#action_12641050 ] Otis Gospodnetic commented on NUTCH-650: This sounds great, Doğacan! Simplifica

[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-10-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641059#action_12641059 ] Otis Gospodnetic commented on NUTCH-628: After seeing NUTCH-650 I have a fee

[jira] Issue Comment Edited: (NUTCH-655) Injecting Crawl metadata

2008-10-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641047#action_12641047 ] otis edited comment on NUTCH-655 at 10/20/08 9:29 AM: -- I th

[jira] Resolved: (NUTCH-660) Does anybody know how to let nutch crawl this kind of website?

2008-11-11 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-660. Resolution: Invalid I see you already asked on the list. That's the right place t

[jira] Resolved: (NUTCH-659) Help! No urls fetched for internal repository website

2008-11-11 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-659. Resolution: Invalid Please ask questions on the mailing list. > Help! No urls fetched

[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-10 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-669: --- Priority: Major (was: Minor) Fix Version/s: 1.0.0 +1 -- people, vote for it. This

[jira] Updated: (NUTCH-563) Include custom fields in BasicQueryFilter

2008-12-10 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-563: --- Fix Version/s: (was: 0.9.0) 1.0.0 > Include custom fields

[jira] Commented: (NUTCH-675) Reduce tasks do not report their status and are killed by jobtracker

2008-12-22 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658610#action_12658610 ] Otis Gospodnetic commented on NUTCH-675: Sha Feng, could you please bring thi

[jira] Resolved: (NUTCH-675) Reduce tasks do not report their status and are killed by jobtracker

2008-12-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-675. Resolution: Won't Fix According to Dennis Kubes's response on the mailing list

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2008-12-29 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659610#action_12659610 ] Otis Gospodnetic commented on NUTCH-171: But does generate.update.crawldb=

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2008-12-29 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659639#action_12659639 ] Otis Gospodnetic commented on NUTCH-171: Hm, yes, it's nice to b

[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2008-12-29 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659644#action_12659644 ] Otis Gospodnetic commented on NUTCH-669: I, too, am very anxious to see how the

[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-02 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660397#action_12660397 ] Otis Gospodnetic commented on NUTCH-669: Todd, and when you say "sustaine

[jira] Resolved: (NUTCH-627) Minimize host address lookup

2009-01-13 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-627. Resolution: Fixed Thanks Otis. SendingCHANGES.txt Sendingsrc/java/org

[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool

2009-01-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665482#action_12665482 ] Otis Gospodnetic commented on NUTCH-679: I'm not sure, but committing this

[jira] Commented: (NUTCH-655) Injecting Crawl metadata

2009-01-22 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666283#action_12666283 ] Otis Gospodnetic commented on NUTCH-655: 1.1 sounds good to me. > Injectin

[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-22 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666290#action_12666290 ] Otis Gospodnetic commented on NUTCH-628: I'm +1 on getting Domain Stats

[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666763#action_12666763 ] Otis Gospodnetic commented on NUTCH-666: Dennis, could you please describe how

[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666764#action_12666764 ] Otis Gospodnetic commented on NUTCH-628: Could you take it if you have time, pl

[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668135#action_12668135 ] Otis Gospodnetic commented on NUTCH-628: Thanks for the update. Sorry, I d

[jira] Updated: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment

2009-02-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-707: --- Fix Version/s: (was: 0.9.0) 1.1 > Generation of multiple segments

[jira] Resolved: (NUTCH-736) how long it takes nutch 1.0 to fetch

2009-05-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-736. Resolution: Invalid Assignee: Otis Gospodnetic Please ask questions on nutch-user

[jira] Commented: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-05-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712489#action_12712489 ] Otis Gospodnetic commented on NUTCH-731: People have redirects on their robots

[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-05-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712492#action_12712492 ] Otis Gospodnetic commented on NUTCH-721: Questions: Has anyone tried profiling

[jira] Commented: (NUTCH-721) Fetcher2 Slow

2009-05-23 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712494#action_12712494 ] Otis Gospodnetic commented on NUTCH-721: Ken's thoughts: h

[jira] Assigned: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2009-05-27 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned NUTCH-693: -- Assignee: Otis Gospodnetic > Add configurable option for treating nofollow behavi

[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

2009-05-27 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713862#action_12713862 ] Otis Gospodnetic commented on NUTCH-693: I think I see some formatting that

[jira] Commented: (NUTCH-650) Hbase Integration

2009-05-27 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713867#action_12713867 ] Otis Gospodnetic commented on NUTCH-650: Doğacan, I think http://github.com/dog

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714086#action_12714086 ] Otis Gospodnetic commented on NUTCH-739: I think there are a few issues

[jira] Commented: (NUTCH-677) Segment merge filering based on segment content

2009-05-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714091#action_12714091 ] Otis Gospodnetic commented on NUTCH-677: Marcin - could you please include

[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2009-05-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-740: --- Priority: Minor (was: Major) Affects Version/s: (was: 0.9.0) Fix

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-28 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714286#action_12714286 ] Otis Gospodnetic commented on NUTCH-739: Yes, external optimize calls will wor

[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop

2009-05-29 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714536#action_12714536 ] Otis Gospodnetic commented on NUTCH-739: Yeah, sounds right. That Tool should

[jira] Resolved: (NUTCH-101) RobotRulesParser

2009-06-19 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-101. Resolution: Fixed Thank you Ken. > RobotRulesPar

[jira] Updated: (NUTCH-731) Redirection of robots.txt in RobotRulesParser

2009-06-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-731: --- Fix Version/s: 1.1 Assignee: Otis Gospodnetic > Redirection of robots.txt

[jira] Resolved: (NUTCH-742) Checksum Error

2009-06-20 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-742. Resolution: Incomplete Could you please post more detailed information to nutch-user

[jira] Updated: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.

2009-08-04 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-746: --- Patch Info: [Patch Available] Fix Version/s: 1.1 > NutchBeanConstructor does not cl

[jira] Updated: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed

2009-08-04 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-738: --- Affects Version/s: (was: 1.1) 1.0.0 Fix Version/s: 1.1

[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-16 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-740: --- Assignee: (was: Otis Gospodnetic) > Configuration option to override default language

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851461#action_12851461 ] Otis Gospodnetic commented on NUTCH-570: Serykh, what does your version of

  1   2   >