Re: SegmentMerger "no input paths" problem and "special files/directories"

2008-06-11 Thread ogjunk-nutch
I have not looked into this deeply, but this change would make me nervous, too. The main reason for that is that I have never seen this error, and the error makes me think that something is simply giving the SegmentMerger wrong/bad input. Otis -- Sematext -- http://sematext.com/ -- Lucene - So

Re: nutch-0.9 and hadoop-0.15.0

2008-06-09 Thread ogjunk-nutch
Use nutch-user mailing list, please. I'll reply there. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: m.harig <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org > Sent: Monday, June 9, 2008 12:17:07 PM > Subject: nutch-0.9 and hadoop-0.

Re: svn nutch with hadoop .17

2008-06-06 Thread ogjunk-nutch
Michael & Lincoln, It's great to see you two working on this. In general, this is best done via JIRA, really. The "process" is roughly as follows: * Open a new JIRA issue, describing what it's about, what you are trying to solve * Prepare a patch locally, off of nutch trunk/head, naming it aft

Re: nutch file content limit

2008-06-06 Thread ogjunk-nutch
rg > Sent: Friday, June 6, 2008 3:56:30 AM > Subject: Re: nutch file content limit > > > is there any way to index partial content of doc/xls/rtf . if its not > possible let me know. > > > ogjunk-nutch wrote: > > > > I *think* you have to fetch the *full* co

Re: nutch file content limit

2008-06-05 Thread ogjunk-nutch
I *think* you have to fetch the *full* content of MS Word docs (and PDFs and RTFs and ...) if you want parsers that handle those documents to be able to parse them. A partial MS Word/PDF/RTF/... document is considered invalid/broken. Try opening it with MS Word, for example -- it will not work

Re: nutch file content limit

2008-06-04 Thread ogjunk-nutch
How's this: $ grep -n content nutch/conf/nutch-default.xml 28: file.content.limit 30: The length limit for downloaded content, in bytes. 31: If this value is nonnegative (>=0), content longer than it will be truncated; 37: file.content.ignored 39: If true, no file content will be saved duri

Re: Adding Otis to JIRA

2008-05-21 Thread ogjunk-nutch
Works, thanks Andrzej! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Andrzej Bialecki <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org > Sent: Wednesday, May 21, 2008 5:43:11 PM > Subject: Re: Adding Otis to JIRA > > Otis Gospodnetic

Re: Bug in NutchAnalysis.java

2008-05-15 Thread ogjunk-nutch
Hi, But shouldn't this be the expected behaviour? In each of the examples the query really is bad/invalid, uses incorrect syntax. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: ivrokv <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org

Re: Writing a plugin

2008-05-11 Thread ogjunk-nutch
Hi, Yes, you have to add your plugin to nutch-site.xml, along with other plugins you probably already have defined there. If you don't have them in nutch-site.xml, look at nutch-default.xml Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From:

Re: Problem compiling plugins

2008-05-09 Thread ogjunk-nutch
Hi, You are missing some ant jars. I'm not sure which ones, but it looks like the class that cannot be found is TraXLiaison , so once you google you'll find which optional ant jar this is in. Get that jar, put it in your ant home's lib dir and try again. Otis -- Sematext -- http://sematext.c

Re: Welcome Otis Gospodnetic as Nutch committer

2008-05-08 Thread ogjunk-nutch
Thank you! I'll do my best to help out. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Andrzej Bialecki <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org > Sent: Thursday, May 8, 2008 1:55:36 PM > Subject: Welcome Otis Gospodnetic as Nu

Re: Internet crawl: CrawlDb getting big!

2008-05-07 Thread ogjunk-nutch
You don't have to update CrawlDb after every fetch cycle, so keeping the generated CrawlDatums from one generate run might be useful. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: wuqi <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org

Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-21 Thread ogjunk-nutch
Thanks Andrzej. So the disconnect was only measuring (download speed in my mind) per-URL vs. per-host In that case, I think we are talking about a small change (to Fetcher2) that might look like this: + // time the request + long fetchStart = System.currentTimeMillis(

Re: Fetching inefficiency

2008-04-21 Thread ogjunk-nutch
Adding some comments to the email below, but here on nutch-dev. Basically, it is my feeling that whenever fetchlists (and its parts) are not "well balanced", this inefficiency will be seen. Concretely, whichever task is "stuck fetching from the slow server with a lot of its pages in the fetchlis

Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-19 Thread ogjunk-nutch
Hi, (Andrzej - sorry about line length - I don't see an option for that in Y! Mail now/any more, BCCing my non-Y account to see what's going on) - Original Message > From: Andrzej Bialecki <[EMAIL PROTECTED]> > To: nutch-dev@lucene.apache.org > Sent: Saturday, April 19, 2008 6:07:17 PM >

Fw: [jira] Closed: (INFRA-1583) Wiki => email not working for Nutch wiki

2008-04-19 Thread ogjunk-nutch
We've got mail. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Forwarded Message From: Martin Cooper (JIRA) <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, April 19, 2008 2:44:21 PM Subject: [jira] Closed: (INFRA-1583) Wiki => email not working for Nutch

Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-18 Thread ogjunk-nutch
You are both in agreement, but I don't fully follow as I'm not intimately familiar with all the files and structures yet. - Fetcher-s putting info about hosts into crawl_fetch for each fetched segment makes sense. I see Fetcher(2) uses FetcherOutputFormat, which has its own RecordWriter, which

Re: Wiki -> email -> nutch-dev?

2008-04-14 Thread ogjunk-nutch
OK: https://issues.apache.org/jira/browse/INFRA-1583 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Dennis Kubes <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Monday, April 14, 2008 1:04:25 AM Subject: Re: Wiki -> email -> nutch-dev

Re: Wiki -> email -> nutch-dev?

2008-04-13 Thread ogjunk-nutch
Is this a thing for infrastructure@ or infrastructure JIRA? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Dennis Kubes <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Sunday, April 13, 2008 4:52:51 PM Subject: Re: Wiki -> email -> nu

Wiki -> email -> nutch-dev?

2008-04-12 Thread ogjunk-nutch
Hi, It looks like Nutch's Wiki is not configured to send email to nutch-dev when its pages are changed. Is this on purpose? Not that I need more email in my life, but it does help (me) passively keep track of new knowledge posted on the Wiki. I see there are recent changes listed on http://

Re: Keywords in documents

2008-04-11 Thread ogjunk-nutch
Hi Amit, There is no semantic summarizer (What exactly would it do? Can you provide an example?). There is a more or less "standard" snippet/highlighter - a lot like what you see on Google's search results, for example. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - O

Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

2008-04-10 Thread ogjunk-nutch
Hi Andrzej, Sure, that sounds good - thanks! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Andrzej Bialecki <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Thursday, April 10, 2008 4:45:08 AM Subject: Re: [jira] Updated: (NUTCH-627)

Re: what is the difference between nutch and some other opensource search engines

2008-04-09 Thread ogjunk-nutch
Broad question, broad answer: free, scalable, extensible, open-source are a few characteristics that come to mind. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: minskv <[EMAIL PROTECTED]> To: nutch-dev Sent: Wednesday, April 9, 2008 2:44:51

Re: Is there any LSI implementation?

2008-04-06 Thread ogjunk-nutch
Hi Ed, The answer is no, though I'm not sure if you really meant to ask on the Nutch mailing list. Lucene mailing list ([EMAIL PROTECTED]) would be a better place to ask, if you haven't already. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message Fro

Re: Why is Nutch not involved in Google Summer of Code - 2008?

2008-03-30 Thread ogjunk-nutch
Hi Dennis, Not too late, I think, just add Nutch + Solr idea to http://wiki.apache.org/general/SummerOfCode2008 on Monday. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Dennis Kubes <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: S

Re: [jira] Created: (NUTCH-624) Better parsed text

2008-03-30 Thread ogjunk-nutch
Vinci, Please use the mailing list to ask questions and discuss first, not JIRA. Also, please include an example of what you are describing, if you can. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Vinci (JIRA) <[EMAIL PROTECTED]>

Re: Why is Nutch not involved in Google Summer of Code - 2008?

2008-03-29 Thread ogjunk-nutch
Hi Susam, Good question, and I'm afraid we may be a little late: http://wiki.apache.org/general/SummerOfCodeMentor I think the main problem is that nobody has time to be the mentor. As for ideas, I think Solr integration would be very nice to have. Solr, with its recent support for distrib

Re: Confine nutch to one NIC?

2008-03-11 Thread ogjunk-nutch
I don't think there is anything you can do about this on the Nutch end. I do know that Java now has the ability to differentiate between different NICs, but Nutch doesn't have support for that. There may be something you can do on the OS level, though I don't have any concrete advice there, un

Re: Labeling URLs a-la Google

2007-09-07 Thread ogjunk-nutch
Hi Jeff, Nice. Could you submit this to JIRA as a patch? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Jeff Maki <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Thursday,

Re: [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread ogjunk-nutch
+1 Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Chris Mattmann <[EMAIL PROTECTED]> To: "nutch-dev@lucene.apache.org" Sent: Tuesday, March 27, 2007 1:43:17 AM Subject: [VOTE] Releas

Re: lib-http crawl-delay problem

2007-02-15 Thread ogjunk-nutch
HI, I think the robots.txt example you used was invalid (no path for that last Disallow rule). Small patch indeed, but sticking it in JIRA would still make sense because: - it leaves a good record of the bug + fix - it could be used for release notes/changelog Not trying to be picky, just pointi

Re: Reviving Nutch 0.7

2007-01-22 Thread ogjunk-nutch
All good arguments, and as nobody else voiced the desire to have this other branch of Nutch I was rambling about, I'll consider this thread done. Thanks for the explanations, Doug. Otis - Original Message From: Doug Cutting <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Monda

Re: Reviving Nutch 0.7

2007-01-22 Thread ogjunk-nutch
Yes, certainly, anything that can be shared and decoupled from pieces that make each branch (not SVN/CVS branch) different, should be decoupled. But I was really curious about whether people think this is a valid idea/direction, not necessarily immediately how things should be implemented. In

Re: implement thai lanaguage analyzer in nutch

2006-11-08 Thread ogjunk-nutch
Regarding Thai, there is a Thai Analyzer in Lucene already: $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ total 24 drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.j

Re: Any plans to move to build Nutchusing Maven?

2006-09-13 Thread ogjunk-nutch
Old issue. I don't think there were any conclusions. Not sure if Maven2 would be a good thing, because I haven't used Maven in 2+ years, and I understand it's changed a lot since then. The best way to move anything forward is to contribute the solution/fix/patch and then persuade others to gi

Re: Patch Available status?

2006-09-13 Thread ogjunk-nutch
Sorry if I missed followups to this (catching up on emails after vacation). This sounds like a good idea (because JIRA is often full of bug reports, enhancement requests, and only some issues have patches, and those can get stale, so reviewing and applying them quickly is important). I took a qu

Re: Ontology compile bug

2006-09-07 Thread ogjunk-nutch
I might be just tired, but I don't see the difference between those two lines. Otis - Original Message From: Michael Wechner <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Tuesday, August 22, 2006 9:07:12 AM Subject: Ontology compile bug Hi It seems to me that refine-query-i

Re: HTTP/1.1 problem

2006-09-07 Thread ogjunk-nutch
That looks right - committed, thanks. Otis - Original Message From: Doğacan Güney <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Thursday, August 24, 2006 4:15:53 AM Subject: HTTP/1.1 problem Hello everyone, There is a small bug in lib-http plugin code that prevents protoco

Re: Error in 0.8 regex-urlfilter.txt

2006-08-10 Thread ogjunk-nutch
Thanks, committed. Otis - Original Message From: Matthew Holt <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org Sent: Wednesday, August 9, 2006 9:51:19 AM Subject: Error in 0.8 regex-urlfilter.txt I was doing a search and noticed that a 'png' file was inde

Re: Patch: deflate encoding

2006-08-07 Thread ogjunk-nutch
Pascal, Forgot to say - attachments get stripped. Please put them in JIRA. Thanks, Otis - Original Message From: Pascal Beis <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Monday, August 7, 2006 4:17:33 AM Subject: Patch: deflate encoding Hi all, I'v added support for d

Re: Patch: deflate encoding

2006-08-07 Thread ogjunk-nutch
Ja, ja! Otis - Original Message From: Pascal Beis To: nutch-dev@lucene.apache.org Sent: Monday, August 7, 2006 4:17:33 AM Subject: Patch: deflate encoding Hi all, I'v added support for deflate encoding (next to gzip) to nutch. Is there interest to include this into th

Re: Basic character-cleanups easily possible?

2006-07-12 Thread ogjunk-nutch
Have a look at Lucene's contrib/: $ ff \*ISO\*java ./src/test/org/apache/lucene/analysis/TestISOLatin1AccentFilter.java ./src/java/org/apache/lucene/analysis/ISOLatin1AccentFilter.java Otis - Original Message From: Stefan Neufeind <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent

Re: org.farng and com.etranslate

2006-06-27 Thread ogjunk-nutch
Thanks, that was it. Otis - Original Message From: Sami Siren <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Tuesday, June 27, 2006 2:45:13 PM Subject: Re: org.farng and com.etranslate [EMAIL PROTECTED] wrote: >Hi, > >I have 2 unresolved imports in my IDE's Nutch project: >

org.farng and com.etranslate

2006-06-27 Thread ogjunk-nutch
Hi, I have 2 unresolved imports in my IDE's Nutch project: org.farng.*(used in parse-mp3) com.etranslate.* (used in parse-rtf) Where are these classes from? I searched in the trunk, and couldn't find their jars. I search online (Krugle, too!), and couldn't find where these th

Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread ogjunk-nutch
I was going to suggest the same approach. Seems simple enough and would force the person to edit the config. What is entered in place of EDITME is another story, but maybe some code can enforce some rules on that, too. Otis - Original Message From: Teruhiko Kurosaka <[EMAIL PROTECTED

Re: [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml

2006-06-04 Thread ogjunk-nutch
Hi, What exactly does this plugin do? I haven't seen it mentioned and the README.txt doesn't really describe it. Thanks, Otis - Original Message From: [EMAIL PROTECTED] To: nutch-commits@lucene.apache.org Sent: Sunday, June 4, 2006 3:44:23 PM Subject: [Nutch-cvs] svn commit: r411594 -

Re: [Nutch-cvs] svn commit: r409869 - in /lucene/nutch/trunk/contrib/web2/plugins/caching-oscache/src/java/org: ./ apache/ apache/nutch/ apache/nutch/webapp/ apache/nutch/webapp/controller/

2006-05-28 Thread ogjunk-nutch
Spotted a reference to "NutchReferehPolicy();" : EntryRefreshPolicy policy=new NutchReferehPolicy(); Typo? Otis - Original Message From: [EMAIL PROTECTED] To: nutch-commits@lucene.apache.org Sent: Saturday, May 27, 2006 4:36:56 PM Subject: [Nutch-cvs] svn commit: r409869 - in /lucene

Re: [Proposal] New Lucene sub-project

2006-04-24 Thread ogjunk-nutch
This thread seems to have gotten very little attention. Jérôme - I'm all for extracting sub-libraries that can really live on its own and are substantial enough to warrant "their own identity". Personally, I'm the most interested in Language Identifier plugin becoming a standalone, Nutch-indepen

Re: Nutch 0.7.2

2006-03-09 Thread ogjunk-nutch
I'm still on 0.7*, and would welcome a new release. Otis - Original Message From: Piotr Kosiorowski <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Thursday, March 9, 2006 3:31:09 PM Subject: Nutch 0.7.2 Hello, I would like to release nutch 0.7.2 in a week or two. Some serious

Re: ignore eclipse .project and .classpath

2006-02-09 Thread ogjunk-nutch
Done. - Original Message From: Stefan Groschupf <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Wed 08 Feb 2006 03:15:15 PM EST Subject: Re: ignore eclipse .project and .classpath +1 Am 08.02.2006 um 06:16 schrieb Chris Mattmann: > Hi Folks, > > > > Just wondering if someon

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread ogjunk-nutch
I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in the near future. Does that sound ok? Otis - Original Message From: Stefan Groschupf <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Mon 23 Jan 2006 02:55:46 PM EST Subject: Re: lang identifier and

Re: tool to mount nutch filesystem

2006-01-20 Thread ogjunk-nutch
Hi John, NDFS + MapReduce will soon become a separate Lucene sub-project. Otis - Original Message From: John X <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Cc: [EMAIL PROTECTED] Sent: Fri 20 Jan 2006 07:55:17 PM EST Subject: tool to mount nutch filesystem I have created a simple

Re: wiki:commandline options classpaths

2006-01-09 Thread ogjunk-nutch
Yes, everything is in org.apache now, I believe. Thanks for helping out. Otis - Original Message From: Jerry Russell <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Mon 09 Jan 2006 02:20:02 PM EST Subject: wiki:commandline options classpaths I noticed that the command line

Re: Normalizing URLs with anchors

2006-01-05 Thread ogjunk-nutch
I think it's safe to strip anchors, as they simply point to a different portion of the same page for browser rendering. I do that for Simpy while normalizing URLs, in order not to have duplicates like this. Otis - Original Message From: Ken Krugler <[EMAIL PROTECTED]> To: nutch-dev@lu

Re: [VOTE] Commiter access for Stefan Groschupf

2005-12-21 Thread ogjunk-nutch
I'm late, but better late than never: +1 (I thought Stefan was already a committer, actually). Stefan: will you be putting some of those media-style Nutch tutorials in Nutch's own Wiki? Otis - Original Message From: Andrzej Bialecki <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org

RE: DNS

2005-10-15 Thread ogjunk-nutch
Fuad, Please stick this in JIRA - it will get lost in a pile of incoming Nutch email. Otis --- Fuad Efendi <[EMAIL PROTECTED]> wrote: > Thanks Paul, > > > It would be nice to have this piece of code in > org.apache.nutch.fetcher.Fetcher > > > private static final int DNS_CACHE_TTL_MINUTES =

Re: [jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread ogjunk-nutch
Hi, I find it a bit hard to follow your various ideas here, but I'll add my comments to some parts below. --- "Fuad Efendi (JIRA)" <[EMAIL PROTECTED]> wrote: > [ > http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331892 > ] > > Fuad Efendi commented on NUTCH-109: > ---

RE: Re[2]: what contibute to fetch slowing down

2005-10-11 Thread ogjunk-nutch
Fuad, I think you are constantly comparing apples and oranges here. It looks like your new code simply hammers the server sending multiple requests to a single server in parallel. That's a big no-no in a web crawling/spidering/fetching world, as bad as not obeying robots.txt. The fact that the

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread ogjunk-nutch
Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Thanks, Otis --- Gal Nitzan <[EMAIL PROTECTED]> wrote: > Hi Michael, > > At the moment I have about 3000 domains in my db. I didn't time t

Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

2005-09-15 Thread ogjunk-nutch
Sounds good to me. Otis --- Chris Mattmann <[EMAIL PROTECTED]> wrote: > Hi Otis, > > Point taken. In actuality since both convey the same information I > think > that it's okay to support both, but by default say we could code the > initial > plugins specified in parse-plugins.xml without the

Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

2005-09-15 Thread ogjunk-nutch
Well, you have to tell users about order="N" somewhere in the docs. Instead of telling them about order="N", tell them that the order in XML matters. Either case requires education, and the latter one requires less typing and avoids the case described in the proposal. Otis --- Sébastien LE CALL

Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

2005-09-15 Thread ogjunk-nutch
Quick comment about order="N" and the paragraph that describes how to deal with cases where people mess things up and enter multiple plugins for the same content type and the same order: - Why is the order attribute even needed? It looks like a redundant piece of information - why not derive orde

Re: merge mapred to trunk

2005-08-31 Thread ogjunk-nutch
--- Doug Cutting <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > I, too, am looking forward to this, but I am wondering what that > will > > do to Kelvin Tan's recent contribution, especially since I saw that > > both MapReduce and Kelvin's code change how FetchListEntry works. > If > >

Re: merge mapred to trunk

2005-08-31 Thread ogjunk-nutch
> Currently we have three versions of nutch: trunk, 0.7 and mapred. > This > increases the chances for conflicts. I would thus like to merge the > mapred branch into trunk soon. The soonest I could actually start > this is next week. Are there any objections? I, too, am looking forward to th

Re: [Nutch Wiki] Update of "Committer's Rules" by AndrzejBialecki

2005-08-31 Thread ogjunk-nutch
> Glancing at other Apache projects in subversion, I see that httpd > uses > branch names like "2.2.x" and tag names like "2.2.4". That's a > little > cryptic. I propose that we use branch names like "branch-2.4" and > tag > names like "release-2.4.1". What do folks think? I agree. That is

Re: [Nutch-cvs] svn commit: r240359 - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/indexer/ plugin/nutch-extensionpoints/

2005-08-27 Thread ogjunk-nutch
I see several instances of 'analySer' in comments/javadoc and some variables. That should probably be changed to american version - analyzer, for consistency's sake. Otis --- [EMAIL PROTECTED] wrote: > Author: jerome > Date: Fri Aug 26 15:47:04 2005 > New Revision: 240359 > > URL: http://svn.

Re: 0.7 branch

2005-08-23 Thread ogjunk-nutch
15.9. is almost a month away. I think it would be good to take a look at Kelvin's modifications and include it either in 0.7.1, or maybe 0.8 (without map-reduce, which would then be in 0.9). Kelvin, you should put your code in Jira. Otis --- Piotr Kosiorowski <[EMAIL PROTECTED]> wrote: > Hell

Re: Fetcher for constrained crawls

2005-08-23 Thread ogjunk-nutch
>From what I heard from Kelvin, the Spring part could be thrown out and replaced with classes with main(). I think there is a need for having the Fetcher component more separated from the rest of Nutch. The Fetcher alone is well done and quite powerful on its own - it has host-based queues, doesn