Re: nutch file content limit

2008-06-06 Thread ogjunk-nutch
, 2008 3:56:30 AM Subject: Re: nutch file content limit is there any way to index partial content of doc/xls/rtf . if its not possible let me know. ogjunk-nutch wrote: I *think* you have to fetch the *full* content of MS Word docs (and PDFs and RTFs and ...) if you want parsers

Re: nutch file content limit

2008-06-05 Thread ogjunk-nutch
I *think* you have to fetch the *full* content of MS Word docs (and PDFs and RTFs and ...) if you want parsers that handle those documents to be able to parse them. A partial MS Word/PDF/RTF/... document is considered invalid/broken. Try opening it with MS Word, for example -- it will not

Re: nutch file content limit

2008-06-04 Thread ogjunk-nutch
How's this: $ grep -n content nutch/conf/nutch-default.xml 28: namefile.content.limit/name 30: descriptionThe length limit for downloaded content, in bytes. 31: If this value is nonnegative (=0), content longer than it will be truncated; 37: namefile.content.ignored/name 39: descriptionIf

Re: Adding Otis to JIRA

2008-05-21 Thread ogjunk-nutch
Works, thanks Andrzej! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Wednesday, May 21, 2008 5:43:11 PM Subject: Re: Adding Otis to JIRA Otis Gospodnetic wrote:

Re: Welcome Otis Gospodnetic as Nutch committer

2008-05-08 Thread ogjunk-nutch
Thank you! I'll do my best to help out. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, May 8, 2008 1:55:36 PM Subject: Welcome Otis Gospodnetic as Nutch

Re: Fetching inefficiency

2008-04-21 Thread ogjunk-nutch
Adding some comments to the email below, but here on nutch-dev. Basically, it is my feeling that whenever fetchlists (and its parts) are not well balanced, this inefficiency will be seen. Concretely, whichever task is stuck fetching from the slow server with a lot of its pages in the fetchlist,

Fw: [jira] Closed: (INFRA-1583) Wiki = email not working for Nutch wiki

2008-04-19 Thread ogjunk-nutch
We've got mail. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Forwarded Message From: Martin Cooper (JIRA) [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, April 19, 2008 2:44:21 PM Subject: [jira] Closed: (INFRA-1583) Wiki = email not working for Nutch

Re: [jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2008-04-18 Thread ogjunk-nutch
You are both in agreement, but I don't fully follow as I'm not intimately familiar with all the files and structures yet. - Fetcher-s putting info about hosts into crawl_fetch for each fetched segment makes sense. I see Fetcher(2) uses FetcherOutputFormat, which has its own RecordWriter,

Re: Wiki - email - nutch-dev?

2008-04-14 Thread ogjunk-nutch
OK: https://issues.apache.org/jira/browse/INFRA-1583 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Dennis Kubes [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Monday, April 14, 2008 1:04:25 AM Subject: Re: Wiki - email - nutch-dev?

Wiki - email - nutch-dev?

2008-04-12 Thread ogjunk-nutch
Hi, It looks like Nutch's Wiki is not configured to send email to nutch-dev when its pages are changed. Is this on purpose? Not that I need more email in my life, but it does help (me) passively keep track of new knowledge posted on the Wiki. I see there are recent changes listed on

Re: Keywords in documents

2008-04-11 Thread ogjunk-nutch
Hi Amit, There is no semantic summarizer (What exactly would it do? Can you provide an example?). There is a more or less standard snippet/highlighter - a lot like what you see on Google's search results, for example. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch -

Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

2008-04-10 Thread ogjunk-nutch
Hi Andrzej, Sure, that sounds good - thanks! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, April 10, 2008 4:45:08 AM Subject: Re: [jira] Updated: (NUTCH-627)

Re: what is the difference between nutch and some other opensource search engines

2008-04-09 Thread ogjunk-nutch
Broad question, broad answer: free, scalable, extensible, open-source are a few characteristics that come to mind. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: minskv [EMAIL PROTECTED] To: nutch-dev nutch-dev@lucene.apache.org Sent:

Re: [jira] Created: (NUTCH-624) Better parsed text

2008-03-30 Thread ogjunk-nutch
Vinci, Please use the mailing list to ask questions and discuss first, not JIRA. Also, please include an example of what you are describing, if you can. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Vinci (JIRA) [EMAIL PROTECTED]

Re: Why is Nutch not involved in Google Summer of Code - 2008?

2008-03-30 Thread ogjunk-nutch
Hi Dennis, Not too late, I think, just add Nutch + Solr idea to http://wiki.apache.org/general/SummerOfCode2008 on Monday. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Dennis Kubes [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent:

Re: Why is Nutch not involved in Google Summer of Code - 2008?

2008-03-29 Thread ogjunk-nutch
Hi Susam, Good question, and I'm afraid we may be a little late: http://wiki.apache.org/general/SummerOfCodeMentor I think the main problem is that nobody has time to be the mentor. As for ideas, I think Solr integration would be very nice to have. Solr, with its recent support for

Re: Confine nutch to one NIC?

2008-03-11 Thread ogjunk-nutch
I don't think there is anything you can do about this on the Nutch end. I do know that Java now has the ability to differentiate between different NICs, but Nutch doesn't have support for that. There may be something you can do on the OS level, though I don't have any concrete advice there,

Re: Labeling URLs a-la Google

2007-09-07 Thread ogjunk-nutch
Hi Jeff, Nice. Could you submit this to JIRA as a patch? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Jeff Maki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday,

Re: implement thai lanaguage analyzer in nutch

2006-11-08 Thread ogjunk-nutch
Regarding Thai, there is a Thai Analyzer in Lucene already: $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ total 24 drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27

Re: Patch Available status?

2006-09-13 Thread ogjunk-nutch
Sorry if I missed followups to this (catching up on emails after vacation). This sounds like a good idea (because JIRA is often full of bug reports, enhancement requests, and only some issues have patches, and those can get stale, so reviewing and applying them quickly is important). I took a

Re: HTTP/1.1 problem

2006-09-07 Thread ogjunk-nutch
That looks right - committed, thanks. Otis - Original Message From: Doğacan Güney [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, August 24, 2006 4:15:53 AM Subject: HTTP/1.1 problem Hello everyone, There is a small bug in lib-http plugin code that prevents

Re: Ontology compile bug

2006-09-07 Thread ogjunk-nutch
I might be just tired, but I don't see the difference between those two lines. Otis - Original Message From: Michael Wechner [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, August 22, 2006 9:07:12 AM Subject: Ontology compile bug Hi It seems to me that

Re: Error in 0.8 regex-urlfilter.txt

2006-08-10 Thread ogjunk-nutch
Thanks, committed. Otis - Original Message From: Matthew Holt [EMAIL PROTECTED] To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org Sent: Wednesday, August 9, 2006 9:51:19 AM Subject: Error in 0.8 regex-urlfilter.txt I was doing a search and noticed that a 'png' file was

Re: Patch: deflate encoding

2006-08-07 Thread ogjunk-nutch
Ja, ja! Otis - Original Message From: Pascal Beis To: nutch-dev@lucene.apache.org Sent: Monday, August 7, 2006 4:17:33 AM Subject: Patch: deflate encoding Hi all, I'v added support for deflate encoding (next to gzip) to nutch. Is there interest to include this into

Re: Patch: deflate encoding

2006-08-07 Thread ogjunk-nutch
Pascal, Forgot to say - attachments get stripped. Please put them in JIRA. Thanks, Otis - Original Message From: Pascal Beis [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Monday, August 7, 2006 4:17:33 AM Subject: Patch: deflate encoding Hi all, I'v added support for

org.farng and com.etranslate

2006-06-27 Thread ogjunk-nutch
Hi, I have 2 unresolved imports in my IDE's Nutch project: org.farng.*(used in parse-mp3) com.etranslate.* (used in parse-rtf) Where are these classes from? I searched in the trunk, and couldn't find their jars. I search online (Krugle, too!), and couldn't find where these

Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread ogjunk-nutch
I was going to suggest the same approach. Seems simple enough and would force the person to edit the config. What is entered in place of EDITME is another story, but maybe some code can enforce some rules on that, too. Otis - Original Message From: Teruhiko Kurosaka [EMAIL

Re: [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml

2006-06-04 Thread ogjunk-nutch
Hi, What exactly does this plugin do? I haven't seen it mentioned and the README.txt doesn't really describe it. Thanks, Otis - Original Message From: [EMAIL PROTECTED] To: nutch-commits@lucene.apache.org Sent: Sunday, June 4, 2006 3:44:23 PM Subject: [Nutch-cvs] svn commit: r411594

Re: [Nutch-cvs] svn commit: r409869 - in /lucene/nutch/trunk/contrib/web2/plugins/caching-oscache/src/java/org: ./ apache/ apache/nutch/ apache/nutch/webapp/ apache/nutch/webapp/controller/

2006-05-28 Thread ogjunk-nutch
Spotted a reference to NutchReferehPolicy(); : EntryRefreshPolicy policy=new NutchReferehPolicy(); Typo? Otis - Original Message From: [EMAIL PROTECTED] To: nutch-commits@lucene.apache.org Sent: Saturday, May 27, 2006 4:36:56 PM Subject: [Nutch-cvs] svn commit: r409869 - in

Re: Nutch 0.7.2

2006-03-09 Thread ogjunk-nutch
I'm still on 0.7*, and would welcome a new release. Otis - Original Message From: Piotr Kosiorowski [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, March 9, 2006 3:31:09 PM Subject: Nutch 0.7.2 Hello, I would like to release nutch 0.7.2 in a week or two. Some serious

Re: ignore eclipse .project and .classpath

2006-02-09 Thread ogjunk-nutch
Done. - Original Message From: Stefan Groschupf [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Wed 08 Feb 2006 03:15:15 PM EST Subject: Re: ignore eclipse .project and .classpath +1 Am 08.02.2006 um 06:16 schrieb Chris Mattmann: Hi Folks, Just wondering if someone

Re: tool to mount nutch filesystem

2006-01-20 Thread ogjunk-nutch
Hi John, NDFS + MapReduce will soon become a separate Lucene sub-project. Otis - Original Message From: John X [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Cc: [EMAIL PROTECTED] Sent: Fri 20 Jan 2006 07:55:17 PM EST Subject: tool to mount nutch filesystem I have created a simple

Re: wiki:commandline options classpaths

2006-01-09 Thread ogjunk-nutch
Yes, everything is in org.apache now, I believe. Thanks for helping out. Otis - Original Message From: Jerry Russell [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Mon 09 Jan 2006 02:20:02 PM EST Subject: wiki:commandline options classpaths I noticed that the command line

Re: Normalizing URLs with anchors

2006-01-05 Thread ogjunk-nutch
I think it's safe to strip anchors, as they simply point to a different portion of the same page for browser rendering. I do that for Simpy while normalizing URLs, in order not to have duplicates like this. Otis - Original Message From: Ken Krugler [EMAIL PROTECTED] To:

RE: Re[2]: what contibute to fetch slowing down

2005-10-11 Thread ogjunk-nutch
Fuad, I think you are constantly comparing apples and oranges here. It looks like your new code simply hammers the server sending multiple requests to a single server in parallel. That's a big no-no in a web crawling/spidering/fetching world, as bad as not obeying robots.txt. The fact that the

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-10 Thread ogjunk-nutch
Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Thanks, Otis --- Gal Nitzan [EMAIL PROTECTED] wrote: Hi Michael, At the moment I have about 3000 domains in my db. I didn't time the

Re: [Nutch-cvs] [Nutch Wiki] Update of ParserFactoryImprovementProposal by ChrisMattmann

2005-09-15 Thread ogjunk-nutch
Quick comment about order=N and the paragraph that describes how to deal with cases where people mess things up and enter multiple plugins for the same content type and the same order: - Why is the order attribute even needed? It looks like a redundant piece of information - why not derive order

Re: [Nutch-cvs] [Nutch Wiki] Update of ParserFactoryImprovementProposal by ChrisMattmann

2005-09-15 Thread ogjunk-nutch
Sounds good to me. Otis --- Chris Mattmann [EMAIL PROTECTED] wrote: Hi Otis, Point taken. In actuality since both convey the same information I think that it's okay to support both, but by default say we could code the initial plugins specified in parse-plugins.xml without the order=

Re: merge mapred to trunk

2005-08-31 Thread ogjunk-nutch
Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances for conflicts. I would thus like to merge the mapred branch into trunk soon. The soonest I could actually start this is next week. Are there any objections? I, too, am looking forward to this,

Re: merge mapred to trunk

2005-08-31 Thread ogjunk-nutch
--- Doug Cutting [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: I, too, am looking forward to this, but I am wondering what that will do to Kelvin Tan's recent contribution, especially since I saw that both MapReduce and Kelvin's code change how FetchListEntry works. If merging