I have not looked into this deeply, but this change would make me nervous, too.
The main reason for that is that I have never seen this error, and the error
makes me think that something is simply giving the SegmentMerger wrong/bad
input.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - So
Use nutch-user mailing list, please. I'll reply there.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: m.harig <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Monday, June 9, 2008 12:17:07 PM
> Subject: nutch-0.9 and hadoop-0.
Michael & Lincoln,
It's great to see you two working on this. In general, this is best done via
JIRA, really. The "process" is roughly as follows:
* Open a new JIRA issue, describing what it's about, what you are trying to
solve
* Prepare a patch locally, off of nutch trunk/head, naming it aft
rg
> Sent: Friday, June 6, 2008 3:56:30 AM
> Subject: Re: nutch file content limit
>
>
> is there any way to index partial content of doc/xls/rtf . if its not
> possible let me know.
>
>
> ogjunk-nutch wrote:
> >
> > I *think* you have to fetch the *full* co
I *think* you have to fetch the *full* content of MS Word docs (and PDFs and
RTFs and ...) if you want parsers that handle those documents to be able to
parse them. A partial MS Word/PDF/RTF/... document is considered
invalid/broken. Try opening it with MS Word, for example -- it will not work
How's this:
$ grep -n content nutch/conf/nutch-default.xml
28: file.content.limit
30: The length limit for downloaded content, in bytes.
31: If this value is nonnegative (>=0), content longer than it will be
truncated;
37: file.content.ignored
39: If true, no file content will be saved duri
Works, thanks Andrzej!
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Andrzej Bialecki <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Wednesday, May 21, 2008 5:43:11 PM
> Subject: Re: Adding Otis to JIRA
>
> Otis Gospodnetic
Hi,
But shouldn't this be the expected behaviour? In each of the examples the
query really is bad/invalid, uses incorrect syntax.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: ivrokv <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
Hi,
Yes, you have to add your plugin to nutch-site.xml, along with other plugins
you probably already have defined there. If you don't have them in
nutch-site.xml, look at nutch-default.xml
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From:
Hi,
You are missing some ant jars. I'm not sure which ones, but it looks like the
class that cannot be found is TraXLiaison , so once you google you'll find
which optional ant jar this is in. Get that jar, put it in your ant home's lib
dir and try again.
Otis
--
Sematext -- http://sematext.c
Thank you! I'll do my best to help out.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Andrzej Bialecki <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Thursday, May 8, 2008 1:55:36 PM
> Subject: Welcome Otis Gospodnetic as Nu
You don't have to update CrawlDb after every fetch cycle, so keeping the
generated CrawlDatums from one generate run might be useful.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: wuqi <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
Thanks Andrzej. So the disconnect was only measuring (download speed
in my mind) per-URL vs. per-host
In that case, I think we are talking about a small change (to Fetcher2) that
might
look like this:
+ // time the request
+ long fetchStart = System.currentTimeMillis(
Adding some comments to the email below, but here on nutch-dev.
Basically, it is my feeling that whenever fetchlists (and its parts) are not
"well balanced", this inefficiency will be seen.
Concretely, whichever task is "stuck fetching from the slow server with a lot
of its pages in the fetchlis
Hi,
(Andrzej - sorry about line length - I don't see an option for that in Y! Mail
now/any more, BCCing my non-Y account to see what's going on)
- Original Message
> From: Andrzej Bialecki <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Saturday, April 19, 2008 6:07:17 PM
>
We've got mail.
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Forwarded Message
From: Martin Cooper (JIRA) <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Saturday, April 19, 2008 2:44:21 PM
Subject: [jira] Closed: (INFRA-1583) Wiki => email not working for Nutch
You are both in agreement, but I don't fully follow as I'm not intimately
familiar with all the files and structures yet.
- Fetcher-s putting info about hosts into crawl_fetch for each fetched segment
makes sense. I see Fetcher(2) uses FetcherOutputFormat, which has its own
RecordWriter, which
OK: https://issues.apache.org/jira/browse/INFRA-1583
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Dennis Kubes <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Monday, April 14, 2008 1:04:25 AM
Subject: Re: Wiki -> email -> nutch-dev
Is this a thing for infrastructure@ or infrastructure JIRA?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Dennis Kubes <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Sunday, April 13, 2008 4:52:51 PM
Subject: Re: Wiki -> email -> nu
Hi,
It looks like Nutch's Wiki is not configured to send email to nutch-dev when
its pages are changed. Is this on purpose? Not that I need more email in my
life, but it does help (me) passively keep track of new knowledge posted on the
Wiki.
I see there are recent changes listed on
http://
Hi Amit,
There is no semantic summarizer (What exactly would it do? Can you provide an
example?).
There is a more or less "standard" snippet/highlighter - a lot like what you
see on Google's search results, for example.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- O
Hi Andrzej,
Sure, that sounds good - thanks!
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Thursday, April 10, 2008 4:45:08 AM
Subject: Re: [jira] Updated: (NUTCH-627)
Broad question, broad answer: free, scalable, extensible, open-source are a few
characteristics that come to mind.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: minskv <[EMAIL PROTECTED]>
To: nutch-dev
Sent: Wednesday, April 9, 2008 2:44:51
Hi Ed,
The answer is no, though I'm not sure if you really meant to ask on the Nutch
mailing list. Lucene mailing list ([EMAIL PROTECTED]) would be a better place
to ask, if you haven't already.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
Fro
Hi Dennis,
Not too late, I think, just add Nutch + Solr idea to
http://wiki.apache.org/general/SummerOfCode2008 on Monday.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Dennis Kubes <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: S
Vinci,
Please use the mailing list to ask questions and discuss first, not JIRA.
Also, please include an example of what you are describing, if you can.
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Vinci (JIRA) <[EMAIL PROTECTED]>
Hi Susam,
Good question, and I'm afraid we may be a little late:
http://wiki.apache.org/general/SummerOfCodeMentor
I think the main problem is that nobody has time to be the mentor.
As for ideas, I think Solr integration would be very nice to have. Solr, with
its recent support for distrib
I don't think there is anything you can do about this on the Nutch end. I do
know that Java now has the ability to differentiate between different NICs, but
Nutch doesn't have support for that. There may be something you can do on the
OS level, though I don't have any concrete advice there, un
Hi Jeff,
Nice. Could you submit this to JIRA as a patch?
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message
From: Jeff Maki <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Thursday,
+1
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message
From: Chris Mattmann <[EMAIL PROTECTED]>
To: "nutch-dev@lucene.apache.org"
Sent: Tuesday, March 27, 2007 1:43:17 AM
Subject: [VOTE] Releas
HI,
I think the robots.txt example you used was invalid (no path for that last
Disallow rule).
Small patch indeed, but sticking it in JIRA would still make sense because:
- it leaves a good record of the bug + fix
- it could be used for release notes/changelog
Not trying to be picky, just pointi
All good arguments, and as nobody else voiced the desire to have this other
branch of Nutch I was rambling about, I'll consider this thread done.
Thanks for the explanations, Doug.
Otis
- Original Message
From: Doug Cutting <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Monda
Yes, certainly, anything that can be shared and decoupled from pieces that make
each branch (not SVN/CVS branch) different, should be decoupled. But I was
really curious about whether people think this is a valid idea/direction, not
necessarily immediately how things should be implemented. In
Regarding Thai, there is a Thai Analyzer in Lucene already:
$ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
total 24
drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/
-rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java
-rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.j
Old issue. I don't think there were any conclusions. Not sure if Maven2 would
be a good thing, because I haven't used Maven in 2+ years, and I understand
it's changed a lot since then.
The best way to move anything forward is to contribute the solution/fix/patch
and then persuade others to gi
Sorry if I missed followups to this (catching up on emails after vacation).
This sounds like a good idea (because JIRA is often full of bug reports,
enhancement requests, and only some issues have patches, and those can get
stale, so reviewing and applying them quickly is important).
I took a qu
I might be just tired, but I don't see the difference between those two lines.
Otis
- Original Message
From: Michael Wechner <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Tuesday, August 22, 2006 9:07:12 AM
Subject: Ontology compile bug
Hi
It seems to me that refine-query-i
That looks right - committed, thanks.
Otis
- Original Message
From: Doğacan Güney <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Thursday, August 24, 2006 4:15:53 AM
Subject: HTTP/1.1 problem
Hello everyone,
There is a small bug in lib-http plugin code that prevents
protoco
Thanks, committed.
Otis
- Original Message
From: Matthew Holt <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Sent: Wednesday, August 9, 2006 9:51:19 AM
Subject: Error in 0.8 regex-urlfilter.txt
I was doing a search and noticed that a 'png' file was inde
Pascal,
Forgot to say - attachments get stripped. Please put them in JIRA.
Thanks,
Otis
- Original Message
From: Pascal Beis <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Monday, August 7, 2006 4:17:33 AM
Subject: Patch: deflate encoding
Hi all,
I'v added support for d
Ja, ja!
Otis
- Original Message
From: Pascal Beis
To: nutch-dev@lucene.apache.org
Sent: Monday, August 7, 2006 4:17:33 AM
Subject: Patch: deflate encoding
Hi all,
I'v added support for deflate encoding (next to gzip) to nutch. Is there
interest to
include this into th
Have a look at Lucene's contrib/:
$ ff \*ISO\*java
./src/test/org/apache/lucene/analysis/TestISOLatin1AccentFilter.java
./src/java/org/apache/lucene/analysis/ISOLatin1AccentFilter.java
Otis
- Original Message
From: Stefan Neufeind <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent
Thanks, that was it.
Otis
- Original Message
From: Sami Siren <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Tuesday, June 27, 2006 2:45:13 PM
Subject: Re: org.farng and com.etranslate
[EMAIL PROTECTED] wrote:
>Hi,
>
>I have 2 unresolved imports in my IDE's Nutch project:
>
Hi,
I have 2 unresolved imports in my IDE's Nutch project:
org.farng.*(used in parse-mp3)
com.etranslate.* (used in parse-rtf)
Where are these classes from? I searched in the trunk, and couldn't find their
jars. I search online (Krugle, too!), and couldn't find where these th
I was going to suggest the same approach. Seems simple enough and would force
the person to edit the config. What is entered in place of EDITME is another
story, but maybe some code can enforce some rules on that, too.
Otis
- Original Message
From: Teruhiko Kurosaka <[EMAIL PROTECTED
Hi,
What exactly does this plugin do? I haven't seen it mentioned and the
README.txt doesn't really describe it.
Thanks,
Otis
- Original Message
From: [EMAIL PROTECTED]
To: nutch-commits@lucene.apache.org
Sent: Sunday, June 4, 2006 3:44:23 PM
Subject: [Nutch-cvs] svn commit: r411594 -
Spotted a reference to "NutchReferehPolicy();" :
EntryRefreshPolicy policy=new NutchReferehPolicy();
Typo?
Otis
- Original Message
From: [EMAIL PROTECTED]
To: nutch-commits@lucene.apache.org
Sent: Saturday, May 27, 2006 4:36:56 PM
Subject: [Nutch-cvs] svn commit: r409869 - in
/lucene
This thread seems to have gotten very little attention.
Jérôme - I'm all for extracting sub-libraries that can really live on its own
and are substantial enough to warrant "their own identity".
Personally, I'm the most interested in Language Identifier plugin becoming a
standalone, Nutch-indepen
I'm still on 0.7*, and would welcome a new release.
Otis
- Original Message
From: Piotr Kosiorowski <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Thursday, March 9, 2006 3:31:09 PM
Subject: Nutch 0.7.2
Hello,
I would like to release nutch 0.7.2 in a week or two. Some serious
Done.
- Original Message
From: Stefan Groschupf <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Wed 08 Feb 2006 03:15:15 PM EST
Subject: Re: ignore eclipse .project and .classpath
+1
Am 08.02.2006 um 06:16 schrieb Chris Mattmann:
> Hi Folks,
>
>
>
> Just wondering if someon
I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in
the near future.
Does that sound ok?
Otis
- Original Message
From: Stefan Groschupf <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Mon 23 Jan 2006 02:55:46 PM EST
Subject: Re: lang identifier and
Hi John,
NDFS + MapReduce will soon become a separate Lucene sub-project.
Otis
- Original Message
From: John X <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Fri 20 Jan 2006 07:55:17 PM EST
Subject: tool to mount nutch filesystem
I have created a simple
Yes, everything is in org.apache now, I believe. Thanks for helping out.
Otis
- Original Message
From: Jerry Russell <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Mon 09 Jan 2006 02:20:02 PM EST
Subject: wiki:commandline options classpaths
I noticed that the command line
I think it's safe to strip anchors, as they simply point to a different portion
of the same page for browser rendering. I do that for Simpy while normalizing
URLs, in order not to have duplicates like this.
Otis
- Original Message
From: Ken Krugler <[EMAIL PROTECTED]>
To: nutch-dev@lu
I'm late, but better late than never: +1 (I thought Stefan was already a
committer, actually).
Stefan: will you be putting some of those media-style Nutch tutorials in
Nutch's own Wiki?
Otis
- Original Message
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Fuad,
Please stick this in JIRA - it will get lost in a pile of incoming
Nutch email.
Otis
--- Fuad Efendi <[EMAIL PROTECTED]> wrote:
> Thanks Paul,
>
>
> It would be nice to have this piece of code in
> org.apache.nutch.fetcher.Fetcher
>
>
> private static final int DNS_CACHE_TTL_MINUTES =
Hi,
I find it a bit hard to follow your various ideas here, but I'll add my
comments to some parts below.
--- "Fuad Efendi (JIRA)" <[EMAIL PROTECTED]> wrote:
> [
>
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331892
> ]
>
> Fuad Efendi commented on NUTCH-109:
> ---
Fuad,
I think you are constantly comparing apples and oranges here. It looks
like your new code simply hammers the server sending multiple requests
to a single server in parallel. That's a big no-no in a web
crawling/spidering/fetching world, as bad as not obeying robots.txt.
The fact that the
Hi Gal,
I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.
Thanks,
Otis
--- Gal Nitzan <[EMAIL PROTECTED]> wrote:
> Hi Michael,
>
> At the moment I have about 3000 domains in my db. I didn't time t
Sounds good to me.
Otis
--- Chris Mattmann <[EMAIL PROTECTED]> wrote:
> Hi Otis,
>
> Point taken. In actuality since both convey the same information I
> think
> that it's okay to support both, but by default say we could code the
> initial
> plugins specified in parse-plugins.xml without the
Well, you have to tell users about order="N" somewhere in the docs.
Instead of telling them about order="N", tell them that the order in
XML matters. Either case requires education, and the latter one
requires less typing and avoids the case described in the proposal.
Otis
--- Sébastien LE CALL
Quick comment about order="N" and the paragraph that describes how to
deal with cases where people mess things up and enter multiple plugins
for the same content type and the same order:
- Why is the order attribute even needed? It looks like a redundant
piece of information - why not derive orde
--- Doug Cutting <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote:
> > I, too, am looking forward to this, but I am wondering what that
> will
> > do to Kelvin Tan's recent contribution, especially since I saw that
> > both MapReduce and Kelvin's code change how FetchListEntry works.
> If
> >
> Currently we have three versions of nutch: trunk, 0.7 and mapred.
> This
> increases the chances for conflicts. I would thus like to merge the
> mapred branch into trunk soon. The soonest I could actually start
> this is next week. Are there any objections?
I, too, am looking forward to th
> Glancing at other Apache projects in subversion, I see that httpd
> uses
> branch names like "2.2.x" and tag names like "2.2.4". That's a
> little
> cryptic. I propose that we use branch names like "branch-2.4" and
> tag
> names like "release-2.4.1". What do folks think?
I agree. That is
I see several instances of 'analySer' in comments/javadoc and some
variables. That should probably be changed to american version -
analyzer, for consistency's sake.
Otis
--- [EMAIL PROTECTED] wrote:
> Author: jerome
> Date: Fri Aug 26 15:47:04 2005
> New Revision: 240359
>
> URL: http://svn.
15.9. is almost a month away. I think it would be good to take a look
at Kelvin's modifications and include it either in 0.7.1, or maybe 0.8
(without map-reduce, which would then be in 0.9).
Kelvin, you should put your code in Jira.
Otis
--- Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:
> Hell
>From what I heard from Kelvin, the Spring part could be thrown out and
replaced with classes with main().
I think there is a need for having the Fetcher component more separated
from the rest of Nutch. The Fetcher alone is well done and quite
powerful on its own - it has host-based queues, doesn
68 matches
Mail list logo