, 2008 3:56:30 AM
Subject: Re: nutch file content limit
is there any way to index partial content of doc/xls/rtf . if its not
possible let me know.
ogjunk-nutch wrote:
I *think* you have to fetch the *full* content of MS Word docs (and PDFs
and RTFs and ...) if you want parsers
I *think* you have to fetch the *full* content of MS Word docs (and PDFs and
RTFs and ...) if you want parsers that handle those documents to be able to
parse them. A partial MS Word/PDF/RTF/... document is considered
invalid/broken. Try opening it with MS Word, for example -- it will not
How's this:
$ grep -n content nutch/conf/nutch-default.xml
28: namefile.content.limit/name
30: descriptionThe length limit for downloaded content, in bytes.
31: If this value is nonnegative (=0), content longer than it will be
truncated;
37: namefile.content.ignored/name
39: descriptionIf
Works, thanks Andrzej!
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Andrzej Bialecki [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Wednesday, May 21, 2008 5:43:11 PM
Subject: Re: Adding Otis to JIRA
Otis Gospodnetic wrote:
Thank you! I'll do my best to help out.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Andrzej Bialecki [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, May 8, 2008 1:55:36 PM
Subject: Welcome Otis Gospodnetic as Nutch
Adding some comments to the email below, but here on nutch-dev.
Basically, it is my feeling that whenever fetchlists (and its parts) are not
well balanced, this inefficiency will be seen.
Concretely, whichever task is stuck fetching from the slow server with a lot
of its pages in the fetchlist,
We've got mail.
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Forwarded Message
From: Martin Cooper (JIRA) [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, April 19, 2008 2:44:21 PM
Subject: [jira] Closed: (INFRA-1583) Wiki = email not working for Nutch
You are both in agreement, but I don't fully follow as I'm not intimately
familiar with all the files and structures yet.
- Fetcher-s putting info about hosts into crawl_fetch for each fetched segment
makes sense. I see Fetcher(2) uses FetcherOutputFormat, which has its own
RecordWriter,
OK: https://issues.apache.org/jira/browse/INFRA-1583
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Dennis Kubes [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Monday, April 14, 2008 1:04:25 AM
Subject: Re: Wiki - email - nutch-dev?
Hi,
It looks like Nutch's Wiki is not configured to send email to nutch-dev when
its pages are changed. Is this on purpose? Not that I need more email in my
life, but it does help (me) passively keep track of new knowledge posted on the
Wiki.
I see there are recent changes listed on
Hi Amit,
There is no semantic summarizer (What exactly would it do? Can you provide an
example?).
There is a more or less standard snippet/highlighter - a lot like what you
see on Google's search results, for example.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
-
Hi Andrzej,
Sure, that sounds good - thanks!
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Andrzej Bialecki [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, April 10, 2008 4:45:08 AM
Subject: Re: [jira] Updated: (NUTCH-627)
Broad question, broad answer: free, scalable, extensible, open-source are a few
characteristics that come to mind.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: minskv [EMAIL PROTECTED]
To: nutch-dev nutch-dev@lucene.apache.org
Sent:
Vinci,
Please use the mailing list to ask questions and discuss first, not JIRA.
Also, please include an example of what you are describing, if you can.
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Vinci (JIRA) [EMAIL PROTECTED]
Hi Dennis,
Not too late, I think, just add Nutch + Solr idea to
http://wiki.apache.org/general/SummerOfCode2008 on Monday.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Dennis Kubes [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent:
Hi Susam,
Good question, and I'm afraid we may be a little late:
http://wiki.apache.org/general/SummerOfCodeMentor
I think the main problem is that nobody has time to be the mentor.
As for ideas, I think Solr integration would be very nice to have. Solr, with
its recent support for
I don't think there is anything you can do about this on the Nutch end. I do
know that Java now has the ability to differentiate between different NICs, but
Nutch doesn't have support for that. There may be something you can do on the
OS level, though I don't have any concrete advice there,
Hi Jeff,
Nice. Could you submit this to JIRA as a patch?
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message
From: Jeff Maki [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday,
Regarding Thai, there is a Thai Analyzer in Lucene already:
$ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
total 24
drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/
-rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java
-rw-rw-r-- 1 otis otis 2437 Jun 5 14:27
Sorry if I missed followups to this (catching up on emails after vacation).
This sounds like a good idea (because JIRA is often full of bug reports,
enhancement requests, and only some issues have patches, and those can get
stale, so reviewing and applying them quickly is important).
I took a
That looks right - committed, thanks.
Otis
- Original Message
From: Doğacan Güney [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, August 24, 2006 4:15:53 AM
Subject: HTTP/1.1 problem
Hello everyone,
There is a small bug in lib-http plugin code that prevents
I might be just tired, but I don't see the difference between those two lines.
Otis
- Original Message
From: Michael Wechner [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, August 22, 2006 9:07:12 AM
Subject: Ontology compile bug
Hi
It seems to me that
Thanks, committed.
Otis
- Original Message
From: Matthew Holt [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Sent: Wednesday, August 9, 2006 9:51:19 AM
Subject: Error in 0.8 regex-urlfilter.txt
I was doing a search and noticed that a 'png' file was
Ja, ja!
Otis
- Original Message
From: Pascal Beis
To: nutch-dev@lucene.apache.org
Sent: Monday, August 7, 2006 4:17:33 AM
Subject: Patch: deflate encoding
Hi all,
I'v added support for deflate encoding (next to gzip) to nutch. Is there
interest to
include this into
Pascal,
Forgot to say - attachments get stripped. Please put them in JIRA.
Thanks,
Otis
- Original Message
From: Pascal Beis [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Monday, August 7, 2006 4:17:33 AM
Subject: Patch: deflate encoding
Hi all,
I'v added support for
Hi,
I have 2 unresolved imports in my IDE's Nutch project:
org.farng.*(used in parse-mp3)
com.etranslate.* (used in parse-rtf)
Where are these classes from? I searched in the trunk, and couldn't find their
jars. I search online (Krugle, too!), and couldn't find where these
I was going to suggest the same approach. Seems simple enough and would force
the person to edit the config. What is entered in place of EDITME is another
story, but maybe some code can enforce some rules on that, too.
Otis
- Original Message
From: Teruhiko Kurosaka [EMAIL
Hi,
What exactly does this plugin do? I haven't seen it mentioned and the
README.txt doesn't really describe it.
Thanks,
Otis
- Original Message
From: [EMAIL PROTECTED]
To: nutch-commits@lucene.apache.org
Sent: Sunday, June 4, 2006 3:44:23 PM
Subject: [Nutch-cvs] svn commit: r411594
Spotted a reference to NutchReferehPolicy(); :
EntryRefreshPolicy policy=new NutchReferehPolicy();
Typo?
Otis
- Original Message
From: [EMAIL PROTECTED]
To: nutch-commits@lucene.apache.org
Sent: Saturday, May 27, 2006 4:36:56 PM
Subject: [Nutch-cvs] svn commit: r409869 - in
I'm still on 0.7*, and would welcome a new release.
Otis
- Original Message
From: Piotr Kosiorowski [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, March 9, 2006 3:31:09 PM
Subject: Nutch 0.7.2
Hello,
I would like to release nutch 0.7.2 in a week or two. Some serious
Done.
- Original Message
From: Stefan Groschupf [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Wed 08 Feb 2006 03:15:15 PM EST
Subject: Re: ignore eclipse .project and .classpath
+1
Am 08.02.2006 um 06:16 schrieb Chris Mattmann:
Hi Folks,
Just wondering if someone
Hi John,
NDFS + MapReduce will soon become a separate Lucene sub-project.
Otis
- Original Message
From: John X [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Fri 20 Jan 2006 07:55:17 PM EST
Subject: tool to mount nutch filesystem
I have created a simple
Yes, everything is in org.apache now, I believe. Thanks for helping out.
Otis
- Original Message
From: Jerry Russell [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Mon 09 Jan 2006 02:20:02 PM EST
Subject: wiki:commandline options classpaths
I noticed that the command line
I think it's safe to strip anchors, as they simply point to a different portion
of the same page for browser rendering. I do that for Simpy while normalizing
URLs, in order not to have duplicates like this.
Otis
- Original Message
From: Ken Krugler [EMAIL PROTECTED]
To:
Fuad,
I think you are constantly comparing apples and oranges here. It looks
like your new code simply hammers the server sending multiple requests
to a single server in parallel. That's a big no-no in a web
crawling/spidering/fetching world, as bad as not obeying robots.txt.
The fact that the
Hi Gal,
I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.
Thanks,
Otis
--- Gal Nitzan [EMAIL PROTECTED] wrote:
Hi Michael,
At the moment I have about 3000 domains in my db. I didn't time the
Quick comment about order=N and the paragraph that describes how to
deal with cases where people mess things up and enter multiple plugins
for the same content type and the same order:
- Why is the order attribute even needed? It looks like a redundant
piece of information - why not derive order
Sounds good to me.
Otis
--- Chris Mattmann [EMAIL PROTECTED] wrote:
Hi Otis,
Point taken. In actuality since both convey the same information I
think
that it's okay to support both, but by default say we could code the
initial
plugins specified in parse-plugins.xml without the order=
Currently we have three versions of nutch: trunk, 0.7 and mapred.
This
increases the chances for conflicts. I would thus like to merge the
mapred branch into trunk soon. The soonest I could actually start
this is next week. Are there any objections?
I, too, am looking forward to this,
--- Doug Cutting [EMAIL PROTECTED] wrote:
[EMAIL PROTECTED] wrote:
I, too, am looking forward to this, but I am wondering what that
will
do to Kelvin Tan's recent contribution, especially since I saw that
both MapReduce and Kelvin's code change how FetchListEntry works.
If
merging
40 matches
Mail list logo