I think this is for [EMAIL PROTECTED] please remove java-dev@ when
replying.
--- Michael Ji [EMAIL PROTECTED] wrote:
hi,
I saw several discussions about Distributed Link
Analysis Tool before. And I still have question about
the usage of the field next score in Page data
structure.
Hi,
I'm interested in Language Identifier plugin that Sami and Jerome put together.
I noticed the list of supported languaged does not include CJK languages:
http://wiki.apache.org/nutch/LanguageIdentifierPlugin
I'm wondering:
1. why is that? (technical difficulty of some kind?)
2. are
Fabian - blo.gs, weblogs.com's changes.xml and pingomatic should be sufficient
to get a good coverage (and solid overlap) of the blogosphere. There used to
be FeedMesh, too, run by PubSub, but as PubSub is long gone, so is the
FeedMesh, I believe.
Got a site with a public demo?
Otis
to that information or if we can
access to pingomatic services of updated blogs. Do you know something about
this?
Thanks for your answer.
2007/9/3, Otis Gospodnetic [EMAIL PROTECTED]:
Fabian - blo.gs, weblogs.com's changes.xml and pingomatic should be
sufficient to get a good coverage (and solid overlap
Hi,
I'm curious about what Tomislav is asking about, too -- how do searchers know
when to reopen the index? That is, say you have a cluster of fetchers and
every once in a while you end up with a newer version of an index (or indices),
and say that you simply scp those indices to searchers,
Dennis,
Does the tmpfs really help more than the normal FS caching would help?
For example, if you were to force the FS to read the whole index (files), it
would read them into RAM and, hopefully, cache them. Wouldn't that achieve the
same effect as tmpfs? I've done the former with very large
to the application and IO drops to
practically zero when used.
Dennis Kubes
Otis Gospodnetic wrote:
Dennis,
Does the tmpfs really help more than the normal FS caching would
help?
For example, if you were to force the FS to read the whole index
(files), it would read them into RAM and, hopefully, cache
Have you considered using EC2 during your testing/development stage? It would
be safer than investing in the wrong hardware with insufficient knowledge of
the exact demands and requirements.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From:
Hm, I didn't see that comment before. I think indexing incoming text is super
obvious, the equivalent to human annotation/tagging of web pages, no?
As for which anchor texts not to index hm, not sure. Nothing from spam
pages? Nothing from non-authoritative pages even if they are not
Oleg - just a quick pointer to adaptive refetching - is this not already
available?
See https://issues.apache.org/jira/browse/NUTCH-61
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Oleg Mürk [EMAIL PROTECTED]
To:
It sounds like you really want to create a simplistic crawler for something
that small. Nutch does a *pile* of other stuff that you don't seem to care
about. Google for: open source web crawlers . I think there is one called
Sphynx that is simple.
Otis
--
Sematext -- http://sematext.com/ --
Dennis Co.
Is the 0.15.* - 0.16 upgrade seamless? That is, a jar replacement and that's
it, or is there an explicit HDFS upgrade step involved?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Dennis Kubes [EMAIL PROTECTED]
To:
You can certainly use the Lucene version that your version of Nutch uses.
Lucene had a few releases since the last Nutch release (0.9).
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Duan, Nick [EMAIL PROTECTED]
To:
Siva - you can't really just use the Lucene demo tool nor that luceneweb thing
and expect it to search your Nutch-created Lucene index. The two index
structures (their fields) are quite different. I don't want to self-promote,
but if you can, get a copy of Lucene in Action in order to get a
Aha, I see several answers on the Nutch ML - bravo Tomo! :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Tomislav Poljak [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Wednesday, March 5, 2008 1:11:39 PM
Subject: Re: merging
I hate to do this, but here it goes:
Please give volunteers at least 2-3 days to answer your question before
reminding - it doesn't look nice.
Either my mail reader is lying or you sent your reminder email only 30 minutes
after your original email.
Words like please and thank you also help. :)
Hello Svein,
Quick answers to your questions:
- Nutch does not include an image crawler, though some people have started
working on that a long time ago, and Archive.org is sponsoring this
work/project.
- Nutch has a distributed fetcher. Not sure about Heritrix.
- Nutch is being worked on,
Regarding the Tika error message, I've seen that, too. if you need
motivation, Chris. :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Chris Mattmann [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Saturday, April 5, 2008
. I was seeing about 15 pages/second. I didn't
get a chance to implement the other suggestions because I'll eat all
of the office's bandwidth and get yelled at :)
Maybe I'll make a Nutch Speed Improvements entry in the Wiki.
Cheers,
Bradford Stephens
On Sun, Apr 6, 2008 at 10:06 PM, Otis
Hi,
I noticed that during fetching map tasks get to 100% complete (in the GUI), but
are not marked as completed (also in the GUI), and are in fact really not
complete - the logs show there is fetching still going on (though almost
exclusively timeouts at the end of the fetch run, as expected),
Hi,
Hm, I have to say I'm not sure if I agree 100% with part 1. I think it would
be great to have such flexibility, but I wonder if trying to achieve it would
be over-engineering. Do people really need that? I don't know, maybe! If
they do, then ignore my comment. :)
I'm curious about 2.
: Dennis Kubes [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Sunday, April 13, 2008 5:44:32 PM
Subject: Re: Next Generation Nutch
Otis Gospodnetic wrote:
Hello,
A few quick comments. I don't know how much you track Solr, but the mention
of shards makes me think of SOLR-303
- Original Message
From: Andrzej Bialecki [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Monday, April 14, 2008 1:01:37 PM
Subject: Re: Next Generation Nutch
Dennis Kubes wrote:
Otis Gospodnetic wrote:
I suppose the first thing to do would be describe the requirements
Thanks Dennis.
But, hm, I don't get it 100% yet. I looked at Generator.java and I see this:
if (numLists == -1) { // for politeness make
numLists = job.getNumMapTasks();// a partition per fetch task
}
Thus, when -numFetchers is not given, the
You are right, the scripts are missing. I don't know why that is. I do see
them in bin in my local svn checkout of nutch/trunk though.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: nutchvf [EMAIL PROTECTED]
To:
Svein,
It sounds like this should be added to JIRA, though I wonder if this is just
the case of some bad/invalid Javascript that confuses the js parser. You'll
want to include the URL where this problem happens and its source. Probably
best to grab the source with something like curl or wget
Removed the plugin from the config :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Siddhartha Reddy [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Thursday, June 12, 2008 11:41:17 PM
Subject: Re: java.lang.StackOverflowError
I'm not sure -- I try to avoid running single Nutch job at a time, as I find
overlapping is more efficient.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Sean Dean [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Thursday, June
of agreeing with me. Running
multiple nutch processes on a multi-core processor is more efficient then
running one single process on heavily scaled hardware.
Am i correct with this statement?
- Original Message
From: Otis Gospodnetic
To: nutch-user@lucene.apache.org
Sent: Friday
Hi,
You didn't mention URL injection, which makes me think you didn't inject any
seed URLs to crawl. I also suggest figuring out how to run Nutch normally,
from the command-line, before introducing additional variables and
complexities, such as running Nutch from an IDE.
Otis
--
Sematext --
crawl.Injector - Skipping
http://lucene.apache.org/:java.lang.NullPointerExcep
tion
2008-06-13 22:29:35,101 WARN crawl.Injector - Skipping
http://shopping.yahoo.com/:java.lang.NullPointerExce
ption
HB
On Fri, Jun 13, 2008 at 10:55 PM, Otis Gospodnetic
wrote:
Hi,
You didn't
This seems to be a common request - sizing. I think the best you can do is use
existing search engines to estimate how many pages sites you are interested in
have. You will have to know the exact sites (their URLs) and make use of the
site: search operator (Google, Yahoo). Yahoo also has
Yes, this is a pure CLASSPATH issue. I haven't built a Nutch war in a while,
so I don't recall what is in it, but most likely it has WEB-INF/lib directory
with some jar files. One of these ah, let's just see. Here:
[EMAIL PROTECTED] trunk]$ unzip -l build/nutch-1.0-dev.war | grep jar |
Don't have the answer, but got a question. Does this happen only when
redirection to the external host are involved?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Drew Hite [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent:
Uhuh, yes, this is most likely due to session IDs that create unique URLs that
Nutch keeps processing.
Look at conf/regex-normalize.xml for how you can clean up URLs. That should
help.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Felix
Hi,
There is also a setting for the maximal number of bytes to fetch. If your main
index page is large, maybe it's just getting cut off because of that. The
property has content in the name, I believe, so look for that in
nutch-default.xml.
Otis
--
Sematext -- http://sematext.com/ -- Lucene
to miss lots of small, individual sites - I
wonder how google, msn, yahoo does it - they must be getting list of
from ISPs, hosting providers, etc?
Thanks
Jha,
On Mon, Jun 16, 2008 at 11:15 PM, Otis Gospodnetic
wrote:
This seems to be a common request - sizing. I think the best you
Hi,
Both of you should open some JIRA issues and upload your patches there as you
progress, so others can see the direction you are headed and make suggestions
when appropriate.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Marcus Herou
Don't count on the Admin UI. I believe it was only a prototype that was never
integrated in Nutch and probably never will be (until somebody contributes
something).
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Martin Xu [EMAIL
Hi,
Nutch is a Java application and consists of a number of Java classes that
perform different operations. If you are asking whether you can run these
classes from a C or C++ application -- I'm not sure, I never had to do that.
If you know how to call java classes from C/C++, have a look at
Don't know off the top of my head, but I'd guess no, because Nutch uses
Hadoop/HDFS. HDFS files are write-once, so I doubt you can just update a
single URL's data. But you could write a MapReduce job that goes over the
whole CrawlDb and modifies only the records you need modified. You'll
Hi Ann,
Regarding frames - this is not the problem here (with Nutch), as Nutch doesn't
even seem to be able to connect to your server. It never gets to see the HTML
and frames in it. Perhaps there is something useful in the logs not on the
Nutch side, but on that v4 server.
Otis
--
Just get the latest JDK from Sun. No need for yum, just download, install, set
JAVA_HOME, add JAVA_HOME/bin to PATH and you are set.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Winton Davies [EMAIL PROTECTED]
To:
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 19, 2008 10:54 PM
To: nutch-user@lucene.apache.org
Subject: Re: how does nutch connect to urls internally?
Hi Ann,
Regarding frames - this is not the problem here (with Nutch), as Nutch
doesn't even
Hi,
You can dump the whole CrawlDb and grep for your URL. Not fast, but it will
work. You could also just try looking in your logs first.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Viksit Gaur [EMAIL PROTECTED]
To:
Hi,
If there an existing method for generating a segment/fetchlist containing only
URLs that have not yet been fetched?
I'm asking because I can imagine a situation where one has a large and old
CrawlDb that knows about a lot of URLs (the ones with db_unfetched status
if you run -stats) and in
Hi,
You really need to ask this question on the Lucene mailing list, as that's
where hit scoring comes from.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Alexander Aristov [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent:
It ain't Nutch, but you can look at Elevate component in Solr to get some
ideas. There is a Wiki page for the component.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Edward Quick [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Heh, I'll point to Solr's SpellCheckComponent. :) It, too, has a good page on
the Wiki.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Edward Quick [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Wednesday, September 24, 2008
Axel, how did this go? I'd love to know if you got to 1B.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Webmaster [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Tuesday, October 7, 2008 1:13:29 AM
Subject: Extensive web
Hi,
Just noticed Hadoop's new fair sharing job scheduler (
https://issues.apache.org/jira/browse/HADOOP-3746
). It seems to be in 0.19, which I think Nutch is not on yet... but still:
- is this something that would benefit Nutch?
The last time I used Nutch I remember having to be careful
PM, Otis Gospodnetic wrote:
By newsgroups do you mean Usenet newsgroups? If so, it might be a lot
simpler to use Solr, unless you want to build an NNTP crawler
I did do something like that over a decade ago. I used it to find people and
build a White Pages directory (this was big
work for and help with Nutch
generate/fetch/parse/etc. operations.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Nutch User List nutch-user@lucene.apache.org
Sent: Thursday, November 20, 2008 3
of webservice?
-John
On Nov 20, 2008, at 4:23 PM, Otis Gospodnetic wrote:
Yes, you'd have to write a mini newsgroup reader, mimic its behaviour, but
then once you grab a post you could send it directly to Solr for indexing.
No
need for intermediate DB, XML files, etc.
Otis
Hi Todd,
This sounds good. I think we've all see the problem you are describing.
You can see something related at:
- https://issues.apache.org/jira/browse/NUTCH-629
- https://issues.apache.org/jira/browse/NUTCH-628
It would be great if you could incorporate any of the good ideas from the above
Hi,
It would be possible if you index tokens not as words, but as character
ngrams. You'd need a custom analyzer for that. Code for character-based
ngrams already exists in Lucene contrib, but you'd need to add it to Nutch.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Hi,
Yes, if you want flowers to match flower you will want to apply stemming. You
can use the Snowball for English. I don't have any code handy, but you can see
how it's done if you look at Lucene's unit test for Snowball Analyzer.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr -
Hi,
Unfortunately, there are no Nutch books (nor are any Nutch books in the works
that I know of), and I think the documentation on the Nutch Wiki is the
best/only thing there is.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: opsec
You need to stem both at index time and at search time. Then flowers will be
stemmed to flower in both cases and flower at search time will match the
indexed term flower.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: RanjithStar
Hi Doug,
Nutch is not really meant for this type of stuff. You'd be using a very very
massive hammer for a very small nail if you were to choose Nutch for this task.
:)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Doug Leeper
Check java-user archives on markmail.org and search for Toke and SSD to see
SSD benchmarks done by Toke a few months back.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Sean Dean seand...@rogers.com
To: nutch-user@lucene.apache.org
Tony,
You've sent about 10 emails about this already, both on the Nutch and on the
Solr list.
Please have a bit more patience and wait for Nutch 1.0 release. My guess is
this Nutch-Solr integration will be in Nutch 1.0.
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr -
Vishal,
Re 2. - I don't think it's quite true. RAM is still much faster than SSDs.
Also, which version of Lucene are you using? Make sure you're using the latest
one if you care about performance.
Also, if you have extra RAM, you can make your .tii bigger/denser and speed up
searches that
Step one is to identify the exact jar where this class lives. Are you sure
it's in mail.jar? Maybe it's in activate.jar?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Antony Bowesman a...@teamware.com
To: nutch-user@lucene.apache.org
Nutch doesn't make use of sitemaps currently.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: consultas consul...@qualidade.eng.br
To: nutch-user@lucene.apache.org
Sent: Friday, February 27, 2009 12:34:30 PM
Subject: sitemaps
From a
You don't have enough free disk space, that's all.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Tony Wang ivyt...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Tuesday, March 3, 2009 10:58:41 PM
Subject: error when bootstrap DMOZ
Hello,
Comments inlined.
- Original Message
From: Dennis Kubes ku...@apache.org
To: nutch-user@lucene.apache.org
Sent: Friday, March 13, 2009 8:19:37 PM
With the release of Nutch 1.0 I think it is a good time to begin a discussion
about the future of Nutch. Here are some
Hi John,
It would be quite appropriate, actually.
You may want to put a link to it under the Resources section on the front page,
and maybe even on http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
Otis (Nutch committer) --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
See https://issues.apache.org/jira/browse/NUTCH-570 for something relevant.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Raymond Balmès raymond.bal...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Wednesday, May 27, 2009 9:43:02 AM
Ray,
I don't think fetchlist generation sticks URLs from the same domain or host
together. But URLs for the same host do end up in the same queue. This is by
design and it is a good thing -- this is how Nutch can ensure not to hit the
same host with more simultaneous threads than it should
have many URLs per host of course. Need to get all the pages of the
sites, don't understand the question.
-Raymond
2009/5/26 Otis Gospodnetic
But how, Ray, if you have only 1 URL per host?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original
on this?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Otis Gospodnetic ogjunk-nu...@yahoo.com
To: nutch-user@lucene.apache.org
Sent: Wednesday, May 27, 2009 11:38:48 PM
Subject: Re: threads get stuck in spinwaiting
Ray,
I don't think
Unfortunately Lucene doesn't allow that. You have to reindex the whole doc.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Vijay vijay.stanf...@gmail.com
To: nutch-user@lucene.apache.org; java-u...@lucene.apache.org
Sent: Monday, June
Hello,
It really depends on the version of Lucene used in your Nutch instance and
whether Lucene.NET version you are using is compatible at index format level.
As for segments dir vs. file, this is just a case of unfortunate naming.
Segments in Lucene means a completely different thing than
Paul,
There was talk of this in the past, at least between some other people here and
me, possibly off-line. Your best bet may be going to what's left of Wikia
Search and getting their old index. But, you see, this is exactly the problem
- the index may be quite outdated by now.
Otis
--
db which had over 1Billion urls at last count.
So
it might be a good starting point for crawling the web. At last count though
it
was 250G in size so no downloadable unless you have a fast connection. It is
available for anyone that wants it though.
Dennis
Otis Gospodnetic wrote
Neeti,
I don't think there is a way to know when a regular web site has been updated.
You can issue GET or HEAD requests and look at the Last-Modified date, but this
is not 100% reliable. You can fetch and compare content, but that's not 100%
reliable either. If you are indexing blogs,
Johan,
Yes, you can fetch and fetch and fetch and only fetch with Nutch and have the
data saved in HDFS (Nutch uses something called Hadoop and that includes HDFS,
a distributed FS that sits on top of regular FS/disk). You can then read the
data from there and index it however you want,
I remember seeing those in the logs, but it's been a while.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: caezar caeza...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Friday, June 26, 2009 3:50:39 AM
Subject: Re: Nutch fetch
Depends on hardware, of course!
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Polsnet pols...@163.com
To: nutch-user@lucene.apache.org
Sent: Friday, July 3, 2009 12:03:30 AM
Subject: Nutch 1.0 on the limits of the data
Nutch 1.0
Hi,
See this: http://markmail.org/message/znbu5khl7qbkvhkm
(I didn't double-check CHANGES.txt to see if this made it into 1.0)
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Message
From:
Hi,
robots.txt is periodically rechecked and the previously denied URL should be
retried when the time to refetch it comes. If robots.txt rules no longer deny
access to it, it should be fetched.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta,
Nutch uses Lucene (Java), not CLucene (C++).
Why are you looking to rewrite Nutch in C++ anyway? Sounds scary.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Message
From: alx...@aim.com
Mario,
I think text is the only output format.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Message
From: schroedi schroedi2...@gmail.com
To: nutch-user@lucene.apache.org
Sent:
I don't know of an elegant way, but if you want to hack Nutch sources, you
could set its refetch time to some point in time veeey far in the future,
for example. Or introduce additional status.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch,
Hello,
Lucene sounds like the way to go here. What's more, if you have a copy of
Lucene in Action (1st edition), I wrote a small and simple framework for
file-system indexing. You could define your own parser for your own custom
file format and the indexer will use it. I think it's in
useful, please let me know.
thanks.
Alex.
-Original Message-
From: Otis Gospodnetic
To: nutch-user@lucene.apache.org
Sent: Sun, Aug 2, 2009 8:15 pm
Subject: Re: Nutch in C++
Nutch uses Lucene (Java), not CLucene (C++).
Why are you looking to rewrite Nutch
-Original Message-
From: Otis Gospodnetic [mailto:ogjunk-nu...@yahoo.com]
Sent: 04 August 2009 04:49
To: nutch-user@lucene.apache.org
Subject: Re: Nutch in C++
CLucene is just like Lucene (except a few versions behind), but written in
C++.
Yes, you could rewrite Nutch in C
@lucene.apache.org
Sent: Tuesday, August 4, 2009 12:36:19 PM
Subject: Re: Nutch in C++
Thanks for your comments. Is there anything that I code in C++ that open
source
community could benefit?
Alex.
-Original Message-
From: Otis Gospodnetic
To: nutch-user@lucene.apache.org
I don't have a fix, but I have a suggestion - have you tried using the very
latest version of PDFBox? I believe it's going through Apache Incubator...
aha, here: http://incubator.apache.org/pdfbox/
Too bad the page doesn't say *when* the release was made, so one can get a
sense of the state
Kenan,
Have you considered using Carrot2? I think Nutch includes a plugin for it
already. Or, if your categories are predefined, you could index with Solr (if
you were to use Nutch 1.0) and use Solr's faceting capabilities.
Otis
--
Sematext is hiring --
Solr is just a search and indexing server. It doesn't do crawling. Nutch does
the crawling and page parsing, and can index into Lucene or into a Solr server.
Nutch is a biggish beast, and if you just need to index a site or even a small
set of them, you may have an easier time with Droids.
I don't recall off the top of my head what that jobtracker.jsp shows, but
judging by name, it shows your job. Each job is composed of multiple map and
reduce tasks. Drill into your job and you should see multiple tasks running.
Otis
--
Sematext is hiring --
Droids is much simpler if all you want to do is do a little bit of crawling.
Nutch is built to scale to many millions of web pages.
If you need to crawl just a few sites, I'd suggest Droids.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta,
I think in the end what Ken Krugler did with Bixo (limiting crawl time) and
what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus
https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem,
in addition to what Andrzej described below.
Can you try
Hello,
For those living in or near NYC, you may be interested in joining (and/or
presenting?) at the NYC Search Discovery Meetup.
Topics are: search, machine learning, data mining, NLP, information gathering,
information extraction, etc.
http://www.meetup.com/NYC-Search-and-Discovery/
Our
Sounds like Nutch for crawling to gather the data, custom tools to read the
gathered data, call the KV store, construct SolrInputDocuments, and index those
to Solr. If you want Solr and not Lucene, which is a bigger question that I
can't answer without knowing the details.
Otis
--
Sematext
Claudio,
If you think synonyms will do, perhaps you should look at Solr, which includes
support for query-time and/or index-time synonym expansion.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
From: Claudio Martella claudio.marte...@tis.bz.it
Hello,
If Search Engine Integration, Deployment and Scaling in the Cloud sounds
interesting to you, and you are going to be in or near New York next Wednesday
(Jan 20) evening:
http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/
Sorry for dupes to those of you subscribed to
Use Droids to crawl. It already has hooks to index crawled content with Solr,
e.g.
http://search-lucene.com/c?id=Droids:/droids-solr/src/main/java/org/apache/droids/solr/SolrHandler.java||solr
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search ::
1 - 100 of 101 matches
Mail list logo