Hello folks,
Those of you in or near NYC and using Lucene or Solr should come to "Lucandra -
a Cassandra-based backend for Lucene and Solr" on April 26th:
http://www.meetup.com/NYC-Search-and-Discovery/calendar/12979971/
The presenter will be Lucandra's author, Jake Luciani.
Please spread the
Use Droids to crawl. It already has hooks to index crawled content with Solr,
e.g.
http://search-lucene.com/c?id=Droids:/droids-solr/src/main/java/org/apache/droids/solr/SolrHandler.java||solr
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://
Hello,
If "Search Engine Integration, Deployment and Scaling in the Cloud" sounds
interesting to you, and you are going to be in or near New York next Wednesday
(Jan 20) evening:
http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/
Sorry for dupes to those of you subscribed to mul
Claudio,
If you think synonyms will do, perhaps you should look at Solr, which includes
support for query-time and/or index-time synonym expansion.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Claudio Martella
> To: nutch-user@lucene.a
Sounds like Nutch for crawling to gather the data, custom tools to read the
gathered data, call the KV store, construct SolrInputDocuments, and index those
to Solr. If you want Solr and not Lucene, which is a bigger question that I
can't answer without knowing the details.
Otis
--
Sematext --
Hello,
For those living in or near NYC, you may be interested in joining (and/or
presenting?) at the NYC Search & Discovery Meetup.
Topics are: search, machine learning, data mining, NLP, information gathering,
information extraction, etc.
http://www.meetup.com/NYC-Search-and-Discovery/
Our
I think in the end what Ken Krugler did with Bixo (limiting crawl time) and
what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus
https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem,
in addition to what Andrzej described below.
Can you try https:/
Droids is much simpler if all you want to do is do a little bit of crawling.
Nutch is built to scale to many millions of web pages.
If you need to crawl just a few sites, I'd suggest Droids.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop,
I don't recall off the top of my head what that jobtracker.jsp shows, but
judging by name, it shows your job. Each job is composed of multiple map and
reduce tasks. Drill into your job and you should see multiple tasks running.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?
Solr is just a search and indexing server. It doesn't do crawling. Nutch does
the crawling and page parsing, and can index into Lucene or into a Solr server.
Nutch is a biggish beast, and if you just need to index a site or even a small
set of them, you may have an easier time with Droids.
O
Kenan,
Have you considered using Carrot2? I think Nutch includes a plugin for it
already. Or, if your categories are predefined, you could index with Solr (if
you were to use Nutch 1.0) and use Solr's faceting capabilities.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
I don't have a fix, but I have a suggestion - have you tried using the very
latest version of PDFBox? I believe it's going through Apache Incubator...
aha, here: http://incubator.apache.org/pdfbox/
Too bad the page doesn't say *when* the release was made, so one can get a
sense of the state of
pache.org
> Sent: Tuesday, August 4, 2009 12:36:19 PM
> Subject: Re: Nutch in C++
>
>
> Thanks for your comments. Is there anything that I code in C++ that open
> source
> community could benefit?
>
> Alex.
>
>
>
>
>
>
>
> --
e problem (and you may not see much if
> any).
>
> So if you have a few months to spare
>
>
> Iain
>
> -Original Message-
> From: Otis Gospodnetic [mailto:ogjunk-nu...@yahoo.com]
> Sent: 04 August 2009 04:49
> To: nutch-user@lucene.apache.org
> Subject:
e? contribution to open
> source.
> If you know other projects that may be more useful, please let me know.
>
> thanks.
> Alex.
>
>
> -Original Message-
> From: Otis Gospodnetic
> To: nutch-user@lucene.apache.org
> Sent: Sun, Aug 2, 2009 8:15 pm
> Su
Hello,
Lucene sounds like the way to go here. What's more, if you have a copy of
Lucene in Action (1st edition), I wrote a small and simple framework for
file-system indexing. You could define your own parser for your own custom
file format and the indexer will use it. I think it's in Chapte
I don't know of an elegant way, but if you want to hack Nutch sources, you
could set its refetch time to some point in time veeey far in the future,
for example. Or introduce additional status.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta,
Mario,
I think text is the only output format.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Message
> From: schroedi
> To: nutch-user@lucene.apache.org
> Sent: Thursday, July 30, 2009 1
Nutch uses Lucene (Java), not CLucene (C++).
Why are you looking to rewrite Nutch in C++ anyway? Sounds scary.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Message
> From: "alx...@aim.co
Hi,
robots.txt is periodically rechecked and the previously denied URL should be
retried when the time to refetch it comes. If robots.txt rules no longer deny
access to it, it should be fetched.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, H
Hi,
See this: http://markmail.org/message/znbu5khl7qbkvhkm
(I didn't double-check CHANGES.txt to see if this made it into 1.0)
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Message
> From:
Depends on hardware, of course!
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Polsnet
> To: nutch-user@lucene.apache.org
> Sent: Friday, July 3, 2009 12:03:30 AM
> Subject: Nutch 1.0 on the limits of the data
>
>
> Nutch 1.0 largest n
I remember seeing those in the logs, but it's been a while.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: caezar
> To: nutch-user@lucene.apache.org
> Sent: Friday, June 26, 2009 3:50:39 AM
> Subject: Re: Nutch fetch performance
>
>
>
Johan,
Yes, you can fetch and fetch and fetch and only fetch with Nutch and have the
data saved in HDFS (Nutch uses something called Hadoop and that includes HDFS,
a distributed FS that sits on top of regular FS/disk). You can then read the
data from there and index it however you want, using
Neeti,
I don't think there is a way to know when a regular web site has been updated.
You can issue GET or HEAD requests and look at the Last-Modified date, but this
is not 100% reliable. You can fetch and compare content, but that's not 100%
reliable either. If you are indexing blogs, then
still the url crawl db which had over 1Billion urls at last count.
> So
> it might be a good starting point for crawling the web. At last count though
> it
> was 250G in size so no downloadable unless you have a fast connection. It is
> available for anyone that wants it thou
Paul,
There was talk of this in the past, at least between some other people here and
me, possibly "off-line". Your best bet may be going to what's left of Wikia
Search and getting their old index. But, you see, this is exactly the problem
- the index may be quite outdated by now.
Otis
--
S
Hello,
It really depends on the version of Lucene used in your Nutch instance and
whether Lucene.NET version you are using is compatible at index format level.
As for segments dir vs. file, this is just a case of unfortunate naming.
"Segments" in Lucene means a completely different thing than
Unfortunately Lucene doesn't allow that. You have to reindex the whole doc.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Vijay
> To: nutch-user@lucene.apache.org; java-u...@lucene.apache.org
> Sent: Monday, June 1, 2009 6:32:23 PM
> S
d drops.
Can anyone produce a patch based on this?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Otis Gospodnetic
> To: nutch-user@lucene.apache.org
> Sent: Wednesday, May 27, 2009 11:38:48 PM
> Subject: Re: threads get stuck
lete.
>
> -Raymond-
> 2009/5/27 Raymond Balmès
>
> > I have many URLs per host of course. Need to get all the pages of the
> > sites, don't understand the question.
> >
> > -Raymond
> >
> > 2009/5/26 Otis Gospodnetic
> >
> >
&g
Ray,
I don't think fetchlist generation sticks URLs from the same domain or host
together. But URLs for the same host do end up in the same queue. This is by
design and it is a good thing -- this is how Nutch can ensure not to hit the
same host with more simultaneous threads than it should (
See https://issues.apache.org/jira/browse/NUTCH-570 for something relevant.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Raymond Balmès
> To: nutch-user@lucene.apache.org
> Sent: Wednesday, May 27, 2009 9:43:02 AM
> Subject: Re: thread
Hi John,
It would be quite appropriate, actually.
You may want to put a link to it under the Resources section on the front page,
and maybe even on http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
Otis (Nutch committer) --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
But how, Ray, if you have only 1 URL per host?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Raymond Balmès
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, May 26, 2009 4:11:27 PM
> Subject: Re: threads get stuck in spinwaiting
>
>
John, nice!
You should add this to the Nutch Wiki!
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: John Whelan
> To: nutch-user@lucene.apache.org
> Sent: Friday, April 17, 2009 10:44:22 PM
> Subject: Nutch-based Application for Windows
ture of Nutch
>
> I just wish there could be some clear documentation for Nutch/Solr
> integration publicly available. Or some developers are already working on
> this?
> - Tony
>
> On Mon, Mar 16, 2009 at 6:50 PM, Otis Gospodnetic wrote:
>
> >
> > Hello,
> &
Eric,
There are a couple of ways you can back up a Lucene index built by Solr:
1) have a look at the Solr replication scripts, specifically snapshooter. This
script creates a snapshot of an index. It's typically triggered by Solr after
its "commit" or "optimize" calls, when the index is "sta
Hello,
Comments inlined.
- Original Message
> From: Dennis Kubes
> To: nutch-user@lucene.apache.org
> Sent: Friday, March 13, 2009 8:19:37 PM
>
> With the release of Nutch 1.0 I think it is a good time to begin a discussion
> about the future of Nutch. Here are some things to cons
You don't have enough free disk space, that's all.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Tony Wang
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, March 3, 2009 10:58:41 PM
> Subject: error when bootstrap DMOZ databases
>
>
Nutch doesn't make use of sitemaps currently.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: consultas
> To: nutch-user@lucene.apache.org
> Sent: Friday, February 27, 2009 12:34:30 PM
> Subject: sitemaps
>
> From a response of a previou
Step one is to identify the exact jar where this class lives. Are you sure
it's in mail.jar? Maybe it's in activate.jar?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Antony Bowesman
> To: nutch-user@lucene.apache.org
> Sent: Friday, J
Vishal,
Re 2. - I don't think it's quite true. RAM is still much faster than SSDs.
Also, which version of Lucene are you using? Make sure you're using the latest
one if you care about performance.
Also, if you have extra RAM, you can make your .tii bigger/denser and speed up
searches that wa
Hi Matthias,
Several years ago when I did crawling/parsing/indexing of full-page content for
Simpy.com I used Nutch in exactly that manner.
For example (this is outdated code, but you'll get the idea):
System.out.println("Urls to fetch: " + _urls.size());
if (_urls.size() == 0)
Tony,
You've sent about 10 emails about this already, both on the Nutch and on the
Solr list.
Please have a bit more patience and wait for Nutch 1.0 release. My guess is
this Nutch-Solr integration will be in Nutch 1.0.
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Check java-user archives on markmail.org and search for "Toke" and "SSD" to see
SSD benchmarks done by Toke a few months back.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Sean Dean
> To: nutch-user@lucene.apache.org
> Sent: Thursday, J
Hi Doug,
Nutch is not really meant for this type of stuff. You'd be using a very very
massive hammer for a very small nail if you were to choose Nutch for this task.
:)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Doug Leeper
> To: n
You need to stem both at index time and at search time. Then flowers will be
stemmed to flower in both cases and flower at search time will match the
indexed term flower.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: RanjithStar
> To:
Hi,
Unfortunately, there are no Nutch books (nor are any Nutch books in the works
that I know of), and I think the documentation on the Nutch Wiki is the
best/only thing there is.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: opsec
>
Hi,
Yes, if you want flowers to match flower you will want to apply stemming. You
can use the Snowball for English. I don't have any code handy, but you can see
how it's done if you look at Lucene's unit test for Snowball Analyzer.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr -
Hi,
It would be possible if you index tokens not as "words", but as "character
ngrams". You'd need a custom analyzer for that. Code for character-based
ngrams already exists in Lucene contrib, but you'd need to add it to Nutch.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutc
Hi Todd,
This sounds good. I think we've all see the problem you are describing.
You can see something related at:
- https://issues.apache.org/jira/browse/NUTCH-629
- https://issues.apache.org/jira/browse/NUTCH-628
It would be great if you could incorporate any of the good ideas from the above
Allow me to add a related question:
Fetching is faster if you have more machines.
Is the same true for generate and update steps?
In other words, is it faster to generate a fetchlist on a 100-node cluster than
on a 10-node cluster (assuming the same crawldb, etc.)?
Thanks,
Otis
--
Sematext --
this be through a REST Interface
> or
> some sort of webservice?
>
> -John
>
> On Nov 20, 2008, at 4:23 PM, Otis Gospodnetic wrote:
>
> > Yes, you'd have to write a mini newsgroup reader, mimic its behaviour, but
> then once you grab a post you could send it
hink this would work for and help with Nutch
generate/fetch/parse/etc. operations.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
____
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: Nutch User List
Sent: Thursday, November 20, 2008 3:5
v 20, 2008, at 4:03 PM, Otis Gospodnetic wrote:
> By newsgroups do you mean Usenet newsgroups? If so, it might be a lot
> simpler to use Solr, unless you want to build an "NNTP crawler"
>
> I did do something like that over a decade ago. I used it to find people and
> bu
By newsgroups do you mean Usenet newsgroups? If so, it might be a lot simpler
to use Solr, unless you want to build an "NNTP crawler"
I did do something like that over a decade ago. I used it to find people and
build a White Pages directory (this was big in the 90s :) called POPULUS:
http://w
Hi,
Just noticed Hadoop's new fair sharing job scheduler (
https://issues.apache.org/jira/browse/HADOOP-3746
). It seems to be in 0.19, which I think Nutch is not on yet... but still:
- is this something that would benefit Nutch?
The last time I used Nutch I remember having to be careful abo
Axel, how did this go? I'd love to know if you got to 1B.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Webmaster <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Sent: Tuesday, October 7, 2008 1:13:29 AM
> Subject: Extensive we
Heh, I'll point to Solr's SpellCheckComponent. :) It, too, has a good page on
the Wiki.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Edward Quick <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Sent: Wednesday, September 24, 2
It ain't Nutch, but you can look at Elevate component in Solr to get some
ideas. There is a Wiki page for the component.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Edward Quick <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
>
Hi,
You really need to ask this question on the Lucene mailing list, as that's
where hit scoring comes from.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Alexander Aristov <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Sent: T
osoft.icon">
>
>
>
>
>
> border="4" fr
> ameborder="1" scrolling="no">
>
>
> marginheig
> ht="0" scrolling="no" frameborder="1" resize=yes>
>
>
> marginwidth=&q
Hi,
This is defined in hadoop-default.xml. Copy the relevant property to a file
called hadoop-site.xml and change the directory to something suitable on your
system. If you think this would be good to document, please edit the relevant
page on the Wiki - anyone can do it, just create an accou
Hi,
If there an existing method for generating a segment/fetchlist containing only
URLs that have not yet been fetched?
I'm asking because I can imagine a situation where one has a large and "old"
CrawlDb that "knows" about a lot of URLs (the ones with "db_unfetched" status
if you run -stats) a
Hi,
You can dump the whole CrawlDb and grep for your URL. Not fast, but it will
work. You could also just try looking in your logs first.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Viksit Gaur <[EMAIL PROTECTED]>
> To: nutch-user@lu
: URL filter: true
> LinkDb: adding segment: crawl/segments/20080620184000
> LinkDb: adding segment: crawl/segments/20080620184010
> LinkDb: adding segment: crawl/segments/20080620184021
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/
Just get the latest JDK from Sun. No need for yum, just download, install, set
JAVA_HOME, add JAVA_HOME/bin to PATH and you are set.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Winton Davies <[EMAIL PROTECTED]>
> To: nutch-user@lucene.a
Hi Ann,
Regarding frames - this is not the problem here (with Nutch), as Nutch doesn't
even seem to be able to connect to your server. It never gets to see the HTML
and frames in it. Perhaps there is something useful in the logs not on the
Nutch side, but on that v4 server.
Otis
--
Sematext
Don't know off the top of my head, but I'd guess no, because Nutch uses
Hadoop/HDFS. HDFS files are write-once, so I doubt you can just update a
single URL's data. But you could write a MapReduce job that goes over the
whole CrawlDb and modifies only the records you need modified. You'll need
Hi,
Nutch is a Java application and consists of a number of Java classes that
perform different operations. If you are asking whether you can run these
classes from a C or C++ application -- I'm not sure, I never had to do that.
If you know how to call java classes from C/C++, have a look at
Don't count on the Admin UI. I believe it was only a prototype that was never
integrated in Nutch and probably never will be (until somebody contributes
something).
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Martin Xu <[EMAIL PROTECT
Hi,
Both of you should open some JIRA issues and upload your patches there as you
progress, so others can see the direction you are headed and make suggestions
when appropriate.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Marcus Herou
seed list form the known site
> list is that I am sure to miss lots of small, individual sites - I
> wonder how google, msn, yahoo does it - they must be getting list of
> from ISPs, hosting providers, etc?
>
> Thanks
> Jha,
>
>
>
>
> On Mon, Jun 16, 2008
Hi,
There is also a setting for the maximal number of bytes to fetch. If your main
index page is large, maybe it's just getting cut off because of that. The
property has "content" in the name, I believe, so look for that in
nutch-default.xml.
Otis
--
Sematext -- http://sematext.com/ -- Lucen
Uhuh, yes, this is most likely due to session IDs that create unique URLs that
Nutch keeps processing.
Look at conf/regex-normalize.xml for how you can clean up URLs. That should
help.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Felix
Don't have the answer, but got a question. Does this happen only when
redirection to the external host are involved?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Drew Hite <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Sent: Mo
Yes, this is a pure CLASSPATH issue. I haven't built a Nutch war in a while,
so I don't recall what is in it, but most likely it has WEB-INF/lib directory
with some jar files. One of these ah, let's just see. Here:
[EMAIL PROTECTED] trunk]$ unzip -l build/nutch-1.0-dev.war | grep jar | g
This seems to be a common request - sizing. I think the best you can do is use
existing search engines to estimate how many pages sites you are interested in
have. You will have to know the exact sites (their URLs) and make use of the
"site:" search operator (Google, Yahoo). Yahoo also has so
an't find rules
> for scope 'inject', using default
> 2008-06-13 22:29:35,101 WARN crawl.Injector - Skipping
> http://lucene.apache.org/:java.lang.NullPointerExcep
> tion
> 2008-06-13 22:29:35,101 WARN crawl.Injector - Skipping
> http://shopping.yahoo.com/:jav
Hi,
You didn't mention URL injection, which makes me think you didn't inject any
seed URLs to crawl. I also suggest figuring out how to run Nutch "normally",
"from the command-line", before introducing additional variables and
complexities, such as running Nutch from an IDE.
Otis
--
Sematext
le of contexts your sort of agreeing with me. Running
> multiple nutch processes on a multi-core processor is more efficient then
> running one single process on heavily scaled hardware.
>
> Am i correct with this statement?
>
>
> - Original Message
> From: Otis
I'm not sure -- I try to avoid running single Nutch job at a time, as I find
overlapping is more efficient.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Sean Dean <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Sent: Thursday, Ju
Removed the plugin from the config :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Siddhartha Reddy <[EMAIL PROTECTED]>
> To: nutch-user@lucene.apache.org
> Sent: Thursday, June 12, 2008 11:41:17 PM
> Subject: Re: java.lang.StackOverflowEr
I don't think that's doable, as I *think* CrawlDb doesn't know which segment
the URL is in (or does it? Not looking at the code now, sorry).
But, knowing the segment you should be able to pull the web page data out.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Orig
Svein,
It sounds like this should be added to JIRA, though I wonder if this is just
the case of some bad/invalid Javascript that confuses the js parser. You'll
want to include the URL where this problem happens and its source. Probably
best to grab the source with something like curl or wget
You are right, the scripts are missing. I don't know why that is. I do see
them in bin in my local svn checkout of nutch/trunk though.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: nutchvf <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.
Thanks Dennis.
But, hm, I don't get it 100% yet. I looked at Generator.java and I see this:
if (numLists == -1) { // for politeness make
numLists = job.getNumMapTasks();// a partition per fetch task
}
Thus, when -numFetchers is not given, the nu
Nutch
- Original Message
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Monday, April 14, 2008 1:01:37 PM
Subject: Re: Next Generation Nutch
Dennis Kubes wrote:
>
>
> Otis Gospodnetic wrote:
>> I suppose the first thing to do would be des
From: Dennis Kubes <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Sunday, April 13, 2008 5:44:32 PM
Subject: Re: Next Generation Nutch
Otis Gospodnetic wrote:
> Hello,
>
> A few quick comments. I don't know how much you track Solr, but the mention
> of shard
Hi,
Hm, I have to say I'm not sure if I agree 100% with part 1. I think it would
be great to have such flexibility, but I wonder if trying to achieve it would
be over-engineering. Do people really need that? I don't know, maybe! If
they do, then ignore my comment. :)
I'm curious about 2. -
Hello,
A few quick comments. I don't know how much you track Solr, but the mention of
shards makes me think of SOLR-303 and DistributedSearch page on Solr Wiki.
You'll want to check those out. In short, Solr has the notion of shards and
distributed search, kind of like Nutch with its RPC fra
Hi,
I noticed that during fetching map tasks get to 100% complete (in the GUI), but
are not marked as completed (also in the GUI), and are in fact really not
complete - the logs show there is fetching still going on (though almost
exclusively timeouts at the end of the fetch run, as expected),
increase the threads to 400 per
server, and 3 per host. I was seeing about 15 pages/second. I didn't
get a chance to implement the other suggestions because I'll eat all
of the office's bandwidth and get yelled at :)
Maybe I'll make a "Nutch Speed Improvements" entry in
I cannot tell for sure without looking at the code, but my guess is diacritics
are simply not being stripped anywhere. I imagine you could modify the
NutchAnalyzer to include that ISO...Filter, the same class that you must have
configured in your Solr schema.xml.
Otis
--
Sematext -- http://s
Regarding the Tika error message, I've seen that, too. if you need
motivation, Chris. :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Chris Mattmann <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Saturday, April 5, 2008 2:58:
Hello Svein,
Quick answers to your questions:
- Nutch does not include an image crawler, though some people have started
working on that a long time ago, and Archive.org is sponsoring this
work/project.
- Nutch has a distributed fetcher. Not sure about Heritrix.
- Nutch is being worked on, bu
I hate to do this, but here it goes:
Please give volunteers at least 2-3 days to answer your question before
reminding - it doesn't look nice.
Either my mail reader is lying or you sent your reminder email only 30 minutes
after your original email.
Words like please and thank you also help. :)
Aha, I see several answers on the Nutch ML - bravo Tomo! :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Tomislav Poljak <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Wednesday, March 5, 2008 1:11:39 PM
Subject: Re: merging index
Siva - you can't really just use the Lucene demo tool nor that luceneweb thing
and expect it to search your Nutch-created Lucene index. The two index
structures (their fields) are quite different. I don't want to self-promote,
but if you can, get a copy of Lucene in Action in order to get a be
1 - 100 of 114 matches
Mail list logo