Lucandra - Lucene/Solr on Cassandra: April 26, NYC

2010-04-22 Thread Otis Gospodnetic
Hello folks, Those of you in or near NYC and using Lucene or Solr should come to "Lucandra - a Cassandra-based backend for Lucene and Solr" on April 26th: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12979971/ The presenter will be Lucandra's author, Jake Luciani. Please spread the

Re: Using Nutch to crawl and use it as input to Solr

2010-01-28 Thread Otis Gospodnetic
Use Droids to crawl. It already has hooks to index crawled content with Solr, e.g. http://search-lucene.com/c?id=Droids:/droids-solr/src/main/java/org/apache/droids/solr/SolrHandler.java||solr Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://

NYC Search in the Cloud meetup: Jan 20

2010-01-12 Thread Otis Gospodnetic
Hello, If "Search Engine Integration, Deployment and Scaling in the Cloud" sounds interesting to you, and you are going to be in or near New York next Wednesday (Jan 20) evening: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/ Sorry for dupes to those of you subscribed to mul

Re: ontology implementation

2010-01-07 Thread Otis Gospodnetic
Claudio, If you think synonyms will do, perhaps you should look at Solr, which includes support for query-time and/or index-time synonym expansion. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Claudio Martella > To: nutch-user@lucene.a

Re: What is the best choice: nutch/lucene or nutch/solr?

2009-12-04 Thread Otis Gospodnetic
Sounds like Nutch for crawling to gather the data, custom tools to read the gathered data, call the KV store, construct SolrInputDocuments, and index those to Solr. If you want Solr and not Lucene, which is a bigger question that I can't answer without knowing the details. Otis -- Sematext --

NYC Search & Discovery Meetup

2009-12-01 Thread Otis Gospodnetic
Hello, For those living in or near NYC, you may be interested in joining (and/or presenting?) at the NYC Search & Discovery Meetup. Topics are: search, machine learning, data mining, NLP, information gathering, information extraction, etc. http://www.meetup.com/NYC-Search-and-Discovery/ Our

Re: 100 fetches per second?

2009-11-26 Thread Otis Gospodnetic
I think in the end what Ken Krugler did with Bixo (limiting crawl time) and what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem, in addition to what Andrzej described below. Can you try https:/

Re: crawling / data aggregation - is nutch the right tool?

2009-11-15 Thread Otis Gospodnetic
Droids is much simpler if all you want to do is do a little bit of crawling. Nutch is built to scale to many millions of web pages. If you need to crawl just a few sites, I'd suggest Droids. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop,

Re: How to configure nutch to crawl parallelly

2009-11-13 Thread Otis Gospodnetic
I don't recall off the top of my head what that jobtracker.jsp shows, but judging by name, it shows your job. Each job is composed of multiple map and reduce tasks. Drill into your job and you should see multiple tasks running. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?

Re: Nutch/Solr question

2009-11-11 Thread Otis Gospodnetic
Solr is just a search and indexing server. It doesn't do crawling. Nutch does the crawling and page parsing, and can index into Lucene or into a Solr server. Nutch is a biggish beast, and if you just need to index a site or even a small set of them, you may have an easier time with Droids. O

Re: Categorizing search results

2009-08-04 Thread Otis Gospodnetic
Kenan, Have you considered using Carrot2? I think Nutch includes a plugin for it already. Or, if your categories are predefined, you could index with Solr (if you were to use Nutch 1.0) and use Solr's faceting capabilities. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls

Re: PDFBox log file locks Fetcher

2009-08-04 Thread Otis Gospodnetic
I don't have a fix, but I have a suggestion - have you tried using the very latest version of PDFBox? I believe it's going through Apache Incubator... aha, here: http://incubator.apache.org/pdfbox/ Too bad the page doesn't say *when* the release was made, so one can get a sense of the state of

Re: Nutch in C++

2009-08-04 Thread Otis Gospodnetic
pache.org > Sent: Tuesday, August 4, 2009 12:36:19 PM > Subject: Re: Nutch in C++ > > > Thanks for your comments. Is there anything that I code in C++ that open > source > community could benefit? > > Alex. > > > > > > > > --

Re: Nutch in C++

2009-08-04 Thread Otis Gospodnetic
e problem (and you may not see much if > any). > > So if you have a few months to spare > > > Iain > > -Original Message- > From: Otis Gospodnetic [mailto:ogjunk-nu...@yahoo.com] > Sent: 04 August 2009 04:49 > To: nutch-user@lucene.apache.org > Subject:

Re: Nutch in C++

2009-08-03 Thread Otis Gospodnetic
e? contribution to open > source. > If you know other projects that may be more useful, please let me know. > > thanks. > Alex. > > > -Original Message- > From: Otis Gospodnetic > To: nutch-user@lucene.apache.org > Sent: Sun, Aug 2, 2009 8:15 pm > Su

Re: Using Nutch (w/custom plugin) to crawl vs. custom Lucene app

2009-08-02 Thread Otis Gospodnetic
Hello, Lucene sounds like the way to go here. What's more, if you have a copy of Lucene in Action (1st edition), I wrote a small and simple framework for file-system indexing. You could define your own parser for your own custom file format and the indexer will use it. I think it's in Chapte

Re: Meaning of ProtocolStatus.ACCESS_DENIED

2009-08-02 Thread Otis Gospodnetic
I don't know of an elegant way, but if you want to hack Nutch sources, you could set its refetch time to some point in time veeey far in the future, for example. Or introduce additional status. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta,

Re: Dumping Crawl DB with XML

2009-08-02 Thread Otis Gospodnetic
Mario, I think text is the only output format. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message > From: schroedi > To: nutch-user@lucene.apache.org > Sent: Thursday, July 30, 2009 1

Re: Nutch in C++

2009-08-02 Thread Otis Gospodnetic
Nutch uses Lucene (Java), not CLucene (C++). Why are you looking to rewrite Nutch in C++ anyway? Sounds scary. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message > From: "alx...@aim.co

Re: denied by robots.txt rules

2009-08-02 Thread Otis Gospodnetic
Hi, robots.txt is periodically rechecked and the previously denied URL should be retried when the time to refetch it comes. If robots.txt rules no longer deny access to it, it should be fetched. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, H

Re: Specific fetch list based on url status or score

2009-08-02 Thread Otis Gospodnetic
Hi, See this: http://markmail.org/message/znbu5khl7qbkvhkm (I didn't double-check CHANGES.txt to see if this made it into 1.0) Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message > From:

Re: Nutch 1.0 on the limits of the data

2009-07-03 Thread Otis Gospodnetic
Depends on hardware, of course! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Polsnet > To: nutch-user@lucene.apache.org > Sent: Friday, July 3, 2009 12:03:30 AM > Subject: Nutch 1.0 on the limits of the data > > > Nutch 1.0 largest n

Re: Nutch fetch performance

2009-06-26 Thread Otis Gospodnetic
I remember seeing those in the logs, but it's been a while. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: caezar > To: nutch-user@lucene.apache.org > Sent: Friday, June 26, 2009 3:50:39 AM > Subject: Re: Nutch fetch performance > > >

Re: Using nutch only as a webcrawler?

2009-06-26 Thread Otis Gospodnetic
Johan, Yes, you can fetch and fetch and fetch and only fetch with Nutch and have the data saved in HDFS (Nutch uses something called Hadoop and that includes HDFS, a distributed FS that sits on top of regular FS/disk). You can then read the data from there and index it however you want, using

Re: recrawling

2009-06-24 Thread Otis Gospodnetic
Neeti, I don't think there is a way to know when a regular web site has been updated. You can issue GET or HEAD requests and look at the Last-Modified date, but this is not 100% reliable. You can fetch and compare content, but that's not 100% reliable either. If you are indexing blogs, then

Re: adding pre-indexed DB's together

2009-06-22 Thread Otis Gospodnetic
still the url crawl db which had over 1Billion urls at last count. > So > it might be a good starting point for crawling the web. At last count though > it > was 250G in size so no downloadable unless you have a fast connection. It is > available for anyone that wants it thou

Re: adding pre-indexed DB's together

2009-06-22 Thread Otis Gospodnetic
Paul, There was talk of this in the past, at least between some other people here and me, possibly "off-line". Your best bet may be going to what's left of Wikia Search and getting their old index. But, you see, this is exactly the problem - the index may be quite outdated by now. Otis -- S

Re: Reading Nutch indexes w/ Lucene.NET

2009-06-10 Thread Otis Gospodnetic
Hello, It really depends on the version of Lucene used in your Nutch instance and whether Lucene.NET version you are using is compatible at index format level. As for segments dir vs. file, this is just a case of unfortunate naming. "Segments" in Lucene means a completely different thing than

Re: Question on Efficient field updates in the Lucene index in Nutch

2009-06-02 Thread Otis Gospodnetic
Unfortunately Lucene doesn't allow that. You have to reindex the whole doc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Vijay > To: nutch-user@lucene.apache.org; java-u...@lucene.apache.org > Sent: Monday, June 1, 2009 6:32:23 PM > S

Re: threads get stuck in spinwaiting

2009-05-27 Thread Otis Gospodnetic
d drops. Can anyone produce a patch based on this? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Otis Gospodnetic > To: nutch-user@lucene.apache.org > Sent: Wednesday, May 27, 2009 11:38:48 PM > Subject: Re: threads get stuck

Re: threads get stuck in spinwaiting

2009-05-27 Thread Otis Gospodnetic
lete. > > -Raymond- > 2009/5/27 Raymond Balmès > > > I have many URLs per host of course. Need to get all the pages of the > > sites, don't understand the question. > > > > -Raymond > > > > 2009/5/26 Otis Gospodnetic > > > > &g

Re: threads get stuck in spinwaiting

2009-05-27 Thread Otis Gospodnetic
Ray, I don't think fetchlist generation sticks URLs from the same domain or host together. But URLs for the same host do end up in the same queue. This is by design and it is a good thing -- this is how Nutch can ensure not to hit the same host with more simultaneous threads than it should (

Re: threads get stuck in spinwaiting

2009-05-27 Thread Otis Gospodnetic
See https://issues.apache.org/jira/browse/NUTCH-570 for something relevant. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Raymond Balmès > To: nutch-user@lucene.apache.org > Sent: Wednesday, May 27, 2009 9:43:02 AM > Subject: Re: thread

Re: Nutch-based Application for Windows

2009-05-26 Thread Otis Gospodnetic
Hi John, It would be quite appropriate, actually. You may want to put a link to it under the Resources section on the front page, and maybe even on http://wiki.apache.org/nutch/GettingNutchRunningWithWindows Otis (Nutch committer) -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: threads get stuck in spinwaiting

2009-05-26 Thread Otis Gospodnetic
But how, Ray, if you have only 1 URL per host? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Raymond Balmès > To: nutch-user@lucene.apache.org > Sent: Tuesday, May 26, 2009 4:11:27 PM > Subject: Re: threads get stuck in spinwaiting > >

Re: Nutch-based Application for Windows

2009-05-23 Thread Otis Gospodnetic
John, nice! You should add this to the Nutch Wiki! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: John Whelan > To: nutch-user@lucene.apache.org > Sent: Friday, April 17, 2009 10:44:22 PM > Subject: Nutch-based Application for Windows

Re: The Future of Nutch

2009-03-16 Thread Otis Gospodnetic
ture of Nutch > > I just wish there could be some clear documentation for Nutch/Solr > integration publicly available. Or some developers are already working on > this? > - Tony > > On Mon, Mar 16, 2009 at 6:50 PM, Otis Gospodnetic wrote: > > > > > Hello, > &

Re: Index Disaster Recovery

2009-03-16 Thread Otis Gospodnetic
Eric, There are a couple of ways you can back up a Lucene index built by Solr: 1) have a look at the Solr replication scripts, specifically snapshooter. This script creates a snapshot of an index. It's typically triggered by Solr after its "commit" or "optimize" calls, when the index is "sta

Re: The Future of Nutch

2009-03-16 Thread Otis Gospodnetic
Hello, Comments inlined. - Original Message > From: Dennis Kubes > To: nutch-user@lucene.apache.org > Sent: Friday, March 13, 2009 8:19:37 PM > > With the release of Nutch 1.0 I think it is a good time to begin a discussion > about the future of Nutch. Here are some things to cons

Re: error when bootstrap DMOZ databases

2009-03-03 Thread Otis Gospodnetic
You don't have enough free disk space, that's all. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Tony Wang > To: nutch-user@lucene.apache.org > Sent: Tuesday, March 3, 2009 10:58:41 PM > Subject: error when bootstrap DMOZ databases > >

Re: sitemaps

2009-02-27 Thread Otis Gospodnetic
Nutch doesn't make use of sitemaps currently. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: consultas > To: nutch-user@lucene.apache.org > Sent: Friday, February 27, 2009 12:34:30 PM > Subject: sitemaps > > From a response of a previou

Re: Adding new plugin and classloading issues

2009-01-23 Thread Otis Gospodnetic
Step one is to identify the exact jar where this class lives. Are you sure it's in mail.jar? Maybe it's in activate.jar? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Antony Bowesman > To: nutch-user@lucene.apache.org > Sent: Friday, J

Re: Search performance for large indexes (>100M docs)

2009-01-13 Thread Otis Gospodnetic
Vishal, Re 2. - I don't think it's quite true. RAM is still much faster than SSDs. Also, which version of Lucene are you using? Make sure you're using the latest one if you care about performance. Also, if you have extra RAM, you can make your .tii bigger/denser and speed up searches that wa

Re: nutch crawling with java (not shellscript)

2009-01-13 Thread Otis Gospodnetic
Hi Matthias, Several years ago when I did crawling/parsing/indexing of full-page content for Simpy.com I used Nutch in exactly that manner. For example (this is outdated code, but you'll get the idea): System.out.println("Urls to fetch: " + _urls.size()); if (_urls.size() == 0)

Re: [jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2009-01-09 Thread Otis Gospodnetic
Tony, You've sent about 10 emails about this already, both on the Nutch and on the Solr list. Please have a bit more patience and wait for Nutch 1.0 release. My guess is this Nutch-Solr integration will be in Nutch 1.0. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Search performance for large indexes (>100M docs)

2009-01-09 Thread Otis Gospodnetic
Check java-user archives on markmail.org and search for "Toke" and "SSD" to see SSD benchmarks done by Toke a few months back. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Sean Dean > To: nutch-user@lucene.apache.org > Sent: Thursday, J

Re: Spider a single url and get tokenzied keyword/phrases

2008-12-23 Thread Otis Gospodnetic
Hi Doug, Nutch is not really meant for this type of stuff. You'd be using a very very massive hammer for a very small nail if you were to choose Nutch for this task. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Doug Leeper > To: n

Re: Stemming issues

2008-12-17 Thread Otis Gospodnetic
You need to stem both at index time and at search time. Then flowers will be stemmed to flower in both cases and flower at search time will match the indexed term flower. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: RanjithStar > To:

Re: Any "unofficial" howtos for a new Nutch user

2008-12-17 Thread Otis Gospodnetic
Hi, Unfortunately, there are no Nutch books (nor are any Nutch books in the works that I know of), and I think the documentation on the Nutch Wiki is the best/only thing there is. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: opsec >

Re: Stemming issues

2008-12-16 Thread Otis Gospodnetic
Hi, Yes, if you want flowers to match flower you will want to apply stemming. You can use the Snowball for English. I don't have any code handy, but you can see how it's done if you look at Lucene's unit test for Snowball Analyzer. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr -

Re: Partial word searches?

2008-12-15 Thread Otis Gospodnetic
Hi, It would be possible if you index tokens not as "words", but as "character ngrams". You'd need a custom analyzer for that. Code for character-based ngrams already exists in Lucene contrib, but you'd need to add it to Nutch. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutc

Re: proposal: fetcher performance improvements

2008-12-10 Thread Otis Gospodnetic
Hi Todd, This sounds good. I think we've all see the problem you are describing. You can see something related at: - https://issues.apache.org/jira/browse/NUTCH-629 - https://issues.apache.org/jira/browse/NUTCH-628 It would be great if you could incorporate any of the good ideas from the above

Re: Fetching vs. generate and updatedb time ratio

2008-12-10 Thread Otis Gospodnetic
Allow me to add a related question: Fetching is faster if you have more machines. Is the same true for generate and update steps? In other words, is it faster to generate a fetchlist on a 100-node cluster than on a 10-node cluster (assuming the same crawldb, etc.)? Thanks, Otis -- Sematext --

Re: Indexing News groups

2008-11-20 Thread Otis Gospodnetic
this be through a REST Interface > or > some sort of webservice? > > -John > > On Nov 20, 2008, at 4:23 PM, Otis Gospodnetic wrote: > > > Yes, you'd have to write a mini newsgroup reader, mimic its behaviour, but > then once you grab a post you could send it

Re: Hadoop's new fair sharing job scheduler

2008-11-20 Thread Otis Gospodnetic
hink this would work for and help with Nutch generate/fetch/parse/etc. operations. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ____ From: Otis Gospodnetic <[EMAIL PROTECTED]> To: Nutch User List Sent: Thursday, November 20, 2008 3:5

Re: Indexing News groups

2008-11-20 Thread Otis Gospodnetic
v 20, 2008, at 4:03 PM, Otis Gospodnetic wrote: > By newsgroups do you mean Usenet newsgroups? If so, it might be a lot > simpler to use Solr, unless you want to build an "NNTP crawler" > > I did do something like that over a decade ago. I used it to find people and > bu

Re: Indexing News groups

2008-11-20 Thread Otis Gospodnetic
By newsgroups do you mean Usenet newsgroups? If so, it might be a lot simpler to use Solr, unless you want to build an "NNTP crawler" I did do something like that over a decade ago. I used it to find people and build a White Pages directory (this was big in the 90s :) called POPULUS: http://w

Hadoop's new fair sharing job scheduler

2008-11-20 Thread Otis Gospodnetic
Hi, Just noticed Hadoop's new fair sharing job scheduler ( https://issues.apache.org/jira/browse/HADOOP-3746 ). It seems to be in 0.19, which I think Nutch is not on yet... but still: - is this something that would benefit Nutch? The last time I used Nutch I remember having to be careful abo

Re: Extensive web crawl

2008-10-20 Thread Otis Gospodnetic
Axel, how did this go? I'd love to know if you got to 1B. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Webmaster <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org > Sent: Tuesday, October 7, 2008 1:13:29 AM > Subject: Extensive we

Re: did you mean?

2008-09-24 Thread Otis Gospodnetic
Heh, I'll point to Solr's SpellCheckComponent. :) It, too, has a good page on the Wiki. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Edward Quick <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org > Sent: Wednesday, September 24, 2

Re: keyword match

2008-09-24 Thread Otis Gospodnetic
It ain't Nutch, but you can look at Elevate component in Solr to get some ideas. There is a Wiki page for the component. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Edward Quick <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org >

Re: nutch and lucene scoring

2008-08-06 Thread Otis Gospodnetic
Hi, You really need to ask this question on the Lucene mailing list, as that's where hit scoring comes from. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Alexander Aristov <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org > Sent: T

Re: how does nutch connect to urls internally?

2008-06-23 Thread Otis Gospodnetic
osoft.icon"> > > > > > > border="4" fr > ameborder="1" scrolling="no"> > > > marginheig > ht="0" scrolling="no" frameborder="1" resize=yes> > > > marginwidth=&q

Re: default hadoop goes to /

2008-06-22 Thread Otis Gospodnetic
Hi, This is defined in hadoop-default.xml. Copy the relevant property to a file called hadoop-site.xml and change the directory to something suitable on your system. If you think this would be good to document, please edit the relevant page on the Wiki - anyone can do it, just create an accou

Fetching only unfetched URLs

2008-06-22 Thread Otis Gospodnetic
Hi, If there an existing method for generating a segment/fetchlist containing only URLs that have not yet been fetched? I'm asking because I can imagine a situation where one has a large and "old" CrawlDb that "knows" about a lot of URLs (the ones with "db_unfetched" status if you run -stats) a

Re: Querying linkdb for a URL with special characters

2008-06-22 Thread Otis Gospodnetic
Hi, You can dump the whole CrawlDb and grep for your URL. Not fast, but it will work. You could also just try looking in your logs first. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Viksit Gaur <[EMAIL PROTECTED]> > To: nutch-user@lu

Re: how does nutch connect to urls internally?

2008-06-20 Thread Otis Gospodnetic
: URL filter: true > LinkDb: adding segment: crawl/segments/20080620184000 > LinkDb: adding segment: crawl/segments/20080620184010 > LinkDb: adding segment: crawl/segments/20080620184021 > LinkDb: done > Indexer: starting > Indexer: linkdb: crawl/linkdb > Indexer: adding segment: crawl/segments/

Re: GNUgcj problem?

2008-06-20 Thread Otis Gospodnetic
Just get the latest JDK from Sun. No need for yum, just download, install, set JAVA_HOME, add JAVA_HOME/bin to PATH and you are set. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Winton Davies <[EMAIL PROTECTED]> > To: nutch-user@lucene.a

Re: how does nutch connect to urls internally?

2008-06-19 Thread Otis Gospodnetic
Hi Ann, Regarding frames - this is not the problem here (with Nutch), as Nutch doesn't even seem to be able to connect to your server. It never gets to see the HTML and frames in it. Perhaps there is something useful in the logs not on the Nutch side, but on that v4 server. Otis -- Sematext

Re: updating retry inteval

2008-06-19 Thread Otis Gospodnetic
Don't know off the top of my head, but I'd guess no, because Nutch uses Hadoop/HDFS. HDFS files are write-once, so I doubt you can just update a single URL's data. But you could write a MapReduce job that goes over the whole CrawlDb and modifies only the records you need modified. You'll need

Re: Has anybody implemented NUTCH in a C or C++ Application?

2008-06-19 Thread Otis Gospodnetic
Hi, Nutch is a Java application and consists of a number of Java classes that perform different operations. If you are asking whether you can run these classes from a C or C++ application -- I'm not sure, I never had to do that. If you know how to call java classes from C/C++, have a look at

Re: All administration gui links in wiki are broken

2008-06-19 Thread Otis Gospodnetic
Don't count on the Admin UI. I believe it was only a prototype that was never integrated in Nutch and probably never will be (until somebody contributes something). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Martin Xu <[EMAIL PROTECT

Re: where nutch store crawled data

2008-06-17 Thread Otis Gospodnetic
Hi, Both of you should open some JIRA issues and upload your patches there as you progress, so others can see the direction you are headed and make suggestions when appropriate. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Marcus Herou

Re: getting seed list for vertical search engine

2008-06-17 Thread Otis Gospodnetic
seed list form the known site > list is that I am sure to miss lots of small, individual sites - I > wonder how google, msn, yahoo does it - they must be getting list of > from ISPs, hosting providers, etc? > > Thanks > Jha, > > > > > On Mon, Jun 16, 2008

Re: problems with link limits

2008-06-17 Thread Otis Gospodnetic
Hi, There is also a setting for the maximal number of bytes to fetch. If your main index page is large, maybe it's just getting cut off because of that. The property has "content" in the name, I believe, so look for that in nutch-default.xml. Otis -- Sematext -- http://sematext.com/ -- Lucen

Re: infinite loop-problem

2008-06-16 Thread Otis Gospodnetic
Uhuh, yes, this is most likely due to session IDs that create unique URLs that Nutch keeps processing. Look at conf/regex-normalize.xml for how you can clean up URLs. That should help. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Felix

Re: db.ignore.external.links=true and redirects

2008-06-16 Thread Otis Gospodnetic
Don't have the answer, but got a question. Does this happen only when redirection to the external host are involved? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Drew Hite <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org > Sent: Mo

Re: ClassNotFoundException: org.apache.nutch.analysis.CommonGrams

2008-06-16 Thread Otis Gospodnetic
Yes, this is a pure CLASSPATH issue. I haven't built a Nutch war in a while, so I don't recall what is in it, but most likely it has WEB-INF/lib directory with some jar files. One of these ah, let's just see. Here: [EMAIL PROTECTED] trunk]$ unzip -l build/nutch-1.0-dev.war | grep jar | g

Re: getting seed list for vertical search engine

2008-06-16 Thread Otis Gospodnetic
This seems to be a common request - sizing. I think the best you can do is use existing search engines to estimate how many pages sites you are interested in have. You will have to know the exact sites (their URLs) and make use of the "site:" search operator (Google, Yahoo). Yahoo also has so

Re: problem running nutch from eclipse 3.2 in ubuntu hardy.

2008-06-14 Thread Otis Gospodnetic
an't find rules > for scope 'inject', using default > 2008-06-13 22:29:35,101 WARN crawl.Injector - Skipping > http://lucene.apache.org/:java.lang.NullPointerExcep > tion > 2008-06-13 22:29:35,101 WARN crawl.Injector - Skipping > http://shopping.yahoo.com/:jav

Re: problem running nutch from eclipse 3.2 in ubuntu hardy.

2008-06-13 Thread Otis Gospodnetic
Hi, You didn't mention URL injection, which makes me think you didn't inject any seed URLs to crawl. I also suggest figuring out how to run Nutch "normally", "from the command-line", before introducing additional variables and complexities, such as running Nutch from an IDE. Otis -- Sematext

Re: Hardware Specifications

2008-06-12 Thread Otis Gospodnetic
le of contexts your sort of agreeing with me. Running > multiple nutch processes on a multi-core processor is more efficient then > running one single process on heavily scaled hardware. > > Am i correct with this statement? > > > - Original Message > From: Otis

Re: Hardware Specifications

2008-06-12 Thread Otis Gospodnetic
I'm not sure -- I try to avoid running single Nutch job at a time, as I find overlapping is more efficient. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Sean Dean <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org > Sent: Thursday, Ju

Re: java.lang.StackOverflowError in HTMLMetaProcessor.getMetaTagsHelper

2008-06-12 Thread Otis Gospodnetic
Removed the plugin from the config :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Siddhartha Reddy <[EMAIL PROTECTED]> > To: nutch-user@lucene.apache.org > Sent: Thursday, June 12, 2008 11:41:17 PM > Subject: Re: java.lang.StackOverflowEr

Re: Retrieving data for a particular URL from crawldb?

2008-06-12 Thread Otis Gospodnetic
I don't think that's doable, as I *think* CrawlDb doesn't know which segment the URL is in (or does it? Not looking at the code now, sorry). But, knowing the segment you should be able to pull the web page data out. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Orig

Re: Parser bug?

2008-04-18 Thread Otis Gospodnetic
Svein, It sounds like this should be added to JIRA, though I wonder if this is just the case of some bad/invalid Javascript that confuses the js parser. You'll want to include the URL where this problem happens and its source. Probably best to grab the source with something like curl or wget

Re: Files removed from https://svn.apache.org/repos/asf/lucene/nutch/trunk/bin???

2008-04-18 Thread Otis Gospodnetic
You are right, the scripts are missing. I don't know why that is. I do see them in bin in my local svn checkout of nutch/trunk though. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: nutchvf <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.

Re: Parallel operations in fetch

2008-04-15 Thread Otis Gospodnetic
Thanks Dennis. But, hm, I don't get it 100% yet. I looked at Generator.java and I see this: if (numLists == -1) { // for politeness make numLists = job.getNumMapTasks();// a partition per fetch task } Thus, when -numFetchers is not given, the nu

Re: Next Generation Nutch

2008-04-14 Thread Otis Gospodnetic
Nutch - Original Message From: Andrzej Bialecki <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Monday, April 14, 2008 1:01:37 PM Subject: Re: Next Generation Nutch Dennis Kubes wrote: > > > Otis Gospodnetic wrote: >> I suppose the first thing to do would be des

Re: Next Generation Nutch

2008-04-13 Thread Otis Gospodnetic
From: Dennis Kubes <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Sunday, April 13, 2008 5:44:32 PM Subject: Re: Next Generation Nutch Otis Gospodnetic wrote: > Hello, > > A few quick comments. I don't know how much you track Solr, but the mention > of shard

Re: Next Generation Nutch

2008-04-11 Thread Otis Gospodnetic
Hi, Hm, I have to say I'm not sure if I agree 100% with part 1. I think it would be great to have such flexibility, but I wonder if trying to achieve it would be over-engineering. Do people really need that? I don't know, maybe! If they do, then ignore my comment. :) I'm curious about 2. -

Re: Next Generation Nutch

2008-04-11 Thread Otis Gospodnetic
Hello, A few quick comments. I don't know how much you track Solr, but the mention of shards makes me think of SOLR-303 and DistributedSearch page on Solr Wiki. You'll want to check those out. In short, Solr has the notion of shards and distributed search, kind of like Nutch with its RPC fra

Fetch task 100% done, but still fetching

2008-04-10 Thread Otis Gospodnetic
Hi, I noticed that during fetching map tasks get to 100% complete (in the GUI), but are not marked as completed (also in the GUI), and are in fact really not complete - the logs show there is fetching still going on (though almost exclusively timeouts at the end of the fetch run, as expected),

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

2008-04-08 Thread Otis Gospodnetic
increase the threads to 400 per server, and 3 per host. I was seeing about 15 pages/second. I didn't get a chance to implement the other suggestions because I'll eat all of the office's bandwidth and get yelled at :) Maybe I'll make a "Nutch Speed Improvements" entry in

Re: dealing with utf-8 characters

2008-04-06 Thread Otis Gospodnetic
I cannot tell for sure without looking at the code, but my guess is diacritics are simply not being stripped anywhere. I imagine you could modify the NutchAnalyzer to include that ISO...Filter, the same class that you must have configured in your Solr schema.xml. Otis -- Sematext -- http://s

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

2008-04-06 Thread Otis Gospodnetic
Regarding the Tika error message, I've seen that, too. if you need motivation, Chris. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Mattmann <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Saturday, April 5, 2008 2:58:

Re: Nutch or Heritrix?

2008-04-06 Thread Otis Gospodnetic
Hello Svein, Quick answers to your questions: - Nutch does not include an image crawler, though some people have started working on that a long time ago, and Archive.org is sponsoring this work/project. - Nutch has a distributed fetcher. Not sure about Heritrix. - Nutch is being worked on, bu

Re: url file and crawl filter file - basic question ( may be )

2008-03-29 Thread Otis Gospodnetic
I hate to do this, but here it goes: Please give volunteers at least 2-3 days to answer your question before reminding - it doesn't look nice. Either my mail reader is lying or you sent your reminder email only 30 minutes after your original email. Words like please and thank you also help. :)

Re: merging indexes with nutch

2008-03-11 Thread Otis Gospodnetic
Aha, I see several answers on the Nutch ML - bravo Tomo! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Tomislav Poljak <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Wednesday, March 5, 2008 1:11:39 PM Subject: Re: merging index

Re: What's the way make a nutch index work like a the lucene index?

2008-03-11 Thread Otis Gospodnetic
Siva - you can't really just use the Lucene demo tool nor that luceneweb thing and expect it to search your Nutch-created Lucene index. The two index structures (their fields) are quite different. I don't want to self-promote, but if you can, get a copy of Lucene in Action in order to get a be

  1   2   >