Re: next score usage

2005-10-14 Thread Otis Gospodnetic
I think this is for [EMAIL PROTECTED] please remove java-dev@ when replying. --- Michael Ji [EMAIL PROTECTED] wrote: hi, I saw several discussions about Distributed Link Analysis Tool before. And I still have question about the usage of the field next score in Page data structure.

LanguageIdentifierPlugin and CJK

2006-01-04 Thread Otis Gospodnetic
Hi, I'm interested in Language Identifier plugin that Sami and Jerome put together. I noticed the list of supported languaged does not include CJK languages: http://wiki.apache.org/nutch/LanguageIdentifierPlugin I'm wondering: 1. why is that? (technical difficulty of some kind?) 2. are

Re: pingomatic and pings with nutch

2007-09-03 Thread Otis Gospodnetic
Fabian - blo.gs, weblogs.com's changes.xml and pingomatic should be sufficient to get a good coverage (and solid overlap) of the blogosphere. There used to be FeedMesh, too, run by PubSub, but as PubSub is long gone, so is the FeedMesh, I believe. Got a site with a public demo? Otis

Re: pingomatic and pings with nutch

2007-09-05 Thread Otis Gospodnetic
to that information or if we can access to pingomatic services of updated blogs. Do you know something about this? Thanks for your answer. 2007/9/3, Otis Gospodnetic [EMAIL PROTECTED]: Fabian - blo.gs, weblogs.com's changes.xml and pingomatic should be sufficient to get a good coverage (and solid overlap

Re: help with hardware requirements

2007-09-09 Thread Otis Gospodnetic
Hi, I'm curious about what Tomislav is asking about, too -- how do searchers know when to reopen the index? That is, say you have a cluster of fetchers and every once in a while you end up with a newer version of an index (or indices), and say that you simply scp those indices to searchers,

Re: Hadoop distributed search.

2007-12-06 Thread Otis Gospodnetic
Dennis, Does the tmpfs really help more than the normal FS caching would help? For example, if you were to force the FS to read the whole index (files), it would read them into RAM and, hopefully, cache them. Wouldn't that achieve the same effect as tmpfs? I've done the former with very large

Re: Hadoop distributed search.

2007-12-08 Thread Otis Gospodnetic
to the application and IO drops to practically zero when used. Dennis Kubes Otis Gospodnetic wrote: Dennis, Does the tmpfs really help more than the normal FS caching would help? For example, if you were to force the FS to read the whole index (files), it would read them into RAM and, hopefully, cache

Re: Infrastructure Question

2007-12-23 Thread Otis Gospodnetic
Have you considered using EC2 during your testing/development stage? It would be safer than investing in the wrong hardware with insufficient knowledge of the exact demands and requirements. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From:

Re: Inbound Link Text

2008-01-10 Thread Otis Gospodnetic
Hm, I didn't see that comment before. I think indexing incoming text is super obvious, the equivalent to human annotation/tagging of web pages, no? As for which anchor texts not to index hm, not sure. Nothing from spam pages? Nothing from non-authoritative pages even if they are not

Re: Some questions about Nutch

2008-02-17 Thread Otis Gospodnetic
Oleg - just a quick pointer to adaptive refetching - is this not already available? See https://issues.apache.org/jira/browse/NUTCH-61 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Oleg Mürk [EMAIL PROTECTED] To:

Re: Help needed to crawl webpages

2008-02-18 Thread Otis Gospodnetic
It sounds like you really want to create a simplistic crawler for something that small. Nutch does a *pile* of other stuff that you don't seem to care about. Google for: open source web crawlers . I think there is one called Sphynx that is simple. Otis -- Sematext -- http://sematext.com/ --

Re: nutch vs hadoop versions

2008-02-18 Thread Otis Gospodnetic
Dennis Co. Is the 0.15.* - 0.16 upgrade seamless? That is, a jar replacement and that's it, or is there an explicit HDFS upgrade step involved? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Dennis Kubes [EMAIL PROTECTED] To:

Re: Nutch and Lucene

2008-02-26 Thread Otis Gospodnetic
You can certainly use the Lucene version that your version of Nutch uses. Lucene had a few releases since the last Nutch release (0.9). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Duan, Nick [EMAIL PROTECTED] To:

Re: What's the way make a nutch index work like a the lucene index?

2008-03-11 Thread Otis Gospodnetic
Siva - you can't really just use the Lucene demo tool nor that luceneweb thing and expect it to search your Nutch-created Lucene index. The two index structures (their fields) are quite different. I don't want to self-promote, but if you can, get a copy of Lucene in Action in order to get a

Re: merging indexes with nutch

2008-03-11 Thread Otis Gospodnetic
Aha, I see several answers on the Nutch ML - bravo Tomo! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Tomislav Poljak [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, March 5, 2008 1:11:39 PM Subject: Re: merging

Re: url file and crawl filter file - basic question ( may be )

2008-03-29 Thread Otis Gospodnetic
I hate to do this, but here it goes: Please give volunteers at least 2-3 days to answer your question before reminding - it doesn't look nice. Either my mail reader is lying or you sent your reminder email only 30 minutes after your original email. Words like please and thank you also help. :)

Re: Nutch or Heritrix?

2008-04-06 Thread Otis Gospodnetic
Hello Svein, Quick answers to your questions: - Nutch does not include an image crawler, though some people have started working on that a long time ago, and Archive.org is sponsoring this work/project. - Nutch has a distributed fetcher. Not sure about Heritrix. - Nutch is being worked on,

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

2008-04-06 Thread Otis Gospodnetic
Regarding the Tika error message, I've seen that, too. if you need motivation, Chris. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Mattmann [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Saturday, April 5, 2008

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

2008-04-09 Thread Otis Gospodnetic
. I was seeing about 15 pages/second. I didn't get a chance to implement the other suggestions because I'll eat all of the office's bandwidth and get yelled at :) Maybe I'll make a Nutch Speed Improvements entry in the Wiki. Cheers, Bradford Stephens On Sun, Apr 6, 2008 at 10:06 PM, Otis

Fetch task 100% done, but still fetching

2008-04-10 Thread Otis Gospodnetic
Hi, I noticed that during fetching map tasks get to 100% complete (in the GUI), but are not marked as completed (also in the GUI), and are in fact really not complete - the logs show there is fetching still going on (though almost exclusively timeouts at the end of the fetch run, as expected),

Re: Next Generation Nutch

2008-04-11 Thread Otis Gospodnetic
Hi, Hm, I have to say I'm not sure if I agree 100% with part 1. I think it would be great to have such flexibility, but I wonder if trying to achieve it would be over-engineering. Do people really need that? I don't know, maybe! If they do, then ignore my comment. :) I'm curious about 2.

Re: Next Generation Nutch

2008-04-13 Thread Otis Gospodnetic
: Dennis Kubes [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Sunday, April 13, 2008 5:44:32 PM Subject: Re: Next Generation Nutch Otis Gospodnetic wrote: Hello, A few quick comments. I don't know how much you track Solr, but the mention of shards makes me think of SOLR-303

Re: Next Generation Nutch

2008-04-14 Thread Otis Gospodnetic
- Original Message From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Monday, April 14, 2008 1:01:37 PM Subject: Re: Next Generation Nutch Dennis Kubes wrote: Otis Gospodnetic wrote: I suppose the first thing to do would be describe the requirements

Re: Parallel operations in fetch

2008-04-15 Thread Otis Gospodnetic
Thanks Dennis. But, hm, I don't get it 100% yet. I looked at Generator.java and I see this: if (numLists == -1) { // for politeness make numLists = job.getNumMapTasks();// a partition per fetch task } Thus, when -numFetchers is not given, the

Re: Files removed from https://svn.apache.org/repos/asf/lucene/nutch/trunk/bin???

2008-04-18 Thread Otis Gospodnetic
You are right, the scripts are missing. I don't know why that is. I do see them in bin in my local svn checkout of nutch/trunk though. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: nutchvf [EMAIL PROTECTED] To:

Re: Parser bug?

2008-04-18 Thread Otis Gospodnetic
Svein, It sounds like this should be added to JIRA, though I wonder if this is just the case of some bad/invalid Javascript that confuses the js parser. You'll want to include the URL where this problem happens and its source. Probably best to grab the source with something like curl or wget

Re: java.lang.StackOverflowError in HTMLMetaProcessor.getMetaTagsHelper

2008-06-12 Thread Otis Gospodnetic
Removed the plugin from the config :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Siddhartha Reddy [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Thursday, June 12, 2008 11:41:17 PM Subject: Re: java.lang.StackOverflowError

Re: Hardware Specifications

2008-06-12 Thread Otis Gospodnetic
I'm not sure -- I try to avoid running single Nutch job at a time, as I find overlapping is more efficient. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Sean Dean [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Thursday, June

Re: Hardware Specifications

2008-06-12 Thread Otis Gospodnetic
of agreeing with me. Running multiple nutch processes on a multi-core processor is more efficient then running one single process on heavily scaled hardware. Am i correct with this statement? - Original Message From: Otis Gospodnetic To: nutch-user@lucene.apache.org Sent: Friday

Re: problem running nutch from eclipse 3.2 in ubuntu hardy.

2008-06-13 Thread Otis Gospodnetic
Hi, You didn't mention URL injection, which makes me think you didn't inject any seed URLs to crawl. I also suggest figuring out how to run Nutch normally, from the command-line, before introducing additional variables and complexities, such as running Nutch from an IDE. Otis -- Sematext --

Re: problem running nutch from eclipse 3.2 in ubuntu hardy.

2008-06-14 Thread Otis Gospodnetic
crawl.Injector - Skipping http://lucene.apache.org/:java.lang.NullPointerExcep tion 2008-06-13 22:29:35,101 WARN crawl.Injector - Skipping http://shopping.yahoo.com/:java.lang.NullPointerExce ption HB On Fri, Jun 13, 2008 at 10:55 PM, Otis Gospodnetic wrote: Hi, You didn't

Re: getting seed list for vertical search engine

2008-06-16 Thread Otis Gospodnetic
This seems to be a common request - sizing. I think the best you can do is use existing search engines to estimate how many pages sites you are interested in have. You will have to know the exact sites (their URLs) and make use of the site: search operator (Google, Yahoo). Yahoo also has

Re: ClassNotFoundException: org.apache.nutch.analysis.CommonGrams

2008-06-16 Thread Otis Gospodnetic
Yes, this is a pure CLASSPATH issue. I haven't built a Nutch war in a while, so I don't recall what is in it, but most likely it has WEB-INF/lib directory with some jar files. One of these ah, let's just see. Here: [EMAIL PROTECTED] trunk]$ unzip -l build/nutch-1.0-dev.war | grep jar |

Re: db.ignore.external.links=true and redirects

2008-06-16 Thread Otis Gospodnetic
Don't have the answer, but got a question. Does this happen only when redirection to the external host are involved? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Drew Hite [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent:

Re: infinite loop-problem

2008-06-16 Thread Otis Gospodnetic
Uhuh, yes, this is most likely due to session IDs that create unique URLs that Nutch keeps processing. Look at conf/regex-normalize.xml for how you can clean up URLs. That should help. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Felix

Re: problems with link limits

2008-06-17 Thread Otis Gospodnetic
Hi, There is also a setting for the maximal number of bytes to fetch. If your main index page is large, maybe it's just getting cut off because of that. The property has content in the name, I believe, so look for that in nutch-default.xml. Otis -- Sematext -- http://sematext.com/ -- Lucene

Re: getting seed list for vertical search engine

2008-06-17 Thread Otis Gospodnetic
to miss lots of small, individual sites - I wonder how google, msn, yahoo does it - they must be getting list of from ISPs, hosting providers, etc? Thanks Jha, On Mon, Jun 16, 2008 at 11:15 PM, Otis Gospodnetic wrote: This seems to be a common request - sizing. I think the best you

Re: where nutch store crawled data

2008-06-17 Thread Otis Gospodnetic
Hi, Both of you should open some JIRA issues and upload your patches there as you progress, so others can see the direction you are headed and make suggestions when appropriate. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Marcus Herou

Re: All administration gui links in wiki are broken

2008-06-19 Thread Otis Gospodnetic
Don't count on the Admin UI. I believe it was only a prototype that was never integrated in Nutch and probably never will be (until somebody contributes something). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Martin Xu [EMAIL

Re: Has anybody implemented NUTCH in a C or C++ Application?

2008-06-19 Thread Otis Gospodnetic
Hi, Nutch is a Java application and consists of a number of Java classes that perform different operations. If you are asking whether you can run these classes from a C or C++ application -- I'm not sure, I never had to do that. If you know how to call java classes from C/C++, have a look at

Re: updating retry inteval

2008-06-19 Thread Otis Gospodnetic
Don't know off the top of my head, but I'd guess no, because Nutch uses Hadoop/HDFS. HDFS files are write-once, so I doubt you can just update a single URL's data. But you could write a MapReduce job that goes over the whole CrawlDb and modifies only the records you need modified. You'll

Re: how does nutch connect to urls internally?

2008-06-19 Thread Otis Gospodnetic
Hi Ann, Regarding frames - this is not the problem here (with Nutch), as Nutch doesn't even seem to be able to connect to your server. It never gets to see the HTML and frames in it. Perhaps there is something useful in the logs not on the Nutch side, but on that v4 server. Otis --

Re: GNUgcj problem?

2008-06-20 Thread Otis Gospodnetic
Just get the latest JDK from Sun. No need for yum, just download, install, set JAVA_HOME, add JAVA_HOME/bin to PATH and you are set. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Winton Davies [EMAIL PROTECTED] To:

Re: how does nutch connect to urls internally?

2008-06-20 Thread Otis Gospodnetic
-Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, June 19, 2008 10:54 PM To: nutch-user@lucene.apache.org Subject: Re: how does nutch connect to urls internally? Hi Ann, Regarding frames - this is not the problem here (with Nutch), as Nutch doesn't even

Re: Querying linkdb for a URL with special characters

2008-06-22 Thread Otis Gospodnetic
Hi, You can dump the whole CrawlDb and grep for your URL. Not fast, but it will work. You could also just try looking in your logs first. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Viksit Gaur [EMAIL PROTECTED] To:

Fetching only unfetched URLs

2008-06-22 Thread Otis Gospodnetic
Hi, If there an existing method for generating a segment/fetchlist containing only URLs that have not yet been fetched? I'm asking because I can imagine a situation where one has a large and old CrawlDb that knows about a lot of URLs (the ones with db_unfetched status if you run -stats) and in

Re: nutch and lucene scoring

2008-08-06 Thread Otis Gospodnetic
Hi, You really need to ask this question on the Lucene mailing list, as that's where hit scoring comes from. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Alexander Aristov [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent:

Re: keyword match

2008-09-24 Thread Otis Gospodnetic
It ain't Nutch, but you can look at Elevate component in Solr to get some ideas. There is a Wiki page for the component. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Edward Quick [EMAIL PROTECTED] To: nutch-user@lucene.apache.org

Re: did you mean?

2008-09-24 Thread Otis Gospodnetic
Heh, I'll point to Solr's SpellCheckComponent. :) It, too, has a good page on the Wiki. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Edward Quick [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, September 24, 2008

Re: Extensive web crawl

2008-10-20 Thread Otis Gospodnetic
Axel, how did this go? I'd love to know if you got to 1B. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Webmaster [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, October 7, 2008 1:13:29 AM Subject: Extensive web

Hadoop's new fair sharing job scheduler

2008-11-20 Thread Otis Gospodnetic
Hi, Just noticed Hadoop's new fair sharing job scheduler ( https://issues.apache.org/jira/browse/HADOOP-3746 ). It seems to be in 0.19, which I think Nutch is not on yet... but still: - is this something that would benefit Nutch? The last time I used Nutch I remember having to be careful

Re: Indexing News groups

2008-11-20 Thread Otis Gospodnetic
PM, Otis Gospodnetic wrote: By newsgroups do you mean Usenet newsgroups? If so, it might be a lot simpler to use Solr, unless you want to build an NNTP crawler I did do something like that over a decade ago. I used it to find people and build a White Pages directory (this was big

Re: Hadoop's new fair sharing job scheduler

2008-11-20 Thread Otis Gospodnetic
work for and help with Nutch generate/fetch/parse/etc. operations. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Otis Gospodnetic [EMAIL PROTECTED] To: Nutch User List nutch-user@lucene.apache.org Sent: Thursday, November 20, 2008 3

Re: Indexing News groups

2008-11-20 Thread Otis Gospodnetic
of webservice? -John On Nov 20, 2008, at 4:23 PM, Otis Gospodnetic wrote: Yes, you'd have to write a mini newsgroup reader, mimic its behaviour, but then once you grab a post you could send it directly to Solr for indexing. No need for intermediate DB, XML files, etc. Otis

Re: proposal: fetcher performance improvements

2008-12-10 Thread Otis Gospodnetic
Hi Todd, This sounds good. I think we've all see the problem you are describing. You can see something related at: - https://issues.apache.org/jira/browse/NUTCH-629 - https://issues.apache.org/jira/browse/NUTCH-628 It would be great if you could incorporate any of the good ideas from the above

Re: Partial word searches?

2008-12-15 Thread Otis Gospodnetic
Hi, It would be possible if you index tokens not as words, but as character ngrams. You'd need a custom analyzer for that. Code for character-based ngrams already exists in Lucene contrib, but you'd need to add it to Nutch. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Stemming issues

2008-12-16 Thread Otis Gospodnetic
Hi, Yes, if you want flowers to match flower you will want to apply stemming. You can use the Snowball for English. I don't have any code handy, but you can see how it's done if you look at Lucene's unit test for Snowball Analyzer. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr -

Re: Any unofficial howtos for a new Nutch user

2008-12-17 Thread Otis Gospodnetic
Hi, Unfortunately, there are no Nutch books (nor are any Nutch books in the works that I know of), and I think the documentation on the Nutch Wiki is the best/only thing there is. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: opsec

Re: Stemming issues

2008-12-17 Thread Otis Gospodnetic
You need to stem both at index time and at search time. Then flowers will be stemmed to flower in both cases and flower at search time will match the indexed term flower. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: RanjithStar

Re: Spider a single url and get tokenzied keyword/phrases

2008-12-23 Thread Otis Gospodnetic
Hi Doug, Nutch is not really meant for this type of stuff. You'd be using a very very massive hammer for a very small nail if you were to choose Nutch for this task. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doug Leeper

Re: Search performance for large indexes (100M docs)

2009-01-09 Thread Otis Gospodnetic
Check java-user archives on markmail.org and search for Toke and SSD to see SSD benchmarks done by Toke a few months back. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Sean Dean seand...@rogers.com To: nutch-user@lucene.apache.org

Re: [jira] Commented: (NUTCH-442) Integrate Solr/Nutch

2009-01-09 Thread Otis Gospodnetic
Tony, You've sent about 10 emails about this already, both on the Nutch and on the Solr list. Please have a bit more patience and wait for Nutch 1.0 release. My guess is this Nutch-Solr integration will be in Nutch 1.0. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr -

Re: Search performance for large indexes (100M docs)

2009-01-13 Thread Otis Gospodnetic
Vishal, Re 2. - I don't think it's quite true. RAM is still much faster than SSDs. Also, which version of Lucene are you using? Make sure you're using the latest one if you care about performance. Also, if you have extra RAM, you can make your .tii bigger/denser and speed up searches that

Re: Adding new plugin and classloading issues

2009-01-23 Thread Otis Gospodnetic
Step one is to identify the exact jar where this class lives. Are you sure it's in mail.jar? Maybe it's in activate.jar? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Antony Bowesman a...@teamware.com To: nutch-user@lucene.apache.org

Re: sitemaps

2009-02-27 Thread Otis Gospodnetic
Nutch doesn't make use of sitemaps currently. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: consultas consul...@qualidade.eng.br To: nutch-user@lucene.apache.org Sent: Friday, February 27, 2009 12:34:30 PM Subject: sitemaps From a

Re: error when bootstrap DMOZ databases

2009-03-03 Thread Otis Gospodnetic
You don't have enough free disk space, that's all. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Tony Wang ivyt...@gmail.com To: nutch-user@lucene.apache.org Sent: Tuesday, March 3, 2009 10:58:41 PM Subject: error when bootstrap DMOZ

Re: The Future of Nutch

2009-03-16 Thread Otis Gospodnetic
Hello, Comments inlined. - Original Message From: Dennis Kubes ku...@apache.org To: nutch-user@lucene.apache.org Sent: Friday, March 13, 2009 8:19:37 PM With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some

Re: Nutch-based Application for Windows

2009-05-26 Thread Otis Gospodnetic
Hi John, It would be quite appropriate, actually. You may want to put a link to it under the Resources section on the front page, and maybe even on http://wiki.apache.org/nutch/GettingNutchRunningWithWindows Otis (Nutch committer) -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: threads get stuck in spinwaiting

2009-05-27 Thread Otis Gospodnetic
See https://issues.apache.org/jira/browse/NUTCH-570 for something relevant. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Raymond Balmès raymond.bal...@gmail.com To: nutch-user@lucene.apache.org Sent: Wednesday, May 27, 2009 9:43:02 AM

Re: threads get stuck in spinwaiting

2009-05-27 Thread Otis Gospodnetic
Ray, I don't think fetchlist generation sticks URLs from the same domain or host together. But URLs for the same host do end up in the same queue. This is by design and it is a good thing -- this is how Nutch can ensure not to hit the same host with more simultaneous threads than it should

Re: threads get stuck in spinwaiting

2009-05-27 Thread Otis Gospodnetic
have many URLs per host of course. Need to get all the pages of the sites, don't understand the question. -Raymond 2009/5/26 Otis Gospodnetic But how, Ray, if you have only 1 URL per host? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original

Re: threads get stuck in spinwaiting

2009-05-27 Thread Otis Gospodnetic
on this? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Otis Gospodnetic ogjunk-nu...@yahoo.com To: nutch-user@lucene.apache.org Sent: Wednesday, May 27, 2009 11:38:48 PM Subject: Re: threads get stuck in spinwaiting Ray, I don't think

Re: Question on Efficient field updates in the Lucene index in Nutch

2009-06-02 Thread Otis Gospodnetic
Unfortunately Lucene doesn't allow that. You have to reindex the whole doc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Vijay vijay.stanf...@gmail.com To: nutch-user@lucene.apache.org; java-u...@lucene.apache.org Sent: Monday, June

Re: Reading Nutch indexes w/ Lucene.NET

2009-06-10 Thread Otis Gospodnetic
Hello, It really depends on the version of Lucene used in your Nutch instance and whether Lucene.NET version you are using is compatible at index format level. As for segments dir vs. file, this is just a case of unfortunate naming. Segments in Lucene means a completely different thing than

Re: adding pre-indexed DB's together

2009-06-22 Thread Otis Gospodnetic
Paul, There was talk of this in the past, at least between some other people here and me, possibly off-line. Your best bet may be going to what's left of Wikia Search and getting their old index. But, you see, this is exactly the problem - the index may be quite outdated by now. Otis --

Re: adding pre-indexed DB's together

2009-06-22 Thread Otis Gospodnetic
db which had over 1Billion urls at last count. So it might be a good starting point for crawling the web. At last count though it was 250G in size so no downloadable unless you have a fast connection. It is available for anyone that wants it though. Dennis Otis Gospodnetic wrote

Re: recrawling

2009-06-24 Thread Otis Gospodnetic
Neeti, I don't think there is a way to know when a regular web site has been updated. You can issue GET or HEAD requests and look at the Last-Modified date, but this is not 100% reliable. You can fetch and compare content, but that's not 100% reliable either. If you are indexing blogs,

Re: Using nutch only as a webcrawler?

2009-06-26 Thread Otis Gospodnetic
Johan, Yes, you can fetch and fetch and fetch and only fetch with Nutch and have the data saved in HDFS (Nutch uses something called Hadoop and that includes HDFS, a distributed FS that sits on top of regular FS/disk). You can then read the data from there and index it however you want,

Re: Nutch fetch performance

2009-06-26 Thread Otis Gospodnetic
I remember seeing those in the logs, but it's been a while. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: caezar caeza...@gmail.com To: nutch-user@lucene.apache.org Sent: Friday, June 26, 2009 3:50:39 AM Subject: Re: Nutch fetch

Re: Nutch 1.0 on the limits of the data

2009-07-03 Thread Otis Gospodnetic
Depends on hardware, of course! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Polsnet pols...@163.com To: nutch-user@lucene.apache.org Sent: Friday, July 3, 2009 12:03:30 AM Subject: Nutch 1.0 on the limits of the data Nutch 1.0

Re: Specific fetch list based on url status or score

2009-08-02 Thread Otis Gospodnetic
Hi, See this: http://markmail.org/message/znbu5khl7qbkvhkm (I didn't double-check CHANGES.txt to see if this made it into 1.0) Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From:

Re: denied by robots.txt rules

2009-08-02 Thread Otis Gospodnetic
Hi, robots.txt is periodically rechecked and the previously denied URL should be retried when the time to refetch it comes. If robots.txt rules no longer deny access to it, it should be fetched. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta,

Re: Nutch in C++

2009-08-02 Thread Otis Gospodnetic
Nutch uses Lucene (Java), not CLucene (C++). Why are you looking to rewrite Nutch in C++ anyway? Sounds scary. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: alx...@aim.com

Re: Dumping Crawl DB with XML

2009-08-02 Thread Otis Gospodnetic
Mario, I think text is the only output format. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: schroedi schroedi2...@gmail.com To: nutch-user@lucene.apache.org Sent:

Re: Meaning of ProtocolStatus.ACCESS_DENIED

2009-08-02 Thread Otis Gospodnetic
I don't know of an elegant way, but if you want to hack Nutch sources, you could set its refetch time to some point in time veeey far in the future, for example. Or introduce additional status. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch,

Re: Using Nutch (w/custom plugin) to crawl vs. custom Lucene app

2009-08-02 Thread Otis Gospodnetic
Hello, Lucene sounds like the way to go here. What's more, if you have a copy of Lucene in Action (1st edition), I wrote a small and simple framework for file-system indexing. You could define your own parser for your own custom file format and the indexer will use it. I think it's in

Re: Nutch in C++

2009-08-03 Thread Otis Gospodnetic
useful, please let me know. thanks. Alex. -Original Message- From: Otis Gospodnetic To: nutch-user@lucene.apache.org Sent: Sun, Aug 2, 2009 8:15 pm Subject: Re: Nutch in C++ Nutch uses Lucene (Java), not CLucene (C++). Why are you looking to rewrite Nutch

Re: Nutch in C++

2009-08-04 Thread Otis Gospodnetic
-Original Message- From: Otis Gospodnetic [mailto:ogjunk-nu...@yahoo.com] Sent: 04 August 2009 04:49 To: nutch-user@lucene.apache.org Subject: Re: Nutch in C++ CLucene is just like Lucene (except a few versions behind), but written in C++. Yes, you could rewrite Nutch in C

Re: Nutch in C++

2009-08-04 Thread Otis Gospodnetic
@lucene.apache.org Sent: Tuesday, August 4, 2009 12:36:19 PM Subject: Re: Nutch in C++ Thanks for your comments. Is there anything that I code in C++ that open source community could benefit? Alex. -Original Message- From: Otis Gospodnetic To: nutch-user@lucene.apache.org

Re: PDFBox log file locks Fetcher

2009-08-04 Thread Otis Gospodnetic
I don't have a fix, but I have a suggestion - have you tried using the very latest version of PDFBox? I believe it's going through Apache Incubator... aha, here: http://incubator.apache.org/pdfbox/ Too bad the page doesn't say *when* the release was made, so one can get a sense of the state

Re: Categorizing search results

2009-08-04 Thread Otis Gospodnetic
Kenan, Have you considered using Carrot2? I think Nutch includes a plugin for it already. Or, if your categories are predefined, you could index with Solr (if you were to use Nutch 1.0) and use Solr's faceting capabilities. Otis -- Sematext is hiring --

Re: Nutch/Solr question

2009-11-11 Thread Otis Gospodnetic
Solr is just a search and indexing server. It doesn't do crawling. Nutch does the crawling and page parsing, and can index into Lucene or into a Solr server. Nutch is a biggish beast, and if you just need to index a site or even a small set of them, you may have an easier time with Droids.

Re: How to configure nutch to crawl parallelly

2009-11-13 Thread Otis Gospodnetic
I don't recall off the top of my head what that jobtracker.jsp shows, but judging by name, it shows your job. Each job is composed of multiple map and reduce tasks. Drill into your job and you should see multiple tasks running. Otis -- Sematext is hiring --

Re: crawling / data aggregation - is nutch the right tool?

2009-11-15 Thread Otis Gospodnetic
Droids is much simpler if all you want to do is do a little bit of crawling. Nutch is built to scale to many millions of web pages. If you need to crawl just a few sites, I'd suggest Droids. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta,

Re: 100 fetches per second?

2009-11-26 Thread Otis Gospodnetic
I think in the end what Ken Krugler did with Bixo (limiting crawl time) and what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem, in addition to what Andrzej described below. Can you try

NYC Search Discovery Meetup

2009-12-01 Thread Otis Gospodnetic
Hello, For those living in or near NYC, you may be interested in joining (and/or presenting?) at the NYC Search Discovery Meetup. Topics are: search, machine learning, data mining, NLP, information gathering, information extraction, etc. http://www.meetup.com/NYC-Search-and-Discovery/ Our

Re: What is the best choice: nutch/lucene or nutch/solr?

2009-12-04 Thread Otis Gospodnetic
Sounds like Nutch for crawling to gather the data, custom tools to read the gathered data, call the KV store, construct SolrInputDocuments, and index those to Solr. If you want Solr and not Lucene, which is a bigger question that I can't answer without knowing the details. Otis -- Sematext

Re: ontology implementation

2010-01-07 Thread Otis Gospodnetic
Claudio, If you think synonyms will do, perhaps you should look at Solr, which includes support for query-time and/or index-time synonym expansion. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Claudio Martella claudio.marte...@tis.bz.it

NYC Search in the Cloud meetup: Jan 20

2010-01-12 Thread Otis Gospodnetic
Hello, If Search Engine Integration, Deployment and Scaling in the Cloud sounds interesting to you, and you are going to be in or near New York next Wednesday (Jan 20) evening: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/ Sorry for dupes to those of you subscribed to

Re: Using Nutch to crawl and use it as input to Solr

2010-01-28 Thread Otis Gospodnetic
Use Droids to crawl. It already has hooks to index crawled content with Solr, e.g. http://search-lucene.com/c?id=Droids:/droids-solr/src/main/java/org/apache/droids/solr/SolrHandler.java||solr Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search ::

  1   2   >