Where is database?!

2006-03-06 Thread Dima Mazmanov
Hi! Here are my steps of crawling. I started all hadoop daemins, inserted url file into dfs. then started to crawl. Here is part of crawl log. 060306 124851 parsing file:/usr/home/duche/nutch-nightly/conf/mapred-default.xml 060306 124851 parsing

RE: query site

2006-03-06 Thread Laurent Michenaud
I think it is strange. OR is supported by Lucene, so it should be supported by Nutch. No ? -Message d'origine- De : Jack Tang [mailto:[EMAIL PROTECTED] Envoyé : vendredi 3 mars 2006 18:39 À : nutch-user@lucene.apache.org Objet : Re: query site OR is not supported in nutch yet. On

Re: nutch and multilingualism

2006-03-06 Thread Ivan Sekulovic
Hi Jerome! Would it be possible to generate ngram profiles for LanguageIdentifier plugin from crawled content and not from file? What is my idea? The best source for content in one language could be wikipedia.org. We would just crawl the wikipedia in desired language and then create ngram

Re: NullPointerException

2006-03-06 Thread Howie Wang
I didn't see query-basic/query-more on your list of plugins included. This is what handles most queries usually. query-url will only handle parts of the query that look like url:http://www.google.com, and query-site handles site:www.google.com. Nothing seems to be handling just regular text in

Re: Jpeg and Exif Plugin

2006-03-06 Thread Ivan Sekulovic
I think that licence is OK. Using that libray for plugin is realy simple. I've done some test some time ago. All you have to do is something like this (content is byte[]) Metadata metadata = JpegMetadataReader.extractMetadataFromJpegSegmentReader(new JpegSegmentReader(content)); And then

Re: nutch and multilingualism

2006-03-06 Thread Zaheed Haque
I think its a very good idea. It will be even better if one could create a separate Crawl script just for ngram creation where one could add their own URL for example national libraries URL or etc.. My thinking is that bin/nutch ngram which is similler to crawl one shot intranet searching but

query-more

2006-03-06 Thread Laurent Michenaud
Hi, Have u got an example how query-more plugin is working ? For type, is it used to do something like that or not ? +type:text/html

Re: query-more

2006-03-06 Thread Jérôme Charron
For type, is it used to do something like that or not ? +type:text/html If you just type a query like type:text, type:html or type:text/html it will return no result. It is a filter, ie you must associate it to a search term, for instance: type:html nutch if you want to get nutch related

Re: find duplicate urls in webdb

2006-03-06 Thread Andrzej Bialecki
Elwin wrote: When I read pages out of a webdb and printed out the url of each page, I found two urls are just the same. Is it possible that two pages with the same url? WebDB should not allow two URLs that are exactly the same (Nutch uses MD5 signature for that). Please check them

RE: query-more

2006-03-06 Thread Laurent Michenaud
It doesnot work for me. When I search something, I've got no attribute type in the HitDetails. I should see it, no ? The plugin is well activated : 2006030612:39:28,377DEBUG.[]plugin: id=query-more name=More Query Filter version=1.0.0 provider=nutch.orgclass=null 20060306

Offline search (Vicaya 0.1)

2006-03-06 Thread Alexander E Genaud
Hello, I've just released a modified version of nutch071 and tomcat50 running off a CDROM or local harddrive cross-platform: http://sf.net/projects/vicaya My ambitions are not 'the whole web' but a small and static collection of pages. I intend to allow users to use nutch offline with the

Re: Offline search (Vicaya 0.1)

2006-03-06 Thread Stefan Groschupf
Hi, storing the index on the hdd would be a good idea. Take a look to the nutchBean init method to get an idea what you need to change. Should be simple by just allowing to provide an location for the index that is different than the segments folder. Stefan Am 06.03.2006 um 12:53 schrieb

Re: query-more

2006-03-06 Thread Jérôme Charron
When I search something, I've got no attribute type in the HitDetails. I should see it, no ? You should see 3 type fields in HitDetails : one for primary type, one for subtype and one for full content type. Are you sure your index has been builded with the index-more plugin activated? Jérôme

RE: query-more

2006-03-06 Thread Laurent Michenaud
Thanks, i forgot to activate the index-more plugin. I only activated the query-more. -Message d'origine- De : Jérôme Charron [mailto:[EMAIL PROTECTED] Envoyé : lundi 6 mars 2006 14:17 À : nutch-user@lucene.apache.org Objet : Re: query-more When I search something, I've got no

Re: how can i go deep?

2006-03-06 Thread Steven Yelton
I'd be glad too, but I need to clean them up a bit (and make them more generic) first. In the mean time, here is a link to an article that I found helpful: http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html Search for 'recrawl'. You can use this script out of the box

Re: limit fetching by using crawl-urlfilter.txt

2006-03-06 Thread Ravi Chintakunta
You can have the inclusion and exclusion urls regex specified in different lines or combine them by ORing. That does not make much difference. Make sure that you have this line at the end. -. This will make sure all other sites are not crawled. - Ravi On 3/3/06, Jack Tang [EMAIL PROTECTED]

Re: Crawl Problem

2006-03-06 Thread Ravi Chintakunta
Nutch is not able to find the urls file you have specified on the command line. The filename you have mentioned is urls.txt and not urls. Correct this by changing the filename or by specifying urls.txt on the command line. - Ravi On 3/3/06, Pine Cone [EMAIL PROTECTED] wrote: Hello, I am

Re: project vitality?

2006-03-06 Thread TDLN
Stefan. I know people having 500 mio pages index and I personal run crawls with ~300 pages per second. Sorry, but I have to ask: what kind of setup do you have (network, hw, nutch version) that you manage so many pages per second? Unless this is a company secret, it would be very nice to know

Multi dimensional searches

2006-03-06 Thread sudhendra seshachala
I have been using nutch for learning purpose as to how it works so far. I have been fairly successful in actually getting it up and running for some sites on my local machine. I sincerely thank the vibrant group helping me and many others.. I have some questions or issues, however

Multi-applications?

2006-03-06 Thread Franz Werfel
Hello, Is it possible to have more than one Nutch application on one Nutch installation? What I would like to do would be to have several (4-5) indexes relating to independant websites, searchable independently but with just one Nutch install (ie, one Tomcat webapp). On the indexing side, this

Re: project vitality?

2006-03-06 Thread Stefan Groschupf
Hi Thomas, for this crawl setup we have a test environment of nutch 0.8, 10xAMD's, custom linux build, 100Mbit eth1, 1Gb eth0, each box has a 'caching' dns server. Stefan Am 06.03.2006 um 15:59 schrieb TDLN: Stefan. I know people having 500 mio pages index and I personal run crawls

Re: project vitality?

2006-03-06 Thread mos
On 3/4/06, Stefan Groschupf: Just a general note, jira has a voting functionality. This allows everybody to vote an issue and can show in a very compressed style what the community is looking for. However it is not used that often yet. It would be great if more users can use it. That's a

Ignore external Links

2006-03-06 Thread David Odmark
Hi, We are using 0.8, and I see a property called db.ignore.internal.links that is used by LinkDB to, well, ignore internal links. What we need is a runtime-switchable option that allows the opposite -something like db.ignore.external links. This is to say, for a given page, we don't want to

Re: project vitality?

2006-03-06 Thread mos
On 3/4/06, Stefan Groschupf: Just a general note, jira has a voting functionality. This allows everybody to vote an issue and can show in a very compressed style what the community is looking for. However it is not used that often yet. It would be great if more users can use it. That's a

move from nutch 0.71 to 0.8

2006-03-06 Thread Insurance Squared Inc.
I've seen it noted that a complete recrawl is necessary to migrate from 0.71 to 0.8. Is this absolutely necessary? Or could a converter be created to migrate the data? Has anyone created this? I expect at some point I'll have to move versions and something like this would be very useful.

Indexing Excel and Powerpoint

2006-03-06 Thread Laurent Michenaud
I need to index Excel and Powerpoint files in nutch 0.7.1 ? I've seen the plugins in nutch-0.8-dev. No version of these plugins for nutch 0.7.1 ? And is it possible to index OpenOffice documents ? if yes, what version is required ? Thanks

HTTPS support?

2006-03-06 Thread David Odmark
Hi, Does Nutch 0.8 support https fetches? If not, are there any active efforts to support it? TIA, David Odmark

Re: Indexing Excel and Powerpoint

2006-03-06 Thread Jérôme Charron
I need to index Excel and Powerpoint files in nutch 0.7.1 ? I've seen the plugins in nutch-0.8-dev. No version of these plugins for nutch 0.7.1 ? Originaly, these plugins were writed for nutch-0.7.1, and then adapted and committed in nutch-0.8 You can retrieve the original patches in JIRA.

Re: HTTPS support?

2006-03-06 Thread Andrzej Bialecki
David Odmark wrote: Hi, Does Nutch 0.8 support https fetches? If not, are there any active efforts to support it? It does, using protocol-httpclient plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Multi-applications?

2006-03-06 Thread Nutch Newbie
Ravi: Just wondering did you submit your modification in JIIRA? I can't seems to find it. Thanks On 3/6/06, Ravi Chintakunta [EMAIL PROTECTED] wrote: Hi Frank, Have a look at this thread. http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03014.html - Ravi On 3/6/06, Franz

RE: Indexing Excel and Powerpoint

2006-03-06 Thread Laurent Michenaud
Ok, i found them thanks. -Message d'origine- De : Jérôme Charron [mailto:[EMAIL PROTECTED] Envoyé : lundi 6 mars 2006 18:41 À : nutch-user@lucene.apache.org Objet : Re: Indexing Excel and Powerpoint I need to index Excel and Powerpoint files in nutch 0.7.1 ? I've seen the plugins in

Problem running Nutch Mapred after applying patch for Adaptive refetch

2006-03-06 Thread D . Saravanaraj
Hi Andrzej, I applied your patch for adaptive refetch. In the Indexer.java, the case statement for STATUS_FETCH_UNMODIFIED is missing in the reduce() method. I hope a simple break statement is to be added there. Thanks D.Saravanaraj

Help with bin/nutch server 8081 crawl

2006-03-06 Thread Monu Ogbe
Hello Team, I am having a lot of fun evaluating 0.8-dev, and after following Stefan's and the doc team's tutorials, have got everything working in both local and multi-machine modes using hadoop. In single-machine mode, I have come unstuck, though, trying to expose nutch server on port 8081

RE: query site

2006-03-06 Thread Teruhiko Kurosaka
From: Laurent Michenaud [mailto:[EMAIL PROTECTED] Sent: 2006-3-06 0:13 To: nutch-user@lucene.apache.org Subject: RE: query site I think it is strange. OR is supported by Lucene, so it should be supported by Nutch. No ? No, Nutch doesn't use Lucene's QueryParser. It has its own

Re: Multi-applications?

2006-03-06 Thread Franz Werfel
Ravi, Thanks for your answer, and yes, your problem was very similar to mine -- more complicated, even, since you want to be able to search one or several indices at a time, and I need to search only one. Is your solution available online somewhere, as a patch or plugin? That would be very

Re: project vitality?

2006-03-06 Thread Doug Cutting
Richard Braman wrote: I realy do think nutch is great, but I echo Matthias's comments that the community needs to come together and contirbute more back. And that comes with the requirement of making sure volunteers are given access to make their contributions part of the project. Here's how

Re: project vitality?

2006-03-06 Thread Doug Cutting
David Wallace wrote: Also, I've lost count of the number of times someone has posted something to the effect of I'll pay someone to give me Nutch support, simply because they find the existing documentation and mailing lists inadequate. Usually, that person gets told that the best way to get

Re: issues w/ new nutch versions

2006-03-06 Thread Doug Cutting
Florent Gluck wrote: In hadoop jobtracker's log, I can see several tasks being losts as follow: 060306 184155 Aborting job job_hyhtho 060306 184156 Task 'task_m_7qgat2' has been lost. 060306 184156 Aborting job job_hyhtho 060306 184156 Task 'task_m_lph5qs' has been lost. 060306 184156 Aborting

Re: Help with bin/nutch server 8081 crawl

2006-03-06 Thread Doug Cutting
Monu Ogbe wrote: Caused by: java.lang.InstantiationException: org.apache.nutch.searcher.Query at java.lang.Class.newInstance0(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav It

Re: Moving tutorial link to wiki

2006-03-06 Thread Doug Cutting
Matthias Jaekle wrote: Maybe we should move the tutorial to the wiki so it can be commented on. +1 +1 Doug

Re: NullPointerException

2006-03-06 Thread Hasan Diwan
On 06/03/06, Howie Wang [EMAIL PROTECTED] wrote: Is query-basic or query-more included in your nutch-default.xml? It is indeed included in my nutch-site.xml :- property nameplugin.includes/name

Re: running Nutch

2006-03-06 Thread D . Saravanaraj
Delete the crawl folder which would have been created in the previous crawl. On 3/7/06, ilango gurusamy [EMAIL PROTECTED] wrote: Hi I am trying to run Nutch by following the instructions given in the tutorial. The environment is Suse Linux10, JDK 1.4.2 and Nutch 0.71. And of course Tomcat 5

Re: NullPointerException

2006-03-06 Thread Howie Wang
Hi, Hasan, Looking more carefully at the query-more plugin, it seems that it only adds functionality for date queries and type queries. I think you need to add query-basic to the list also to get it to search the default content. Can you try adding query-basic and running: bin/nutch search http

Re: running Nutch

2006-03-06 Thread ilango gurusamy
Hi I successfully ran Nutch. Thanks for the tip. Strangely I remember deleting the crawl directory before..but anyway, you worked the magic for me by the way, Saravanaraj, are you from TN. What are your research interests with Nutch ilango D.Saravanaraj [EMAIL PROTECTED] wrote: Delete