CBIR (Re: Jpeg and Exif Plugin)

2006-03-03 Thread Andrzej Bialecki
Jérôme Charron wrote: What do you thing about a plug-in for indexing MetaData Exif on Jpeg ? Do you thing it's a good idea ? I think it makes sense. For a general search engine it will allow to search on image comments for instance. For an image search engine it will allow to search on

Re: Jpeg and Exif Plugin

2006-03-03 Thread Philippe EUGENE
I think it makes sense. For a general search engine it will allow to search on image comments for instance. For an image search engine it will allow to search on technical metadata (exposure time, date, ...) Ok. I can try to make this plug-in next week. I can use this java library :

limit fetching by using crawl-urlfilter.txt

2006-03-03 Thread Michael Ji
Hi, I searched on the mail-post, but still have problem to run my testing. Actually, I want my crawling is limited to two site solely. such as, *.abc.com/* and *.def.com/* so I put two line in crawl-urlfilter.txt as +^http://([a-z0-9]*\.)*.abc.com/ +^http://([a-z0-9]*\.)*.def.com/ But

nutch and multilingualism

2006-03-03 Thread Laurent Michenaud
Hi, What is the good strategy to adopt for multilingualism sites ? I want nutch to index a site in the different languages and then, the search only prints results that are in the user language. Thanks for advices please.

Re: https plugin for Nutch

2006-03-03 Thread Ravi Chintakunta
Another way of crawling password protected site, is modifying your intranet site to allow the nutch bot to crawl the site without authentication. Since this is your intranet site, this should be simple. You may also have to validate against the the crawler machine's IP while allowing the nutch bot

Re: nutch and multilingualism

2006-03-03 Thread Jérôme Charron
What is the good strategy to adopt for multilingualism sites ? I want nutch to index a site in the different languages and then, the search only prints results that are in the user language. Hi Laurent, What I can suggest is to : 1. use the languageidentifier plugin while crawling in order to

Re: Empty search results using a merged index

2006-03-03 Thread keren nutch
Hi Byron, We use Nutch 0.7.1. What version do you use. Maybe Nutch 0.7.1 doesn't support the merged index. Keren Byron Miller [EMAIL PROTECTED] wrote: Sounds like it couldn't find your segments. Did catalina.out show your segments were found or report any other errors? --- keren nutch

Re: limit fetching by using crawl-urlfilter.txt

2006-03-03 Thread Michael Ji
hi, I tried this, actually in my case, one site ends with .net and the other is .org so I modified it to +^http://([a-z0-9]*\.)*(abc.net|def.org)/ and I run another testing, seems doesn't work, coz I saw a site other than abc and def is being fetched, any hints? thanks, Michael, ---

query site

2006-03-03 Thread Laurent Michenaud
Hi, How do u use the query-site ? I've tried : site:http://localhost:8080 but it returns nothing. Thanks

How to set up for merged index

2006-03-03 Thread keren nutch
Hi, After I merged indexes from the directory /home/nutch/segmetns which contains 20 sub directories. My outputIndex name is index. Then, I moved the index under /home/nutch/merged_index/. In the nutch-site.xml, I set 'searcher.dir' to be ' /home/nutch/merged_index'. After that, I restarted

RE: query site

2006-03-03 Thread Laurent Michenaud
Hi, i found, it is : site:localhost Now, can do I do a search both on the site site1 and site2 ? site:site1 OR site:site2 doesnot work Thanks -Message d'origine- De : Laurent Michenaud [mailto:[EMAIL PROTECTED] Envoyé : vendredi 3 mars 2006 17:02 À : nutch-user@lucene.apache.org

RE: Question about Index Writing/Merging

2006-03-03 Thread Tim Patton
Thanks, that's exactly what I was thinking. Do you have any recommendations on maximum index size (obviously we'd be testing ourselves, but its good to get an idea)? Tim -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, March 02, 2006 7:34 PM To:

Re: limit fetching by using crawl-urlfilter.txt

2006-03-03 Thread Jack Tang
On 3/3/06, Michael Ji [EMAIL PROTECTED] wrote: hi, I tried this, actually in my case, one site ends with .net and the other is .org so I modified it to +^http://([a-z0-9]*\.)*(abc.net|def.org)/ I guess '.' is metadata in regexp, so pls try +^http://([a-z0-9]*\.)*(abc\.net|def\.org)/ Good

Tutorial: indexing

2006-03-03 Thread Patrice Neff
There seems to be another error in the tutorial. The command bin/nutch index indexes crawl/linkdb crawl/segments/* should IMHO read bin/nutch index indexes crawl/crawldb crawl/linkdb crawl/segments/* See also the usage of nutch index: Usage: index crawldb linkdb segment ... Cheers Patrice

Nutch doesn't support Korean?

2006-03-03 Thread Teruhiko Kurosaka
I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. Is anybody successfully using Nutch for Korean?

Crawl Problem

2006-03-03 Thread Pine Cone
Hello, I am having some problem when I run the bin/nutch crawl urls -dir ct -depth 3 crawl.log I get this Error in my crawl.log file: Created webdb at LocalFS, /root/Desktop/nutch/nutch-0.7/ct/db Exception in thread main java.io.FileNotFoundException: urls (No such file or

project vitality?

2006-03-03 Thread Matt Wilkie
Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature

RE: project vitality?

2006-03-03 Thread Richard Braman
I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. I have tried to get the tutorial and faqs updated, but I haven't heard back. -Original

RE: project vitality?

2006-03-03 Thread Howie Wang
I wouldn't call Nutch 0.7.x proof-of-concept. There are several production sites running it already: http://wiki.apache.org/nutch/PublicServers Plus I think technorati is built on either Nutch and/or Lucene. That said, the doc could be better, and it's probably a good idea if you know Java

language-identifier and language filter

2006-03-03 Thread Teruhiko Kurosaka
Hello, I enabled language-identifier plugin and indexed some documents. But adding lang:en to the query does not seem to filter the docs by the language. Instead, it tries to find documents that has two terms lang and en. Am I using a wrong syntax? Do I have to do more than adding

Re: project vitality?

2006-03-03 Thread gekkokid
passed the concept stage, technorati uses lucene, in open source projects the last thing people want to do is documentation, anybody know why yahoo took down their nutch server? - Original Message - From: Howie Wang [EMAIL PROTECTED] To: [EMAIL PROTECTED];

Re: Nutch doesn't support Korean?

2006-03-03 Thread Cheolgoo Kang
Hello, There was similar issue with Lucene's StandardTokenizer.jj. http://issues.apache.org/jira/browse/LUCENE-444 and http://issues.apache.org/jira/browse/LUCENE-461 I'm have almost no experience with Nutch, but you can handle it like those issues above. On 3/4/06, Teruhiko Kurosaka [EMAIL

Re: project vitality?

2006-03-03 Thread sudhendra seshachala
I could not agree with Doug more. This is one of the best.. am trying UIMA too... though UIMA also uses Lucene...as of today, it is still a framework and community in early stages.. In fact the nightly builds has good improvements than 0.71. Any serious user or adopter should be trying