Jérôme Charron wrote:
What do you thing about a plug-in for indexing MetaData Exif on Jpeg ?
Do you thing it's a good idea ?
I think it makes sense.
For a general search engine it will allow to search on image comments for
instance.
For an image search engine it will allow to search on
I think it makes sense.
For a general search engine it will allow to search on image comments for
instance.
For an image search engine it will allow to search on technical metadata
(exposure time, date, ...)
Ok. I can try to make this plug-in next week.
I can use this java library :
Hi,
I searched on the mail-post, but still have problem to
run my testing.
Actually, I want my crawling is limited to two site
solely.
such as, *.abc.com/*
and *.def.com/*
so I put two line in crawl-urlfilter.txt as
+^http://([a-z0-9]*\.)*.abc.com/
+^http://([a-z0-9]*\.)*.def.com/
But
Hi,
What is the good strategy to adopt for multilingualism sites ?
I want nutch to index a site in the different languages and
then, the search only prints results that are in the user language.
Thanks for advices please.
Another way of crawling password protected site, is modifying your
intranet site to allow the nutch bot to crawl the site without
authentication. Since this is your intranet site, this should be
simple. You may also have to validate against the the crawler
machine's IP while allowing the nutch bot
What is the good strategy to adopt for multilingualism sites ?
I want nutch to index a site in the different languages and
then, the search only prints results that are in the user language.
Hi Laurent,
What I can suggest is to :
1. use the languageidentifier plugin while crawling in order to
Hi Byron,
We use Nutch 0.7.1. What version do you use. Maybe Nutch 0.7.1 doesn't support
the merged index.
Keren
Byron Miller [EMAIL PROTECTED] wrote: Sounds like it couldn't find your
segments. Did
catalina.out show your segments were found or report
any other errors?
--- keren nutch
hi,
I tried this, actually in my case, one site ends with
.net and the other is .org
so I modified it to
+^http://([a-z0-9]*\.)*(abc.net|def.org)/
and I run another testing, seems doesn't work, coz I
saw a site other than abc and def is being fetched,
any hints?
thanks,
Michael,
---
Hi,
How do u use the query-site ?
I've tried : site:http://localhost:8080 but it returns nothing.
Thanks
Hi,
After I merged indexes from the directory /home/nutch/segmetns which contains
20 sub directories. My outputIndex name is index. Then, I moved the index under
/home/nutch/merged_index/. In the nutch-site.xml, I set 'searcher.dir' to be '
/home/nutch/merged_index'. After that, I restarted
Hi, i found, it is :
site:localhost
Now, can do I do a search both on the site site1 and site2 ?
site:site1 OR site:site2 doesnot work
Thanks
-Message d'origine-
De : Laurent Michenaud [mailto:[EMAIL PROTECTED]
Envoyé : vendredi 3 mars 2006 17:02
À : nutch-user@lucene.apache.org
Thanks, that's exactly what I was thinking. Do you have any recommendations
on maximum index size (obviously we'd be testing ourselves, but its good to
get an idea)?
Tim
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 02, 2006 7:34 PM
To:
On 3/3/06, Michael Ji [EMAIL PROTECTED] wrote:
hi,
I tried this, actually in my case, one site ends with
.net and the other is .org
so I modified it to
+^http://([a-z0-9]*\.)*(abc.net|def.org)/
I guess '.' is metadata in regexp, so pls try
+^http://([a-z0-9]*\.)*(abc\.net|def\.org)/
Good
There seems to be another error in the tutorial.
The command
bin/nutch index indexes crawl/linkdb crawl/segments/*
should IMHO read
bin/nutch index indexes crawl/crawldb crawl/linkdb crawl/segments/*
See also the usage of nutch index:
Usage: index crawldb linkdb segment ...
Cheers
Patrice
I was browing NutchAnalysis.jj and found that
Hungul Syllables (U+AC00 ... U+D7AF; U+ means
a Unicode character of the hex value ) are not
part of LETTER or CJK class. This seems to me that
Nutch cannot handle Korean documents at all.
Is anybody successfully using Nutch for Korean?
Hello,
I am having some problem when I run the bin/nutch crawl urls -dir ct -depth
3 crawl.log
I get this Error in my crawl.log file:
Created webdb at LocalFS, /root/Desktop/nutch/nutch-0.7/ct/db
Exception in thread main java.io.FileNotFoundException: urls (No such file
or
Hi there, I'm new around here. The mailing lists seem to have a pretty
steady stream of traffic but the website hasn't been updated since
august, and there's only a handful of news items before that. What is
the vitality of Nutch project? Is it basically a labority proof of
concept or a mature
I think it is still very much at proof of concept stage. I think it is
close, but as you have mentioned, the website Is severely out of date
and the information and documentation on it lacks luster. I have tried
to get the tutorial and faqs updated, but I haven't heard back.
-Original
I wouldn't call Nutch 0.7.x proof-of-concept. There are several
production sites running it already:
http://wiki.apache.org/nutch/PublicServers
Plus I think technorati is built on either Nutch and/or Lucene.
That said, the doc could be better, and it's probably a good idea
if you know Java
Hello,
I enabled language-identifier plugin and indexed some documents.
But adding lang:en to the query does not seem to filter the
docs by the language. Instead, it tries to find documents
that has two terms lang and en. Am I using a wrong syntax?
Do I have to do more than adding
passed the concept stage, technorati uses lucene, in open source projects
the last thing people want to do is documentation,
anybody know why yahoo took down their nutch server?
- Original Message -
From: Howie Wang [EMAIL PROTECTED]
To: [EMAIL PROTECTED];
Hello,
There was similar issue with Lucene's StandardTokenizer.jj.
http://issues.apache.org/jira/browse/LUCENE-444
and
http://issues.apache.org/jira/browse/LUCENE-461
I'm have almost no experience with Nutch, but you can handle it like
those issues above.
On 3/4/06, Teruhiko Kurosaka [EMAIL
I could not agree with Doug more. This is one of the best.. am trying UIMA
too... though UIMA also uses Lucene...as of today, it is still a framework and
community in early stages..
In fact the nightly builds has good improvements than 0.71.
Any serious user or adopter should be trying
23 matches
Mail list logo