date:20060303

CBIR (Re: Jpeg and Exif Plugin)

2006-03-03 Thread Andrzej Bialecki


Jérôme Charron wrote:

What do you thing about a plug-in for indexing MetaData Exif on Jpeg ?
Do you thing it's a good idea  ?



I think it makes sense.
For a general search engine it will allow to search on image comments for
instance.
For an image search engine it will allow to search on technical metadata
(exposure time, date, ...)
But what's about images without comments for instance? How to retrieve them
in a general search engine?
The more Nutch have plugins the more it will be usefull fore many purpose
and so for a wide variety of users.
+1
  


I agree, it would be a useful addition.

Also, I think it would be great if someone familiar with CBIR could 
contribute a plugin for indexing  searching for images by their 
fingerprints - there are several known techniques for doing this (look 
at imgSeek for inspiration). Nutch would require only minimal changes to 
support a suitable front-end.




--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Jpeg and Exif Plugin

2006-03-03 Thread Philippe EUGENE




I think it makes sense.
For a general search engine it will allow to search on image comments for
instance.
For an image search engine it will allow to search on technical metadata
(exposure time, date, ...)
  


Ok. I can try to make this plug-in next week.
I can use this java library :
http://www.drewnoakes.com/code/exif/

I hope there is no Licensing problem using this library inside Nutch 
Project.

--
Philippe

limit fetching by using crawl-urlfilter.txt

2006-03-03 Thread Michael Ji

Hi,

I searched on the mail-post, but still have problem to
run my testing.

Actually, I want my crawling is limited to two site
solely.

such as, *.abc.com/*
and  *.def.com/*

so I put two line in crawl-urlfilter.txt as
+^http://([a-z0-9]*\.)*.abc.com/
+^http://([a-z0-9]*\.)*.def.com/

But after running testing, the crawling is not limited
to the above two sites. 

From log, I found not found ...urlfilter-prefix

I wonder if the failure is due to not include
crawl-urlfilter.txt in my configure xml or there is
syntax error for my previous statement.

thanks,

Michael


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

nutch and multilingualism

2006-03-03 Thread Laurent Michenaud

Hi,

 

What is the good strategy to adopt for multilingualism sites ?

 

I want nutch to index a site in the different languages and

then, the search only prints results that are in the user language.

 

Thanks for advices please.

Re: https plugin for Nutch

2006-03-03 Thread Ravi Chintakunta

Another way of crawling password protected site, is modifying your
intranet site to allow the nutch bot to crawl the site without
authentication. Since this is your intranet site, this should be
simple. You may also have to validate against the the crawler
machine's IP while allowing the nutch bot to crawl un-authenticated.

- Ravi Chintakunta


On 3/2/06, Richard Braman [EMAIL PROTECTED] wrote:
 Crawling password protected sites would require two things:

 1. being able to submit data to auth page via post, as most do not
 accept the login in the query string, some do, but most dont.
 2. being able to manage the session during the crawl, so that the server
 thinks the agent is stilled logged in as it goes from page to page.  I
 did this in an intelligent agent I wrote about 6 years ago, but I don't
 know enough about the nutch agent to tell if it is possible.

 -Original Message-
 From: Mohini Padhye [mailto:[EMAIL PROTECTED]
 Sent: Thursday, March 02, 2006 4:26 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: https plugin for Nutch


 Sameer,
 Thanks for the reply. I could configure and use protocol-http plugin for
 crawling site that's using https protocol. Also, has anyone worked with
 crawling password protected sites? My requirement is crawling an
 intranet site that uses https and user authentication. I searched
 through the forum but couldn't find anybody who has successfully
 implemented it. I'm also going through the source files for
 protocol-http plugin to see if any changes can be made there for my
 specific requirement. Thanks, Mohini


 -Original Message-
 From: Sameer Tamsekar [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 01, 2006 10:31 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: https plugin for Nutch

 If you use protocol-httpclient (versus protocol-http) then it should
 support https.

 I have got this reply from one of the mailing list user.

 Regards,

 Sameer

 On 3/2/06, Mohini Padhye [EMAIL PROTECTED] wrote:
 
  I am using nutch-0.7.1. I wanted to know if anyone has successfully
  implemented https plugin for nutch.
  If not, can someone provide guidelines about developing it and I can
  start with the implementation?
  -Mohini

Re: nutch and multilingualism

2006-03-03 Thread Jérôme Charron

 What is the good strategy to adopt for multilingualism sites ?

I want nutch to index a site in the different languages and
 then, the search only prints results that are in the user language.

Hi Laurent,

What I can suggest is to :
1. use the languageidentifier plugin while crawling in order to guess the
language of the content
2. automatically filters the results by adding the lang:user_agent_lang
clause to the query (could be done in the jsp).

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Empty search results using a merged index

2006-03-03 Thread keren nutch

Hi Byron,
 
 We use Nutch 0.7.1. What version do you use. Maybe Nutch 0.7.1 doesn't support 
the merged index.
 
 Keren

Byron Miller [EMAIL PROTECTED] wrote: Sounds like it couldn't find your 
segments. Did
catalina.out show your segments were found or report
any other errors?

--- keren nutch  wrote:

 
 I merged 20 seperate indexes into a master index.
 After I pointed to the master index, I got 
 empty search results. I looked at catalina.out, it
 says as follows.
  
  060302 092418 13 query request from 127.0.0.1
  060302 092418 13 query: canada
  060302 092418 13 searching for 20 raw hits
  060302 092419 13 total hits: 1319570
  
  It seemed that tt got results. Please let me know
 why I got empty 
 search results.
  
  Best regards,
  Keren
  
   
 -
 Enrich your life at Yahoo! Canada Finance




-
Make free worldwide PC-to-PC calls. Try the new Yahoo! Canada Messenger with 
Voice

Re: limit fetching by using crawl-urlfilter.txt

2006-03-03 Thread Michael Ji

hi,

I tried this, actually in my case, one site ends with
.net and the other is .org

so I modified it to 

+^http://([a-z0-9]*\.)*(abc.net|def.org)/

and I run another testing, seems doesn't work, coz I
saw a site other than abc and def is being fetched,

any hints?

thanks,

Michael,

--- sudhendra seshachala [EMAIL PROTECTED] wrote:

 
 Hi,
   Try the following pattern
   +^http://([a-z0-9]*\.)*(abc|def).com/

   I was able to search couple of sites using similar
 pattern.
   If this is what you are asking ?
   
 Michael Ji [EMAIL PROTECTED] wrote:
   Hi,
 
 I searched on the mail-post, but still have problem
 to
 run my testing.
 
 Actually, I want my crawling is limited to two site
 solely.
 
 such as, *.abc.com/*
 and *.def.com/*
 
 so I put two line in crawl-urlfilter.txt as
 +^http://([a-z0-9]*\.)*.abc.com/
 +^http://([a-z0-9]*\.)*.def.com/
 
 But after running testing, the crawling is not
 limited
 to the above two sites. 
 
 From log, I found not found ...urlfilter-prefix
 
 I wonder if the failure is due to not include
 crawl-urlfilter.txt in my configure xml or there is
 syntax error for my previous statement.
 
 thanks,
 
 Michael
 
 
 __
 Do You Yahoo!?
 Tired of spam? Yahoo! Mail has the best spam
 protection around 
 http://mail.yahoo.com 
 
 
 
   Sudhi Seshachala
   http://sudhilogs.blogspot.com/

 
 
   
 -
 Yahoo! Mail
 Bring photos to life! New PhotoMail  makes sharing a
 breeze. 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

query site

2006-03-03 Thread Laurent Michenaud

Hi,

 

How do u use the query-site ?

 

I've tried : site:http://localhost:8080 but it returns nothing.

 

Thanks

How to set up for merged index

2006-03-03 Thread keren nutch

Hi,
 
 After I merged indexes from the directory /home/nutch/segmetns which contains 
20 sub directories. My outputIndex name is index. Then, I moved the index under 
/home/nutch/merged_index/. In the nutch-site.xml, I set 'searcher.dir' to be ' 
/home/nutch/merged_index'. After that, I restarted tomcat. When I did a search, 
I got error:
 
java.lang.RuntimeException: java.lang.NullPointerException
 at 
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:190)
 at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:298)
 at 
org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:138)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:696)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:809)
 at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:200)
 at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:146)
 at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:209)
 at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:596)
 at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:433)
 at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:948)
 at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:144)
 at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:596)
 at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:433)
 at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:948)
 at org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2358)
 at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:133)
 at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:596)
 at 
org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.java:118)
 at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:594)
 at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:116)
 at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:594)
 at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:433)
 at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:948)
 at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:127)
 at 
org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:596)
 at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:433)
 at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:948)
 at org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:152)
 at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:683)
 at java.lang.Thread.run(Thread.java:534)
Caused by: java.lang.NullPointerException
 at 
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:144)
 at 
org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:163)
 Please let me know what's wrong with my settings. 
 
 Best regards,
 Keren
 

-
Share your photos with the people who matter at Yahoo! Canada Photos

RE: query site

2006-03-03 Thread Laurent Michenaud

Hi, i found, it is :

site:localhost

Now, can do I do a search both on the site site1 and site2 ?

site:site1 OR site:site2  doesnot work

Thanks

-Message d'origine-
De : Laurent Michenaud [mailto:[EMAIL PROTECTED] 
Envoyé : vendredi 3 mars 2006 17:02
À : nutch-user@lucene.apache.org
Objet : query site

Hi,

 

How do u use the query-site ?

 

I've tried : site:http://localhost:8080 but it returns nothing.

 

Thanks

RE: Question about Index Writing/Merging

2006-03-03 Thread Tim Patton

Thanks, that's exactly what I was thinking.  Do you have any recommendations
on maximum index size (obviously we'd be testing ourselves, but its good to
get an idea)?

Tim

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 02, 2006 7:34 PM
To: nutch-user@lucene.apache.org
Subject: Re: Question about Index Writing/Merging

Tim Patton wrote:
 I'm working on a project that uses pieces of Nutch to store a Lucene index
 in Hadoop (basically I am using the FsDirectory and related classes).
When
 trying to write to an index I got an unsupported exception since
FsDirectory
 doesn't support seek which Lucene uses on closing an IndexWriter, the
file
 system is write-once.  After looking through the Nutch code I saw that an
 index is worked on locally, either with writing or merging, then
transferred
 into the dfs when finished.  I just was checking to make sure I understood
 this correctly.

Yes, this is correct.

 If I was to work on a multi-gigabyte index I would need
 that much free space on my local drive to transfer the index to and it
would
 take a while to copy each way.  How does this work for the really huge
 indexes people want to build with Nutch?  Would there be many smaller
Lucene
 indexes in the dfs, since obviously one huge terabyte index couldn't be
 downloaded?  I'm just trying to have a better understanding of how Nutch
 works.

Terabyte indexes aren't actually very useful, since they take too long 
to search.  So with big collections (100M pages) one will keep multiple 
indexes and use distributed search to search them all in parallel.

Doug

Re: limit fetching by using crawl-urlfilter.txt

2006-03-03 Thread Jack Tang

On 3/3/06, Michael Ji [EMAIL PROTECTED] wrote:
 hi,

 I tried this, actually in my case, one site ends with
 .net and the other is .org

 so I modified it to

 +^http://([a-z0-9]*\.)*(abc.net|def.org)/
I guess '.' is metadata in regexp, so pls try
+^http://([a-z0-9]*\.)*(abc\.net|def\.org)/

Good luck!

 and I run another testing, seems doesn't work, coz I
 saw a site other than abc and def is being fetched,

 any hints?

 thanks,

 Michael,

 --- sudhendra seshachala [EMAIL PROTECTED] wrote:

 
  Hi,
Try the following pattern
+^http://([a-z0-9]*\.)*(abc|def).com/
 
I was able to search couple of sites using similar
  pattern.
If this is what you are asking ?
 
  Michael Ji [EMAIL PROTECTED] wrote:
Hi,
 
  I searched on the mail-post, but still have problem
  to
  run my testing.
 
  Actually, I want my crawling is limited to two site
  solely.
 
  such as, *.abc.com/*
  and *.def.com/*
 
  so I put two line in crawl-urlfilter.txt as
  +^http://([a-z0-9]*\.)*.abc.com/
  +^http://([a-z0-9]*\.)*.def.com/
 
  But after running testing, the crawling is not
  limited
  to the above two sites.
 
  From log, I found not found ...urlfilter-prefix
 
  I wonder if the failure is due to not include
  crawl-urlfilter.txt in my configure xml or there is
  syntax error for my previous statement.
 
  thanks,
 
  Michael
 
 
  __
  Do You Yahoo!?
  Tired of spam? Yahoo! Mail has the best spam
  protection around
  http://mail.yahoo.com
 
 
 
Sudhi Seshachala
http://sudhilogs.blogspot.com/
 
 
 
 
  -
  Yahoo! Mail
  Bring photos to life! New PhotoMail  makes sharing a
  breeze.


 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com



--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Tutorial: indexing

2006-03-03 Thread Patrice Neff


There seems to be another error in the tutorial.

The command
bin/nutch index indexes crawl/linkdb crawl/segments/*

should IMHO read
bin/nutch index indexes crawl/crawldb crawl/linkdb crawl/segments/*

See also the usage of nutch index:
Usage: index crawldb linkdb segment ...

Cheers
Patrice

Nutch doesn't support Korean?

2006-03-03 Thread Teruhiko Kurosaka

I was browing NutchAnalysis.jj and found that
Hungul Syllables (U+AC00 ... U+D7AF; U+ means
a Unicode character of the hex value ) are not
part of LETTER or CJK class.  This seems to me that
Nutch cannot handle Korean documents at all.

Is anybody successfully using Nutch for Korean?

-kuro

Crawl Problem

2006-03-03 Thread Pine Cone

Hello,  
   
  I am having some problem when I run the bin/nutch crawl urls -dir ct -depth 
3  crawl.log  
   
  I get this Error in my crawl.log file:  
  Created webdb at LocalFS, /root/Desktop/nutch/nutch-0.7/ct/db
  Exception in thread main java.io.FileNotFoundException: urls (No such file 
or directory) 
   
  My urls.txt file look like this 
  http://localhost:8080/tomcat-docs/introduction.html 
   
  My crawl-urlfilter.txt looks like this:  
  +^http://([a-z0-9]*\.)*localhost:8080/
   
  I am running my tomcat webserver as a local host and I want to crawl the 
content of my webserver.  my webserver is not connected to the internet.  
   
  Thanks, 
   
  P. Cone 
   
   


-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze.

project vitality?

2006-03-03 Thread Matt Wilkie

Hi there, I'm new around here. The mailing lists seem to have a pretty 
steady stream of traffic but the website hasn't been updated since 
august, and there's only a handful of news items before that. What is 
the vitality of Nutch project? Is it basically a labority proof of 
concept or a mature ready for production product?


thanks for your time,

--
matt wilkie

Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/

RE: project vitality?

2006-03-03 Thread Richard Braman

I think it is still very much at proof of concept stage.  I think it is
close, but as you have mentioned, the website Is severely out of date
and the information and documentation on it lacks luster.  I have tried
to get the tutorial and faqs updated, but I haven't heard back.

-Original Message-
From: Matt Wilkie [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 03, 2006 6:34 PM
To: nutch-user@lucene.apache.org
Subject: project vitality?

Hi there, I'm new around here. The mailing lists seem to have a pretty 
steady stream of traffic but the website hasn't been updated since 
august, and there's only a handful of news items before that. What is 
the vitality of Nutch project? Is it basically a labority proof of 
concept or a mature ready for production product?

thanks for your time,

-- 
matt wilkie

Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/

RE: project vitality?

2006-03-03 Thread Howie Wang


I wouldn't call Nutch 0.7.x proof-of-concept. There are several
production sites running it already:

http://wiki.apache.org/nutch/PublicServers

Plus I think technorati is built on either Nutch and/or Lucene.

That said, the doc could be better, and it's probably a good idea
if you know Java since you might have to tweak the code a bit to
get the exact behavior you want.  If you don't have special needs,
you could get something like a site search up in very little time.

The newer versions seem to be changing a lot still though. I've
been waiting for the dust to settle before I see if I want to upgrade.

Howie


I think it is still very much at proof of concept stage.  I think it is
close, but as you have mentioned, the website Is severely out of date
and the information and documentation on it lacks luster.  I have tried
to get the tutorial and faqs updated, but I haven't heard back.

-Original Message-
From: Matt Wilkie [mailto:[EMAIL PROTECTED]
Sent: Friday, March 03, 2006 6:34 PM
To: nutch-user@lucene.apache.org
Subject: project vitality?


Hi there, I'm new around here. The mailing lists seem to have a pretty
steady stream of traffic but the website hasn't been updated since
august, and there's only a handful of news items before that. What is
the vitality of Nutch project? Is it basically a labority proof of
concept or a mature ready for production product?

thanks for your time,

--
matt wilkie

Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/

language-identifier and language filter

2006-03-03 Thread Teruhiko Kurosaka

Hello,
I enabled language-identifier plugin and indexed some documents.
But adding lang:en to the query does not seem to filter the
docs by the language.  Instead, it tries to find documents
that has two terms lang and en. Am I using a wrong syntax?
Do I have to do more than adding language-identifer to the
plugin list in conf/nutch-site.xml ?

-kuro

Re: project vitality?

2006-03-03 Thread gekkokid

passed the concept stage, technorati uses lucene, in open source projects 
the last thing people want to do is documentation,


anybody know why yahoo took down their nutch server?


- Original Message - 
From: Howie Wang [EMAIL PROTECTED]

To: [EMAIL PROTECTED]; nutch-user@lucene.apache.org
Sent: Saturday, March 04, 2006 1:09 AM
Subject: RE: project vitality?



I wouldn't call Nutch 0.7.x proof-of-concept. There are several
production sites running it already:

http://wiki.apache.org/nutch/PublicServers

Plus I think technorati is built on either Nutch and/or Lucene.

That said, the doc could be better, and it's probably a good idea
if you know Java since you might have to tweak the code a bit to
get the exact behavior you want.  If you don't have special needs,
you could get something like a site search up in very little time.

The newer versions seem to be changing a lot still though. I've
been waiting for the dust to settle before I see if I want to upgrade.

Howie


I think it is still very much at proof of concept stage.  I think it is
close, but as you have mentioned, the website Is severely out of date
and the information and documentation on it lacks luster.  I have tried
to get the tutorial and faqs updated, but I haven't heard back.

-Original Message-
From: Matt Wilkie [mailto:[EMAIL PROTECTED]
Sent: Friday, March 03, 2006 6:34 PM
To: nutch-user@lucene.apache.org
Subject: project vitality?


Hi there, I'm new around here. The mailing lists seem to have a pretty
steady stream of traffic but the website hasn't been updated since
august, and there's only a handful of news items before that. What is
the vitality of Nutch project? Is it basically a labority proof of
concept or a mature ready for production product?

thanks for your time,

--
matt wilkie

Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/

Re: Nutch doesn't support Korean?

2006-03-03 Thread Cheolgoo Kang

Hello,

There was similar issue with Lucene's StandardTokenizer.jj.

http://issues.apache.org/jira/browse/LUCENE-444

and

http://issues.apache.org/jira/browse/LUCENE-461

I'm have almost no experience with Nutch, but you can handle it like
those issues above.


On 3/4/06, Teruhiko Kurosaka [EMAIL PROTECTED] wrote:
 I was browing NutchAnalysis.jj and found that
 Hungul Syllables (U+AC00 ... U+D7AF; U+ means
 a Unicode character of the hex value ) are not
 part of LETTER or CJK class.  This seems to me that
 Nutch cannot handle Korean documents at all.

 Is anybody successfully using Nutch for Korean?

 -kuro



--
Cheolgoo

Re: project vitality?

2006-03-03 Thread sudhendra seshachala

I could not agree with Doug more. This is one of the best.. am trying UIMA 
too... though UIMA also uses Lucene...as of today, it is still a framework and 
community in early stages..
   
  In fact the nightly builds has good improvements than 0.71.
  Any serious user or adopter should be trying with a snapshot of nightly 
build..
   
  Doug, 
  It  would be better, if there is official 0.8 release or atleast a RC.
  before major releasing 1.0. I am newbie, so let me know about ideas on 
releasing 0.8.
   
  Thanks
  Sudhi
  

Doug Cutting [EMAIL PROTECTED] wrote:
  Richard Braman wrote:
 I think it is still very much at proof of concept stage. I think it is
 close, but as you have mentioned, the website Is severely out of date
 and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks luster the project 
must be dead! Seriously, this is an active project. It is not yet 1.0, 
so don't expect polish. If it doesn't look easily usable to you then 
perhaps it is not. It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html

Lots of public sites are using Nutch. Some are listed at 
http://wiki.apache.org/nutch/PublicServers, but many are not, like 
http://search.bittorrent.com/.

 I have tried
 to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project. If you find a bug, please file a bug 
report, so that other folks are aware of it. Better yet, if you have a 
solution or improvement, please construct a patch file (even for 
documentation) and attach it to a bug report. On the wiki, anyone can 
make themselves an account and update documentation. We don't boss 
folks around here, or complain. We pitch in and help.

Doug



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   



-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze.

CBIR (Re: Jpeg and Exif Plugin)

Re: Jpeg and Exif Plugin

limit fetching by using crawl-urlfilter.txt

nutch and multilingualism

Re: https plugin for Nutch

Re: nutch and multilingualism

Re: Empty search results using a merged index

Re: limit fetching by using crawl-urlfilter.txt

query site

How to set up for merged index

RE: query site

RE: Question about Index Writing/Merging

Re: limit fetching by using crawl-urlfilter.txt

Tutorial: indexing

Nutch doesn't support Korean?

Crawl Problem

project vitality?

RE: project vitality?

RE: project vitality?

language-identifier and language filter

Re: project vitality?

Re: Nutch doesn't support Korean?

Re: project vitality?

23 matches

Site Navigation

Mail list logo

Footer information