CBIR (Re: Jpeg and Exif Plugin)
Jérôme Charron wrote: What do you thing about a plug-in for indexing MetaData Exif on Jpeg ? Do you thing it's a good idea ? I think it makes sense. For a general search engine it will allow to search on image comments for instance. For an image search engine it will allow to search on technical metadata (exposure time, date, ...) But what's about images without comments for instance? How to retrieve them in a general search engine? The more Nutch have plugins the more it will be usefull fore many purpose and so for a wide variety of users. +1 I agree, it would be a useful addition. Also, I think it would be great if someone familiar with CBIR could contribute a plugin for indexing searching for images by their fingerprints - there are several known techniques for doing this (look at imgSeek for inspiration). Nutch would require only minimal changes to support a suitable front-end. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Jpeg and Exif Plugin
I think it makes sense. For a general search engine it will allow to search on image comments for instance. For an image search engine it will allow to search on technical metadata (exposure time, date, ...) Ok. I can try to make this plug-in next week. I can use this java library : http://www.drewnoakes.com/code/exif/ I hope there is no Licensing problem using this library inside Nutch Project. -- Philippe
limit fetching by using crawl-urlfilter.txt
Hi, I searched on the mail-post, but still have problem to run my testing. Actually, I want my crawling is limited to two site solely. such as, *.abc.com/* and *.def.com/* so I put two line in crawl-urlfilter.txt as +^http://([a-z0-9]*\.)*.abc.com/ +^http://([a-z0-9]*\.)*.def.com/ But after running testing, the crawling is not limited to the above two sites. From log, I found not found ...urlfilter-prefix I wonder if the failure is due to not include crawl-urlfilter.txt in my configure xml or there is syntax error for my previous statement. thanks, Michael __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
nutch and multilingualism
Hi, What is the good strategy to adopt for multilingualism sites ? I want nutch to index a site in the different languages and then, the search only prints results that are in the user language. Thanks for advices please.
Re: https plugin for Nutch
Another way of crawling password protected site, is modifying your intranet site to allow the nutch bot to crawl the site without authentication. Since this is your intranet site, this should be simple. You may also have to validate against the the crawler machine's IP while allowing the nutch bot to crawl un-authenticated. - Ravi Chintakunta On 3/2/06, Richard Braman [EMAIL PROTECTED] wrote: Crawling password protected sites would require two things: 1. being able to submit data to auth page via post, as most do not accept the login in the query string, some do, but most dont. 2. being able to manage the session during the crawl, so that the server thinks the agent is stilled logged in as it goes from page to page. I did this in an intelligent agent I wrote about 6 years ago, but I don't know enough about the nutch agent to tell if it is possible. -Original Message- From: Mohini Padhye [mailto:[EMAIL PROTECTED] Sent: Thursday, March 02, 2006 4:26 PM To: nutch-user@lucene.apache.org Subject: RE: https plugin for Nutch Sameer, Thanks for the reply. I could configure and use protocol-http plugin for crawling site that's using https protocol. Also, has anyone worked with crawling password protected sites? My requirement is crawling an intranet site that uses https and user authentication. I searched through the forum but couldn't find anybody who has successfully implemented it. I'm also going through the source files for protocol-http plugin to see if any changes can be made there for my specific requirement. Thanks, Mohini -Original Message- From: Sameer Tamsekar [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 01, 2006 10:31 PM To: nutch-user@lucene.apache.org Subject: Re: https plugin for Nutch If you use protocol-httpclient (versus protocol-http) then it should support https. I have got this reply from one of the mailing list user. Regards, Sameer On 3/2/06, Mohini Padhye [EMAIL PROTECTED] wrote: I am using nutch-0.7.1. I wanted to know if anyone has successfully implemented https plugin for nutch. If not, can someone provide guidelines about developing it and I can start with the implementation? -Mohini
Re: nutch and multilingualism
What is the good strategy to adopt for multilingualism sites ? I want nutch to index a site in the different languages and then, the search only prints results that are in the user language. Hi Laurent, What I can suggest is to : 1. use the languageidentifier plugin while crawling in order to guess the language of the content 2. automatically filters the results by adding the lang:user_agent_lang clause to the query (could be done in the jsp). Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Empty search results using a merged index
Hi Byron, We use Nutch 0.7.1. What version do you use. Maybe Nutch 0.7.1 doesn't support the merged index. Keren Byron Miller [EMAIL PROTECTED] wrote: Sounds like it couldn't find your segments. Did catalina.out show your segments were found or report any other errors? --- keren nutch wrote: I merged 20 seperate indexes into a master index. After I pointed to the master index, I got empty search results. I looked at catalina.out, it says as follows. 060302 092418 13 query request from 127.0.0.1 060302 092418 13 query: canada 060302 092418 13 searching for 20 raw hits 060302 092419 13 total hits: 1319570 It seemed that tt got results. Please let me know why I got empty search results. Best regards, Keren - Enrich your life at Yahoo! Canada Finance - Make free worldwide PC-to-PC calls. Try the new Yahoo! Canada Messenger with Voice
Re: limit fetching by using crawl-urlfilter.txt
hi, I tried this, actually in my case, one site ends with .net and the other is .org so I modified it to +^http://([a-z0-9]*\.)*(abc.net|def.org)/ and I run another testing, seems doesn't work, coz I saw a site other than abc and def is being fetched, any hints? thanks, Michael, --- sudhendra seshachala [EMAIL PROTECTED] wrote: Hi, Try the following pattern +^http://([a-z0-9]*\.)*(abc|def).com/ I was able to search couple of sites using similar pattern. If this is what you are asking ? Michael Ji [EMAIL PROTECTED] wrote: Hi, I searched on the mail-post, but still have problem to run my testing. Actually, I want my crawling is limited to two site solely. such as, *.abc.com/* and *.def.com/* so I put two line in crawl-urlfilter.txt as +^http://([a-z0-9]*\.)*.abc.com/ +^http://([a-z0-9]*\.)*.def.com/ But after running testing, the crawling is not limited to the above two sites. From log, I found not found ...urlfilter-prefix I wonder if the failure is due to not include crawl-urlfilter.txt in my configure xml or there is syntax error for my previous statement. thanks, Michael __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
query site
Hi, How do u use the query-site ? I've tried : site:http://localhost:8080 but it returns nothing. Thanks
How to set up for merged index
Hi, After I merged indexes from the directory /home/nutch/segmetns which contains 20 sub directories. My outputIndex name is index. Then, I moved the index under /home/nutch/merged_index/. In the nutch-site.xml, I set 'searcher.dir' to be ' /home/nutch/merged_index'. After that, I restarted tomcat. When I did a search, I got error: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:190) at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:298) at org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:138) at javax.servlet.http.HttpServlet.service(HttpServlet.java:696) at javax.servlet.http.HttpServlet.service(HttpServlet.java:809) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:200) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:146) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:209) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:596) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:433) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:948) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:144) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:596) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:433) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:948) at org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2358) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:133) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:596) at org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.java:118) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:594) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:116) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:594) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:433) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:948) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:127) at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext(StandardPipeline.java:596) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:433) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:948) at org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:152) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:683) at java.lang.Thread.run(Thread.java:534) Caused by: java.lang.NullPointerException at org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:144) at org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:163) Please let me know what's wrong with my settings. Best regards, Keren - Share your photos with the people who matter at Yahoo! Canada Photos
RE: query site
Hi, i found, it is : site:localhost Now, can do I do a search both on the site site1 and site2 ? site:site1 OR site:site2 doesnot work Thanks -Message d'origine- De : Laurent Michenaud [mailto:[EMAIL PROTECTED] Envoyé : vendredi 3 mars 2006 17:02 À : nutch-user@lucene.apache.org Objet : query site Hi, How do u use the query-site ? I've tried : site:http://localhost:8080 but it returns nothing. Thanks
RE: Question about Index Writing/Merging
Thanks, that's exactly what I was thinking. Do you have any recommendations on maximum index size (obviously we'd be testing ourselves, but its good to get an idea)? Tim -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, March 02, 2006 7:34 PM To: nutch-user@lucene.apache.org Subject: Re: Question about Index Writing/Merging Tim Patton wrote: I'm working on a project that uses pieces of Nutch to store a Lucene index in Hadoop (basically I am using the FsDirectory and related classes). When trying to write to an index I got an unsupported exception since FsDirectory doesn't support seek which Lucene uses on closing an IndexWriter, the file system is write-once. After looking through the Nutch code I saw that an index is worked on locally, either with writing or merging, then transferred into the dfs when finished. I just was checking to make sure I understood this correctly. Yes, this is correct. If I was to work on a multi-gigabyte index I would need that much free space on my local drive to transfer the index to and it would take a while to copy each way. How does this work for the really huge indexes people want to build with Nutch? Would there be many smaller Lucene indexes in the dfs, since obviously one huge terabyte index couldn't be downloaded? I'm just trying to have a better understanding of how Nutch works. Terabyte indexes aren't actually very useful, since they take too long to search. So with big collections (100M pages) one will keep multiple indexes and use distributed search to search them all in parallel. Doug
Re: limit fetching by using crawl-urlfilter.txt
On 3/3/06, Michael Ji [EMAIL PROTECTED] wrote: hi, I tried this, actually in my case, one site ends with .net and the other is .org so I modified it to +^http://([a-z0-9]*\.)*(abc.net|def.org)/ I guess '.' is metadata in regexp, so pls try +^http://([a-z0-9]*\.)*(abc\.net|def\.org)/ Good luck! and I run another testing, seems doesn't work, coz I saw a site other than abc and def is being fetched, any hints? thanks, Michael, --- sudhendra seshachala [EMAIL PROTECTED] wrote: Hi, Try the following pattern +^http://([a-z0-9]*\.)*(abc|def).com/ I was able to search couple of sites using similar pattern. If this is what you are asking ? Michael Ji [EMAIL PROTECTED] wrote: Hi, I searched on the mail-post, but still have problem to run my testing. Actually, I want my crawling is limited to two site solely. such as, *.abc.com/* and *.def.com/* so I put two line in crawl-urlfilter.txt as +^http://([a-z0-9]*\.)*.abc.com/ +^http://([a-z0-9]*\.)*.def.com/ But after running testing, the crawling is not limited to the above two sites. From log, I found not found ...urlfilter-prefix I wonder if the failure is due to not include crawl-urlfilter.txt in my configure xml or there is syntax error for my previous statement. thanks, Michael __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Tutorial: indexing
There seems to be another error in the tutorial. The command bin/nutch index indexes crawl/linkdb crawl/segments/* should IMHO read bin/nutch index indexes crawl/crawldb crawl/linkdb crawl/segments/* See also the usage of nutch index: Usage: index crawldb linkdb segment ... Cheers Patrice
Nutch doesn't support Korean?
I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. Is anybody successfully using Nutch for Korean? -kuro
Crawl Problem
Hello, I am having some problem when I run the bin/nutch crawl urls -dir ct -depth 3 crawl.log I get this Error in my crawl.log file: Created webdb at LocalFS, /root/Desktop/nutch/nutch-0.7/ct/db Exception in thread main java.io.FileNotFoundException: urls (No such file or directory) My urls.txt file look like this http://localhost:8080/tomcat-docs/introduction.html My crawl-urlfilter.txt looks like this: +^http://([a-z0-9]*\.)*localhost:8080/ I am running my tomcat webserver as a local host and I want to crawl the content of my webserver. my webserver is not connected to the internet. Thanks, P. Cone - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.
project vitality?
Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
RE: project vitality?
I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. I have tried to get the tutorial and faqs updated, but I haven't heard back. -Original Message- From: Matt Wilkie [mailto:[EMAIL PROTECTED] Sent: Friday, March 03, 2006 6:34 PM To: nutch-user@lucene.apache.org Subject: project vitality? Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
RE: project vitality?
I wouldn't call Nutch 0.7.x proof-of-concept. There are several production sites running it already: http://wiki.apache.org/nutch/PublicServers Plus I think technorati is built on either Nutch and/or Lucene. That said, the doc could be better, and it's probably a good idea if you know Java since you might have to tweak the code a bit to get the exact behavior you want. If you don't have special needs, you could get something like a site search up in very little time. The newer versions seem to be changing a lot still though. I've been waiting for the dust to settle before I see if I want to upgrade. Howie I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. I have tried to get the tutorial and faqs updated, but I haven't heard back. -Original Message- From: Matt Wilkie [mailto:[EMAIL PROTECTED] Sent: Friday, March 03, 2006 6:34 PM To: nutch-user@lucene.apache.org Subject: project vitality? Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
language-identifier and language filter
Hello, I enabled language-identifier plugin and indexed some documents. But adding lang:en to the query does not seem to filter the docs by the language. Instead, it tries to find documents that has two terms lang and en. Am I using a wrong syntax? Do I have to do more than adding language-identifer to the plugin list in conf/nutch-site.xml ? -kuro
Re: project vitality?
passed the concept stage, technorati uses lucene, in open source projects the last thing people want to do is documentation, anybody know why yahoo took down their nutch server? - Original Message - From: Howie Wang [EMAIL PROTECTED] To: [EMAIL PROTECTED]; nutch-user@lucene.apache.org Sent: Saturday, March 04, 2006 1:09 AM Subject: RE: project vitality? I wouldn't call Nutch 0.7.x proof-of-concept. There are several production sites running it already: http://wiki.apache.org/nutch/PublicServers Plus I think technorati is built on either Nutch and/or Lucene. That said, the doc could be better, and it's probably a good idea if you know Java since you might have to tweak the code a bit to get the exact behavior you want. If you don't have special needs, you could get something like a site search up in very little time. The newer versions seem to be changing a lot still though. I've been waiting for the dust to settle before I see if I want to upgrade. Howie I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. I have tried to get the tutorial and faqs updated, but I haven't heard back. -Original Message- From: Matt Wilkie [mailto:[EMAIL PROTECTED] Sent: Friday, March 03, 2006 6:34 PM To: nutch-user@lucene.apache.org Subject: project vitality? Hi there, I'm new around here. The mailing lists seem to have a pretty steady stream of traffic but the website hasn't been updated since august, and there's only a handful of news items before that. What is the vitality of Nutch project? Is it basically a labority proof of concept or a mature ready for production product? thanks for your time, -- matt wilkie Geographic Information, Information Management and Technology, Yukon Department of Environment 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9 867-667-8133 Tel * 867-393-7003 Fax http://environmentyukon.gov.yk.ca/geomatics/
Re: Nutch doesn't support Korean?
Hello, There was similar issue with Lucene's StandardTokenizer.jj. http://issues.apache.org/jira/browse/LUCENE-444 and http://issues.apache.org/jira/browse/LUCENE-461 I'm have almost no experience with Nutch, but you can handle it like those issues above. On 3/4/06, Teruhiko Kurosaka [EMAIL PROTECTED] wrote: I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. Is anybody successfully using Nutch for Korean? -kuro -- Cheolgoo
Re: project vitality?
I could not agree with Doug more. This is one of the best.. am trying UIMA too... though UIMA also uses Lucene...as of today, it is still a framework and community in early stages.. In fact the nightly builds has good improvements than 0.71. Any serious user or adopter should be trying with a snapshot of nightly build.. Doug, It would be better, if there is official 0.8 release or atleast a RC. before major releasing 1.0. I am newbie, so let me know about ideas on releasing 0.8. Thanks Sudhi Doug Cutting [EMAIL PROTECTED] wrote: Richard Braman wrote: I think it is still very much at proof of concept stage. I think it is close, but as you have mentioned, the website Is severely out of date and the information and documentation on it lacks luster. It stands to reason that if the documentation lacks luster the project must be dead! Seriously, this is an active project. It is not yet 1.0, so don't expect polish. If it doesn't look easily usable to you then perhaps it is not. It's still for early adopters. The commit list shows a fair amount of activity: http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html Lots of public sites are using Nutch. Some are listed at http://wiki.apache.org/nutch/PublicServers, but many are not, like http://search.bittorrent.com/. I have tried to get the tutorial and faqs updated, but I haven't heard back. This is an all-volunteer project. If you find a bug, please file a bug report, so that other folks are aware of it. Better yet, if you have a solution or improvement, please construct a patch file (even for documentation) and attach it to a bug report. On the wiki, anyone can make themselves an account and update documentation. We don't boss folks around here, or complain. We pitch in and help. Doug Sudhi Seshachala http://sudhilogs.blogspot.com/ - Yahoo! Mail Bring photos to life! New PhotoMail makes sharing a breeze.