Re: About regex in the crawl-urlfilter.txt config file

2006-02-23 Thread Jack Tang
Hi I think in the url-filter it uses contain rather than match. /Jack On 2/23/06, Elwin [EMAIL PROTECTED] wrote: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/? I think it's not, but in

Re: Nutch and HTTrack Crawler

2006-02-23 Thread Stefan Groschupf
Am 23.02.2006 um 01:55 schrieb sudhendra seshachala: Is there a way I could use HTTrack for crawling and nutch for just searching? Has any body done this before andcomparision between crawlers. I suggest take a look to lucene, since I guess it is more work changing nutch to your

Re: retrieve data from index file

2006-02-23 Thread Stefan Groschupf
Hi Wong, take a look to: http://lucene.apache.org/java/docs/api/org/apache/lucene/index/ IndexReader.html There are many code snippets in the net that will show you how you can use it. In general I found the book lucene in action a useful guide when working with lucene. Stefan Am

Re: About regex in the crawl-urlfilter.txt config file

2006-02-23 Thread Elwin
Oh I have asked a silly question about regex, hehe. 2006/2/23, Jack Tang [EMAIL PROTECTED]: Hi I think in the url-filter it uses contain rather than match. /Jack On 2/23/06, Elwin [EMAIL PROTECTED] wrote: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ Will

Re: Admin GUI

2006-02-23 Thread Stefan Groschupf
Hi Daniel, thanks we still working on it. Actually we have to finish something behind the sense and than we will publish a kind of plugin extension point that will allow other people to contribute. Thanks for the offer, may be the only thing you can do is to vote for this issue since this

RE: Simple indexation and reindexation

2006-02-23 Thread Vanderdray, Jacob
If you look at the section of the tutorial for doing intranet crawls, you should be able to use that for your small number of websites. The bin/nutch script wraps up all the crawl functions for you (fetching, indexing, deduping, etc). You'll just need to delete the results of your

Re: Simple indexation and reindexation

2006-02-23 Thread Sugra Llistaire
Thanks Jacob, for the help. It is a pity the results of the previous crawl must be removed. Specially because it's a problem to restart the container (JBoss, in my case). Is this a feature inherited from lucene? Or maybe this will be improved in the future? Thanks again. En/na Vanderdray,

Re: Re[6]: parse-swf plugin in 0.7 release

2006-02-23 Thread Raghavendra Prabhu
If u tell me the exact version of nutch which you are using ,I can suggest modifications based on that Rgds Prabhu On 2/23/06, Dima Mazmanov [EMAIL PROTECTED] wrote: Hi Raghavendra, I tried to compile SWFParser.java and got following errors compile: [echo] Compiling plugin: parse-swf

RE: Simple indexation and reindexation

2006-02-23 Thread Vanderdray, Jacob
That issue gets a lot of discussion on this list and some folks have come up with their own workarounds. Those generally involve different implementations of the search bean. I haven't heard of any definitive solution for the next release. Jake. -Original Message- From: Sugra

meta in search query string

2006-02-23 Thread Poettgen
Hi, I have added on my HTML-Pages two meta tags for the language and a category (news, articles,...) of the page. meta name=dc.language content=en / And an meta tag for an categorie: meta name=dc.category content=news / Who can I buildt an search query and get the hits for example: Find all

Re: Nutch on Windows

2006-02-23 Thread Top100Forever
It's incredible... :( I tried also to change port, to change jdk version, to change tomcat version, to add servlet api in mine classpath, but I receive all the time the stackOverflowError... The installation instructions are too simple...but what can I do wrong??? Look at the

Manage severals NutchConf in one webapp

2006-02-23 Thread Laurent Michenaud
Is it possible to manage severals NutchConf in only one webapp ? My situation : Inside my webapp, I can manage severals web sites with differents urls. I want to have a search engine per web site, so I need to have one NutchBean per web site and one NutchConf per web site. I have

RE: Nutch on Windows

2006-02-23 Thread Steve Betts
My first thought is that the page is forwarding to itself. That would create an infinite loop and cause a stack overflow. Is the language not set somehow? That would make it forward forever. Thanks, Steve Betts [EMAIL PROTECTED] 937-477-1797 -Original Message- From: Top100Forever

Re[8]: parse-swf plugin in 0.7 release

2006-02-23 Thread Dima Mazmanov
Hi,Raghavendra. I have nutch-0.7 release without any modifications in code. You wrote 23 февраля 2006 г., 19:21:17: If u tell me the exact version of nutch which you are using ,I can suggest modifications based on that Rgds Prabhu On 2/23/06, Dima Mazmanov [EMAIL PROTECTED] wrote: Hi

Re[8]: parse-swf plugin in 0.7 release

2006-02-23 Thread Dima Mazmanov
Hi,Raghavendra. /usr/local/nutch/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf/SWFParser.java:108: cannot resolve symbol [javac] symbol : constructor ParseData (org.apache.nutch.parse.ParseStatus,java.lang.String, org.apache.nutch.parse.Outlink[],

Re: meta in search query string

2006-02-23 Thread TDLN
You can follow the tutorial at http://wiki.apache.org/nutch/WritingPluginExample. Just replace recommended with category, and it will show you what to do. (I just implemened a category filter this way ...) Rgrds, T. On 2/23/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, I have added on

Re: Re[8]: parse-swf plugin in 0.7 release

2006-02-23 Thread Raghavendra Prabhu
I have attached a new parser I have changed ContentProperties to Properties Probably your nutch version uses that. I did not check for compilaton Please look at other parsers and do the same It shud be easy to spot it . Hope this helps All the best Rgds Prabhu On 2/23/06, Dima Mazmanov [EMAIL

Re: Nutch on Windows

2006-02-23 Thread Top100Forever
Ouch, I found the error...this instruction not work properly: String language = ResourceBundle.getBundle(org.nutch.jsp.search, request.getLocale()) .getLocale().getLanguage(); language, after this instruction, is , so for now, I have correct with this code: - Original Message

Re: Nutch on Windows

2006-02-23 Thread Top100Forever
Ouch, I found the error...this instruction not work properly: String language = ResourceBundle.getBundle(org.nutch.jsp.search, request.getLocale()) .getLocale().getLanguage(); language, after this instruction, is , so for now, I have correct with this code: language = en; Why language

fetcher.threads.fetch

2006-02-23 Thread Raghavendra Prabhu
protocol-httpclient uses this value fetcher.threads.fetch protocol-http does not use fetcher.threads.fetch Should not this value be used by protocol-http,protocol-file,protocol-ftp Is this the protocol files which act as the control for this value? Or The fetcher module? Fetcher.java Any light

Whole Web Indexing

2006-02-23 Thread sudhendra seshachala
IS invertlinks supported or not ? I am using nutch 0.7.1. I am getting no class def found error. or should I use a compiled version.. Can some help me here ? Whole-web: Indexing Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages.

RE: meta in search query string

2006-02-23 Thread Vanderdray, Jacob
One difference you'll want is to change the plugin.xml file so that your query filter gets used just for the fields you're interested in. Instead of fields=DEFAULT in the example, you'll want raw-fields=language and raw-fields=category. Assuming you name the fields language and category

Re: Search Particulars

2006-02-23 Thread Raghavendra Prabhu
Okie I am new to this topic But do u add metatags to a particular field if so shud not that field also appear as in the field path The normal nutch maybe does not look at that field at all ? Maybe this is the reason ? Unless you give the metadatafield and search for the keyword Rgds Prabhu

RE: Search Particulars

2006-02-23 Thread Vanderdray, Jacob
I'm not sure I understand what you're getting at. In this case I've added a comma separated list of names of meta tags that I want to index and search against. I've written a parse filter, an index filter and this query filter that all read in that list of meta tags from the

Re: Search Particulars

2006-02-23 Thread Jack Tang
Hey One simplest way is copy BasicQueryFilter class and rename it, then modify the FIELDS/FIELD_BOOSTS by replacing them with you meta tags from nutch config. And don't forget the configuration in your query filter's plugin.xml. Good luck! /Jack On 2/24/06, Vanderdray, Jacob [EMAIL PROTECTED]

Re: Nutch on Windows

2006-02-23 Thread Stefan Groschupf
P.S. Now finally i could test nutch...:) Puhh, that was a pain! :-) Welcome!

Re: Nutch on Windows

2006-02-23 Thread Stefan Groschupf
Puhh, that was a pain! :-) Welcome! Ups I hit the send button to fast. :-/ Before people may miss understand that, 'welcome' mean 'welcome to nutch'. 'Welcome' in german means in any case 'someone welcome to something', sorry.

Re: Manage severals NutchConf in one webapp

2006-02-23 Thread Stefan Groschupf
This should be possible with the latest version nutch 0.8 you may need build from sources. There nutchConf is not static anymore and you can pass it down the stack. Beside that you my need to store the Nutchbean not as context attribute but in a hashmap that is stored as content attribute.

Re: Stop Indexing

2006-02-23 Thread Stefan Groschupf
No. Am 22.02.2006 um 22:29 schrieb Saravanaraj Duraisamy: Hi, in nutch 0.7.1 is there a way to stop indexing with out corrupting the index files in the middle of indexing??? thanks d.saravanaraj - blog: http://www.find23.org company:

exception in thread main.. fun

2006-02-23 Thread Florian Mettetal
I am running cygwin (I know), with jdk1.5.0 and tomcat 4.1 From cygwin I run: bin/nutch crawl urls -dir crawlresults/ -depth 2 - topN 1000 results: run java in C:/program files/java/jdk1.5.0/ 060223 123010 parsing file:/c:/cygwin/home/falieson/nutch/conf/nutch- default.xml 060223 123010

Re: Intranet search - some questions

2006-02-23 Thread Stefan Groschupf
Hi, - Is there any way to perform form based authentication? I know that this is a common request but I haven’t found a “good-enough” answer to it. The only references I’ve found are about basic auth, which I’d prefer to avoid. I ask this because I’ve noticed that SearchBlox,

Re: exception in thread main.. fun

2006-02-23 Thread Poettgen
Hi Florian, Where is your urls file located?. If you created urls in the conf folder then you have to call: bin/nutch crawl conf/urls -dir crawlresults/ -depth 2 - topN 1000 Good luck Detlev I am running cygwin (I know), with jdk1.5.0 and tomcat 4.1 From cygwin I run: bin/nutch crawl urls

Re: exception in thread main.. fun

2006-02-23 Thread Florian Mettetal
I made a urls file, yah didn't realize that was waht crawl referred to. I thought it would simply grab the url from the conf\urlfilter.txt file.bin/nutch crawl urls -dir crawled -depth 2 crawl.logWell at least now it ran, but with zero results. file urls

Re: Nutch on Windows

2006-02-23 Thread Top100Forever
Don't worry, I understood what do you meant :) But, what is the reason of this kind of problem? Why nutch is not capable of select the language? Do you have some idea? My solution is not the best... :) At now I'm studying nutch architecture... I need, for my purposes, a search engine that give

Re: Nutch on Windows

2006-02-23 Thread Stefan Groschupf
Don't worry, I understood what do you meant :) Sorry my english is too often just terrible I'm trying to improve it, I feel people to often misunderstand me. Anyway I guess and hope my java is much better. :-) But, what is the reason of this kind of problem? Why nutch is not capable of select

Nutch 0.8 version required..

2006-02-23 Thread sudhendra seshachala
The latest version I could see in the SVN is 0.7.1, Where can I get 0.8., source code is even better. Could I just grab from nightly builds ? Please let me know.. Thanks Sudhi Seshachala http://sudhilogs.blogspot.com/

Re: Nutch 0.8 version required..

2006-02-23 Thread Stefan Groschupf
http://cvs.apache.org/dist/lucene/nutch/nightly/ Am 24.02.2006 um 01:44 schrieb sudhendra seshachala: The latest version I could see in the SVN is 0.7.1, Where can I get 0.8., source code is even better. Could I just grab from nightly builds ? Please let me know.. Thanks Sudhi

Re: Admin GUI

2006-02-23 Thread Jack Tang
Hi Stefan The GUI looks great! My idea is to add ajax tech. to reduce the page reload and show the job progress in realtime. If contribution is welcome and no one is working on this, I'd like to take this. Regards /Jack On 2/23/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Daniel, thanks

exception during fetch using hadoop

2006-02-23 Thread Mike Smith
I have been getting this exception during fetching for almost a month. This exception stops the whole crawl. It happens on and off! Any Idea?? We are really stocked with this problem. I am using 3 data node and 1 name server. 060223 173809 task_m_b8ibww fetching

Re: Admin GUI

2006-02-23 Thread Jack Tang
On 2/24/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Jack, The GUI looks great! I will forward this to Frank Henze he had done the design and sample. :) Thanks. I'll prepare some utility and debug javascript classes from now on:) My idea is to add ajax tech. to reduce the page reload

Re: Nutch 0.8 version required..

2006-02-23 Thread sudhendra seshachala
Thanks Stefan. But when I compiled, the jar size was just 318kB for 0.8-dev where as the 0.7.1 release was 718KB. Am I missing something ? Sudhi Stefan Groschupf [EMAIL PROTECTED] wrote: http://cvs.apache.org/dist/lucene/nutch/nightly/ Am 24.02.2006 um 01:44 schrieb sudhendra seshachala:

Re: Nutch 0.8 version required..

2006-02-23 Thread Jack Tang
On 2/24/06, sudhendra seshachala [EMAIL PROTECTED] wrote: Thanks Stefan. But when I compiled, the jar size was just 318kB for 0.8-dev where as the 0.7.1 release was 718KB. Am I missing something ? I guess no. All classes about mapreduce were sperated from nutch and hosted in hadoop proj.

[nutch0.8]why map progress become negative?

2006-02-23 Thread 郑昀
Hi, i'm using nutch-2006-02-22.tar.gz(release 0.8) to nutch web. but when i run bin/nutch crawl seeds -dir cnblogs -depth 3 command, i always got negative map progress?! just like this: 060224 135430 seeds\urls.txt:0+23 060224 135431 seeds\urls.txt:0+23 060224 135432 map -32509% reduce 0%

Re: retrieve data from index file

2006-02-23 Thread Jack Tang
Exception in thread main java.lang.NoClassDefFoundError: org/apache/lucene/st ore/FSDirectory ^^^ Why one blank here? On 2/24/06, Wong Ting Kiong [EMAIL PROTECTED] wrote: hi, I had tried some java codes calling lucene lib lucene-1.9-rc1-dev.jar, but got error, my

Re: Search Particulars

2006-02-23 Thread Raghavendra Prabhu
Hi The code which you sent is only for query-filter In the parse-filter and especially in index-fitler , do u add it to any new field which you define?? What i do is any data which i want to have ,i store it in a new field (created by me) I guess the index-filter must be storing it in such