crawl-urlfilter.txt mechanics

2005-08-21 Thread Michael Ji
Hi, When I use intranet crawling, such as, call "bin/nutch crawl ...", crawl-urlfilter.txt works---it filters out the urls that is not matched the domain I included; actually, when I take a look at crawltool.java, the config files are read in Java Properties by 'NutchConf.get().addConfResource(

Re: dump nutch index

2005-08-21 Thread Jack Tang
Hi Michael On 8/22/05, Michael Ji <[EMAIL PROTECTED]> wrote: > hi Jack: > > I guess segread can dump the content of fetched > segment content; but I want to see inside of index > created by running "bin/nutch index", etc. Try to search "http/https/ftp/file"(the protocol) keywords using NutchBean

Re: dump nutch index

2005-08-21 Thread Michael Ji
hi Jack: I guess segread can dump the content of fetched segment content; but I want to see inside of index created by running "bin/nutch index", etc. thanks, Michael Ji --- Jack Tang <[EMAIL PROTECTED]> wrote: > Hi Michael > > Is "segread" nutch command what you wanna? > Corresponding class

Re: dump nutch index

2005-08-21 Thread Jack Tang
Hi Michael Is "segread" nutch command what you wanna? Corresponding class is org.apache.nutch.segment.SegmentReader Regards /Jack On 8/22/05, Michael Ji <[EMAIL PROTECTED]> wrote: > hi Jack: > > I am using Lukeall now and can browse into the index > files; it is very powerful tool. > > But I w

Re: dump nutch index

2005-08-21 Thread Michael Ji
hi Jack: I am using Lukeall now and can browse into the index files; it is very powerful tool. But I wonder if I can output the content of the individual files in index dir to a text format, means, I can see the each text saved in index files without interpreting by Lukeall. thanks, Michael Ji

Re: dump nutch index

2005-08-21 Thread Jack Tang
Hi Michael Hope luke helps you. http://www.getopt.org/luke/ Regards /Jack On 8/22/05, Michael Ji <[EMAIL PROTECTED]> wrote: > hi there, > > Is there a easy way that I could dump nutch index to a > human-readable format? > > thanks, > > Michael Ji > > > > ___

dump nutch index

2005-08-21 Thread Michael Ji
hi there, Is there a easy way that I could dump nutch index to a human-readable format? thanks, Michael Ji Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs

Re: Failing JUnit test

2005-08-21 Thread Jérôme Charron
> I found it and commited the fix. It was not using UTF-8 encoding > sometimes. Thanks Piotr > But while looking at the code I feel a little bit worried about > LanguageIdentifier.identify(InputStream is) - as it reads bytes from > file in chunks and coverts each chunk to stink separatelly. If m

Re: Failing JUnit test

2005-08-21 Thread Piotr Kosiorowski
Hello Jérôme, I found it and commited the fix. It was not using UTF-8 encoding sometimes. But while looking at the code I feel a little bit worried about LanguageIdentifier.identify(InputStream is) - as it reads bytes from file in chunks and coverts each chunk to stink separatelly. If multibyte