Hi,
When I use intranet crawling, such as, call
"bin/nutch crawl ...", crawl-urlfilter.txt works---it
filters out the urls that is not matched the domain I
included;
actually, when I take a look at crawltool.java, the
config files are read in Java Properties by
'NutchConf.get().addConfResource(
Hi Michael
On 8/22/05, Michael Ji <[EMAIL PROTECTED]> wrote:
> hi Jack:
>
> I guess segread can dump the content of fetched
> segment content; but I want to see inside of index
> created by running "bin/nutch index", etc.
Try to search "http/https/ftp/file"(the protocol) keywords using
NutchBean
hi Jack:
I guess segread can dump the content of fetched
segment content; but I want to see inside of index
created by running "bin/nutch index", etc.
thanks,
Michael Ji
--- Jack Tang <[EMAIL PROTECTED]> wrote:
> Hi Michael
>
> Is "segread" nutch command what you wanna?
> Corresponding class
Hi Michael
Is "segread" nutch command what you wanna?
Corresponding class is org.apache.nutch.segment.SegmentReader
Regards
/Jack
On 8/22/05, Michael Ji <[EMAIL PROTECTED]> wrote:
> hi Jack:
>
> I am using Lukeall now and can browse into the index
> files; it is very powerful tool.
>
> But I w
hi Jack:
I am using Lukeall now and can browse into the index
files; it is very powerful tool.
But I wonder if I can output the content of the
individual files in index dir to a text format, means,
I can see the each text saved in index files without
interpreting by Lukeall.
thanks,
Michael Ji
Hi Michael
Hope luke helps you.
http://www.getopt.org/luke/
Regards
/Jack
On 8/22/05, Michael Ji <[EMAIL PROTECTED]> wrote:
> hi there,
>
> Is there a easy way that I could dump nutch index to a
> human-readable format?
>
> thanks,
>
> Michael Ji
>
>
>
> ___
hi there,
Is there a easy way that I could dump nutch index to a
human-readable format?
thanks,
Michael Ji
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
> I found it and commited the fix. It was not using UTF-8 encoding
> sometimes.
Thanks Piotr
> But while looking at the code I feel a little bit worried about
> LanguageIdentifier.identify(InputStream is) - as it reads bytes from
> file in chunks and coverts each chunk to stink separatelly. If m
Hello Jérôme,
I found it and commited the fix. It was not using UTF-8 encoding sometimes.
But while looking at the code I feel a little bit worried about
LanguageIdentifier.identify(InputStream is) - as it reads bytes from
file in chunks and coverts each chunk to stink separatelly. If multibyte