Hi everyone,
I am from Turkey. My language has a special char ğ .This char used only
in Turkish and i have to make a language identifier.I have thought that
instead of using ngrams i can simply check that
if the html content includes ğ or not.For this reason I need an if check
to make the
hi i got latest nutch release trough the svn.. i crawled and indexed some
sites without problem. When i tried to extract links into the linkdb i saw
that these lines in
cat linkdb/current/part-0/data
SEQorg.apache.hadoop.io.Text
Hi all,
finally I make nutch can crawl and search, but when I click the cache page,
it throw a http 500 to me:
screen dump
type Exception report
message
description The server encountered an internal error () that prevented it
from fulfilling this request.
Jasper,
Thanks for the reply, yes, that helps my understanding. I had a little
look at the Luke tool which allowed me to see how different analyzers
were handling any given text, and seeing the tokens produced by using
org.apache.lucene.analysis.WhitespaceAnalyzer. I thought I'd attempt to
hi,
anwser by myself again:
the tika jar is not placed in the tomcat webapp in 1.0-dev that cause this
exception
thank you for your attention,
Vinci
Vinci wrote:
Hi all,
finally I make nutch can crawl and search, but when I click the cache
page, it throw a http 500 to me:
problem is when i try to ../bin/nutch readlinkdb ready/otomotiv/linkdb/
-dump alo there is nothing in the cat alo/part-0
i fetched 20.000 urls without problem..
Dennis Kubes-2 wrote:
You are showing the cat output of linkdb which is composed of binary
files. What is your problem?
I am interested to know the reason why the following check is done
whether a crawl directory already exists.
FileSystem fs = FileSystem.get(job);
if (fs.exists(dir)) {
throw new RuntimeException(dir + already exists.);
}
Is it only to save the user from overwriting his crawl
Hi folks...
Is there a way to retrieve stats from Nutch - meaning how many webpages
are indexed, to be indexed etc??
When I was working with AspSeek and Mnogosearch in the past I could run
a command to see stats
Thanks again,
Paul
Try this command:-
bin/nutch readdb crawl/crawldb -stats
To get help, try:-
bin/nutch readdb
Regards,
Susam Pal
On Feb 1, 2008 8:21 AM, Paul Stewart [EMAIL PROTECTED] wrote:
Hi folks...
Is there a way to retrieve stats from Nutch - meaning how many webpages
are indexed, to be indexed