Help needed!!

2008-01-31 Thread Volkan Ebil
Hi everyone, I am from Turkey. My language has a special char ğ .This char used only in Turkish and i have to make a language identifier.I have thought that instead of using ngrams i can simply check that if the html content includes ğ or not.For this reason I need an if check to make the

linkdb problem

2008-01-31 Thread Uygar BAYAR
hi i got latest nutch release trough the svn.. i crawled and indexed some sites without problem. When i tried to extract links into the linkdb i saw that these lines in cat linkdb/current/part-0/data SEQorg.apache.hadoop.io.Text

Error when request cache page in 1.0-dev

2008-01-31 Thread Vinci
Hi all, finally I make nutch can crawl and search, but when I click the cache page, it throw a http 500 to me: screen dump type Exception report message description The server encountered an internal error () that prevented it from fulfilling this request.

Re: Simple question about query terms

2008-01-31 Thread Chaz Hickman
Jasper, Thanks for the reply, yes, that helps my understanding. I had a little look at the Luke tool which allowed me to see how different analyzers were handling any given text, and seeing the tokens produced by using org.apache.lucene.analysis.WhitespaceAnalyzer. I thought I'd attempt to

Re: Error when request cache page in 1.0-dev

2008-01-31 Thread Vinci
hi, anwser by myself again: the tika jar is not placed in the tomcat webapp in 1.0-dev that cause this exception thank you for your attention, Vinci Vinci wrote: Hi all, finally I make nutch can crawl and search, but when I click the cache page, it throw a http 500 to me:

Re: linkdb problem

2008-01-31 Thread Uygar BAYAR
problem is when i try to ../bin/nutch readlinkdb ready/otomotiv/linkdb/ -dump alo there is nothing in the cat alo/part-0 i fetched 20.000 urls without problem.. Dennis Kubes-2 wrote: You are showing the cat output of linkdb which is composed of binary files. What is your problem?

Recrawl using org.apache.nutch.crawl.Crawl

2008-01-31 Thread Susam Pal
I am interested to know the reason why the following check is done whether a crawl directory already exists. FileSystem fs = FileSystem.get(job); if (fs.exists(dir)) { throw new RuntimeException(dir + already exists.); } Is it only to save the user from overwriting his crawl

Stats?

2008-01-31 Thread Paul Stewart
Hi folks... Is there a way to retrieve stats from Nutch - meaning how many webpages are indexed, to be indexed etc?? When I was working with AspSeek and Mnogosearch in the past I could run a command to see stats Thanks again, Paul

Re: Stats?

2008-01-31 Thread Susam Pal
Try this command:- bin/nutch readdb crawl/crawldb -stats To get help, try:- bin/nutch readdb Regards, Susam Pal On Feb 1, 2008 8:21 AM, Paul Stewart [EMAIL PROTECTED] wrote: Hi folks... Is there a way to retrieve stats from Nutch - meaning how many webpages are indexed, to be indexed