Re: Problem in config nutch-default.xml

2006-11-11 Thread Håvard W. Kongsgård

Related issue?
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06135.html

[EMAIL PROTECTED] wrote:

Hi all.

I have a problem in config nutch-default.xml. As I am in China, most ftp sites that I 
want to crawl are encoded in chinese, but when nutch crawl these ftp sites,it could 
not get the correct charset code,and the parse results are incomprehensible and 
useless. so I set property
 nameparser.character.encoding.default/name
 valuewindows-1252/value
 /property
to valuegb2312/value and got a very interesting result, nutch now can crawl 
the files and directories of the root directoy of chinese ftp sites without any messy 
characters,but can NOT crawl any files in SUBdirectories,just got a result :404 no found.
I know there must be something wrong in config files but how and where can I config nutch to crawl a chinese ftp site? 
I 've been working on this problem for halt a month and find no way to solve it, Could anyone helo me???


thanks

 




Nutch slow how to speed up?

2006-10-24 Thread Håvard W. Kongsgård
I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), 
searching with queries like 'China Nuclear Forces' takes 20 – 25 s.


My config:
http.content.limit = 6165536
dfs.replication = 1
mapred.submit.replication = 2
mapred.child.java.opts = -Xmx800m

My data:
TOTAL urls: 3748140
retry 0: 3614731
retry 1: 85999
retry 2: 20772
retry 3: 26638
min score: 0.0
avg score: 0.64956105
max score: 3922.723
status 1 (DB_unfetched): 1316016
status 2 (DB_fetched): 2168397
status 3 (DB_gone): 263727

Status: HEALTHY
Total size: 254534723272 B
Total blocks: 5140 (avg. block size 49520374 B)
Total dirs: 260
Total files: 1466
Over-replicated blocks: 8 (0.15564202 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0015564

The filesystem under path '/' is HEALTHY


Re: Nutch slow how to speed up?

2006-10-24 Thread Håvard W. Kongsgård

DistributedSearch
2x datanodes, 2x Task Trackers

Sami Siren wrote:
You are using DistributedSearch? and local filesystem to store index 
and related data?


--
 Sami Siren


Håvard W. Kongsgård wrote:
I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 
memory), searching with queries like 'China Nuclear Forces' takes 20 
– 25 s.


My config:
http.content.limit = 6165536
dfs.replication = 1
mapred.submit.replication = 2
mapred.child.java.opts = -Xmx800m

My data:
TOTAL urls: 3748140
retry 0: 3614731
retry 1: 85999
retry 2: 20772
retry 3: 26638
min score: 0.0
avg score: 0.64956105
max score: 3922.723
status 1 (DB_unfetched): 1316016
status 2 (DB_fetched): 2168397
status 3 (DB_gone): 263727

Status: HEALTHY
Total size: 254534723272 B
Total blocks: 5140 (avg. block size 49520374 B)
Total dirs: 260
Total files: 1466
Over-replicated blocks: 8 (0.15564202 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0015564

The filesystem under path '/' is HEALTHY








Re: problem parsing documents : word, rtf, excel, etc...

2006-10-20 Thread Håvard W. Kongsgård

Post your conf/nutch-site.xml


Aïcha wrote:
Hi, 


I have a lot of parsing problems when I try to index my directory, about only 
50% of files where indexed

I ask the nutch-dev group but I try in the nutch-user, perhaps somebody had 
these problems and solved..

I put a list of the main problem the parsing encountred : 

  -  Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be handled as micrsosoft document. org.apache.poi.hssf.record.RecordFormatException: Unable to construct record instance, the following exception occured: null 

  - Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't be handled as micrsosoft document. java.util.NoSuchElementException 
  
  - Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't be handled as micrsosoft document. java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256 

  - 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks 
java.net.MalformedURLException: unknown protocol: dsp 
at java.net.URL.init(URL.java:574) 
at java.net.URL.init(URL.java:464) 
at java.net.URL.init(URL.java:413) 
at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78) 
at org.apache.nutch.parse.Outlink.init(Outlink.java:35) 
at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111) 
at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:84) 
at org.apache.nutch.parse.msword.MSWordParser.getParse(MSWordParser.java:43) 
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276) 
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152) 



In the last error, the string after unknown protocol:  is not always dsp, it seems to be different in each case and I don't understand what mean this string. 

Thank in advance  

Best regards, 
Aïcha







___ 
Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! 
Demandez à ceux qui savent sur Yahoo! Questions/Réponses

http://fr.answers.yahoo.com
  




Re: why I use site:com to query , but no result return??

2006-10-12 Thread Håvard W. Kongsgård

Site works this way.

China site:www.ndu.edu = only results from http://www.ndu.edu

China site:ndu.edu = only results from http://ndu.edu/

China site:*.ndu.edu = results from http://ndu.edu/ and http://www.ndu.edu

I think you also can use Grouping in nutch: 
http://lucene.apache.org/java/docs/queryparsersyntax.html



xu nutch wrote:

nutch-user,

I download nutch 0.7.2  and have been crawled some webpages.
I can find some result  by keywords,
also I can find some result by query url:com ,
but it is no result by query site:sample.com ,
why?

who can help me ?





nonzero status of 134

2006-10-07 Thread Håvard W. Kongsgård

During a fetch a got this error on one of my nodes;

java.io.IOException: Task process exit with nonzero status of 134.

   at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:242)

   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)



Re: Problem in Distributed crawling using nutch 0.8

2006-09-29 Thread Håvard W. Kongsgård
see: 
http://wiki.apache.org/nutch-data/attachments/FrontPage/attachments/Hadoop-Nutch%200.8%20Tutorial%2022-07-06%20%3CNavoni%20Roberto%3E


Before you start tomcat remeber to change the path of your search directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes directory 

#This is an example of my configuration 


?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
 property
   namefs.default.name/name
   valueLSearchDev01:9000/value
 /property

 property
   namesearcher.dir/name
   value/user/root/crawld/value
 /property

/configuration



Mohan Lal wrote:

Hi,

thanks for your valuable information, i have solved that problem after that
iam facing another problem 
i have 2 slaves
 1)  MAC1
  2)  MAC2

but the job was running in MAC1 itself, and it take a long time to finish
the crawling process
how can i assign job to distributed machines i specified in tha slaves file
?

But my Crowling process done successfully..also how ccan i specify
the searcher dir in the nutch-site.xml file

 property
  namesearcher.dir/name
  value ? /value  
 /property


please help me.


I have done the following setting.

[EMAIL PROTECTED] ~]# cd /home/lucene/nutch-0.8.1/
[EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop namenode -format
Re-format filesystem in /tmp/hadoop/dfs/name ? (Y or N) Y
Formatted /tmp/hadoop/dfs/name
[EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh
starting namenode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
amenode-mohanlal.qburst.local.out
fpo: ssh: fpo: Name or service not known
localhost: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/ha
doop-root-datanode-mohanlal.qburst.local.out
starting jobtracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
-jobtracker-mohanlal.qburst.local.out
fpo: ssh: fpo: Name or service not known
localhost: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs
/hadoop-root-tasktracker-mohanlal.qburst.local.out
[EMAIL PROTECTED] nutch-0.8.1]# bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
sonu: no tasktracker to stop
stopping namenode
sonu: no datanode to stop
localhost: stopping datanode
[EMAIL PROTECTED] nutch-0.8.1]# bin/start-all.sh
starting namenode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root-n
amenode-mohanlal.qburst.local.out
sonu: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-
root-datanode-sonu.qburst.local.out
localhost: starting datanode, logging to
/home/lucene/nutch-0.8.1/bin/../logs/ha
doop-root-datanode-mohanlal.qburst.local.out
starting jobtracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hadoop-root
-jobtracker-mohanlal.qburst.local.out
localhost: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs
/hadoop-root-tasktracker-mohanlal.qburst.local.out
sonu: starting tasktracker, logging to
/home/lucene/nutch-0.8.1/bin/../logs/hado
op-root-tasktracker-sonu.qburst.local.out
[EMAIL PROTECTED] nutch-0.8.1]# bin/hadoop dfs -put  urls urls
[EMAIL PROTECTED] nutch-0.8.1]# bin/nutch crawl urls -dir crawl.1 -depth 2
-topN 10 crawl started in: crawl.1
rootUrlDir = urls
threads = 100
depth = 2
topN = 10
Injector: starting
Injector: crawlDb: crawl.1/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: crawl.1/segments/20060929120038
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl.1/segments/20060929120038
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.1/crawldb
CrawlDb update: segment: crawl.1/segments/20060929120038
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: crawl.1/segments/20060929120235
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl.1/segments/20060929120235
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.1/crawldb
CrawlDb update: segment: crawl.1/segments/20060929120235
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl.1/linkdb
LinkDb: adding segment: /user/root/crawl.1/segments/20060929120038
LinkDb: adding segment: /user/root/crawl.1/segments/20060929120235
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl.1/linkdb
Indexer: adding segment: /user/root/crawl.1/segments/20060929120038
Indexer: adding segment: /user/root/crawl.1/segments/20060929120235
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl.1/indexes
Dedup: done
Adding /user/root/crawl.1/indexes/part-0
Adding 

Re: Tomcat 5 / Nutch web gui timeout blank page

2006-09-29 Thread Håvard W. Kongsgård

I solved the problem by giving tomcat 5 more memory

export JAVA_OPTS=-Xmx528m -Xms128m


Håvard W. Kongsgård wrote:
I have a problem with my Nutch web gui sometimes returning empty pages 
when I do a search. In Nutch 0.7 this was fixed by giving 
ipc.client.timeout a higher value in my webapp/ROOT/ 
WEB-INF/classes/hadoop-site.xml but this has no effect in nutch 0.8.1, 
the nutch web gui still times out after about 30s.







Re: Problem in Distributed crawling using nutch 0.8

2006-09-28 Thread Håvard W. Kongsgård

Do /user/root/url exist, have you uploaded  the url folder to you dfs system?

bin/hadoop dfs -mkdir urls
bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt

or

bin/hadoop -put localsrc dst


Mohan Lal wrote:

Hi all,

While iam try to crawl using distributed machines its throw an error

bin/nutch crawl urls -dir crawl -depth 10 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 10
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread main java.io.IOException: Input directory
/user/root/urls in localhost:9000 is invalid.
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

whats wrong with my configuration,  please help  me..


Regards
Mohan Lal 
  




Indexing in nutch 0.8 / hadoop

2006-09-28 Thread Håvard W. Kongsgård

What is the best way to create a master index on a nutch 8 / hadoop system?

Is it to merge all of the segments together, and then create an index?

Or like Roberto Navoni in his Tutorial
First index all the segments separately and then merge the indexes into 
one master index?


-.-.-.-.-.-.-
# Create a new indexe0
bin/nutch
index /user/root/crawld/indexe0 /user/root/crawld/ /user/root/crawld/linkdb
/user/root/crawld/segments/20060722153133
# Create a new index1
bin/nutch
index /user/root/crawld/indexe1 /user/root/crawld/ /user/root/crawld/linkdb
/user/root/crawld/segments/20060722182213
#Dedup the new indexe0
bin/nutch dedup /user/root/crawld/indexe0
#Dedup the new index1
bin/nutch dedup /user/root/crawld/indexe1
#Delete the old index
#Merge the new index merge directory
bin/nutch
merge /user/root/crawld/index /user/root/crawld/indexe0 
/user/root/crawld/indexe1 ...

#(and the other index create for the fetch segments)
-.-.-.-.-.-.-


Tomcat 5 / Nutch web gui timeout blank page

2006-09-28 Thread Håvard W. Kongsgård
I have a problem with my Nutch web gui sometimes returning empty pages 
when I do a search. In Nutch 0.7 this was fixed by giving 
ipc.client.timeout a higher value in my webapp/ROOT/ 
WEB-INF/classes/hadoop-site.xml but this has no effect in nutch 0.8.1, 
the nutch web gui still times out after about 30s.




How to search for multiple site:

2006-09-27 Thread Håvard W. Kongsgård

In Google the user can search in more than one specific site using OR

admission site:www.stanford.edu OR site: cmu.edu OR site:mit.edu OR 
site:berkeley.edu


Is this possible in the nutch web gui?



Generate linkDb | hadoop/nutch 0.8

2006-07-20 Thread Håvard W. Kongsgård

When I run “bin/nutch invertlinks linkdb segments” I get this error

Exception in thread main java.io.IOException: Input directory 
/user/nutch/segments/parse_data in linux3:9000 is invalid.


at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)

at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212)

at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316)


I have tried to create the directory segments/parse_data but still the 
same error.




Indexing segment | nutch 0.8/hadoop

2006-07-20 Thread Håvard W. Kongsgård
When I try to index my second segment “bin/nutch index issep crawldb 
linkdb segments/x” I get this error


Exception in thread main java.io.IOException: Output directory 
/user/nutch/issep already exists.
at 
org.apache.hadoop.mapred.OutputFormatBase.checkOutputSpecs(OutputFormatBase.java:39)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:279)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:313)



Re: Generate linkDb | hadoop/nutch 0.8

2006-07-20 Thread Håvard W. Kongsgård

Sami Siren wrote:

try “bin/nutch invertlinks linkdb -dir segments”

--
Sami Siren


Håvard W. Kongsgård wrote:


When I run “bin/nutch invertlinks linkdb segments” I get this error

Exception in thread main java.io.IOException: Input directory 
/user/nutch/segments/parse_data in linux3:9000 is invalid.


at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)

at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212)

at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316)


I have tried to create the directory segments/parse_data but still 
the same error.







Thanks it worked



Re: Best performance approach for single MP machine?

2006-07-20 Thread Håvard W. Kongsgård


http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02394.html


Teruhiko Kurosaka wrote:

   Can I use MapReduce to run Nutch on a multi CPU system?
 


Yes.


   I want to run the index job on two (or four) CPUs
   on a single system.  I'm not trying to distribute the job
   over multiple systems.

   If the MapReduce is the way to go,
   do I just specify config parameters like these:
   mapred.tasktracker.tasks.maxiumum=2
   mapred.job.tracker=localhost:9001
   mapred.reduce.tasks=2 (or 1?)

   and
   bin/start-all.sh

   ?
 

That should work. You'd probably want to set the default number of map 
tasks to be a multiple of the number of CPUs, and the number of reduce 
tasks to be exactly the number of cpus.


Don't use start-all.sh, but rather just:

bin/nutch-daemon.sh start tasktracker
bin/nutch-daemon.sh start jobtracker


   Must I use NDFS for MapReduce?
 


No.

Doug







Doug Cook wrote:

Hi,

I've recently switched to 0.8 from 0.7, and after some initial fits and
starts, I'm past the get it working at all stage to the get reasonable
performance stage.

I've got a single machine with 4 CPUs and a lot of memory. URL fetching
works great because it's (mostly) multithreaded. But as soon as I hit the
reduce phase of fetch, it's dog slow. I'm down to running on one CPU, and
the phase can take days, leaving me vulnerable to losing everything should a
process fail.

Wait! you say. That's just what Hadoop is for! I'm all ears. I'd love some
help getting my configuration right. I've seen examples/tutorials of
configurations for multiple machines; am I just faking multiple machines
on my single node (will that work?) or is there a cleaner, simpler approach?

Alternatively, I was all excited to get an easy improvement with
-numFetchers, and run 4 fetchers simultaneously to use all my CPUs, but it
looks like -numFetchers has gone away, and though there was an 0.8 version
patch, at a quick glance this didn't seem to have made it into the mainline
source, and I don't see the value of trying to merge this in if there's a
cleaner Hadoop-based approach.

Many thanks for any help.

Doug
  




Nutch 0.8 java 1.4/1.5

2006-07-17 Thread Håvard W. Kongsgård

I am trying to get nutch/hadoop to run on 3 servers with SUSE linux.

I have followed the Nutch Hadoop Tutorial and everything works find (I 
can run bin/hadoop dfs –ls), but when I run “bin/nutch inject crawldb 
urls” I get this error.


Exception in thread main java.lang.UnsupportedClassVersionError: 
org/apache/commons/cli/ParseException (Unsupported major.minor version 49.0)


at java.lang.ClassLoader.defineClass0(Native Method)

at java.lang.ClassLoader.defineClass(ClassLoader.java:539)

at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123)

at java.net.URLClassLoader.defineClass(URLClassLoader.java:251)

at java.net.URLClassLoader.access$100(URLClassLoader.java:55)

at java.net.URLClassLoader$1.run(URLClassLoader.java:194)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:187)

at java.lang.ClassLoader.loadClass(ClassLoader.java:289)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274)

at java.lang.ClassLoader.loadClass(ClassLoader.java:235)

at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302)

at org.apache.nutch.crawl.Injector.inject(Injector.java:138)

at org.apache.nutch.crawl.Injector.main(Injector.java:164)

I have set the JAVA_HOME variable in hadoop-env.sh to 
/usr/java/jdk1.5.0_07/ but nutch still tells me that I use version 48.0 
(java 1.4).


I have also tried to set the JAVA_HOME variable in bin/nutch but with 
the same result.




Re: Nutch on Windows

2006-07-14 Thread Håvard W. Kongsgård

Kerry Wilson wrote:
Trying to use nutch on windows and the executables are shell scripts, 
how do you use nutch on windows?



http://wiki.apache.org/nutch/GettingNutchRunningWithWindows


Re: favicon?

2006-04-21 Thread Håvard W. Kongsgård

For Internet Explorer
http://www.favicon.com/ie.html

Firefox
Works for me in nutch 0.7.2
Is it the right size?  
http://www.photoshopsupport.com/tutorials/jennifer/favicon.html




Bill Goffe wrote:

At http://ese.rfe.org I've Nutch running for some time, but I have a minor
question: how to put in my own favicon? In .71, I put my favicon.ico in
src/site/src/documentation/resources/images/ and docs/img/ (wasn't sure
which mattered), did an ant war, and redeployed the resulting war file.
The correct favicon is in webapps/ROOT/img/ and
http://ese.rfe.org/favicon.ico shows the correct icon.

But, it shows inconsistently in Firefox and Internet Explorer on search
results and on http://ese.rfe.org in spite of clearing the cache and
history in both (in fact, after clearing them, it now doesn't show!).
Also, in Firefox, when I drag the blank icon from the address bar to my
list of shortcuts (term?) at the top of the browser, the correct icon
shows up there but still not on the address bar. Ugh!

Thanks,

   Bill

  




Re: Nutch shows same results multiple times.

2006-04-20 Thread Håvard W. Kongsgård

Like this

+http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/
-.*

see: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg00479.html

Dima Mazmanov wrote:

I'm not adding urls into urlfilter files.
Besides, I still don't understand how to allow only one zone in 
urlfilter.

Let's say I want to index only .ge zone.
Which one of the following filters is correct?

+^http://([a-z0-9]*\.)*([a-z0-9]*\.).ge/
+^http://([a-z0-9\-\.]*\.)*.ge/
+^http://([a-z0-9\-\.])*.ge/
+^http://www\..*\.ge/
+^http://www\..*\.*\.ge/

By the way if the site you are indexing is dynamic you may just 
disallow to index

www.bbc.co.uk and index only second one.



So what filter settings do you use?
Like this +^http://([a-z0-9]*\.)*bbc.co.uk/
Then you will get bbc.co.uk and www.bbc.co.uk http://www.bbc.co.uk/ 
and

since this site is dynamic, content might bee different.
Have the same problem myself :-(




---
Well my script already contains this command




   Run bin/nutch dedup segments dedup.tmp


   Dima Mazmanov wrote:

   Hi all!! I'm running on nutch-0.7.1.

   Here is result of my search.


   ArGo Software Design Homepage [html] - 30.2 k - ... Look of our
   Web Site Our web site has new look and ... link on the ...
   http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo
   Software Design Homepage [html] - 30.2 k - ... Look of our Web
   Site Our web site has new look and ... link on the ...
   http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo
   Software Design Homepage [html] - 30.2 k - ... Look of our Web
   Site Our web site has new look and ... link on the ...
   http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo
   Software Design Homepage [html] - 30.2 k - ... Look of our Web
   Site Our web site has new look and ... link on the ...
   http://www.argosoft.org/rootpages/Default.aspx (Cached)

   As you can see one result is shown multiple times.
   Why so? What is the difference between these links? I don't 
see any..

   So, how can I avoid this problem?
   Thanks, Regards, Dima






__ NOD32 1.1497 (20060419) Information __

This message was checked by NOD32 antivirus system.
http://www.eset.com









Re: Nutch shows same results multiple times.

2006-04-20 Thread Håvard W. Kongsgård

Don't know but you can try to upgrading to 0.7.2


See Nutch Change Log:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158

Dima Mazmanov wrote:

Hi,Håvard.
Thank you again for your help.
..mmm. there is else once thing  I'm cuerious about...
The search result of several sites displays content like following :

Cool-Warez
[html] - 19.1 k - 11/3/2006
... Avatars   გართობა   კონტაქტი 
როგორ მოვხსნათ www.sendspace.com Многие из Вас ... вопрос: Как качать 
сhttp://www
http://www.cool.caucasus.net/index_moxsna_2.htm (Cached) (More from 
www.cool.caucasus.net)


as you can see there is a lot of spaces between words.. is this bug or 
what?...
maybe it's because of different borders in web page and nutch places 
spaces by his own ???

Is there any way to avoid this problem?





Re: Nutch shows same results multiple times.

2006-04-19 Thread Håvard W. Kongsgård

So what filter settings do you use?
Like this +^http://([a-z0-9]*\.)*bbc.co.uk/
Then you will get bbc.co.uk and www.bbc.co.uk http://www.bbc.co.uk/ and
since this site is dynamic, content might bee different.
Have the same problem myself :-(




---
Well my script already contains this command




   Run bin/nutch dedup segments dedup.tmp


   Dima Mazmanov wrote:
 


   Hi all!! I'm running on nutch-0.7.1.

   Here is result of my search.

   


   ArGo Software Design Homepage [html] - 30.2 k - ... Look of our
   Web Site Our web site has new look and ... link on the ...
   http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo
   Software Design Homepage [html] - 30.2 k - ... Look of our Web
   Site Our web site has new look and ... link on the ...
   http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo
   Software Design Homepage [html] - 30.2 k - ... Look of our Web
   Site Our web site has new look and ... link on the ...
   http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo
   Software Design Homepage [html] - 30.2 k - ... Look of our Web
   Site Our web site has new look and ... link on the ...
   http://www.argosoft.org/rootpages/Default.aspx (Cached)

   As you can see one result is shown multiple times.
   Why so? What is the difference between these links? I don't see any..
   So, how can I avoid this problem?
   Thanks, Regards, Dima


   



Re: Nutch shows same results multiple times.

2006-04-18 Thread Håvard W. Kongsgård

Run bin/nutch dedup segments dedup.tmp


Dima Mazmanov wrote:

Hi all!! I'm running on nutch-0.7.1.

Here is result of my search.

ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
Site Our web site has new look and ... link on the ... 
http://www.argosoft.org/RootPages/Default.aspx (Cached)
ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
Site Our web site has new look and ... link on the ... 
http://www.argosoft.com/rootpages/Default.aspx (Cached)
ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
Site Our web site has new look and ... link on the ... 
http://www.argosoft.com/RootPages/Default.aspx (Cached)
ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
Site Our web site has new look and ... link on the ... 
http://www.argosoft.org/rootpages/Default.aspx (Cached)

As you can see one result is shown multiple times.
Why so? What is the difference between these links? I don't see any..
So, how can I avoid this problem?
Thanks, Regards, Dima






How to run bin/nutch dedup when running multiple servers

2006-04-15 Thread Håvard W. Kongsgård

Hi, I am running nutch 0.7.2 on 3 servers|1 tomcat/db|2 segment servers port 
8081|
is it possible to run bin/nutch dedup on multiple servers so that nutch removes 
all duplicated pages?



Re: Nutch 0.7.2 release | upgrading from 0.7.1?

2006-04-02 Thread Håvard W. Kongsgård

What about upgrading from 0.7.1? Can I use my existing db and segments?

Piotr Kosiorowski wrote:

Hello all,

The 0.7.2 release of Nutch is now available. This is a bug fix release 
for 0.7 branch. See CHANGES.txt 
(http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986) 

 for details. The release is available on 
http://lucene.apache.org/nutch/release/.


Regards,
Piotr





Re: nutch 0.7.1 where is the tutorial? crawldb not found?

2006-02-25 Thread Håvard W. Kongsgård

http://wiki.media-style.com/display/nutchDocu/Home


Roeland Weve wrote:


Hi,

I've installed Nutch 0.7.1 today on Windows XP with Cygwin and tried 
to follow the tutorial at:

http://lucene.apache.org/nutch/tutorial.html
But this tutorial seems to be written for another version of Nutch. 
Because, first of all the DmozParser is not available (I could'nt find 
it in the nutch-0.7.1.jar file, not under 'crawl', 'tools' or 
somewhere else):

java.lang.NoClassDefFoundError: org/apache/nutch/crawl/DmozParser
java.lang.NoClassDefFoundError: org/apache/nutch/tools/DmozParser
Since I'm not really interested in Dmoz data, I continue with 
injecting URLs  of my own (in the dmoz dir, the file is called 'urls', 
with on each line an url) in the database. Unfortunately, I got stuck 
again. I tried to execute:

bin/nutch inject crawl/crawldb dmoz
The error is:
 060225 212634 parsing 
file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-default.xml
 060225 212635 parsing 
file:/D:/cygwin/home/roeland/nutch-0.7.1/conf/nutch-site.xml
 Usage: WebDBInjector (-local | -ndfs namenode:port) db_dir 
(-urlfile url_file | -dmozfile dmoz_file) [-subset 
subsetDenominator] [-includeAdultMaterial] [-skew skew] 
[-noDmozDesc] [-topicFile topic list file] [-topic topic [-topic 
topic [...]]]


So I tried to adjust the parameters, with something like:
 bin/nutch inject crawl/crawldb -urlfile dmoz/urls
But this leads to an exception:
Exception in thread main java.io.FileNotFoundException: 
crawl\crawldb\webdb\pagesByURL\data


There are some files in the crawldb dir, but not the webdb dir. Is 
there a possibility to create an empty or default database? Or do I 
need Nutch 0.8? If yes, where can I download it?
Hopefully, this can this be done with Nutch 0.7.1, because I'm not a 
hero with compiling stuff on Cygwin


The only thing I want is to inject URLs that can be found in a plain 
text file, with on each row a URL. The next step is the crawl those 
URLs. The URLs are all different, so I am not interested in the 
intranet option of Nitch.


Hopefully someone can help me out with this problem.

Roeland






Re: Pdf document title in nutch search

2006-02-21 Thread Håvard W. Kongsgård

Take a look at the Google search result of this rand publication
http://www.google.com/search?hs=z0nhl=enlr=client=firefox-arls=org.mozilla%3Aen-US%3Aofficialq=Implementing+Security+Improvement+Options+at+Los+Angeles+International+Airport+btnG=Search

The pdf document (RAND_DB468-1.sum.pdf) has no pdf title, and google 
don't use the first 2 pages of the document for a title!




Jérôme Charron wrote:


It'd be nice if this was changed so that if a PDF has no title then the
first xx words become the new title.
   



I agree with that.
Please create a JIRA issue for this point.


 


(but it seems that the Google title process is more advanced that this)
   



Really?
Take a look at this :
http://www.google.com/search?num=100hl=frsafe=offc2coff=1as_qdr=allq=http%3A%2F%2Fwww.trellix.com%2Fproducts%2Fdownloads%2Fsearchengines_siteopt.pdfbtnG=Rechercherlr=
In fact Google always take the first characters of the document as the
title.
Google never use the Title property of the document.
So, when there is some shaded characters in the first characters of the pdf
document, you get a TTTiiitttllleee llliiikkkeee ttthhhaaattt ... is it
really an advanced title processing?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

 





Pdf document title in nutch search

2006-02-20 Thread Håvard W. Kongsgård
When searching with nutch the title of pdf documents is a url to the 
file like:

http://www.ists.dartmouth.edu/library/wse0901.pdf

I have noticed that google and ultraseek creates a normal title like:
WebALPS: A Survey of E-Commerce Privacy and Security Applications

Is it possible to make nutch do the same?


Re: Pdf document title in nutch search

2006-02-20 Thread Håvard W. Kongsgård

Must I have index-more enabled to get the pdf titles to work.
I did a test with some pdf files, all pdf titles were ignored (nutch 0.7.1).



Håvard W. Kongsgård wrote:

It'd be nice if this was changed so that if a PDF has no title then 
the first xx words become the new title.

(but it seems that the Google title process is more advanced that this)



Jérôme Charron wrote:


When searching with nutch the title of pdf documents is a url to the
file like:
http://www.ists.dartmouth.edu/library/wse0901.pdf
  



In Nutch, the title of PDF file is displayed if a title is available,
otherwise the URL
of the document is displayed.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

 








Re: Problem/bug setting java_home in hadoop nightly 16.02.06

2006-02-17 Thread Håvard W. Kongsgård

Thanks it worked. Is there any other path I need to set?

# The java implementation to use.
export JAVA_HOME=/usr/lib/java



Doug Cutting wrote:


Have you edited conf/hadoop-env.sh, and defined JAVA_HOME there?

Doug

Håvard W. Kongsgård wrote:

I am unable to set java_home in bin/hadoop, is there a bug? I have 
used nutch 0.7.1 with the same java path.



localhost: Error: JAVA_HOME is not set.


if [ -f $HADOOP_HOME/conf/hadoop-env.sh ]; then
 source ${HADOOP_HOME}/conf/hadoop-env.sh
fi

# some Java parameters
if [ $JAVA_HOME != /usr/lib/java ]; then
 #echo run java in $JAVA_HOME
 JAVA_HOME=$JAVA_HOME
fi

if [ $JAVA_HOME =  ]; then
 echo Error: JAVA_HOME is not set.
 exit 1
fi

JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m

System: SUSE 10 64-bit | Java 1.4.2







Problem/bug setting java_home in hadoop nightly 16.02.06

2006-02-16 Thread Håvard W. Kongsgård
I am unable to set java_home in bin/hadoop, is there a bug? I have used 
nutch 0.7.1 with the same java path.



localhost: Error: JAVA_HOME is not set.


if [ -f $HADOOP_HOME/conf/hadoop-env.sh ]; then
 source ${HADOOP_HOME}/conf/hadoop-env.sh
fi

# some Java parameters
if [ $JAVA_HOME != /usr/lib/java ]; then
 #echo run java in $JAVA_HOME
 JAVA_HOME=$JAVA_HOME
fi

if [ $JAVA_HOME =  ]; then
 echo Error: JAVA_HOME is not set.
 exit 1
fi

JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m

System: SUSE 10 64-bit | Java 1.4.2


Re: Nutch inject problem with hadoop - Missing /tmp/hadoop/mapred/system

2006-02-15 Thread Håvard W. Kongsgård

I get the same error (15.02 nightly build)


Gal Nitzan wrote:


I am getting this error all the time. Cant start inject.

060215 183808 parsing file:/home/nutchuser/nutch/conf/hadoop-site.xml
Exception in thread main java.io.IOException: Cannot open
filename /tmp/hadoop/mapred/system/submit_p4w14i/job.jar
   at org.apache.hadoop.ipc.Client.call(Client.java:301)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
   at org.apache.hadoop.mapred.$Proxy0.submitJob(Unknown Source)
   at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:261)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:290)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:114)
   at org.apache.nutch.crawl.Injector.main(Injector.java:138)

I noticed the system folder doesn't exists. and created manually. now
everything works but I get strange behavior like all task trackers are
fetching from the same site?

Any idea?



 





Hung threads

2006-01-29 Thread Håvard W. Kongsgård
Hi, I have a problem with last Friday nightly build. When I try to fetch 
my segment the fetch process freezesAborting with 10 hung threads.
After failing Nutch tries to run the same urls on another tasktracker 
but again fails.


I have tried turning fetcher.parse off, protocol-httpclient, protocol-http.

nutch-site.xml

property
 namefs.default.name/name
 valuelinux3:5/value
 descriptionThe name of the default file system.  Either the
 literal string local or a host:port for NDFS./description
/property

property
 namemapred.job.tracker/name
 valuelinux3:50020/value
 descriptionThe host and port that the MapReduce job tracker runs
 at.  If local, then jobs are run in-process as a single map
 and reduce task.
 /description
/property

property
 nameplugin.includes/name
 
valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|msword)|index-basic|query-(basic|site|url)/value

 descriptionRegular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 /description
/property

property
 namehttp.content.limit/name
 value-1/value
 descriptionThe length limit for downloaded content, in bytes.
 If this value is nonnegative (=0), content longer than it will be 
truncated;

 otherwise, no truncation at all.
 /description
/property

property
 namefetcher.parse/name
 valuefalse/value
 descriptionIf true, fetcher will parse content./description
/property



Re: The parsing is part of the Map or part of the Reduce?

2006-01-28 Thread Håvard W. Kongsgård
So you have been following the quick tutorial for nutch 0.8 and later at 
media-style…

The author has left out the parse and updatedb part.
After fetch simply run bin/nutch parse segment/2006 and then 
bin/nutch crawldb updatedb segment/2006xxx.


Rafit Izhak_Ratzin wrote:


Hi,
In what part of the mapred the parsing is done in the Map part or in 
the Reduce part?


Thanks,
Rafit

_
Express yourself instantly with MSN Messenger! Download today it's 
FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/







Re: The parsing is part of the Map or part of the Reduce?

2006-01-28 Thread Håvard W. Kongsgård

otherwise how it get the next level of URLss?

bin/nutch crawldb updatedb segment/2006xxx

Rafit Izhak_Ratzin wrote:

I thought that by running the fetch command (bin/nutch fetch ...) it 
already does some kind of parsing , otherwise how it get the next 
level of URLss?


and in this case in what part the parsing is done in the mapping or in 
the reducing of the fetch process?


Thanks again,
Rafit




From: Håvard W. Kongsgård [EMAIL PROTECTED]
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: The parsing is part of the Map or part of the Reduce?
Date: Sat, 28 Jan 2006 23:05:05 +0100

So you have been following the quick tutorial for nutch 0.8 and later 
at media-style…

The author has left out the parse and updatedb part.
After fetch simply run bin/nutch parse segment/2006 and then 
bin/nutch crawldb updatedb segment/2006xxx.


Rafit Izhak_Ratzin wrote:


Hi,
In what part of the mapred the parsing is done in the Map part or in 
the Reduce part?


Thanks,
Rafit

_
Express yourself instantly with MSN Messenger! Download today it's 
FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/







_
FREE pop-up blocking with the new MSN Toolbar - get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/







Re: Parsing PDF Nutch Achilles heel?

2006-01-26 Thread Håvard W. Kongsgård

Cud you create a new version from the latest xpdf version,
I know that the older versions of pdftotext (before October 2005) had 
some issues with PDF 1.6 (acrobat 7).

Sorry my mistake!

Have now tested pdftotext and it's faster than pdfbox, but it doesn't 
prevent the nutch freezes.




Håvard W. Kongsgård wrote:


Cud you create a new version from the latest xpdf version,
I know that the older versions of pdftotext (before October 2005) had 
some issues with PDF 1.6 (acrobat 7).




Doug Cutting wrote:


Steve Betts wrote:

I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot 
faster,

but it does allow it to complete.




I find xpdf much faster than PDFBox.

http://www.mail-archive.com/nutch-dev@incubator.apache.org/msg00161.html

Does this work any better for you?

Doug








Parsing PDF Nutch Achilles heel?

2006-01-25 Thread Håvard W. Kongsgård
I have been doing some testing on different nutch configurations to see 
what slows down the fetching process on my servers(nutch 0.7.1).

My general experience is that the PDF parse process is nutchs Achilles heel.

Nutch works fine on older computers, but with the combination of 
|parse-(text|html|pdf)
and http.content.limit = -1(needed to get PDF parsing to work) nutch 
sometimes freezes completely.


Is there planned any improvement to the parsing of PDF files in the next 
version of nutch (0.8)?   



Re: Parsing PDF Nutch Achilles heel?

2006-01-25 Thread Håvard W. Kongsgård

PDFBox-0.7.2 or one of the nightly builds PDFBox-0.7.3-dev...

Steve Betts wrote:


I should have included the link, but I used PDFBox.

Thanks,

Steve Betts
[EMAIL PROTECTED]
937-477-1797


-Original Message-
From: Håvard W. Kongsgård [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 25, 2006 10:34 AM
To: nutch-user@lucene.apache.org
Subject: Re: Parsing PDF Nutch Achilles heel?

From where do I get the new version http://www.pdfbox.org/ or
http://svn.apache.org/viewcvs.cgi/lucene/nutch/



Steve Betts wrote:

 


There is a bug in the PDF parser tool used with 0.7. You can get a newer
version to replace the jars with the parse-pdf plugin and the freeze will
   


go
 


away.

Thanks,

Steve Betts
[EMAIL PROTECTED]
937-477-1797

-Original Message-
From: Håvard W. Kongsgård [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 25, 2006 10:10 AM
To: nutch-user@lucene.apache.org
Subject: Parsing PDF Nutch Achilles heel?

I have been doing some testing on different nutch configurations to see
what slows down the fetching process on my servers(nutch 0.7.1).
My general experience is that the PDF parse process is nutchs Achilles
   


heel.
 


Nutch works fine on older computers, but with the combination of
|parse-(text|html|pdf)
and http.content.limit = -1(needed to get PDF parsing to work) nutch
sometimes freezes completely.

Is there planned any improvement to the parsing of PDF files in the next
version of nutch (0.8)?





   





 





Re: Injecting new url

2006-01-24 Thread Håvard W. Kongsgård
If your old urls have not expired(30 day) then a bin/nutch generate 
will process only the new urls.




Ennio Tosi wrote:


Hi, I created an index from an injected url. My problem is that if now
I inject another url in the webdb, the fetcher reprocesses the
starting url too... Is there a way to tell nutch to only process the
latest injected resource?

Thanks,
Ennio


 





Nutch system running on multiple servers | fetcher

2006-01-17 Thread Håvard W. Kongsgård
Hi I have setup a nutch (0.7.1) system running on multiple servers 
following Stefan Groschupf tutorial 
(http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever).
I already had a nutch index and a set of segments so I copied some 
segments to different servers.
No I want to add some new sites to my engine. This time however I don’t 
want to use the main box as a fetcher but a faster one with no local web db.
Can I simply create a new local web db on the new box and then store the 
local generated segment in localsegements/segments/ ?




Re: Access pasword protected sites?

2006-01-13 Thread Håvard W. Kongsgård

No the current version of nutch don't support password protected sites,
sites that are password protected = http error 404 in the nutch log



Andy Morris wrote:


Can nutch access password protected sites?
If so how?
Thanks,
Andy


 





Search result is an empty site

2006-01-09 Thread Håvard W. Kongsgård
Hi, I am running a nutch server with a db containing 20 docs. When I 
start tomcat and search for something the browser displays an empty site.

Is this a memory problem, how do I fix it?

System: 2,6 | Memory 1 GB | SUSE 9.2



Re: Search result is an empty site

2006-01-09 Thread Håvard W. Kongsgård
No I use 0.7.1, I have tested nutch/tomcat with 20 000 docs so I know it 
works. Searching using site like china site:www.fas.org also works.


Dominik Friedrich wrote:

If you use the mapred version from svn trunk you might have run into 
the same problem as I have. In the mapred version the searcher.dir 
property in nutch-default.xml is set to crawl and not . anymore. If 
you use this version you have either to put the index and the segments 
dirs into a folder called crawl and start tomcat from above that 
folder or change that value in the nutch-site.xml in 
webapps/ROOT/WEB-INF/classes of your tomcat nutch deployment.


regards
Dominik

Håvard W. Kongsgård wrote:

Hi, I am running a nutch server with a db containing 20 docs. 
When I start tomcat and search for something the browser displays an 
empty site.

Is this a memory problem, how do I fix it?

System: 2,6 | Memory 1 GB | SUSE 9.2













fetcher.threads.per.host bug in 0.7.1?

2006-01-09 Thread Håvard W. Kongsgård
Is there a bug in 0.7.1 that causes the fetcher.threads.per.host setting 
to be ignored?




Nutch-site.xml

property

 namefetcher.server.delay/name

 value15.0/value

 descriptionThe number of seconds the fetcher will delay between

  successive requests to the same server./description /property



property

 namefetcher.threads.fetch/name

 value3/value

 descriptionThe number of FetcherThreads the fetcher should use.

   This is also determines the maximum number of requests that are

   made at once (each FetcherThread handles one

connection)./description /property



property

 namefetcher.threads.per.host/name

 value1/value

 descriptionThis number is the maximum number of threads that

   should be allowed to access a host at one time./description

/property



Fetch Log

060109 202235 fetching http://www.fas.org/irp/news/1998/06/prs_rel21.html
060109 202250 fetch of 
http://www.fas.org/irp/news/1998/04/t04141998_t0414asd-3.html failed 
with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 202250 fetch of 
http://www.fas.org/asmp/campaigns/smallarms/sawgconf.PDF failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.

060109 202250 fetching http://www.fas.org/irp/commission/testhaas.htm
060109 202250 fetching http://www.fas.org/asmp/profiles/bahrain.htm
060109 202250 fetching 
http://www.fas.org/irp/cia/product/dci_speech_03082001.html

060109 202306 fetching http://www.fas.org/irp/news/1998/06/980609-drug10.htm
060109 202321 fetch of http://www.fas.org/irp/commission/testhaas.htm 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 202321 fetch of http://www.fas.org/asmp/profiles/bahrain.htm 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 202321 fetching 
http://www.fas.org/irp/news/1998/04/980422-terror2.htm

060109 202321 fetching http://www.fas.org/irp//congress/2004_cr/index.html
060109 202321 fetching http://www.fas.org/irp//congress/2001_rpt/index.html
060109 202338 fetching http://www.fas.org/irp/budget/fy98_navy/0601152n.htm
060109 202354 fetching http://www.fas.org/irp/dia/product/cent21strat.htm
060109 202408 fetch of 
http://www.fas.org/irp/news/1998/04/980422-terror2.htm failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 202408 fetch of 
http://www.fas.org/irp//congress/2004_cr/index.html failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.

060109 202408 fetching http://www.fas.org/faspir/2001/v54n2/qna.htm
060109 202408 fetching http://www.fas.org/graphics/predator/index.htm
060109 202409 fetching 
http://www.fas.org/irp/doddir/dod/5200-1r/chapter_6.htm

060109 202425 fetching http://www.fas.org/irp//congress/1995_hr/140.htm



Re: Search result is an empty site

2006-01-09 Thread Håvard W. Kongsgård

Never mind solved it

for tomcat 5 run

export JAVA_OPTS=-Xmx128m -Xms128m



Håvard W. Kongsgård wrote:

No I use 0.7.1, I have tested nutch/tomcat with 20 000 docs so I know 
it works. Searching using site like china site:www.fas.org also works.


Dominik Friedrich wrote:

If you use the mapred version from svn trunk you might have run into 
the same problem as I have. In the mapred version the searcher.dir 
property in nutch-default.xml is set to crawl and not . anymore. If 
you use this version you have either to put the index and the 
segments dirs into a folder called crawl and start tomcat from above 
that folder or change that value in the nutch-site.xml in 
webapps/ROOT/WEB-INF/classes of your tomcat nutch deployment.


regards
Dominik

Håvard W. Kongsgård wrote:

Hi, I am running a nutch server with a db containing 20 docs. 
When I start tomcat and search for something the browser displays an 
empty site.

Is this a memory problem, how do I fix it?

System: 2,6 | Memory 1 GB | SUSE 9.2

















No cluster results

2006-01-09 Thread Håvard W. Kongsgård

No cluster results is displayed next to the search results.
Is this because I turned clustering on after running the fetch and the 
indexing?


nutch-site.xml

valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)|clustering-carrot2/value
 descriptionRegular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 /description
/property



Re: Out of memory exception-while updating

2005-12-20 Thread Håvard W. Kongsgård

property
 nameindexer.max.tokens/name
 value1/value
 description
 The maximum number of tokens that will be indexed for a single field
 in a document. This limits the amount of memory required for
 indexing, so that collections with very large files will not crash
 the indexing process by running out of memory.

 Note that this effectively truncates large documents, excluding
 from the index tokens that occur further in the document. If you
 know your source documents are large, be sure to set this value
 high enough to accomodate the expected size. If you set it to
 Integer.MAX_VALUE, then the only limit is your memory, but you
 should anticipate an OutOfMemoryError.
 /description
/property

http://wiki.media-style.com/display/nutchDocu/Hardware

K.A.Hussain Ali wrote:


HI all

   I am Nutch to crawl some site but i get an Out Of Memory Error
   when i try updating the webdb with some good amount of URL's

I tried to find some solution on the mailing list but find nothing for solution

Could anyone put their suggestion over this ?

How much of RAM do Nutch requires for proper updation and indexing with a lack 
of URL's ?

Any help would be greatly appreciated

regards
-Hussain.

 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/207 - Release Date: 19.12.2005
 





Re: is nutch recrawl possible?

2005-12-19 Thread Håvard W. Kongsgård
About this blocking you can try to use the urlfilters, change the 
filter between each  fetch/generate


+^http://www.abc.com

-^http://www.bbc.co.uk


Pushpesh Kr. Rajwanshi wrote:


Oh this is pretty good and quite helpful material i wanted. Thanks Havard
for this. Seems like this will help me writing code for stuff i need :-)

Thanks and Regards,
Pushpesh



On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
 


Try using the whole-web fetching method instead of the crawl method.

http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling

http://wiki.media-style.com/display/nutchDocu/quick+tutorial


Pushpesh Kr. Rajwanshi wrote:

   


Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick response
really appreciate it.

Actually what i am really looking is, suppose i run a crawl for sometime
sites say 5 and for some depth say 2. Then what i want is next time i run
 


a
   


crawl it should re use the webdb contents which it populated first time.
(Assuming a successful crawl. Yea you are right a suddenly broken down
 


crawl
   


wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step
 


commands
   


needed to crawl, but isnt there some way i can reuse the existing crawl
data? May be it involves changing code but thats ok. Just one more quick
question, why every crawl needs a new directory and there isnt an option
 


to
   


alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned
 


and
   


get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:


 


It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:



   


Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh


 

   





No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
 


16.12.2005
   

 

   



 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
 





Re: Problem with fetching segment

2005-12-13 Thread Håvard W. Kongsgård

Sorry I misunderstood the way whole-web crawling works.

One more question, how do I re-fetch the failed urls (failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.).


Is this controlled by



property

 namedb.default.fetch.interval/name

 value30/value

 descriptionThe default number of days between re-fetches of a page.

 /description

/property



Stefan Groschupf wrote:

Sorry, I still do not understand what your problem is, may it is time  
for the weekend... :-)


From your very first mail there is exactly the same in the log:..
060109 014715 logging at INFO
060109 014715 fetching http://www.sourceforge.net/
060109 014715 fetching http://www.apache.org/
060109 014715 fetching http://www.nutch.org/
060109 014715 http.proxy.host = null

Isn't that the same as


060109 154712 fetching http://www.niap.no/magasinet/layout/set/print



In any case that are just logging statement what makes you guess that  
something crashed?


Stefan




Am 09.12.2005 um 17:44 schrieb Håvard W. Kongsgård:

But then i fetch the other domains www.sf.net http://www.sf.net/  
. the output is only


060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http:// 
lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

060109 014715 fetcher.server.delay = 5000
060109 014715 http.max.delays = 52
060109 014718 Using URL normalizer:  
org.apache.nutch.net.BasicUrlNormalizer
060109 014724 status: segment 20060109014654, 3 pages, 0 errors,  
51033 bytes, 8309 ms
060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0  
bytes/page


there is not output like
060109 154712 fetching http://www.niap.no/magasinet/layout/set/print
060109 154712 fetching http://www.niap.no/magasinet/kontakt_oss
060109 154712 fetching http://www.niap.no/magasinet/ezinfo/about
060109 154712 fetching http://www.niap.no/index.php/magasinet/ 
nyheter/midt_sten



Stefan Groschupf wrote:




What is  java.net.SocketTimeoutException?




Can not connect to the server.

In general you hammer your webserver and it may block the ip of  
your  server.
You can setup how many threads per host are loading from one host   
server.
For a intranet crawl it is a good idea to have less less thread  
(may  just as much you plan to use at the same time for the host)  
e.g.  fetcherThreads = 2 maxThreadsPerHost = 2
If you have more threads you should increase the retry / delay   
configuration since in case a host is busy with the maximal  
threads  per host the thread is delayed.
If a thread is delayed to often than you get a Exceeded   
http.max.delays: retry later


Sometimes I'm asking myself if not a queue based fetching would  be  
better the actually implementation, however this is difficult  to 
change.

HTH
Stefan

- 
---


No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.13.13/195 - Release Date:  
08.12.2005















Re: Problem with fetching segment

2005-12-09 Thread Håvard W. Kongsgård
When I feed my domain into the database the segment fetch output was 
like this:



-.-.-.-.-.-.-.-.-.-.-.-.-
060109 154622 fetching 
http://www.niap.no/magasinet/nyheter/nord_amerika/usa/israelsk_lobby_sparker_to_ansatte

060109 154622 fetching http://www.niap.no/magasinet/nyheter/afrika
060109 154622 fetching http://www.niap.no/magasinet/nyheter/asia_australia
060109 154622 fetching 
http://www.niap.no/magasinet/nyheter/midtoesten/libya/eu_oensker_aa_oppheve_forbudet_mot_vaapenhandel_med_libya

060109 154622 fetching http://www.niap.no/magasinet/rss/feed/magasinet_rss1
060109 154622 fetching http://www.niap.no/magasinet/content/search
060109 154622 fetching 
http://www.niap.no/magasinet/nyheter/europa/tyrkia/tyrkia_vil_innfoere_fengselstraff_for_utroskap
060109 154622 fetching 
http://www.niap.no/magasinet/nyheter/europa/russland/stalin_vender_tilbake

060109 154622 fetching http://www.niap.no/magasinet/nyheter/nord_amerika
060109 154626 fetch okay, but can't parse 
http://www.niap.no/magasinet/rss/feed/magasinet_rss1, reason: 
failed(2,203): Content-Type not text/html: text/xml
060109 154626 fetching 
http://www.niap.no/magasinet/nyheter/midtoesten/irak/al_queida

060109 154633 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060109 154633 fetching http://www.niap.no/magasinet/niap/test
060109 154639 fetching 
http://www.niap.no/magasinet/nyheter/europa/italia/pave_benedict_xvi
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/nord_amerika/usa/israelsk_lobby_sparker_to_ansatte 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/asia_australia failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154642 fetch of http://www.niap.no/magasinet/nyheter/afrika 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.

060109 154642 fetching http://www.niap.no/magasinet/nyheter/soer_amerika
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/europa/tyrkia/tyrkia_vil_innfoere_fengselstraff_for_utroskap 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/midtoesten/palestina_israel/israel_bekymret_for_landets_internasjonale_image 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetch of 
http://www.niap.no/magasinet/nyheter/midtoesten/libya/eu_oensker_aa_oppheve_forbudet_mot_vaapenhandel_med_libya 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetch of http://www.niap.no/magasinet/content/search 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154642 fetching 
http://www.niap.no/index.php/magasinet/nyheter/s_r_amerika


-.-.-.-.-.-.-
But then

-.-.-.-.-.-
060109 154714 fetch of http://phpadsnew.niap.no/adx.js failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154714 fetching 
http://www.niap.no/magasinet/nyheter/midtoesten/syria/russland_selger_luftforsvarssystem_til_syria
060109 154722 fetch of http://www.niap.org/ failed with: 
java.lang.Exception: java.net.SocketTimeoutException: connect timed out
060109 154724 fetch of 
http://www.niap.no/index.php/magasinet/nyheter/nord_amerika failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154724 fetch of http://www.niap.no/magasinet failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154724 fetch of http://www.niap.no/magasinet/kontakt_oss failed 
with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154724 fetch of 
http://www.niap.no/magasinet/magasinet/om_magasinet failed with: 
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
http.max.delays: retry later.
060109 154724 fetch of http://www.niap.no/magasinet/layout/set/print 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154729 fetch of 
http://www.niap.no/magasinet/nyheter/midtoesten/syria/russland_selger_luftforsvarssystem_til_syria 
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
Exceeded http.max.delays: retry later.
060109 154730 status: segment 20060109154516, 12 pages, 31 errors, 
181559 bytes, 68511 ms
060109 154730 status: 0.17515436 pages/s, 20.703678 kb/s, 15129.917 
bytes/page


-.-.-.-.-.-
What is  java.net.SocketTimeoutException?




Håvard W. Kongsgård wrote:

Is the fetcher not supposed to fetch all the docs

Problem with fetching segment

2005-12-08 Thread Håvard W. Kongsgård
I have followed the media-style.com quick tutorial, but when I try to 
fetch my segment the fetch is killed!


Have tried to set the system timer + 30 days, no anti-virus is running 
on the systems.

System SUSE 9.2 and SUSE 10

# bin/nutch fetch segments/20060109014654/
060109 014714 parsing 
file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-default.xml

060109 014715 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-site.xml
060109 014715 No FS indicated, using default:local
060109 014715 Plugins: looking in: /home/hkongsgaard/nutch-0.7.1/plugins
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-more
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-site/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/parse-html/plugin.xml
060109 014715 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/parse-text/plugin.xml
060109 014715 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser

060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-ext
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-pdf
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-rss
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-basic/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/index-more

060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-js
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
060109 014715 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-ftp
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/parse-msword
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/creativecommons

060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ontology
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-file
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-http/plugin.xml
060109 014715 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/clustering-carrot2
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/language-identifier
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-prefix
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/query-url/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
060109 014715 parsing: 
/home/hkongsgaard/nutch-0.7.1/plugins/index-basic/plugin.xml
060109 014715 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060109 014715 not including: 
/home/hkongsgaard/nutch-0.7.1/plugins/protocol-httpclient

060109 014715 logging at INFO
060109 014715 fetching http://www.sourceforge.net/
060109 014715 fetching http://www.apache.org/
060109 014715 fetching http://www.nutch.org/
060109 014715 http.proxy.host = null
060109 014715 http.proxy.port = 8080
060109 014715 http.timeout = 1
060109 014715 http.content.limit = -1
060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

060109 014715 fetcher.server.delay = 5000
060109 014715 http.max.delays = 52
060109 014718 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060109 014724 status: segment 20060109014654, 3 pages, 0 errors, 51033 
bytes, 8309 ms

060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0 bytes/page



Re: Crawl auto updated in nutch?

2005-11-29 Thread Håvard W. Kongsgård
So how to update a crawl, the updating section of the FAQ is empty :-( 
http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 





Doug Cutting wrote:


Håvard W. Kongsgård wrote:

- I want to index about 50 – 100 sites with lots of documents, is it 
best use the Intranet Crawling or Whole-web Crawling method.




The intranet style is simpler and hence a good place to start.  If 
it doesn't work well for you then you might try the whole-web style.



- Is the crawl auto updated in nutch, or must I run a cron task




It is not auto-updated.

Doug








Re: Crawl auto updated in nutch?

2005-11-28 Thread Håvard W. Kongsgård
So how to update a crawl, the updating section of the FAQ is empty! 
http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6


Doug Cutting wrote:


Håvard W. Kongsgård wrote:

- I want to index about 50 – 100 sites with lots of documents, is it 
best use the Intranet Crawling or Whole-web Crawling method.



The intranet style is simpler and hence a good place to start.  If 
it doesn't work well for you then you might try the whole-web style.



- Is the crawl auto updated in nutch, or must I run a cron task



It is not auto-updated.

Doug




Crawl auto updated in nutch?

2005-11-25 Thread Håvard W. Kongsgård

Hello I have still some questions about nutch

- I want to index about 50 – 100 sites with lots of documents, is it 
best use the Intranet Crawling or Whole-web Crawling method.


- Is the crawl auto updated in nutch, or must I run a cron task



Intranet craw folder

2005-11-22 Thread Håvard W. Kongsgård

Hi, I am still testing nutch 0.7.1 but now I have another problem.
When I do a normal intranet crawl on some web folders with 2000 pdfs, 
nutch only fetches 47 pdfs from each folder.




Re: Intranet craw folder

2005-11-22 Thread Håvard W. Kongsgård

Do you mean http.content.limit? I have set it to -1 already.

There are no Content truncated at 65536 bytes. Parser can't handle 
incomplete errors in the log.



Stefan Groschupf wrote:


Check the maximal content limit in nutch-default.xml

Am 22.11.2005 um 16:38 schrieb Håvard W. Kongsgård:


Hi, I am still testing nutch 0.7.1 but now I have another problem.
When I do a normal intranet crawl on some web folders with 2000  
pdfs, nutch only fetches 47 pdfs from each folder.












Re: Images

2005-11-22 Thread Håvard W. Kongsgård
If you want an out of the box solution with another search engine try 
this link, http://www.searchtools.com/info/multimedia-search.html


But I don't know if any of them is open source :-(

Aled Jones wrote:


Hi

It's not very clear from the nutch site what can nutch do with images.
Currently you can set the crawler to not ignore images, but it will only
parse text data.
Can it do an image search like google?

Kind Regards

Aled





This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. 


Although we have scanned this e-mail and any attachments for viruses, it is 
your responsibility to ensure that they are actually virus free.



 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.5/177 - Release Date: 21.11.2005
 





Re: PDF indexing support?

2005-11-16 Thread Håvard W. Kongsgård

Tanks it worked


Jérôme Charron wrote:


The value you specified is biggest than the maximal int value, so that it
return an exception, and then the default value is used.
As mentionned in the property's description, use a negative value (-1) for
no truncation at all (or a value lesser than java.lang.Interger.MAX_VALUE).

Regards

Jérôme

On 11/16/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
 


Have now added conf/nutch-site.xml but still the same problem. | Related
to the problem? http://sourceforge.net/forum/message.php?msg_id=3391668
http://sourceforge.net/forum/message.php?msg_id=3398773

   


?xml version=1.0?
?xml-stylesheet type=text/xsl href=nutch-conf.xsl?
nutch-conf
property
namehttp.content.limit/name
value45451515565536/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be
truncated;
otherwise, no truncation at all.
/description
/property
/nutch-conf

Håvard W. Kongsgård wrote:

 


HTTP


Sébastien LE CALLONNEC wrote:

   


Hej Håvard,

That's because you have to create one yourself. The values you will
set in there will override the default values.

Here are a few more questions to try to solve your problem: where is
your PDF located? What protocol is used to fetch it (HTTP, FTP, etc.)?


Regards,
/sebastien

--- Håvard W. Kongsgård [EMAIL PROTECTED] a écrit :



 


Don't have a conf/nutch-site.xml



Jérôme Charron wrote:



   


conf/nutch-default


   


Checks that they are not overrided in the conf/nutch-site
If no, sorry, no more idea for now :-(

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




 



   



   



   


No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date:

 


15.11.2005


   



 



   








 


___
   


Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
Messenger Téléchargez cette version sur http://fr.messenger.yahoo.com




 



   





--
http://motrech.free.fr/
http://www.frutch.org/

 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.3/173 - Release Date: 16.11.2005
 





Re: PDF indexing support?

2005-11-15 Thread Håvard W. Kongsgård

conf/nutch-default



Jérôme Charron wrote:


http.content.limit=542256565536 and file.content.limit=4541165536
still the same error:
   



where do you specify these values? in nutch-default or nutch-site?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005
 





Re: PDF indexing support?

2005-11-15 Thread Håvard W. Kongsgård

Don't have a conf/nutch-site.xml



Jérôme Charron wrote:


conf/nutch-default
   



Checks that they are not overrided in the conf/nutch-site
If no, sorry, no more idea for now :-(

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005
 





PDF indexing support?

2005-11-14 Thread Håvard W. Kongsgård

Hello I new with nutch how do I enable PDF indexing support?