Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
formation Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, R

Nutch config IOException

2009-11-25 Thread Mischa Tuffield
, Mischa Tuffield wrote: > Hello All, > > I am getting the following error in my hadoop.log (see below). It seems to > happen everytime I run any of the nutch command line tools :( > > > > Does anyone know what problem I am having ? > > Cheers, > > M

Re: Nutch config IOException

2009-11-25 Thread Mischa Tuffield
Hi Andrzej, Yeah, I just noticed that this stack trace is for DEBUG purposes only I found it in the hadoop src, thanks for the info. Regards, Mischa On 25 Nov 2009, at 13:11, Andrzej Bialecki wrote: > Mischa Tuffield wrote: >> Hello Again, Following my previous post below, I hav

Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
ate the signatures. >>> >>> >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >

Broken segments ?

2009-11-26 Thread Mischa Tuffield
set this default encoding (is > UTF-8?) to the one that I need (ASCII I guess). > > Thanks in advance ;) > -- > View this message in context: > http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html > Sent from the Nutch - User mailing list archiv

Re: newbie questions

2009-12-01 Thread Mischa Tuffield
files of the crawl data. _______ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: T

Re: converting nutch crawl output to human readable content

2009-12-15 Thread Mischa Tuffield
lowing: > > ls crawl/crawldb/current/part-0/ > data.data.crc index .index.crc > > How do I convert the output to human readable format ? > > Thanks _______ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.

Re: Nutch search works, but no results in Tomcat

2009-12-18 Thread Mischa Tuffield
>> proprietary. The information is intended for the use of the >>>>>>>> individual >>>>>>>> or entity named above. If you are not the intended recipient, be >>>>>>>> aware >>>>>>>> that any disclosure, copying, distribution, or use of the contents of &g

Re: Crawl specific urls and depth argument

2010-01-08 Thread Mischa Tuffield
gt; > Can I accomplish this by setting the depth argument for 'crawl' to "0"? > > If I set the depth to 0, I get a message that says "No URLs to fetch - check > your seed list and URL filters.". > > Any help will be greatly appreciated. &

Re: Crawl specific urls and depth argument

2010-01-08 Thread Mischa Tuffield
etc., it will never crawl any of the > outlinks. Is that correct? > > Regards, > Kumar. > > Mischa Tuffield wrote: >> Hello Kumar, >> There is a config property you can set in conf/nutch-site.xml, as follows : >> >> This will force nutch to only fetch

Re: Enabling Query Strings in *filter.txt files

2010-01-08 Thread Mischa Tuffield
ike everything is running fine but then the index > doesn't get created. > > Thanks, > Kumar. ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 htt

Re: crawl result is empty

2010-01-11 Thread Mischa Tuffield
Hi, Perhaps you are crawling and writing to the hdfs? Have you checked the directory structure of the nutch user in your hadoop dfs? I was caught out by that early on. Mischa Sent on the move On 11 Jan 2010, at 09:12, zud wrote: i have run nutch 1.0 in eclipse in linux every thing wor

Re: crawl result is empty

2010-01-11 Thread Mischa Tuffield
09669.html > Sent from the Nutch - User mailing list archive at Nabble.com. > ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Regis

Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Mischa Tuffield
gt;>>> Do you have to set the -Xss flag somewhere else? >>> >>> Yes, in bin/nutch - looking for where it sets -Xmx >>> >>> - Godmar >> >> > > ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Roa

Re: [sed] Extract domain name from URL

2010-01-18 Thread Mischa Tuffield
've tried sed -e 's/.com\/.*//g' 1 >> 2, and got this output > http://www.mydomain > > > Not only it removed everything after .com/, but it also removed the .com/ > > How do I rewrite it, so I could keep the .com/ to have > http://www.mydomain.com/ >