Re: [sed] Extract domain name from URL

2010-01-18 Thread Mischa Tuffield
Not only it removed everything after .com/, but it also removed the .com/ How do I rewrite it, so I could keep the .com/ to have http://www.mydomain.com/ Thanks! ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik

Re: crawl result is empty

2010-01-11 Thread Mischa Tuffield
Hi, Perhaps you are crawling and writing to the hdfs? Have you checked the directory structure of the nutch user in your hadoop dfs? I was caught out by that early on. Mischa Sent on the move On 11 Jan 2010, at 09:12, zud praveenmotur...@gmail.com wrote: i have run nutch 1.0 in

Re: crawl result is empty

2010-01-11 Thread Mischa Tuffield
archive at Nabble.com. ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered

Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Mischa Tuffield
: java.lang.StackOverflowError On Mon, Jan 11, 2010 at 11:50 AM, Eric Osgood e...@lakemeadonline.com wrote: Do you have to set the -Xss flag somewhere else? Yes, in bin/nutch - looking for where it sets -Xmx - Godmar ___ Mischa Tuffield Email: mischa.tuffi

Re: Crawl specific urls and depth argument

2010-01-08 Thread Mischa Tuffield
. Can I accomplish this by setting the depth argument for 'crawl' to 0? If I set the depth to 0, I get a message that says No URLs to fetch - check your seed list and URL filters.. Any help will be greatly appreciated. Thanks, Kumar. ___ Mischa Tuffield

Re: Crawl specific urls and depth argument

2010-01-08 Thread Mischa Tuffield
. Is that correct? Regards, Kumar. Mischa Tuffield wrote: Hello Kumar, There is a config property you can set in conf/nutch-site.xml, as follows : !-- property namedb.max.outlinks.per.page/name value0/value descriptionThe maximum number of outlinks that we'll process for a page

Re: Enabling Query Strings in *filter.txt files

2010-01-08 Thread Mischa Tuffield
. ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House

Re: Nutch search works, but no results in Tomcat

2009-12-18 Thread Mischa Tuffield
? -- -MilleBii- ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11

Re: converting nutch crawl output to human readable content

2009-12-15 Thread Mischa Tuffield
crawl/crawldb/current/part-0/ data.data.crc index .index.crc How do I convert the output to human readable format ? Thanks ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road

Re: newbie questions

2009-12-01 Thread Mischa Tuffield
: also, I would like to know how to extract flat text files of the crawl data. ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered

Broken segments ?

2009-11-26 Thread Mischa Tuffield
in context: http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html Sent from the Nutch - User mailing list archive at Nabble.com. ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road

Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com

Re: Nutch config IOException

2009-11-25 Thread Mischa Tuffield
Hi Andrzej, Yeah, I just noticed that this stack trace is for DEBUG purposes only I found it in the hadoop src, thanks for the info. Regards, Mischa On 25 Nov 2009, at 13:11, Andrzej Bialecki wrote: Mischa Tuffield wrote: Hello Again, Following my previous post below, I have noticed

Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
Integration http://www.sigram.com Contact: info at sigram dot com ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered