Not only it removed everything after .com/, but it also removed the .com/
How do I rewrite it, so I could keep the .com/ to have
http://www.mydomain.com/
Thanks!
___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik
Hi,
Perhaps you are crawling and writing to the hdfs? Have you checked the
directory structure of the nutch user in your hadoop dfs? I was caught
out by that early on.
Mischa
Sent on the move
On 11 Jan 2010, at 09:12, zud praveenmotur...@gmail.com wrote:
i have run nutch 1.0 in
archive at Nabble.com.
___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered
: java.lang.StackOverflowError
On Mon, Jan 11, 2010 at 11:50 AM, Eric Osgood e...@lakemeadonline.com
wrote:
Do you have to set the -Xss flag somewhere else?
Yes, in bin/nutch - looking for where it sets -Xmx
- Godmar
___
Mischa Tuffield
Email: mischa.tuffi
.
Can I accomplish this by setting the depth argument for 'crawl' to 0?
If I set the depth to 0, I get a message that says No URLs to fetch - check
your seed list and URL filters..
Any help will be greatly appreciated.
Thanks,
Kumar.
___
Mischa Tuffield
. Is that correct?
Regards,
Kumar.
Mischa Tuffield wrote:
Hello Kumar,
There is a config property you can set in conf/nutch-site.xml, as follows :
!--
property
namedb.max.outlinks.per.page/name
value0/value
descriptionThe maximum number of outlinks that we'll process for a page
.
___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House
?
--
-MilleBii-
___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
crawl/crawldb/current/part-0/
data.data.crc index .index.crc
How do I convert the output to human readable format ?
Thanks
___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road
:
also, I would like to know how to extract flat text files of the crawl data.
___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered
in context:
http://old.nabble.com/Encoding-the-content-got-from-Fetcher-tp26528468p26528468.html
Sent from the Nutch - User mailing list archive at Nabble.com.
___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Hi Andrzej,
Yeah, I just noticed that this stack trace is for DEBUG purposes only I found
it in the hadoop src, thanks for the info.
Regards,
Mischa
On 25 Nov 2009, at 13:11, Andrzej Bialecki wrote:
Mischa Tuffield wrote:
Hello Again, Following my previous post below, I have noticed
Integration
http://www.sigram.com Contact: info at sigram dot com
___
Mischa Tuffield
Email: mischa.tuffi...@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered
14 matches
Mail list logo