Re: How do I block/ban a specific domain name or a tld?

2009-11-24 Thread Subhojit Roy
-- View this message in context: http://old.nabble.com/How-do-I-block-ban-a-specific-domain-name-or-a-tld--tp26289091p26306461.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source) email: s

Re: dedup dont delete duplicates !

2009-11-24 Thread Subhojit Roy
___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source

Re: How do I block/ban a specific domain name or a tld?

2009-11-24 Thread Subhojit Roy
Sorry... The regular expressions should be: -^http://( http://%28/[a-z0-9]*\.)*who.int/ Had made an error in the previous email. Wonder whether gmail is playing with the characters in the set emails... -sroy On Wed, Nov 25, 2009 at 12:00 PM, Subhojit Roy mails...@gmail.com wrote: Try: 1

Re: substitute unknown parts of the url

2009-11-19 Thread Subhojit Roy
]*\.)*website.com/unknown-folder/known-folder/ first folder can vary, whereas host name and second folder are known. how can i substitute unknown parts (folders) of the url? any help appreciated! regards mailusenet -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source

Re: MergeSegments - java.lang.OutOfMemoryError

2009-11-17 Thread Subhojit Roy
, in my older version of nutch, same merge works with the default java heap max setting of only 1G. Dose anybody have the same experience? Is there any work around this? Thanks Kevin Chen -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source) email: s...@profound.in

Re: How to fetch URLs with special charaters '?' '='

2009-11-16 Thread Subhojit Roy
://old.nabble.com/file/p26197881/urllist.txt urllist.txt -- View this message in context: http://old.nabble.com/How-to-fetch-URLs-with-special-charaters-%27-%27---%27%3D%27-tp26197881p26197881.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Subhojit Roy Profound

Re: crawling / data aggregation - is nutch the right tool?

2009-11-16 Thread Subhojit Roy
right for the job. It's the part that I have a hard time evaluating with Nutch. Some of what I have read from the mailing list suggests it's still not all that easy to do extraction with Nutch, am I wrong? Mark -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source

Re: Nutch near future - strategic directions

2009-11-15 Thread Subhojit Roy
, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source) email: s...@profound.in http://www.profound.in

Re: PRUNE : need some help on pruning syntax.

2009-11-15 Thread Subhojit Roy
://old.nabble.com/PRUNE-%3A-need-some-help-on-pruning-syntax.-tp26268447p26268447.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source) email: s...@profound.in http://www.profound.in

Re: Nutch does not crawl pages starting with ~

2009-11-15 Thread Subhojit Roy
? Thanks, Regards, Varish Mulwad -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source) email: s...@profound.in http://www.profound.in