Problem in crawling windows shared folder using Nutch's SMB protocol plugin

2009-12-21 Thread Rupesh Mankar
Hi,

I downloaded SMB protocol plugin from following location:
http://issues.apache.org/jira/browse/NUTCH-427

I configured it (as mentioned it in read.txt) with Nutch. But when I tried to 
crawl, nothing gets crawled and get following exception in hadoop log.

2009-12-21 16:25:04,728 FATAL smb.SMB - Could not read content of protocol: 
smb://10.88.45.140/shared_folder/
jcifs.smb.SmbException:
jcifs.util.transport.TransportException
java.net.SocketException: Invalid argument or cannot assign requested address
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:525)
  at java.net.Socket.connect(Socket.java:475)
  at java.net.Socket.init(Socket.java:372)
  at java.net.Socket.init(Socket.java:246)
  at jcifs.smb.SmbTransport.negotiate(SmbTransport.java:244)
  at jcifs.smb.SmbTransport.doConnect(SmbTransport.java:299)
  at jcifs.util.transport.Transport.run(Transport.java:240)
  at java.lang.Thread.run(Thread.java:619)

  at jcifs.util.transport.Transport.run(Transport.java:256)
  at java.lang.Thread.run(Thread.java:619)

  at jcifs.smb.SmbTransport.connect(SmbTransport.java:289)
  at jcifs.smb.SmbTree.treeConnect(SmbTree.java:139)
  at jcifs.smb.SmbFile.connect(SmbFile.java:798)
  at jcifs.smb.SmbFile.connect0(SmbFile.java:768)
  at jcifs.smb.SmbFile.exists(SmbFile.java:1275)
  at 
org.apache.nutch.protocol.smb.SMBResponse.init(SMBResponse.java:74)
  at org.apache.nutch.protocol.smb.SMB.getProtocolOutput(SMB.java:62)
  at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)


Has anyone used SMB protocol plugin before?

Thanks,
Rupesh


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Large files - nutch failing to fetch

2009-12-21 Thread Sundara Kaku
Hi,

   Nutch is throwing errors while fetching large files (file with size more
then 100mb). I have a website with pages that point to large files (file
size varies from 10mb to 500mb) and there are several large files in that
website. I want to fetch all the files using Nutch, but nutch is throwing
outofmemory exception for large files ( have set heap size to 2500m), with
heap memory 2500m file size with 250mb are retrieved but larger that that
are failing,
and nutch takes lot of time after printing
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0

if there are three files with size 100mb each then it is failing (at the
same depth, with heap size 2500m) to fetch files.

i have set http.content.limite to -1

is there way to fetch several large files using nutch..

I am using nutch as webcrawler, i am not using Indexing. I want to download
web resources and scan then for virus using ClamA/V.




-- 
Thanks  Regards,
Sundara Kaku


Re: Large files - nutch failing to fetch

2009-12-21 Thread Andrzej Bialecki

On 2009-12-21 17:15, Sundara Kaku wrote:

Hi,

Nutch is throwing errors while fetching large files (file with size more
then 100mb). I have a website with pages that point to large files (file
size varies from 10mb to 500mb) and there are several large files in that
website. I want to fetch all the files using Nutch, but nutch is throwing
outofmemory exception for large files ( have set heap size to 2500m), with
heap memory 2500m file size with 250mb are retrieved but larger that that
are failing,
and nutch takes lot of time after printing
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0

if there are three files with size 100mb each then it is failing (at the
same depth, with heap size 2500m) to fetch files.

i have set http.content.limite to -1

is there way to fetch several large files using nutch..

I am using nutch as webcrawler, i am not using Indexing. I want to download
web resources and scan then for virus using ClamA/V.


Probably Nutch is not the right tool for you - you should probably use 
wget. Nutch was designed to fetch many pages of limited size - as a 
temporary step it caches the downloaded content in memory, before 
flushing it out to disk.


(I had to solve this limitation once for a specific case - the solution 
was to implement a variant of the protocol and Content that stored data 
into separate HDFS files without buffering in memory - but it was a 
brittle hack that only worked for that particular scenario).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



domain crawl using bin/nutch

2009-12-21 Thread Ted Yu
Hi,
I found db.ignore.external.links property.
How do I limit the crawl by also excluding links within the same domain as
well ?

Thanks


unicode 2029 paragraph separator

2009-12-21 Thread reinhard schwab
http://www.fileformat.info/info/unicode/char/2029/index.htm

i have experienced that this unicode character breaks JSON deserializing
when using SOLR and AJAX.
it comes from a pdf text.
where to filter out or replace this character? pdf parser/text
extractor? solr indexer?
regards
reinhard


Re: domain crawl using bin/nutch

2009-12-21 Thread Jesse Hires
You should be able to do this using one of the variations of *-urlfilter.txt
files. Instead of using + in front of the regex, you can tell it to
exclude lines that match the regex with a -.

Just a guess, I haven't actually tried it, but you could probably use
something like the following. (I'm sure you would have to fiddle with it to
get it to work correctly).

+^http://([a-z0-9]*\.)*mydomain.com/
-*/(pagename1.php|pagename2.php)



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com



On Mon, Dec 21, 2009 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Hi,
 I found db.ignore.external.links property.
 How do I limit the crawl by also excluding links within the same domain as
 well ?

 Thanks



RE: domain crawl using bin/nutch

2009-12-21 Thread Jun Mao
But how could we tell Nutch every time to do crawling in this way?
I do not want to edit *-filter.txt every time.

Thanks,
 
Jun

-Original Message-
From: Jesse Hires [mailto:jhi...@gmail.com] 
Sent: 2009年12月22日 9:23
To: nutch-user@lucene.apache.org
Subject: Re: domain crawl using bin/nutch

You should be able to do this using one of the variations of *-urlfilter.txt
files. Instead of using + in front of the regex, you can tell it to
exclude lines that match the regex with a -.

Just a guess, I haven't actually tried it, but you could probably use
something like the following. (I'm sure you would have to fiddle with it to
get it to work correctly).

+^http://([a-z0-9]*\.)*mydomain.com/
-*/(pagename1.php|pagename2.php)



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com



On Mon, Dec 21, 2009 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Hi,
 I found db.ignore.external.links property.
 How do I limit the crawl by also excluding links within the same domain as
 well ?

 Thanks