RE: 'readdb' and 'readseg' commands shows wrong last-modified-date

2010-02-02 Thread Rupesh Mankar

Thanks a lot Reinhard.
It worked perfectly and now it is showing correct last modified date.


-Original Message-
From: reinhard schwab [mailto:reinhard.sch...@aon.at]
Sent: Monday, February 01, 2010 5:13 PM
To: nutch-user@lucene.apache.org
Subject: Re: 'readdb' and 'readseg' commands shows wrong last-modified-date

paul tomblin has posted a diff for handling last modified.
dont know whether an issue has been opened in jira.

http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15056.html

Rupesh Mankar schrieb:
 Hi,

 I am using Nutch 1.0. I have successfully crawled our intranet site. But when 
 I read the properties of crawled URLs using 'readdb' and 'readseg' commands, 
 it is showing last modified date as 'Modified time: Thu Jan 01 05:30:00 IST 
 1970' for every URL which is incorrect.

 Why Nutch is setting wrong 'last modified date'? Is there any way to fix this 
 problem?

 Thanks,
 Rupesh

 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is the 
 property of Persistent Systems Ltd. It is intended only for the use of the 
 individual or entity to which it is addressed. If you are not the intended 
 recipient, you are not authorized to read, retain, copy, print, distribute or 
 use this message. If you have received this communication in error, please 
 notify the sender and delete all copies of this message. Persistent Systems 
 Ltd. does not accept any liability for virus infected mails.




DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


'readdb' and 'readseg' commands shows wrong last-modified-date

2010-02-01 Thread Rupesh Mankar
Hi,

I am using Nutch 1.0. I have successfully crawled our intranet site. But when I 
read the properties of crawled URLs using 'readdb' and 'readseg' commands, it 
is showing last modified date as 'Modified time: Thu Jan 01 05:30:00 IST 1970' 
for every URL which is incorrect.

Why Nutch is setting wrong 'last modified date'? Is there any way to fix this 
problem?

Thanks,
Rupesh

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Problem in crawling windows shared folder using Nutch's SMB protocol plugin

2009-12-21 Thread Rupesh Mankar
Hi,

I downloaded SMB protocol plugin from following location:
http://issues.apache.org/jira/browse/NUTCH-427

I configured it (as mentioned it in read.txt) with Nutch. But when I tried to 
crawl, nothing gets crawled and get following exception in hadoop log.

2009-12-21 16:25:04,728 FATAL smb.SMB - Could not read content of protocol: 
smb://10.88.45.140/shared_folder/
jcifs.smb.SmbException:
jcifs.util.transport.TransportException
java.net.SocketException: Invalid argument or cannot assign requested address
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:525)
  at java.net.Socket.connect(Socket.java:475)
  at java.net.Socket.init(Socket.java:372)
  at java.net.Socket.init(Socket.java:246)
  at jcifs.smb.SmbTransport.negotiate(SmbTransport.java:244)
  at jcifs.smb.SmbTransport.doConnect(SmbTransport.java:299)
  at jcifs.util.transport.Transport.run(Transport.java:240)
  at java.lang.Thread.run(Thread.java:619)

  at jcifs.util.transport.Transport.run(Transport.java:256)
  at java.lang.Thread.run(Thread.java:619)

  at jcifs.smb.SmbTransport.connect(SmbTransport.java:289)
  at jcifs.smb.SmbTree.treeConnect(SmbTree.java:139)
  at jcifs.smb.SmbFile.connect(SmbFile.java:798)
  at jcifs.smb.SmbFile.connect0(SmbFile.java:768)
  at jcifs.smb.SmbFile.exists(SmbFile.java:1275)
  at 
org.apache.nutch.protocol.smb.SMBResponse.init(SMBResponse.java:74)
  at org.apache.nutch.protocol.smb.SMB.getProtocolOutput(SMB.java:62)
  at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)


Has anyone used SMB protocol plugin before?

Thanks,
Rupesh


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Is there a way to set a plugin execution order in Nutch?

2009-12-15 Thread Rupesh Mankar
Hi,

Suppose I have 3 plugins A, B and C. I want to execute plugin A first then 
plugin B and at last plugin C. I specified plugin entries in nutch-site.xml 
under 'include-plugins' tag as follows:

nameplugin.includes/name
valueA|B|C|protocol-http|urlfilter-regex|parse-(html|pdf|msword|text|xml|msexcel|mspowerpoint)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value

But still plugin execution order is not fixed. How Nutch determines this order? 
Can I set this execution order externally?

I am using Nutch 1.0.

Thanks,
Rupesh


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Optimization in crawling and indexing

2009-12-14 Thread Rupesh Mankar
I want to see if there is any possible bandwidth optimization while using Nutch.


a)Crawling: After initial crawl, ONLY fetch updated document? Re-crawl 
command after every 6 hours will crawl and fetch all documents. 
['db.fetch.interval.default' is 6 hours]. It should just bring updated 
documents only.



Does Nutch internally use HEAD request to check whether that document (html, 
PDFs and Docs) has changed or not?



b)Indexing: Can I find out based on a timestamp, how many documents have 
changed after last re-crawl?


Thanks,
Rupesh

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


RE: How to successfully crawl and index office 2007 documents in Nutch 1.0

2009-12-07 Thread Rupesh Mankar
Is there any readymade plug-in for office 2007 documents available or I have to 
write it by my own?


-Original Message-
From: yangfeng [mailto:yea...@gmail.com]
Sent: Monday, December 07, 2009 4:35 PM
To: nutch-user@lucene.apache.org
Subject: Re: How to successfully crawl and index office 2007 documents in Nutch 
1.0

docx should be parsed,A plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on.

2009/12/4 Rupesh Mankar rupesh_man...@persistent.co.in

 Hi,

 I am new to Nutch. I want to crawl and search office 2007 documents (.docx,
 .pptx etc) from Nutch. But when I try to crawl, crawler throws following
 error:

 fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx
 Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx:
 org.apache.nutch.parse.ParseException: parser not found for
 contentType=application/zip url=
 http://10.88.45.140:8081/tutorial/Office-2007-document.docx
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

 When I add zip plugin in nutch-site.xml under plugin.includes, crawling
 becomes successful but nothing gets search.

 How can we successfully crawl and search contents of office 2007 documents?

 Thanks,
 Rupesh

 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the property of Persistent Systems Ltd. It is intended only for the use of
 the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.