RE: 'readdb' and 'readseg' commands shows wrong last-modified-date
Thanks a lot Reinhard. It worked perfectly and now it is showing correct last modified date. -Original Message- From: reinhard schwab [mailto:reinhard.sch...@aon.at] Sent: Monday, February 01, 2010 5:13 PM To: nutch-user@lucene.apache.org Subject: Re: 'readdb' and 'readseg' commands shows wrong last-modified-date paul tomblin has posted a diff for handling last modified. dont know whether an issue has been opened in jira. http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15056.html Rupesh Mankar schrieb: Hi, I am using Nutch 1.0. I have successfully crawled our intranet site. But when I read the properties of crawled URLs using 'readdb' and 'readseg' commands, it is showing last modified date as 'Modified time: Thu Jan 01 05:30:00 IST 1970' for every URL which is incorrect. Why Nutch is setting wrong 'last modified date'? Is there any way to fix this problem? Thanks, Rupesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
'readdb' and 'readseg' commands shows wrong last-modified-date
Hi, I am using Nutch 1.0. I have successfully crawled our intranet site. But when I read the properties of crawled URLs using 'readdb' and 'readseg' commands, it is showing last modified date as 'Modified time: Thu Jan 01 05:30:00 IST 1970' for every URL which is incorrect. Why Nutch is setting wrong 'last modified date'? Is there any way to fix this problem? Thanks, Rupesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Problem in crawling windows shared folder using Nutch's SMB protocol plugin
Hi, I downloaded SMB protocol plugin from following location: http://issues.apache.org/jira/browse/NUTCH-427 I configured it (as mentioned it in read.txt) with Nutch. But when I tried to crawl, nothing gets crawled and get following exception in hadoop log. 2009-12-21 16:25:04,728 FATAL smb.SMB - Could not read content of protocol: smb://10.88.45.140/shared_folder/ jcifs.smb.SmbException: jcifs.util.transport.TransportException java.net.SocketException: Invalid argument or cannot assign requested address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:525) at java.net.Socket.connect(Socket.java:475) at java.net.Socket.init(Socket.java:372) at java.net.Socket.init(Socket.java:246) at jcifs.smb.SmbTransport.negotiate(SmbTransport.java:244) at jcifs.smb.SmbTransport.doConnect(SmbTransport.java:299) at jcifs.util.transport.Transport.run(Transport.java:240) at java.lang.Thread.run(Thread.java:619) at jcifs.util.transport.Transport.run(Transport.java:256) at java.lang.Thread.run(Thread.java:619) at jcifs.smb.SmbTransport.connect(SmbTransport.java:289) at jcifs.smb.SmbTree.treeConnect(SmbTree.java:139) at jcifs.smb.SmbFile.connect(SmbFile.java:798) at jcifs.smb.SmbFile.connect0(SmbFile.java:768) at jcifs.smb.SmbFile.exists(SmbFile.java:1275) at org.apache.nutch.protocol.smb.SMBResponse.init(SMBResponse.java:74) at org.apache.nutch.protocol.smb.SMB.getProtocolOutput(SMB.java:62) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535) Has anyone used SMB protocol plugin before? Thanks, Rupesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Is there a way to set a plugin execution order in Nutch?
Hi, Suppose I have 3 plugins A, B and C. I want to execute plugin A first then plugin B and at last plugin C. I specified plugin entries in nutch-site.xml under 'include-plugins' tag as follows: nameplugin.includes/name valueA|B|C|protocol-http|urlfilter-regex|parse-(html|pdf|msword|text|xml|msexcel|mspowerpoint)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value But still plugin execution order is not fixed. How Nutch determines this order? Can I set this execution order externally? I am using Nutch 1.0. Thanks, Rupesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Optimization in crawling and indexing
I want to see if there is any possible bandwidth optimization while using Nutch. a)Crawling: After initial crawl, ONLY fetch updated document? Re-crawl command after every 6 hours will crawl and fetch all documents. ['db.fetch.interval.default' is 6 hours]. It should just bring updated documents only. Does Nutch internally use HEAD request to check whether that document (html, PDFs and Docs) has changed or not? b)Indexing: Can I find out based on a timestamp, how many documents have changed after last re-crawl? Thanks, Rupesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
RE: How to successfully crawl and index office 2007 documents in Nutch 1.0
Is there any readymade plug-in for office 2007 documents available or I have to write it by my own? -Original Message- From: yangfeng [mailto:yea...@gmail.com] Sent: Monday, December 07, 2009 4:35 PM To: nutch-user@lucene.apache.org Subject: Re: How to successfully crawl and index office 2007 documents in Nutch 1.0 docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. 2009/12/4 Rupesh Mankar rupesh_man...@persistent.co.in Hi, I am new to Nutch. I want to crawl and search office 2007 documents (.docx, .pptx etc) from Nutch. But when I try to crawl, crawler throws following error: fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-document.docx: org.apache.nutch.parse.ParseException: parser not found for contentType=application/zip url= http://10.88.45.140:8081/tutorial/Office-2007-document.docx at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552) When I add zip plugin in nutch-site.xml under plugin.includes, crawling becomes successful but nothing gets search. How can we successfully crawl and search contents of office 2007 documents? Thanks, Rupesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.