Re: Why did my crawl fail?
Unfortunately I blew away those particular logs when I fetched the svn trunk. I just tried it again (well, I started it again at noon and it just finished) and this time it worked fine, so it seems kind of heisenbug-like. Maybe it has something to do with which pages are types it can't handle? On Mon, Jul 27, 2009 at 11:27 AM, xiao yang wrote: > Hi, Paul > > Can you post the error messages in the log file > (file:/Users/ptomblin/nutch-1.0/logs)? > > On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblin wrote: > > Actually, I got that error the first time I used it, and then again when > I > > blew away the downloaded nutch and grabbed the latest trunk from > Subversion. > > > > On Mon, Jul 27, 2009 at 1:11 AM, xiao yang > wrote: > > > >> You must have crawled for several times, and some of them failed > >> before the parse phase. So the parse data was not generated. > >> You'd better delete the whole directory > >> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you > >> will know the exact reason why it failed in the parse phase from the > >> output information. > >> > >> Xiao > >> > >> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin > wrote: > >> > I installed nutch 1.0 on my laptop last night and set it running to > crawl > >> my > >> > blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 > >> > it was still running strong when I went to bed several hours later, > and > >> this > >> > morning I woke up to this: > >> > > >> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > >> > -activeThreads=0 > >> > Fetcher: done > >> > CrawlDb update: starting > >> > CrawlDb update: db: crawl.blog/crawldb > >> > CrawlDb update: segments: [crawl.blog/segments/20090724010303] > >> > CrawlDb update: additions allowed: true > >> > CrawlDb update: URL normalizing: true > >> > CrawlDb update: URL filtering: true > >> > CrawlDb update: Merging segment data into db. > >> > CrawlDb update: done > >> > LinkDb: starting > >> > LinkDb: linkdb: crawl.blog/linkdb > >> > LinkDb: URL normalize: true > >> > LinkDb: URL filter: true > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 > >> > Exception in thread "main" > >> org.apache.hadoop.mapred.InvalidInputException: > >> > Input path does not exist: > >> > > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data > >> > at > >> > > >> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) > >> > at > >> > > >> > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) > >> > at > >> > > >> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) > >> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > >> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) > >> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) > >> > > >> > > >> > -- > >> > http://www.linkedin.com/in/paultomblin > >> > > >> > > > > > > > > -- > > http://www.linkedin.com/in/paultomblin > > > -- http://www.linkedin.com/in/paultomblin
Support needed
I need someone with substantial knowledge in Nutch, Java and Lucene and have customised the system before. In particular, this should be related to image indexing and geo-positioning. if possible (either or, is good as well). The job role will be on providing supports and advice on how to go about implementing such system.. This includes: 1. replying questions and providing guidance in implementation 2. reviewing codes and providing suggestions as to how to improve. Please let me know if you're interested. -- View this message in context: http://www.nabble.com/Support-needed-tp24688172p24688172.html Sent from the Nutch - User mailing list archive at Nabble.com.
Using Nutch (w/custom plugin) to crawl vs. custom Lucene app
Hi, I've been familiarizing myself with Nutch, in preparation for putting together a proof-of-concept (POC) that we are wanting. Basically, we have some files of proprietary file type, and we want to be able to search on specific "fields" within these files. The files are physically stored on the local filesystem. Thus far, I've gotten an initial Nutch instance working, and also a 2nd Nutch instance, configured for crawling the local filesystem. These test instances just use the "out-of-box" Nutch and Nutch plugins, e.g., the PDF plugin, just to allow me to get familiar with Nutch software. Having done that, my original idea was to write some Nutch plugins that could be used with a Nutch crawl. However, we already have some previously-built apps that basically "crawl" (e.g., they do a recursive directory search on the local filesystem) the local filesystem and finds all of these files. These are Java apps that we previously built for various purposes. So, I'm wondering if it might make more sense (and I think may be easier) to take one of those existing apps, and, basically, just enhance them to build Lucene indexes, which could then be used by the Nutch web app (as a web-based search web app)? As I said, I'm really new to Nutch, and also to Lucene, but from what I've researched so far, it *looks like* it'd be fairly easy to extend some of existing apps to generate Lucene indexes, and I have some questions: - If my custom Java app can be extended to "just" build indexes using Lucene, is that all that it needs to do in order for these to work ok with Nutch web app? - Am I underestimating the effort needed to build the Lucene indexes that the Nutch web app could use? I was wondering if anyone here, has had to go through a similar situation (Nutch plugin for custom file type vs. custom crawl app to build Lucene indexes that the Nutch web app can use)? Any other thoughts on all of this would be greatly appreciated from the Nutch/Lucene experts here!! Thanks, Jim
Re: question
i believe it can. check your configuration files, nutch-site.xml and nutch-default.xml. you will find something like plugin.includes protocol-http|urlfilter-regex|parse-(text|html|swf|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. add to the parsers "msword". change parse-(text|html|swf|pdf)| to parse-(text|html|swf|pdf|msword) there is a plugin in plugins folder, which is parsing ms word documents. parse-msword i have not tried it so far. Jair Piedrahita Vargas schrieb: > Can Nutch search inside the content of an msword file? I've tried, but it > says "parser not found for contentType=application/msword" > What can I do to correct this Error? > > Thanks > > JAIR PIEDRAHITA VARGAS > Gerencia de Investigación y Nuevas Tecnologías > Teléfono: 404 Ext 41632 > Av. los Industriales Cra 48 # 26-85 piso 6B > BANCOLOMBIA S.A > > > > El contenido de este mensaje puede ser información privilegiada y > confidencial. Si usted no es el destinatario real del mismo, por favor > informe de ello a quien lo envía y destrúyalo en forma inmediata. Está > prohibida su retención, grabación, utilización o divulgación con cualquier > propósito. Este mensaje ha sido verificado con software antivirus; en > consecuencia, el remitente de éste no se hace responsable por la presencia en > él o en sus anexos de algún virus que pueda generar daños en los equipos o > programas del destinatario. > ** > This communication (including all attachments) may contain information that > is private, confidential and privileged. If you have received this > communication in error; please notify the sender immediately, delete this > communication from all data storage devices and destroy all hard copies. Any > use, dissemination, distribution, copying or disclosure of this message and > any attachments, in whole or in part, by anyone other than the intended > recipient(s) is strictly prohibited. This message has been checked with an > antivirus software; accordingly, the sender is not liable for the presence of > any virus in attachments that causes or may cause damage to the recipient's > equipment or software. > >
question
Can Nutch search inside the content of an msword file? I've tried, but it says "parser not found for contentType=application/msword" What can I do to correct this Error? Thanks JAIR PIEDRAHITA VARGAS Gerencia de Investigación y Nuevas Tecnologías Teléfono: 404 Ext 41632 Av. los Industriales Cra 48 # 26-85 piso 6B BANCOLOMBIA S.A El contenido de este mensaje puede ser información privilegiada y confidencial. Si usted no es el destinatario real del mismo, por favor informe de ello a quien lo envía y destrúyalo en forma inmediata. Está prohibida su retención, grabación, utilización o divulgación con cualquier propósito. Este mensaje ha sido verificado con software antivirus; en consecuencia, el remitente de éste no se hace responsable por la presencia en él o en sus anexos de algún virus que pueda generar daños en los equipos o programas del destinatario. ** This communication (including all attachments) may contain information that is private, confidential and privileged. If you have received this communication in error; please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. This message has been checked with an antivirus software; accordingly, the sender is not liable for the presence of any virus in attachments that causes or may cause damage to the recipient's equipment or software.
Re: Why did my crawl fail?
Hi, Paul Can you post the error messages in the log file (file:/Users/ptomblin/nutch-1.0/logs)? On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblin wrote: > Actually, I got that error the first time I used it, and then again when I > blew away the downloaded nutch and grabbed the latest trunk from Subversion. > > On Mon, Jul 27, 2009 at 1:11 AM, xiao yang wrote: > >> You must have crawled for several times, and some of them failed >> before the parse phase. So the parse data was not generated. >> You'd better delete the whole directory >> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you >> will know the exact reason why it failed in the parse phase from the >> output information. >> >> Xiao >> >> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin wrote: >> > I installed nutch 1.0 on my laptop last night and set it running to crawl >> my >> > blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 >> > it was still running strong when I went to bed several hours later, and >> this >> > morning I woke up to this: >> > >> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> > -activeThreads=0 >> > Fetcher: done >> > CrawlDb update: starting >> > CrawlDb update: db: crawl.blog/crawldb >> > CrawlDb update: segments: [crawl.blog/segments/20090724010303] >> > CrawlDb update: additions allowed: true >> > CrawlDb update: URL normalizing: true >> > CrawlDb update: URL filtering: true >> > CrawlDb update: Merging segment data into db. >> > CrawlDb update: done >> > LinkDb: starting >> > LinkDb: linkdb: crawl.blog/linkdb >> > LinkDb: URL normalize: true >> > LinkDb: URL filter: true >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 >> > LinkDb: adding segment: >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 >> > Exception in thread "main" >> org.apache.hadoop.mapred.InvalidInputException: >> > Input path does not exist: >> > >> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data >> > at >> > >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) >> > at >> > >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) >> > at >> > >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) >> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) >> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) >> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) >> > >> > >> > -- >> > http://www.linkedin.com/in/paultomblin >> > >> > > > > -- > http://www.linkedin.com/in/paultomblin >
Re: Nutch crawling status
I've found the script here http://wiki.apache.org/nutch/MonitoringNutchCrawls. But I'm not sure how can I use it, when hadoop is on the farm of 15 machines? May be I should use hadoop tasktracker instead of this script somehow? caezar wrote: > > Hi All, > > Is there a way, to retrieve nutch crawling status at runtime? Let me > describe what I mean. For instance if currently fetch job is running, I > want to retrieve that fetch is running, how many URLs already fetched, how > many errors occured. Hadoop farm is used. > > Thanks for any ideas. > -- View this message in context: http://www.nabble.com/Nutch-crawling-status-tp24681707p24681949.html Sent from the Nutch - User mailing list archive at Nabble.com.
Nutch crawling status
Hi All, Is there a way, to retrieve nutch crawling status at runtime? Let me describe what I mean. For instance if currently fetch job is running, I want to retrieve that fetch is running, how many URLs already fetched, how many errors occured. Hadoop farm is used. Thanks for any ideas. -- View this message in context: http://www.nabble.com/Nutch-crawling-status-tp24681707p24681707.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to index other fields in solr
On Mon, Jul 27, 2009 at 09:34, Saurabh Suman wrote: > > I am using solr for searching.I used the class SolrIndexer.But i can search > on content only?I want to search on author also?How to index on author? You need to write your own query plugin. Take a look at query-basic plugin under src/plugin. > -- > View this message in context: > http://www.nabble.com/How-to-index-other-fields-in-solr-tp24674208p24674208.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
Re: Why did my crawl fail?
Actually, I got that error the first time I used it, and then again when I blew away the downloaded nutch and grabbed the latest trunk from Subversion. On Mon, Jul 27, 2009 at 1:11 AM, xiao yang wrote: > You must have crawled for several times, and some of them failed > before the parse phase. So the parse data was not generated. > You'd better delete the whole directory > file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you > will know the exact reason why it failed in the parse phase from the > output information. > > Xiao > > On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin wrote: > > I installed nutch 1.0 on my laptop last night and set it running to crawl > my > > blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 > > it was still running strong when I went to bed several hours later, and > this > > morning I woke up to this: > > > > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=0 > > Fetcher: done > > CrawlDb update: starting > > CrawlDb update: db: crawl.blog/crawldb > > CrawlDb update: segments: [crawl.blog/segments/20090724010303] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: true > > CrawlDb update: URL filtering: true > > CrawlDb update: Merging segment data into db. > > CrawlDb update: done > > LinkDb: starting > > LinkDb: linkdb: crawl.blog/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 > > LinkDb: adding segment: > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 > > Exception in thread "main" > org.apache.hadoop.mapred.InvalidInputException: > > Input path does not exist: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data > > at > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) > > at > > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) > > at > > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) > > > > > > -- > > http://www.linkedin.com/in/paultomblin > > > -- http://www.linkedin.com/in/paultomblin
Re: How to index other fields in solr
Wouldn't that be using facets, as per http://wiki.apache.org/solr/SimpleFacetParameters On Mon, Jul 27, 2009 at 2:34 AM, Saurabh Suman wrote: > > I am using solr for searching.I used the class SolrIndexer.But i can search > on content only?I want to search on author also?How to index on author? > -- > View this message in context: > http://www.nabble.com/How-to-index-other-fields-in-solr-tp24674208p24674208.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- http://www.linkedin.com/in/paultomblin
Re: crawl-tool.xml
its not only confusing me, its also confusing the author, FrankMcCown, of the nutch tutorial http://wiki.apache.org/nutch/NutchTutorial Crawl Command: Configuration To configure things for the crawl command you must: * Create a directory with a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls/nutch containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch file would thus contain: http://lucene.apache.org/nutch/ * Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the apache.org domain, the line should read: +^http://([a-z0-9]*\.)*apache.org/ This will include any url in the domain apache.org. * Until someone could explain this...When I use the file crawl-urlfilter.txt the filter doesn't work, instead of it use the file conf/regex-urlfilter.txt and change the last line from "+." to "-." reinhard schwab schrieb: > i have tried the recrawl script of susam pal and have wondered why > url filtering no longer works. > http://wiki.apache.org/nutch/Crawl > > the mystery is > > only Crawl.java adds crawl-tool.xml to the NutchConfiguration. > > Configuration conf = NutchConfiguration.create(); > conf.addResource("crawl-tool.xml"); > > Fetcher.java and all the other tools which filter the outlinks do not > add this. > this is really confusing me and i have spent some time to figure this out. > > regards > reinhard > > > > > > > > >