how to upgrade a java application with nutch?
Hi! I´ve a java application that I would like to upgrade with nutch. What jars should I add to my lib applicaction to make it possible to use nutch features from some of my app pages and business logic classes? I´ve tried with nutch-1.0.jar generated by war target without success. I wonder what is the proper nutch build.xml target I should execute for this and what of the generated jars are to be included in my app. Maybe apart from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of them? thanks in advance!
Re: how to upgrade a java application with nutch?
2009/10/1 Jaime Martín james...@gmail.com Hi! I´ve a java application that I would like to upgrade with nutch. What jars should I add to my lib applicaction to make it possible to use nutch features from some of my app pages and business logic classes? I´ve tried with nutch-1.0.jar generated by war target without success. I wonder what is the proper nutch build.xml target I should execute for this and what of the generated jars are to be included in my app. Maybe apart from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of them? Maybe I'm doing it wrong, but I used the nutch-1.0.job file instead of the jar. -- http://www.linkedin.com/in/paultomblin
Nutch randomly skipping locations during crawl
This is strange. I manage the webservers for a large university library. On our site we have a staff directory where each user has a location for information. The URLs take the form of: http://mydomain.edu/staff/userid I've added the staff URL to the urls seed file. But even with a crawl set to depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems to only fetch about 50% of the locations in this area of the site. What should I look for to find out why this is happening? -- View this message in context: http://www.nabble.com/Nutch-randomly-skipping-locations-during-crawl-tp25696893p25696893.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how to upgrade a java application with nutch?
Jaime Martín wrote: Hi! I´ve a java application that I would like to upgrade with nutch. What jars should I add to my lib applicaction to make it possible to use nutch features from some of my app pages and business logic classes? I´ve tried with nutch-1.0.jar generated by war target without success. I wonder what is the proper nutch build.xml target I should execute for this and what of the generated jars are to be included in my app. Maybe apart from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of them? thanks in advance! Nutch is not designed for embedding in other applications, so you may face numerous problems. I did such an integration once, and it was far from obvious. A lot depends also whether you want to run it on a distributed cluster or in a single JVM (local mode). Take a look at build/nutch*.job, it's a jar file that contains all dependencies needed to run Nutch except for Hadoop libraries (which are also required). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch randomly skipping locations during crawl
tsmori wrote: This is strange. I manage the webservers for a large university library. On our site we have a staff directory where each user has a location for information. The URLs take the form of: http://mydomain.edu/staff/userid I've added the staff URL to the urls seed file. But even with a crawl set to depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems to only fetch about 50% of the locations in this area of the site. What should I look for to find out why this is happening? * Check that the pages there are not forbidden by robot rules (which may be embedded inside HTML meta tags of index.html, or the top-level robots.txt). * check that your crawldb actually contains entries for these pages - perhaps they are being filtered out. * check your segments whether these URLs were scheduled for fetching, and if so, then what was the status of fetching. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: how to upgrade a java application with nutch?
thank you for the info. that´s really a problem. I have a java project and for some of its new features I would like to use nutch. As I need to customise nutch my idea was next: - 1st: change what needed for my requirements in my downloaded nutch and generate a nutch library - 2nd: add that library in the other project and invoke libraries features when needed is that not advisable? what is the best way then to generate a nutch library to be used in other java projects? or is that not possible without becoming crazy due to configuration issues? 2009/10/1 Andrzej Bialecki a...@getopt.org Jaime Martín wrote: Hi! I´ve a java application that I would like to upgrade with nutch. What jars should I add to my lib applicaction to make it possible to use nutch features from some of my app pages and business logic classes? I´ve tried with nutch-1.0.jar generated by war target without success. I wonder what is the proper nutch build.xml target I should execute for this and what of the generated jars are to be included in my app. Maybe apart from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of them? thanks in advance! Nutch is not designed for embedding in other applications, so you may face numerous problems. I did such an integration once, and it was far from obvious. A lot depends also whether you want to run it on a distributed cluster or in a single JVM (local mode). Take a look at build/nutch*.job, it's a jar file that contains all dependencies needed to run Nutch except for Hadoop libraries (which are also required). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Nutch randomly skipping locations during crawl
yes check also if some userids dont have some caracteres like ?, @, *, !, = they are filtred by default : -[...@=] Date: Thu, 1 Oct 2009 18:15:38 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Nutch randomly skipping locations during crawl tsmori wrote: This is strange. I manage the webservers for a large university library. On our site we have a staff directory where each user has a location for information. The URLs take the form of: http://mydomain.edu/staff/userid I've added the staff URL to the urls seed file. But even with a crawl set to depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems to only fetch about 50% of the locations in this area of the site. What should I look for to find out why this is happening? * Check that the pages there are not forbidden by robot rules (which may be embedded inside HTML meta tags of index.html, or the top-level robots.txt). * check that your crawldb actually contains entries for these pages - perhaps they are being filtered out. * check your segments whether these URLs were scheduled for fetching, and if so, then what was the status of fetching. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ Windows Live helps you keep up with all your friends, in one place. http://go.microsoft.com/?linkid=9660826
Re: how to upgrade a java application with nutch?
Hi Jaime, Depending on what exactly you're trying to do, there are some other projects that offer crawler functionality which could be easier to embed. The two I know about are: - Droids (http://incubator.apache.org/droids/), though I haven't really used it. - Bixo (http://bixo.101tec.com/), which is a project I'm actively working on. -- Ken On Oct 1, 2009, at 9:37am, Jaime Martín wrote: thank you for the info. that´s really a problem. I have a java project and for some of its new features I would like to use nutch. As I need to customise nutch my idea was next: - 1st: change what needed for my requirements in my downloaded nutch and generate a nutch library - 2nd: add that library in the other project and invoke libraries features when needed is that not advisable? what is the best way then to generate a nutch library to be used in other java projects? or is that not possible without becoming crazy due to configuration issues? 2009/10/1 Andrzej Bialecki a...@getopt.org Jaime Martín wrote: Hi! I´ve a java application that I would like to upgrade with nutch. What jars should I add to my lib applicaction to make it possible to use nutch features from some of my app pages and business logic classes? I´ve tried with nutch-1.0.jar generated by war target without success. I wonder what is the proper nutch build.xml target I should execute for this and what of the generated jars are to be included in my app. Maybe apart from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of them? thanks in advance! Nutch is not designed for embedding in other applications, so you may face numerous problems. I did such an integration once, and it was far from obvious. A lot depends also whether you want to run it on a distributed cluster or in a single JVM (local mode). Take a look at build/nutch*.job, it's a jar file that contains all dependencies needed to run Nutch except for Hadoop libraries (which are also required). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-210-6378
RE: how to upgrade a java application with nutch?
Hi Jaime, You don't have to embed; try (simplified) Nutch + SOLR (Nutch has plugin for SOLR). And use SolrJ client for SOLR from your application. This is very easy. -Fuad http://www.linkedin.com/in/liferay -Original Message- From: Jaime Martín [mailto:james...@gmail.com] Sent: October-01-09 5:59 AM To: nutch-user@lucene.apache.org Subject: how to upgrade a java application with nutch? Hi! I´ve a java application that I would like to upgrade with nutch. What jars should I add to my lib applicaction to make it possible to use nutch features from some of my app pages and business logic classes? I´ve tried with nutch-1.0.jar generated by war target without success. I wonder what is the proper nutch build.xml target I should execute for this and what of the generated jars are to be included in my app. Maybe apart from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of them? thanks in advance!
Re: R: Using Nutch for only retriving HTML
BELLINI ADAM wrote: hi, but how to dump the content ? i tried this command : ./bin/nutch readseg -dump crawl/segments/20090903121951/content/ toto and it said : Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate but the crawl_generate is in this path : /usr/local/nutch-1.0/crawl/segments/20091001120102 and not in this one : /usr/local/nutch-1.0/crawl/segments/20091001120102/content can you plz just give me the correct command ? This command will dump just the content part: ./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch -nogenerate -noparse -noparsedata -noparsetext -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Nutch randomly skipping locations during crawl
Both good ideas. Unfortunately, the content for each user is the same. It's a static php file that simply calls information out of our LDAP. It's very strange because I cannot see any difference between the user files/directories that are fetched and those that aren't. In checking both the crawl log and the hadoop log, the missing users are not even fetched. If it's a permissions issue, it's a very odd one. All the directories here have the same group membership and all files and directories under it are owner, group, and world readable/executable. The issue seems to be that they're not fetched and there's no indication in the logs why they aren't. -- View this message in context: http://www.nabble.com/Nutch-randomly-skipping-locations-during-crawl-tp25696893p25705239.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch randomly skipping locations during crawl
tsmori wrote: Both good ideas. Unfortunately, the content for each user is the same. It's a static php file that simply calls information out of our LDAP. It's very strange because I cannot see any difference between the user files/directories that are fetched and those that aren't. In checking both the crawl log and the hadoop log, the missing users are not even fetched. Check the segment's crawl_generate and crawl_fetch, and also check your crawldb for status. Logs don't always contain this information. The issue seems to be that they're not fetched and there's no indication in the logs why they aren't. See above. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Something wrong with nutch.wiki
2009/9/29 Ольга Пескова opesk...@mail.ru: Hello! Please check the url: http://wiki.apache.org/nutch/ I can't find any content there. Just as a point of reference, I got the FrontPage to pull up just prior to sending this e-mail. I'm not sure what is wrong with your connection to it, but I don't believe it is the server. Kirby
Re: Something wrong with nutch.wiki
2009/10/1 Kirby Bohling kirby.bohl...@gmail.com: 2009/9/29 Ольга Пескова opesk...@mail.ru: Hello! Please check the url: http://wiki.apache.org/nutch/ I can't find any content there. Just as a point of reference, I got the FrontPage to pull up just prior to sending this e-mail. I'm not sure what is wrong with your connection to it, but I don't believe it is the server. It was down for a number of hours today, but evidently it's back up now. -- http://www.linkedin.com/in/paultomblin
Fetcher problems with stable version of nutch-1.0 ?
Hi all, I am trying to use nutch to crawl and index a list of about 50K URLs with depth=1. I am running indexing with the command: nutch-1.0/bin/nutch crawl urls/ -depth 1 -topN 10 with appropriate changes to the configuration files. I find that the fetching always terminates prematurely and the logs show an error that looks like: activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1 Aborting with 200 hung threads. Fetcher: done I have not seen this particular error message when using nutch-0.9. Is it advisable to revert to using nutch-0.9? Or do we have some kind of patch to fix this error? Thanks, Vijay
RE: Something wrong with nutch.wiki
FWIW, I often have problems getting to wiki.apache.org. I could not get there this morning, and had to read what I needed out of the google cache. |-Original Message- |From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul |Tomblin |Sent: Thursday, October 01, 2009 4:32 PM |To: nutch-user@lucene.apache.org |Subject: Re: Something wrong with nutch.wiki | |2009/10/1 Kirby Bohling kirby.bohl...@gmail.com: | 2009/9/29 Ольга Пескова opesk...@mail.ru: | Hello! | | Please check the url: | http://wiki.apache.org/nutch/ | I can't find any content there. | | Just as a point of reference, I got the FrontPage to pull up just | prior to sending this e-mail. I'm not sure what is wrong with your | connection to it, but I don't believe it is the server. | |It was down for a number of hours today, but evidently it's back up now. | | | |-- |http://www.linkedin.com/in/paultomblin