Why are specific URLs not fetched?
Hi, First of all sorry for the long signature, but it's configured by an administrator. I'm using pre-configured Nutch package [1] which contains some plugins and configuration to add fields which are used for integration with TYPO3 CMS. Nutch 1.8 is used and in most cases it works like a charm. For one server the whole process basically ends after fetching the seed URLs. Nothing is listed in the parsing fase. Any run after the first one ends immediatly with the notification that there was nothing to do. The seed URLs are publically accessible (publications from a local government) and do not produce any errors in browser dev tools. The content can be fetched by wget from the same server where nutch is running. I'm looking for a way to find out what went wrong here. Where can I find information on what goes wrong during the fetch phase? I tried the IRC channel a few times, but at those times my only company was ChanServ ;-) Thanks in advance for any pointers! [1] https://github.com/dkd/nutch-typo3-cms -- Met vriendelijke groet, Jigal van Hemert | Ontwikkelaar Langesteijn 124 3342LG Hendrik-Ido-Ambacht T. +31 (0)78 635 1200 F. +31 (0)848 34 9697 KvK. 23 09 28 65 ji...@alternet.nl www.alternet.nl Disclaimer: Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan direct per e-mail of telefoon contact op met de verzender en verwijder dit bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op welke wijze dan ook te delen met derden of anderszins openbaar te maken zonder schriftelijke toestemming van alterNET Internet BV. U wordt geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg van virussen. Alle eventueel genoemde prijzen S.E. O., excl. 21% BTW, excl. reiskosten. Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan dit bericht kunnen geen rechten worden ontleend. ! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !
RE: Crawl URL with varying query parameters values
Hi - you probably have URL filtering enabled, the regex specifically. By default it filters out query strings. Check your URL filters. Markus -Original message- From:Krishnanand, Kartik kartik.krishnan...@bankofamerica.com mailto:kartik.krishnan...@bankofamerica.com Sent: Friday 12th September 2014 13:04 To: user@nutch.apache.org mailto:user@nutch.apache.org Subject: Crawl URL with varying query parameters values Hi, Nutch Gurus, I need to crawl two dynamically pages 1. http://example.com http://example.com and 2. http://example.com http://example.com ?request_locale=es_US The difference is that when the query parameter request_locale equals es_US, Spanish content is loaded. We would like to be able to crawl both the URLs if possible. I have passed these urls in my seed.txt but have the logs show that only the first URL is being crawled, but not the second. I modified the regex-normalize.xml to not strip out query parameters and is given below. How do I configure Nutch to crawl both URLs? Kartik regex-normalize !-- removes session ids from urls (such as jsessionid and PHPSESSID) -- regex pattern(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern substitution$4/substitution /regex !-- changes default pages into standard for /index.html, etc. into / regex pattern/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|amp;|#|$)/pattern substitution/$3/substitution /regex -- !-- removes interpage href anchors such as site.com#location -- regex pattern#.*?(\?|amp;|$)/pattern substitution$1/substitution /regex !-- cleans ?amp;var=value into ?var=value -- regex pattern\?amp;/pattern substitution\?/substitution /regex !-- cleans multiple sequential ampersands into a single ampersand -- regex patternamp;{2,}/pattern substitutionamp;/substitution /regex !-- removes trailing ? -- regex pattern[\?amp;\.]$/pattern substitution/substitution /regex !-- removes duplicate slashes -- regex pattern(?lt;!:)/{2,}/pattern substitution//substitution /regex /regex-normalize -- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer http://www.bankofamerica.com/emaildisclaimer . If you are not the intended recipient, please delete this message.
RE: generatorsortvalue
Hi - if you need inlinks as input you need to change how Nutch works. By default, inlinks are only used when indexing. So depending on whatever scoring filter you implement, you also need to process inlinks at that state (generator or updater). This is going to be a costly process because the linkdb can grow quickly and slow to process. Markus -Original message- From:Benjamin Derei stygm...@gmail.com mailto:stygm...@gmail.com Sent: Saturday 13th September 2014 14:19 To: user@nutch.apache.org mailto:user@nutch.apache.org Subject: Re: generatorsortvalue Hi, But where can i get the inlinks containing url and anchors? Ben. Envoyé de mon iPad Le 10 sept. 2014 à 16:02, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu mailto:jlbetanco...@uci.cu a écrit : Hi, Actually the generatorSortValue() method does not have access to the ParseData object (which holds all the info extracted by the parsers from the webpage raw content) as you pointed out. Essentially this method is used in the Generator class in a very early stage of the crawling process way before the URL have been fetched or parsed (which is from where the oulinks ˜ new links come from). The best approach is to use the generatorSortValue() which will assign the initial score and actually will (as you figured out) get you where you want. How do you put your ismarked key into CrawlDatum? do you put it in the metadata? Perhaps you could alter the score in CrawlDatum directly, since the default implementation of the scoring plugins for this method is: datum.getScore() * initSort; Taking into account what you’re trying to do, I think you could use the passScoreAfterParsing() method of the ScoringFilter interface. This method get’s called by the Fetcher after the parse process is done, so you’ll have access to the ParseMetadata and you can alter this value. I’m not clear if this will work, but at least worth check it out. One question about this approach is that if the CrawlDatum score is synchronized with the Parse/Content score. Regards, On Sep 10, 2014, at 3:24 AM, Benjamin Derei stygm...@gmail.com mailto:stygm...@gmail.com wrote: Hello, I'm using nutch 1.9. I want to alter the score used for sorting the topn page for the next parsing. I found it working by modifying the return of generatorsortvalue of a scoringfilter plugin. But this fonction don't have anchors text in inputs... I wrote some inelegant and inefficient code that put a ismarked key in crawldatum for knowing if anchors text or url contains some words... From what function i have to do this? Is there a complete schema of datas path though each plugins type functions? Benjamin. Envoyé de mon iPad Le 10 sept. 2014 à 04:02, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu mailto:jlbetanco...@uci.cu a écrit : You’ll need to write a couple of plugins to accomplish this. Which version of Nutch are you using? In the first case, the score you want to alter is the score that’s indexed into Solr (i.e your backend) ? Regards, On Sep 9, 2014, at 2:38 PM, Benjamin Derei stygm...@gmail.com mailto:stygm...@gmail.com wrote: hi, i'm a beginner in java and nutch. I want to orient the crawl with two rules: -if language identifier plugin detect that page is non fr the score for sorting should be divided by two. -if an anchor text or link cibling this page contain some therms the score for sorting should be multiplied by ten. Any help ? Benjamin. Concurso Mi selfie por los 5. Detalles en http://justiciaparaloscinco.wordpress.com http://justiciaparaloscinco.wordpress.com Concurso Mi selfie por los 5. Detalles en http://justiciaparaloscinco.wordpress.com http://justiciaparaloscinco.wordpress.com
RE: Fetch Job Started Failing on Hadoop Cluster
Hi - you made Nutch believe that hdfs://server1.mydomain.com:9000/user/df/crawldirectory/segments/ is a segment, but it is not. So either no segment was created or written to the wrong location. I don't know what kind of script you are using but you should check the return code of the generator, if gives a -1 for no segment created. Markus -Original message- From:Meraj A. Khan mera...@gmail.com mailto:mera...@gmail.com Sent: Monday 15th September 2014 7:02 To: user@nutch.apache.org mailto:user@nutch.apache.org Subject: Fetch Job Started Failing on Hadoop Cluster Hello Folks, My Nutch crawl which was running fine , started failing in the first Fetch Job/Application, I am unable to figure out whats going on here, I have attached the last snippet of the log below , can some please let me know whats going on here ? What I noticed is that even though the generate phase created a segment 20140915004940 , the fetch phase is only looking up to the segments directory for the segments. Thanks. 14/09/15 00:50:07 INFO crawl.Generator: Generator: finished at 2014-09-15 00:50:07, elapsed: 00:00:59 ls: cannot access crawldirectory/segments/: No such file or directory Operating on segment : Fetching : 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: starting at 2014-09-15 00:50:09 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: segment: crawldirectory/segments 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher Timelimit set for : 1410767409664 Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c libfile', or link it with '-z noexecstack'. 14/09/15 00:50:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at server1.mydomain.com/170.75.152.162:8040 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at server1.mydomain.com/170.75.152.162:8040 14/09/15 00:50:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/df/.staging/job_1410742329411_0010 14/09/15 00:50:12 WARN security.UserGroupInformation: PriviledgedActionException as:df (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:// server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate 14/09/15 00:50:12 WARN security.UserGroupInformation: PriviledgedActionException as:df (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:// server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate 14/09/15 00:50:12 ERROR fetcher.Fetcher: Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:// server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:108) at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1349) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1385) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1358) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
Re: Running Crawls via REST API
If you investigate the bin/nutch script, you will notice that each command supported by Nutch is calling a Java program or class. You can use the same approach in your Java code. That is calling the appropriate Java class with required parameters. Regards Ameer -- View this message in context: http://lucene.472066.n3.nabble.com/Running-Crawls-via-REST-API-tp4159019p4159123.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Running Crawls via REST API
If possible I do not want to write any single line of Java. That is why I am wondering, if it is possible to do everything via REST. But so far it seems like ssh might be the better remote interface. Kind Regards Johannes
Re: Why are specific URLs not fetched?
Hi, Thanks for your reply. On 16 September 2014 12:28, Markus Jelsma markus.jel...@openindex.io wrote: Hi - it is usually a problem with URL filters, which by default do not accept query strings etc. Check your URL filters. In regex-urlfilter.txt the line #-[?*!@=] is already disabled. prefix-urlfilter.txt contains both http:// and https:// I've checked the rest of the files in the configuration, but can't find anything that is either not the default or that woule match the start URLs: http://lochem.raadsinformatie.nl/sitemap/meetings/2013/ (the others have a different year at the end) The pages have one validation error in the w3c validator (a style tag without a type attribute), but I don't think this should be a problem. Any ideas? -- Met vriendelijke groet, Jigal van Hemert | Ontwikkelaar Langesteijn 124 3342LG Hendrik-Ido-Ambacht T. +31 (0)78 635 1200 F. +31 (0)848 34 9697 KvK. 23 09 28 65 ji...@alternet.nl www.alternet.nl Disclaimer: Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan direct per e-mail of telefoon contact op met de verzender en verwijder dit bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op welke wijze dan ook te delen met derden of anderszins openbaar te maken zonder schriftelijke toestemming van alterNET Internet BV. U wordt geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg van virussen. Alle eventueel genoemde prijzen S.E. O., excl. 21% BTW, excl. reiskosten. Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan dit bericht kunnen geen rechten worden ontleend. ! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !
RE: Why are specific URLs not fetched?
You can check the bin/nutch parsechecker tool to see if the URL's are properly extracted from webpages. Then use the bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined tool to see some filter removes your URL's. They may also be normalized to something undesirable but that's not usually the case. -Original message- From:Jigal van Hemert | alterNET internet BV ji...@alternet.nl Sent: Tuesday 16th September 2014 16:04 To: user@nutch.apache.org Subject: Re: Why are specific URLs not fetched? Hi, Thanks for your reply. On 16 September 2014 12:28, Markus Jelsma markus.jel...@openindex.io wrote: Hi - it is usually a problem with URL filters, which by default do not accept query strings etc. Check your URL filters. In regex-urlfilter.txt the line #-[?*!@=] is already disabled. prefix-urlfilter.txt contains both http:// and https:// I've checked the rest of the files in the configuration, but can't find anything that is either not the default or that woule match the start URLs: http://lochem.raadsinformatie.nl/sitemap/meetings/2013/ (the others have a different year at the end) The pages have one validation error in the w3c validator (a style tag without a type attribute), but I don't think this should be a problem. Any ideas? -- Met vriendelijke groet, Jigal van Hemert | Ontwikkelaar Langesteijn 124 3342LG Hendrik-Ido-Ambacht T. +31 (0)78 635 1200 F. +31 (0)848 34 9697 KvK. 23 09 28 65 ji...@alternet.nl www.alternet.nl Disclaimer: Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan direct per e-mail of telefoon contact op met de verzender en verwijder dit bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op welke wijze dan ook te delen met derden of anderszins openbaar te maken zonder schriftelijke toestemming van alterNET Internet BV. U wordt geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg van virussen. Alle eventueel genoemde prijzen S.E. O., excl. 21% BTW, excl. reiskosten. Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan dit bericht kunnen geen rechten worden ontleend. ! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !
RE: Revisiting Loops Job in Nutch Trunk
Hi - So you are not using it for scoring right, but to inspect the graph of the web. Then there's certainly no need to weed out loops using the loops algorithm, neither a need to run the linkrank job Markus -Original message- From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com Sent: Thursday 11th September 2014 19:53 To: user@nutch.apache.org Subject: Re: Revisiting Loops Job in Nutch Trunk Hi Markus, On Wed, Sep 10, 2014 at 10:28 PM, user-digest-h...@nutch.apache.org wrote: Weird, i didn't see my own mail arriving on the list, i sent it via kmail but am on webmail now, which seems to work. sigh ;) Anyway, for vertical search on a whole website i would rely on your (customized) Lucene similarity and proper analysis, but also downgrading `bad` pages for which you can make custom classifier plugins in Nutch. Yep, this sounds much more appropriate for the task at hand. I have debugged the Webgraph code as well as some of the tools within this environment... it is not an apple-for-apple fit for what I am trying to achieve. That way you can, for example, get rid of hub pages and promote actual content. Yeah. I understand. Anyway, it all depends on what you want to achieve, which is? :) - Networks. Specifically, domain specific networks... - how they are formed and where they come from. - Where the traffic comes from (by server host, server IP, client IP and by content relevance) - what the graph looks like within these domain specific, networks. By the way, within this context, I think that a dense graph is probably OK. I am looking for this actually.
Re: Fetch Job Started Failing on Hadoop Cluster
Markus, Thanks, the issue I was setting the PATH variable in the bin/crawl script and once I removed it and set it outside of the bin/crawl script , it started working fine now. On Tue, Sep 16, 2014 at 6:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - you made Nutch believe that hdfs://server1.mydomain.com:9000/user/df/crawldirectory/segments/ is a segment, but it is not. So either no segment was created or written to the wrong location. I don't know what kind of script you are using but you should check the return code of the generator, if gives a -1 for no segment created. Markus -Original message- From:Meraj A. Khan mera...@gmail.com mailto:mera...@gmail.com Sent: Monday 15th September 2014 7:02 To: user@nutch.apache.org mailto:user@nutch.apache.org Subject: Fetch Job Started Failing on Hadoop Cluster Hello Folks, My Nutch crawl which was running fine , started failing in the first Fetch Job/Application, I am unable to figure out whats going on here, I have attached the last snippet of the log below , can some please let me know whats going on here ? What I noticed is that even though the generate phase created a segment 20140915004940 , the fetch phase is only looking up to the segments directory for the segments. Thanks. 14/09/15 00:50:07 INFO crawl.Generator: Generator: finished at 2014-09-15 00:50:07, elapsed: 00:00:59 ls: cannot access crawldirectory/segments/: No such file or directory Operating on segment : Fetching : 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: starting at 2014-09-15 00:50:09 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: segment: crawldirectory/segments 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher Timelimit set for : 1410767409664 Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c libfile', or link it with '-z noexecstack'. 14/09/15 00:50:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at server1.mydomain.com/170.75.152.162:8040 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at server1.mydomain.com/170.75.152.162:8040 14/09/15 00:50:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/df/.staging/job_1410742329411_0010 14/09/15 00:50:12 WARN security.UserGroupInformation: PriviledgedActionException as:df (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:// server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate 14/09/15 00:50:12 WARN security.UserGroupInformation: PriviledgedActionException as:df (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:// server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate 14/09/15 00:50:12 ERROR fetcher.Fetcher: Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:// server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:108) at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833) at
Re: Running Crawls via REST API
Hi Johannes On Tue, Sep 16, 2014 at 10:19 AM, user-digest-h...@nutch.apache.org wrote: is it possible to have nutch as a kind of stand-alone crawl server only spoken to via the REST API? Yes this is possible. We just finished a Google Summer of Code project which addresses exactly this via a Wicket-based Web Application. We are working on the final aspects of the patch before this is attached to the relevant issue https://issues.apache.org/jira/browse/NUTCH-841 I found the generic tutorial to setup nutch server with Cassandra and found this wiki page https://wiki.apache.org/nutch/NutchRESTAPI but it leaves me a bit confused about How I can actually start some full fetch cycles. Yep this is something we need to add to the documentation. We will do this in due course. I probably need to create some fetch job, but what is actually the full command with options to send via REST? https://wiki.apache.org/nutch/NutchRESTAPI#Create_job Might anybody maybe point to some working examples, I started digging through the java code, but it seems to be only generic key-value setting. A fully fledged crawl command has been deprecated in Nutch for a while. Therefore the REST commands you submit to the Nutch 2.X REST API (I suggest you use Nutch 2.3-SNAPSHOT) need to be chained together sequentially. I've been testing this out over the summer using RESTClient plugin for Firefox... it's been working well. Hope this helps you out. Lewis
Re: Revisiting Loops Job in Nutch Trunk
Hi Markus, On Tue, Sep 16, 2014 at 10:19 AM, user-digest-h...@nutch.apache.org wrote: Hi - So you are not using it for scoring right, but to inspect the graph of the web. Yeah, I think that this is a pretty accurate statement. Then there's certainly no need to weed out loops using the loops algorithm, neither a need to run the linkrank job OK doke Markus. I'm going to revisit the documentation for all of the WebGraph classes in an attempt to further define, explain and clarify the data structures. Thanks for your feedback. Lewis
Re: Plugin loading and NUTCH-609
On 15 sep. 2014, at 11:36, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Julien, see my inline replies Hi Edoardo, See my comments below On 12 September 2014 11:11, Edoardo Causarano edoardo.causar...@gmail.com wrote: Hi all, I'm completely lost, can anyone help me out here? I have this job.jar which contains all Nutch code, dependencies and plugins. I don't understand how I keep getting this error: 2014-09-12 11:51:04,458 WARN [main] org.apache.nutch.plugin.PluginRepository: Plugins: not a file: url. Can't load plugins from: jar:file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/oracle/appcache/application_1410512500237_0003/filecache/10/job.jar/job.jar!/lib/plugins Ok, I have to admit that I'm mucking around with the project structure but having found NUTCH-609 it seems that the PluginManifestParser does not support loading plugins from the job payload itself. Is this the case? If so, can anyone tell me where I need to unpack these plugins so that the loader will pick them up? why do you build the job jar yourself instead of using the one that our ant script builds? If you look at it the plugins are in /classes/plugins/ within the jar. Well, basically because I'm not familiar at all with Ivy and wanted to dive into the tool to understand a bit how it worked :) But yes, I solved the issue by correcting the target folder in the maven assembly descriptor target folder (so that Hadoop unpacks this folder as well.) Side note: is anyone interested in overhauling this loading mechanism? The XML manifest could be replaced with an annotation class, although I would be happy enough if I could include and load it into the jar. I like the idea of replacing the XML manifest with annotations - or maybe initially allow both. In an ideal world plugins would be handled as dependencies and we could just get the jars for them. I am sure there would be a way of making the XML file a part of the artefact but if we don't have to and can have a pom and a jar then it would certainly be simpler. Feel free to open a new JIRA for this and contribute a patch if you can. I was already looking into that, and had to hack away a bit at the plugin manifest parser. Seems to work alright but then explodes at runtime when loading the class (an NPE on the classloader if I remember correctly.) I mixed several changes so I'll have to clean up and organize my thoughts. ;) I was thinking the following would work: plugin jars stay where they are, move plugin.xml into jar META-INF/nutch, iterate over plugin paths, parse XML and load the classes declared in the XML. Does the plugin also need to export lib folders? Best, Edoardo Thanks Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Running Crawls via REST API
Hi Lewis, yes it is helping a bit. But especially the pre seeding is confusing me a bit for the moment. Can/will the rest API create the needed directories? The code for the webapp would be interesting to look at, but I actually need to kind of use nutch in a middle step in another software, so I would not show some other app. Kind Regards Johannes