Why are specific URLs not fetched?

2014-09-16 Thread Jigal van Hemert | alterNET internet BV
Hi,

First of all sorry for the long signature, but it's configured by an
administrator.

I'm using pre-configured Nutch package [1] which contains some plugins
and configuration to add fields which are used for integration with
TYPO3 CMS. Nutch 1.8 is used and in most cases it works like a charm.

For one server the whole process basically ends after fetching the
seed URLs. Nothing is listed in the parsing fase. Any run after the
first one ends immediatly with the notification that there was nothing
to do.
The seed URLs are publically accessible (publications from a local
government) and do not produce any errors in browser dev tools. The
content can be fetched by wget from the same server where nutch is
running.

I'm looking for a way to find out what went wrong here. Where can I
find information on what goes wrong during the fetch phase?

I tried the IRC channel a few times, but at those times my only
company was ChanServ ;-)

Thanks in advance for any pointers!

[1] https://github.com/dkd/nutch-typo3-cms

-- 


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

ji...@alternet.nl
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke
informatie bevatten. Als u niet de beoogde ontvanger bent van dit
bericht, neem dan direct per e-mail of telefoon contact op met de
verzender en verwijder dit bericht van uw systeem. Het is niet
toegestaan de inhoud van dit bericht op welke wijze dan ook te delen
met derden of anderszins openbaar te maken zonder schriftelijke
toestemming van alterNET Internet BV. U wordt geadviseerd altijd
bijlagen te scannen op virussen. AlterNET kan op geen enkele wijze
verantwoordelijk worden gesteld voor geleden schade als gevolg van
virussen.

Alle eventueel genoemde prijzen S.E.  O., excl. 21% BTW, excl.
reiskosten. Op al onze prijsopgaven, offertes, overeenkomsten, en
diensten zijn, met uitzondering van alle andere voorwaarden, de
Algemene Voorwaarden van alterNET Internet B.V. van toepassing. Op al
onze domeinregistraties en hostingactiviteiten zijn tevens onze
aanvullende hostingvoorwaarden van toepassing. Dit bericht is
uitsluitend bedoeld voor de geadresseerde. Aan dit bericht kunnen geen
rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !


RE: Crawl URL with varying query parameters values

2014-09-16 Thread Markus Jelsma
Hi - you probably have URL filtering enabled, the regex specifically. By 
default 
it filters out query strings. Check your URL filters.

Markus




 
 
-Original message-
 From:Krishnanand, Kartik kartik.krishnan...@bankofamerica.com 
 mailto:kartik.krishnan...@bankofamerica.com 
 Sent: Friday 12th September 2014 13:04
 To: user@nutch.apache.org mailto:user@nutch.apache.org 
 Subject: Crawl URL with varying query parameters values
 
 Hi, Nutch Gurus,
 
 I need to crawl two dynamically pages
 
 
 1.   http://example.com http://example.com  and
 
 2.   http://example.com http://example.com ?request_locale=es_US
 
 The difference is that when the query parameter request_locale equals 
 es_US, Spanish content is loaded. We would like to be able to crawl both 
 the URLs if possible. I have passed these urls in my seed.txt but have the 
 logs show that only the first URL is being crawled, but not the second.
 
 I modified the regex-normalize.xml to not strip out query parameters and is 
 given below. How do I configure Nutch to crawl both URLs?
 
 Kartik
 
 regex-normalize
 
 !-- removes session ids from urls (such as jsessionid and PHPSESSID) --
 regex
   
 pattern(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern
   substitution$4/substitution
 /regex
 
 !-- changes default pages into standard for /index.html, etc. into /
 regex
   
 pattern/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|amp;|#|$)/pattern
   substitution/$3/substitution
 /regex --
 
 !-- removes interpage href anchors such as site.com#location --
 regex
   pattern#.*?(\?|amp;|$)/pattern
   substitution$1/substitution
 /regex
 
 !-- cleans ?amp;var=value into ?var=value --
 regex
   pattern\?amp;/pattern
   substitution\?/substitution
 /regex
 
 !-- cleans multiple sequential ampersands into a single ampersand --
 regex
   patternamp;{2,}/pattern
   substitutionamp;/substitution
 /regex
 
 !-- removes trailing ? --
 regex
   pattern[\?amp;\.]$/pattern
   substitution/substitution
 /regex
 
 !-- removes duplicate slashes --
 regex
   pattern(?lt;!:)/{2,}/pattern
   substitution//substitution
 /regex
 
 /regex-normalize
 
 --
 This message, and any attachments, is for the intended recipient(s) only, may 
 contain information that is privileged, confidential and/or proprietary and 
 subject to important terms and conditions available at 
 http://www.bankofamerica.com/emaildisclaimer 
 http://www.bankofamerica.com/emaildisclaimer .   If you are not the 
 intended recipient, please delete this message.
 



RE: generatorsortvalue

2014-09-16 Thread Markus Jelsma
Hi - if you need inlinks as input you need to change how Nutch works. By 
default, inlinks are only used when indexing. So depending on whatever scoring 
filter you implement, you also need to process inlinks at that state (generator 
or updater). This is going to be a costly process because the linkdb can grow 
quickly and slow to process.

Markus




 
 
-Original message-
 From:Benjamin Derei stygm...@gmail.com mailto:stygm...@gmail.com 
 Sent: Saturday 13th September 2014 14:19
 To: user@nutch.apache.org mailto:user@nutch.apache.org 
 Subject: Re: generatorsortvalue
 
 Hi,
 
 But where can i get the inlinks containing url and anchors?
 
 Ben.
 
 Envoyé de mon iPad
 
  Le 10 sept. 2014 à 16:02, Jorge Luis Betancourt Gonzalez 
  jlbetanco...@uci.cu mailto:jlbetanco...@uci.cu  a écrit :
  
  Hi, 
  
  Actually the generatorSortValue() method does not have access to the 
  ParseData object (which holds all the info extracted by the parsers from 
  the webpage raw content) as you pointed out. Essentially this method is 
  used in the Generator class in a very early stage of the crawling process 
  way before the URL have been fetched or parsed (which is from where the 
  oulinks ˜ new links come from). 
  
  The best approach is to use the generatorSortValue() which will assign the 
  initial score and actually will (as you figured out) get you where you 
  want. 
  
  How do you put your ismarked key into CrawlDatum? do you put it in the 
  metadata? Perhaps you could alter the score in CrawlDatum directly, since 
  the default implementation of the scoring plugins for this method is: 
  datum.getScore() * initSort;
  
  Taking into account what you’re trying to do, I think you could use the 
  passScoreAfterParsing() method of the ScoringFilter interface. This method 
  get’s called by the Fetcher after the parse process is done, so you’ll have 
  access to the ParseMetadata and you can alter this value. I’m not clear if 
  this will work, but at least worth check it out. One question about this 
  approach is that if the CrawlDatum score is synchronized with the 
  Parse/Content score.
  
  Regards,
  
  On Sep 10, 2014, at 3:24 AM, Benjamin Derei stygm...@gmail.com 
  mailto:stygm...@gmail.com  wrote:
  
  Hello,
  
  I'm using nutch 1.9.
  I want to alter the score used for sorting the topn page for the next 
  parsing.
  I found it working by modifying the return of generatorsortvalue of a 
  scoringfilter plugin.
  But this fonction don't have anchors text in inputs...
  I wrote some inelegant and inefficient code that put a ismarked key in 
  crawldatum for knowing if anchors text or url contains some words... From 
  what function i have to do this?
  Is there a complete schema of datas path though each plugins type 
  functions?
  
  Benjamin.
  
  Envoyé de mon iPad
  
  Le 10 sept. 2014 à 04:02, Jorge Luis Betancourt Gonzalez 
  jlbetanco...@uci.cu mailto:jlbetanco...@uci.cu  a écrit :
  
  You’ll need to write a couple of plugins to accomplish this. Which 
  version of Nutch are you using? In the first case, the score you want to 
  alter is the score that’s indexed into Solr (i.e your backend) ? 
  
  Regards,
  
  On Sep 9, 2014, at 2:38 PM, Benjamin Derei stygm...@gmail.com 
  mailto:stygm...@gmail.com  wrote:
  
  hi,
  
  i'm a beginner in java and nutch.
  
  I want to orient the crawl with two rules:
  -if language identifier plugin detect that page is non fr the score
  for sorting should be divided by two.
  -if an anchor text or link cibling this page contain some therms the
  score for sorting should be multiplied by ten.
  
  Any help ?
  
  Benjamin.
  
  Concurso Mi selfie por los 5. Detalles en 
  http://justiciaparaloscinco.wordpress.com 
  http://justiciaparaloscinco.wordpress.com 
  
  Concurso Mi selfie por los 5. Detalles en 
  http://justiciaparaloscinco.wordpress.com 
  http://justiciaparaloscinco.wordpress.com 
 



RE: Fetch Job Started Failing on Hadoop Cluster

2014-09-16 Thread Markus Jelsma
Hi - you made Nutch believe that 
hdfs://server1.mydomain.com:9000/user/df/crawldirectory/segments/ is a 
segment, but it is not. So either no segment was created or written to the 
wrong location.

I don't know what kind of script you are using but you should check the return 
code of the generator, if gives a -1 for no segment created.

Markus




 
 
-Original message-
 From:Meraj A. Khan mera...@gmail.com mailto:mera...@gmail.com 
 Sent: Monday 15th September 2014 7:02
 To: user@nutch.apache.org mailto:user@nutch.apache.org 
 Subject: Fetch Job Started Failing on Hadoop Cluster
 
 Hello Folks,
 
 My Nutch crawl which was running fine , started failing in the first Fetch
 Job/Application, I am unable to figure out whats going on here, I have
 attached the last snippet of the log below , can some please let me know
 whats going on here ?
 
 What I noticed is that even though the generate phase created a
 segment 20140915004940
 , the fetch phase is only looking up to the segments directory for the
 segments.
 
 Thanks.
 
 14/09/15 00:50:07 INFO crawl.Generator: Generator: finished at 2014-09-15
 00:50:07, elapsed: 00:00:59
 ls: cannot access crawldirectory/segments/: No such file or directory
 Operating on segment :
 Fetching :
 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: starting at 2014-09-15
 00:50:09
 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: segment:
 crawldirectory/segments
 14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher Timelimit set for :
 1410767409664
 Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
 /opt/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have disabled
 stack guard. The VM will try to fix the stack guard now.
 It's highly recommended that you fix the library with 'execstack -c
 libfile', or link it with '-z noexecstack'.
 14/09/15 00:50:10 WARN util.NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
 server1.mydomain.com/170.75.152.162:8040
 14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
 server1.mydomain.com/170.75.152.162:8040
 14/09/15 00:50:12 INFO mapreduce.JobSubmitter: Cleaning up the staging area
 /tmp/hadoop-yarn/staging/df/.staging/job_1410742329411_0010
 14/09/15 00:50:12 WARN security.UserGroupInformation:
 PriviledgedActionException as:df (auth:SIMPLE)
 cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist: hdfs://
 server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
 14/09/15 00:50:12 WARN security.UserGroupInformation:
 PriviledgedActionException as:df (auth:SIMPLE)
 cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist: hdfs://
 server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
 14/09/15 00:50:12 ERROR fetcher.Fetcher: Fetcher:
 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
 hdfs://
 server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
 at
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
 at
 org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:108)
 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
 at
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
 at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
 at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
 at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
 at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1349)
 at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1385)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1358)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 

Re: Running Crawls via REST API

2014-09-16 Thread atawfik
If you investigate the bin/nutch script, you will notice that each command
supported by Nutch is calling a Java program or class. You can use the same
approach in your Java code. That is calling the appropriate Java class with
required parameters.

Regards
Ameer



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Running-Crawls-via-REST-API-tp4159019p4159123.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Running Crawls via REST API

2014-09-16 Thread Johannes Goslar
If possible I do not want to write any single line of Java. That is why I am 
wondering, if it is possible to do everything via REST. But so far it seems 
like ssh might be the better remote interface.

Kind Regards
Johannes

Re: Why are specific URLs not fetched?

2014-09-16 Thread Jigal van Hemert | alterNET internet BV
Hi,

Thanks for your reply.

On 16 September 2014 12:28, Markus Jelsma markus.jel...@openindex.io wrote:
 Hi - it is usually a problem with URL filters, which by default do not accept 
 query strings etc. Check your URL filters.

In regex-urlfilter.txt the line
#-[?*!@=]
is already disabled.

prefix-urlfilter.txt contains both http:// and https://

I've checked the rest of the files in the configuration, but can't
find anything that is either not the default or that woule match the
start URLs:
http://lochem.raadsinformatie.nl/sitemap/meetings/2013/
(the others have a different year at the end)
The pages have one validation error in the w3c validator (a style tag
without a type attribute), but I don't think this should be a problem.

Any ideas?


-- 

Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

ji...@alternet.nl
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke
informatie bevatten. Als u niet de beoogde ontvanger bent van dit
bericht, neem dan direct per e-mail of telefoon contact op met de
verzender en verwijder dit bericht van uw systeem. Het is niet
toegestaan de inhoud van dit bericht op welke wijze dan ook te delen
met derden of anderszins openbaar te maken zonder schriftelijke
toestemming van alterNET Internet BV. U wordt geadviseerd altijd
bijlagen te scannen op virussen. AlterNET kan op geen enkele wijze
verantwoordelijk worden gesteld voor geleden schade als gevolg van
virussen.

Alle eventueel genoemde prijzen S.E.  O., excl. 21% BTW, excl.
reiskosten. Op al onze prijsopgaven, offertes, overeenkomsten, en
diensten zijn, met uitzondering van alle andere voorwaarden, de
Algemene Voorwaarden van alterNET Internet B.V. van toepassing. Op al
onze domeinregistraties en hostingactiviteiten zijn tevens onze
aanvullende hostingvoorwaarden van toepassing. Dit bericht is
uitsluitend bedoeld voor de geadresseerde. Aan dit bericht kunnen geen
rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !


RE: Why are specific URLs not fetched?

2014-09-16 Thread Markus Jelsma
You can check the bin/nutch parsechecker tool to see if the URL's are properly 
extracted from webpages. Then use the bin/nutch 
org.apache.nutch.net.URLFilterChecker -allCombined tool to see some filter 
removes your URL's. They may also be normalized to something undesirable but 
that's not usually the case. 
 
-Original message-
 From:Jigal van Hemert | alterNET internet BV ji...@alternet.nl
 Sent: Tuesday 16th September 2014 16:04
 To: user@nutch.apache.org
 Subject: Re: Why are specific URLs not fetched?
 
 Hi,
 
 Thanks for your reply.
 
 On 16 September 2014 12:28, Markus Jelsma markus.jel...@openindex.io wrote:
  Hi - it is usually a problem with URL filters, which by default do not 
  accept query strings etc. Check your URL filters.
 
 In regex-urlfilter.txt the line
 #-[?*!@=]
 is already disabled.
 
 prefix-urlfilter.txt contains both http:// and https://
 
 I've checked the rest of the files in the configuration, but can't
 find anything that is either not the default or that woule match the
 start URLs:
 http://lochem.raadsinformatie.nl/sitemap/meetings/2013/
 (the others have a different year at the end)
 The pages have one validation error in the w3c validator (a style tag
 without a type attribute), but I don't think this should be a problem.
 
 Any ideas?
 
 
 -- 
 
 Met vriendelijke groet,
 
 
 Jigal van Hemert | Ontwikkelaar
 
 
 
 Langesteijn 124
 3342LG Hendrik-Ido-Ambacht
 
 T. +31 (0)78 635 1200
 F. +31 (0)848 34 9697
 KvK. 23 09 28 65
 
 ji...@alternet.nl
 www.alternet.nl
 
 
 Disclaimer:
 Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke
 informatie bevatten. Als u niet de beoogde ontvanger bent van dit
 bericht, neem dan direct per e-mail of telefoon contact op met de
 verzender en verwijder dit bericht van uw systeem. Het is niet
 toegestaan de inhoud van dit bericht op welke wijze dan ook te delen
 met derden of anderszins openbaar te maken zonder schriftelijke
 toestemming van alterNET Internet BV. U wordt geadviseerd altijd
 bijlagen te scannen op virussen. AlterNET kan op geen enkele wijze
 verantwoordelijk worden gesteld voor geleden schade als gevolg van
 virussen.
 
 Alle eventueel genoemde prijzen S.E.  O., excl. 21% BTW, excl.
 reiskosten. Op al onze prijsopgaven, offertes, overeenkomsten, en
 diensten zijn, met uitzondering van alle andere voorwaarden, de
 Algemene Voorwaarden van alterNET Internet B.V. van toepassing. Op al
 onze domeinregistraties en hostingactiviteiten zijn tevens onze
 aanvullende hostingvoorwaarden van toepassing. Dit bericht is
 uitsluitend bedoeld voor de geadresseerde. Aan dit bericht kunnen geen
 rechten worden ontleend.
 
 ! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !
 


RE: Revisiting Loops Job in Nutch Trunk

2014-09-16 Thread Markus Jelsma
Hi - So you are not using it for scoring right, but to inspect the graph of the 
web. Then there's certainly no need to weed out loops using the loops 
algorithm, neither a need to run the linkrank job 
Markus

 
 
-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Thursday 11th September 2014 19:53
 To: user@nutch.apache.org
 Subject: Re: Revisiting Loops Job in Nutch Trunk
 
 Hi Markus,
 
 On Wed, Sep 10, 2014 at 10:28 PM, user-digest-h...@nutch.apache.org wrote:
 
 
  Weird, i didn't see my own mail arriving on the list, i sent it via kmail
  but am on webmail now, which seems to work.
 
 
 sigh ;)
 
 
  Anyway, for vertical search on a whole website i would rely on your
  (customized) Lucene similarity and proper analysis, but also downgrading
  `bad` pages for which you can make custom classifier plugins in Nutch.
 
 
 Yep, this sounds much more appropriate for the task at hand. I have
 debugged the Webgraph code as well as some of the tools within this
 environment... it is not an apple-for-apple fit for what I am trying to
 achieve.
 
 
  That way you can, for example, get rid of hub pages and promote actual
  content.
 
 
 Yeah. I understand.
 
 
 
  Anyway, it all depends on what you want to achieve, which is? :)
 
 
 
- Networks. Specifically, domain specific networks...
- how they are formed and where they come from.
- Where the traffic comes from (by server host, server IP, client IP and
by content relevance)
- what the graph looks like within these domain specific, networks. By
the way, within this context, I think that a dense graph is probably OK. I
am looking for this actually.
 


Re: Fetch Job Started Failing on Hadoop Cluster

2014-09-16 Thread Meraj A. Khan
Markus,

Thanks, the issue I was setting the PATH variable in the bin/crawl script
and once I removed it and set it outside of the bin/crawl script , it
started working fine now.



On Tue, Sep 16, 2014 at 6:39 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - you made Nutch believe that
 hdfs://server1.mydomain.com:9000/user/df/crawldirectory/segments/ is a
 segment, but it is not. So either no segment was created or written to the
 wrong location.

 I don't know what kind of script you are using but you should check the
 return
 code of the generator, if gives a -1 for no segment created.

 Markus






 -Original message-
  From:Meraj A. Khan mera...@gmail.com mailto:mera...@gmail.com 
  Sent: Monday 15th September 2014 7:02
  To: user@nutch.apache.org mailto:user@nutch.apache.org
  Subject: Fetch Job Started Failing on Hadoop Cluster
 
  Hello Folks,
 
  My Nutch crawl which was running fine , started failing in the first
 Fetch
  Job/Application, I am unable to figure out whats going on here, I have
  attached the last snippet of the log below , can some please let me know
  whats going on here ?
 
  What I noticed is that even though the generate phase created a
  segment 20140915004940
  , the fetch phase is only looking up to the segments directory for the
  segments.
 
  Thanks.
 
  14/09/15 00:50:07 INFO crawl.Generator: Generator: finished at 2014-09-15
  00:50:07, elapsed: 00:00:59
  ls: cannot access crawldirectory/segments/: No such file or directory
  Operating on segment :
  Fetching :
  14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: starting at 2014-09-15
  00:50:09
  14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher: segment:
  crawldirectory/segments
  14/09/15 00:50:09 INFO fetcher.Fetcher: Fetcher Timelimit set for :
  1410767409664
  Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
  /opt/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0 which might have disabled
  stack guard. The VM will try to fix the stack guard now.
  It's highly recommended that you fix the library with 'execstack -c
  libfile', or link it with '-z noexecstack'.
  14/09/15 00:50:10 WARN util.NativeCodeLoader: Unable to load
 native-hadoop
  library for your platform... using builtin-java classes where applicable
  14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
  server1.mydomain.com/170.75.152.162:8040
  14/09/15 00:50:10 INFO client.RMProxy: Connecting to ResourceManager at
  server1.mydomain.com/170.75.152.162:8040
  14/09/15 00:50:12 INFO mapreduce.JobSubmitter: Cleaning up the staging
 area
  /tmp/hadoop-yarn/staging/df/.staging/job_1410742329411_0010
  14/09/15 00:50:12 WARN security.UserGroupInformation:
  PriviledgedActionException as:df (auth:SIMPLE)
  cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
  exist: hdfs://
  server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
  14/09/15 00:50:12 WARN security.UserGroupInformation:
  PriviledgedActionException as:df (auth:SIMPLE)
  cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
  exist: hdfs://
  server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
  14/09/15 00:50:12 ERROR fetcher.Fetcher: Fetcher:
  org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist:
  hdfs://
  server1.mydomain.com:9000/user/df/crawldirectory/segments/crawl_generate
  at
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
  at
 
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
  at
  org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:108)
  at
 
 org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
  at
 
 org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
  at
 
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
  at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
  at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
  at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
  at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
  at
  org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
  at 

Re: Running Crawls via REST API

2014-09-16 Thread Lewis John Mcgibbney
Hi Johannes

On Tue, Sep 16, 2014 at 10:19 AM, user-digest-h...@nutch.apache.org wrote:


 is it possible to have nutch as a kind of stand-alone crawl server only
 spoken to via the REST API?


Yes this is possible.
We just finished a Google Summer of Code project which addresses exactly
this via a Wicket-based Web Application. We are working on the final
aspects of the patch before this is attached to the relevant issue
https://issues.apache.org/jira/browse/NUTCH-841


 I found the generic tutorial to setup nutch server with Cassandra and
 found this wiki page https://wiki.apache.org/nutch/NutchRESTAPI but it
 leaves me a bit confused about How I can actually start some full fetch
 cycles.


Yep this is something we need to add to the documentation. We will do this
in due course.


 I probably need to create some fetch job, but what is actually the full
 command with options to send via REST?


https://wiki.apache.org/nutch/NutchRESTAPI#Create_job


 Might anybody maybe point to some working examples, I started digging
 through the java code, but it seems to be only generic key-value setting.



A fully fledged crawl command has been deprecated in Nutch for a while.
Therefore the REST commands you submit to the Nutch 2.X REST API (I suggest
you use Nutch 2.3-SNAPSHOT) need to be chained together sequentially.

I've been testing this out over the summer using RESTClient plugin for
Firefox... it's been working well.
Hope this helps you out.
Lewis


Re: Revisiting Loops Job in Nutch Trunk

2014-09-16 Thread Lewis John Mcgibbney
Hi Markus,

On Tue, Sep 16, 2014 at 10:19 AM, user-digest-h...@nutch.apache.org wrote:


 Hi - So you are not using it for scoring right, but to inspect the graph
 of the web.


Yeah, I think that this is a pretty accurate statement.


 Then there's certainly no need to weed out loops using the loops
 algorithm, neither a need to run the linkrank job


OK doke Markus.
I'm going to revisit the documentation for all of the WebGraph classes in
an attempt to further define, explain and clarify the data structures.
Thanks for your feedback.
Lewis


Re: Plugin loading and NUTCH-609

2014-09-16 Thread Edoardo Causarano

On 15 sep. 2014, at 11:36, Julien Nioche lists.digitalpeb...@gmail.com wrote:

Hi Julien,

see my inline replies

 Hi Edoardo,
 
 See my comments below
 
 On 12 September 2014 11:11, Edoardo Causarano edoardo.causar...@gmail.com
 wrote:
 
 Hi all,
 
 I'm completely lost, can anyone help me out here?
 
 I have this job.jar which contains all Nutch code, dependencies and
 plugins. I don't understand how I keep getting this error:
 
 2014-09-12 11:51:04,458 WARN [main]
 org.apache.nutch.plugin.PluginRepository: Plugins: not a file: url. Can't
 load plugins from:
 jar:file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/oracle/appcache/application_1410512500237_0003/filecache/10/job.jar/job.jar!/lib/plugins
 
 Ok, I have to admit that I'm mucking around with the project structure but
 having found NUTCH-609 it seems that the PluginManifestParser does not
 support loading plugins from the job payload itself. Is this the case? If
 so, can anyone tell me where I need to unpack these plugins so that the
 loader will pick them up?
 
 
 why do you build the job jar yourself instead of using the one that our ant
 script builds? If you look at it the plugins are in /classes/plugins/
 within the jar.

Well, basically because I'm not familiar at all with Ivy and wanted to dive 
into the tool to understand a bit how it worked :) But yes, I solved the issue 
by correcting the target folder in the maven assembly descriptor target folder 
(so that Hadoop unpacks this folder as well.)  

 Side note: is anyone interested in overhauling this loading mechanism? The
 XML manifest could be replaced with an annotation class, although I would
 be happy enough if I could include and load it into the jar.
 
 
 I like the idea of replacing the XML manifest with annotations - or maybe
 initially allow both. In an ideal world plugins would be handled as
 dependencies and we could just get the jars for them. I am sure there would
 be a way of making the XML file a part of the artefact but if we don't have
 to and can have a pom and a jar then it would certainly be simpler.
 
 Feel free to open a new JIRA for this and contribute a patch if you can.

I was already looking into that, and had to hack away a bit at the plugin 
manifest parser. Seems to work alright but then explodes at runtime when 
loading the class (an NPE on the classloader if I remember correctly.)  

I mixed several changes so I'll have to clean up and organize my thoughts. ;) I 
was thinking the following would work: plugin jars stay where they are, move 
plugin.xml into jar META-INF/nutch, iterate over plugin paths, parse XML and 
load the classes declared in the XML. Does the plugin also need to export lib 
folders?  


Best,
Edoardo

 Thanks
 
 Julien
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble



Re: Running Crawls via REST API

2014-09-16 Thread Johannes Goslar
Hi Lewis, yes it is helping a bit. But especially the pre seeding is confusing 
me a bit for the moment. Can/will the rest API create the needed directories?
The code for the webapp would be interesting to look at, but I actually need to 
kind of use nutch in a middle step in another software, so I would not show 
some other app.
Kind Regards
Johannes