Re: [nutchgora] - proposal to support distributed indexing

2012-02-22 Thread Markus Jelsma
. -sujit On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote: Hi, We're in the process of testing Solr trunk's cloud features that recently includes initial work for distributed indexing. With it, there is no need anymore for doing the partitioning client site because Solr

Re: Exception in thread main java.io.IOException: Job failed!

2012-02-23 Thread Markus Jelsma
Unfetched, unparsed or just a bad corrupt segment. Remove that segment and try again. Many thanks Remi. Finally, after un reboot og the computer (I send my question just before leaving my desk), Nutch started to crawl (amazing :))) ) But now, during the crawl process, I got that :

Re: Large Shared Drive Crawl

2012-02-28 Thread Markus Jelsma
- User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: How to crowl AJAX populated pages

2012-02-28 Thread Markus Jelsma
www.gettinhahead.co.in -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-crowl-AJAX-populated-pages-tp3 783398p3783398.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- Markus Jelsma - CTO - Openindex

Re: crawldb modifications

2012-02-28 Thread Markus Jelsma
-- View this message in context: http://lucene.472066.n3.nabble.com/crawldb-modifications-tp3781740p3781740. html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: crawldb modifications

2012-02-28 Thread Markus Jelsma
In that case i suggest using crawldbscanner tool or the new regex feature for the crawldbreader tool in trunk. On Tuesday 28 February 2012 13:04:47 remi tassing wrote: I think he ment to remove some specific URLs not everything On Tue, Feb 28, 2012 at 1:51 PM, Markus Jelsma markus.jel

Re: [blog post] Accumulo, Nutch, and GORA

2012-02-29 Thread Markus Jelsma
Impressive! On Tue, 28 Feb 2012 20:41:58 -0500, Jason Trost jason.tr...@gmail.com wrote: Blog post for anyone who's interested. I cover a basic howto for getting Nutch to use Apache Gora to store web crawl data in Accumulo. Let me know if you have any questions. Accumulo, Nutch, and GORA

Re: too few db_fetched

2012-02-29 Thread Markus Jelsma
Short anwer: continue crawling! When going to crawl a large amount of records i wouldn't encourage you to use the crawl command. It's better to build a small shell script that repeats the crawl cycle over and over. Remember, the depth parameter is nothing more than a crawl cycle exectuted

Re: Featured link support in Nutch

2012-03-01 Thread Markus Jelsma
but I like what it's offering. I got the basic setup working. I was wondering how would we implement 'Featured link' using Nutch-Solr. I would like to hear your thoughts. Thanks in advance. -Stan -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Markus Jelsma
on one machine. I notice that they are conflicting because they all access /tmp/hadoop-username/mapred How do I change the location of this folder ? Do I have use hadoop to run multiple crawlers each specific to a site ? thanks Jeremy -- Markus Jelsma - CTO - Openindex http://www.linkedin.com

Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Markus Jelsma
to: $NUCHT_DIR/runtime/local/conf/nutch-site.xml Jeremy On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma wrote: you can either: 1. run on hadoop 2. not run multiple concurrent jobs on a local machine 3. set a hadoop.tmp.dir per job 4. merge all crawls to a single crawl

Re: Only fetching initial seedlist

2012-03-01 Thread Markus Jelsma
this message in context: http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: different fetch interval for each depth urls

2012-03-01 Thread Markus Jelsma
wrote: Hello, I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -adddays option in generate command but it seems it cannot be used to solve this issue. Thanks in advance. Alex. -- Markus Jelsma - CTO

Re: Featured link support in Nutch

2012-03-01 Thread Markus Jelsma
to get results by using 'starts with' or prefix query. e.g. Return all results where url starts with http://auto.yahoo.com [1] Thanks again! On Thu, Mar 1, 2012 at 3:59 PM, Markus Jelsma wrote: Hi What is a featured link? Maybe Solr's elevator component is what your are looking for? cheers

Re: different fetch interval for each depth urls

2012-03-03 Thread Markus Jelsma
records restricted by status: generate -Dgenerate.restrict.status=status Thanks. Alex. -Original Message- From: Markus Jelsma To: user Cc: nutch-user Sent: Thu, Mar 1, 2012 10:30 pm Subject: Re: different fetch interval for each depth urls Well, you could set a new default fetch

Re: Incompatible format version 2 expected 1 or lower

2012-03-04 Thread Markus Jelsma
.nabble.com/Incompatible-format-version-2-expected-1-or-lower-tp3796473p3796473.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Nutch as crawler for text analysis: setup ? version ?

2012-03-09 Thread Markus Jelsma
this in a larger setup. thanks ! pvremort -- Markus Jelsma - CTO - Openindex

Re: Blacklisted Tasktracker / AlreadyBeingCreatedException

2012-03-16 Thread Markus Jelsma
. -- Markus Jelsma - CTO - Openindex

Re: Nutch 1.4 with Hadoop - how does Nutch know where Hadoop is running

2012-03-20 Thread Markus Jelsma
to alter these settings to point to the non-default Hadoop? Regards, Dean. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Job failed while creating SolrIndex

2012-03-20 Thread Markus Jelsma
at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Nutch 1.4 with Hadoop - how does Nutch know where Hadoop is running

2012-03-20 Thread Markus Jelsma
dean.pul...@semantico.com wrote: Thanks for your reply. I understand what you've said, but how does Nutch know where the Hadoop jobtracker is running? Regards, Dean. On 20/03/2012 11:03, Markus Jelsma wrote: This is not a Nutch thing. A Nutch job, any job, is submitted to the Hadoop Jobtracker

Re: urls won't get crawled

2012-03-20 Thread Markus Jelsma
/urls-won-t-get-crawled-tp3650610p384206 6.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: urls won't get crawled

2012-03-20 Thread Markus Jelsma
. -- Markus Jelsma - CTO - Openindex

Re: NutchHadoopTutorial Updated

2012-03-20 Thread Markus Jelsma
of the great technologies. We would really appreciate feedback as there will undoubtedly be some errors or data missing. Thanks Lewis [0] http://wiki.apache.org/nutch/NutchHadoopTutorial -- Markus Jelsma - CTO - Openindex

Re: Too much logging

2012-03-22 Thread Markus Jelsma
mapred.JobClient: Map output records=5* === = Regards Andy -- Markus Jelsma - CTO - Openindex

Re: Generator taking time

2012-03-22 Thread Markus Jelsma
, but nothing happened. -- View this message in context: http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848158 .html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: crawl and update one url already in crawldb

2012-03-22 Thread Markus Jelsma
a database that could potentially be locked at any point in time? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra wldb-tp3848358p3848358.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma

Re: crawl and update one url already in crawldb

2012-03-22 Thread Markus Jelsma
scripting and locking horror and it's an I/O consumer. -- View this message in context: http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra wldb-tp3848358p3848423.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: crawl and update one url already in crawldb

2012-03-22 Thread Markus Jelsma
.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra wldb-tp3848358p3848665.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: canonical tag support

2012-03-22 Thread Markus Jelsma
This is not supported by Nutch and there's no issue ticket yet. Feel free to open one. On Thu, 22 Mar 2012 14:32:26 -0500, thomas.j.lut...@wellsfargo.com wrote: Ran across a posting for the Nutch roadmap mentioning support for the canonical tag.

Re: Relative urls, interpage href anchors

2012-03-28 Thread Markus Jelsma
.n3.** nabble.com/Relative-urls-**interpage-href-anchors-** tp3861215p3861215.htmlhttp://lucene.472066.n3.nabble.com/Relative-urls-interpage-href-anchors-tp3861215p3861215.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex http

Re: Out-of-the-box Nutch indexing url source to Solr

2012-03-28 Thread Markus Jelsma
be the command to do that? -- View this message in context: http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-sourc e-to-Solr-tp3855918p3855918.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: recrawl a single page explicit

2012-04-02 Thread Markus Jelsma
manuell set the recrawl interval or the crawl date, or any other explicit way to make nutch invalidate a page? We have got 70k+ pages in the index and a full recrawl would take to long. Thanks Jan -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: request about snippets (with attachement)

2012-04-05 Thread Markus Jelsma
like a result. When I can jump this raw during my crawling? Is possible exclude this raw? thank you in adavande alessio -- *Lewis* -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Class in the code that handles parsing of html files and selection of URLs

2012-04-06 Thread Markus Jelsma
, Anastasia -- View this message in context: http://lucene.472066.n3.nabble.com/Class-in-the-code-that-handles-parsing- of-html-files-and-selection-of-URLs-tp3890250p3890250.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: How to handle failures in nutch?

2012-04-10 Thread Markus Jelsma
hi, On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), nutch.bu...@gmail.com nutch.bu...@gmail.com wrote: Hi There are some scenarios of failure in nutch which I'm not sure how to handle. 1. I run nutch on a huge amount of urls and some kind of OOM exception if thrown, or one of those cannot

Re: How to handle failures in nutch?

2012-04-10 Thread Markus Jelsma
input file. Any other insights on these issues will be appreciated Markus Jelsma-2 wrote hi, On Mon, 9 Apr 2012 22:43:44 -0700 (PDT), nutch.buddy@ nutch.buddy@ wrote: Hi There are some scenarios of failure in nutch which I'm not sure how to handle. 1. I run nutch on a huge amount of urls

WebGraph Outlinks.reduce OOM

2012-04-10 Thread Markus Jelsma
Hi, Recently a reducer got killed because of this. Increasing heap did work but the next job some days later also failed. I looked at the code and i cannot seem to find why it would take more than 400MB of RAM to process outlinks of a single record. We do limit outlinks so the HashSets pages

Re: Limiting Nutch crawl

2012-04-11 Thread Markus Jelsma
this functionality? Best regards, --Anders Rask www.findwise.com -- Markus Jelsma - CTO - Openindex

Re: Limiting Nutch crawl

2012-04-11 Thread Markus Jelsma
in order to recrawl sites then the total number of URLs that are crawled for one site will not be limited by the generate.max.count parameter. Am I right? Best regards, --Anders Rask www.findwise.com Den 11 april 2012 17:14 skrev Markus Jelsma markus.jel...@openindex.io: Check these properties

Re: WebGraph Outlinks.reduce OOM

2012-04-11 Thread Markus Jelsma
somewhere? We have had this URL for a longer time and it happily passed all jobs many times before. On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Recently a reducer got killed because of this. Increasing heap did work but the next job some days later also

Re: Having trouble running nutch on large xlsx files

2012-04-11 Thread Markus Jelsma
Debugging this with a stand-alone Tika would certainly make things easier. There may be an issue in Tika or even in the parser implementation itself. On Wed, 11 Apr 2012 09:37:04 -0700 (PDT), nutch.bu...@gmail.com nutch.bu...@gmail.com wrote: I'm running nutch on large xlsx file (100-150mb),

Re: How to do detailed postmortem analysis (and visualization) of Nutch crawl data

2012-04-15 Thread Markus Jelsma
The CrawlDB is not a suitable data source but the WebGraph's NodeDB is. You could probably write a new MR tool reading the NodeDB and outputting data in a format such a visualization tool understands. I think the only real problem would be the size of the data. On Sun, 15 Apr 2012 12:43:57

Re: How to do detailed postmortem analysis (and visualization) of Nutch crawl data

2012-04-15 Thread Markus Jelsma
, but I don't see a nodedb folder. Thanks in advance. Safdar On Sun, Apr 15, 2012 at 4:17 PM, Markus Jelsma wrote: The CrawlDB is not a suitable data source but the WebGraph's NodeDB is. You could probably write a new MR tool reading the NodeDB and outputting data in a format

Re: Failing to copy activation jar to build/lib

2012-04-15 Thread Markus Jelsma
This error? [javac] warning: [path] bad path element /home/markus/projects/apache/nutch/trunk/build/lib/activation.jar: no such file or directory On Sun, 15 Apr 2012 20:42:42 +0100, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Whilst doing some testing on Nutchgora within

Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-04-16 Thread Markus Jelsma
Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Failing to copy activation jar to build/lib

2012-04-16 Thread Markus Jelsma
. On Sun, Apr 15, 2012 at 10:46 PM, Markus Jelsma markus.jel...@openindex.iowrote: This error? [javac] warning: [path] bad path element /home/markus/projects/apache/** nutch/trunk/build/lib/**activation.jar: no such file or directory On Sun, 15 Apr 2012 20:42:42 +0100, Lewis John Mcgibbney

Re: WebGraph Outlinks.reduce OOM

2012-04-16 Thread Markus Jelsma
an OutlinkDB can make a mess out of itself? Should we enforce uniqueness in the mean time? On Tue, 10 Apr 2012 21:33:36 +0200, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Recently a reducer got killed because of this. Increasing heap did work but the next job some days later also failed. I

Re: WebGraph Outlinks.reduce OOM

2012-04-16 Thread Markus Jelsma
Will provide a patch tomorrow. https://issues.apache.org/jira/browse/NUTCH-1335 On Mon, 16 Apr 2012 20:19:46 +0200, Markus Jelsma markus.jel...@openindex.io wrote: It seems a single URL has about half a million outlinks connected to it in the OutlinkDB! A pattern of 50 URL's repeats a 100.000

Re: Help getting started

2012-04-22 Thread Markus Jelsma
On Sat, 21 Apr 2012 17:44:49 -0700 (PDT), benmccann benjamin.j.mcc...@gmail.com wrote: Hi, I have a few questions about getting started. Is there a good tutorial anywhere? Questions I have: * How do I restrict the crawling or saving of pages to only those matching certain regexes? With

Re: Help getting started

2012-04-22 Thread Markus Jelsma
the status in the Hadoop web gui. I'm doing a local crawl. Does this mean the Hadoop web gui is unavailable? Is there anyway to check status of a local crawl? What's the URL for the hadoop web gui? Thanks! -Ben On Sun, Apr 22, 2012 at 7:33 AM, Markus Jelsma-2 [via Lucene] ml-node

Re: Good workflow for a regular re-indexing job

2012-04-24 Thread Markus Jelsma
of monickr: http://monickr.com [3] 01926 813736 | 07973 156616 _-- _ Links: -- [1] http://[domain]/solr/ [2] http://www.tellura.co.uk/ [3] http://monickr.com/ -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Generator OOM

2012-04-26 Thread Markus Jelsma
Hi, We sometimes see the generator running OOM. This happens because we either have a too high topN value or too many segments to generate. In any case, a very large amount of records is being generated with the same (lowest) score and end up in a single reducer. We limit the generator by

Re: Changing from Indexing Filter

2012-04-27 Thread Markus Jelsma
of Nutch info on the web... http://wiki.apache.org/nutch/ http://wiki.apache.org/nutch/PluginCentral hth Lewis -- Lewis -- Markus Jelsma - CTO - Openindex

Re: Crawl sites with hashtags in url

2012-05-01 Thread Markus Jelsma
. With kind regard, Roberto Gardenier -- Markus Jelsma - CTO - Openindex

Re: Hadoop not doing anything

2012-05-01 Thread Markus Jelsma
Do you have running task trackers and data nodes? Which Nutch job did you start? Any custom code? Check the logs of of the four Hadoop daemons, there may be something there. On Tue, 01 May 2012 16:26:31 +0100, Dean Pullen dean.pul...@semantico.com wrote: Hi all, If this is definitely a

Re: fields foreach document

2012-05-02 Thread Markus Jelsma
FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Generator OOM

2012-05-03 Thread Markus Jelsma
of reducers or, slightly increase the host or domain limit value. On Thu, 26 Apr 2012 21:02:58 +0200, Markus Jelsma markus.jel...@openindex.io wrote: Hi, We sometimes see the generator running OOM. This happens because we either have a too high topN value or too many segments to generate. In any case

Re: Indexing meta tags in Nutch 1.4

2012-05-03 Thread Markus Jelsma
of that command I don't see any keywords or description fields :( just the usual ones (site,title,content,etc). Am I missing something here? Also let me know if you need more details or my nutch-site.xml config file... Regards -- Markus Jelsma - CTO - Openindex http://www.linkedin.com

Re: Indexing meta tags in Nutch 1.4

2012-05-03 Thread Markus Jelsma
to an indexed document. From: Markus Jelsma markus.jel...@openindex.io To: ML mail mlnos...@yahoo.com Cc: Lewis John Mcgibbney lewis.mcgibb...@gmail.com; user@nutch.apache.org Sent: Thursday, May 3, 2012 9:32 AM Subject: Re: Indexing meta tags in Nutch 1.4

Re: Avoid crawling nonsense calendar webpage

2012-05-05 Thread Markus Jelsma
Hi, This is a tough problem indeed. We partially mitigate this problem by using several regular expressions, linkrank scores with domain limiting generator for regular crawls and a second shallow crawl, only following links from the home page. A custom URLFilter as Ferdy explains is a good

Re: link without href

2012-05-07 Thread Markus Jelsma
html snippet as a link? tr onclick=clickOnLink(http://www.example.com/link;);.../tr Thanks, Mohammad -- Markus Jelsma - CTO - Openindex

Re: Is it possible to control the segment size?

2012-05-07 Thread Markus Jelsma
. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Is it possible to control the segment size?

2012-05-08 Thread Markus Jelsma
many segments of ~N records are generated. Markus Jelsma-2 wrote On Mon, 7 May 2012 22:31:43 -0700 (PDT), nutch.buddy@ nutch.buddy@ wrote: In a previous discussion about handling of failures in nutch, it was mentioned that a broken segment cannot be fixed and it's urls should be re

Re: HTML documents with TXT extension

2012-05-08 Thread Markus Jelsma
Hi Nutch should parse an HTML file with a .txt extension just as a normal HTML file, at least, here it does. What does your parserchecker say? In any case you must strip potential left-over HTML in your Solr analyzer, if left like this it's a bad XSS vulnerability. Cheers On Tue, 8 May

Re: Lower case URLs - correct regex?

2012-05-08 Thread Markus Jelsma
/?page=2633pid=1043ELEsite=191;1;db_unfetched;Tue May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT 1970;0;2592000.0;30.0;500.0;null Notice the URL starts with an L? (Thus not matching http/https in another config). Is this some problem with the regex above? Regards, Dean Pullen -- Markus Jelsma

Re: Lower case URLs - correct regex?

2012-05-08 Thread Markus Jelsma
a custom URL Normalizer to get this to work. But why? It doesn't seem alright. On Tue, 08 May 2012 14:46:14 +0200, Markus Jelsma markus.jel...@openindex.io wrote: I'm not sure this is going to work as a lowercase flag is used on the regular expressions. On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen

Re: HTTP ERROR 400

2012-05-09 Thread Markus Jelsma
] mailto:krist...@yahoo-inc.com [22] http://webmail.openindex.io/tel:%2B49%20%280%2989%20231%2097%20207 [23] http://webmail.openindex.io/tel:%2B49%20%280%29%20162%2028899%2002 [24] http://webmail.openindex.io/tel:%28408%29%20349%203300 [25] http://webmail.openindex.io/tel:%28408%29%20349%203301 -- Markus

Re: Focused Crawling with Nutch (IndexingFilter:filter)

2012-05-09 Thread Markus Jelsma
...@gmail.com [1] http://www8.org/w8-papers/5a-search-query/crawling/ [2] http://www.cse.iitb.ac.in/~soumen/focus/ [3] http://nutch.apache.org/apidocs-1.3/org/apache/nutch/indexer/IndexingFilter.html -- Markus Jelsma - CTO - Openindex

Re: De-duplication of Nutch parsed data

2012-05-09 Thread Markus Jelsma
that CrawlDB would not allow duplicate links to get inside it? What link deduplication do you mean? CrawlDB records have a unique key on the URL. Regards | Vikas www.knoldus.com -- Markus Jelsma - CTO - Openindex

Re: Make Nutch to crawl internal urls only

2012-05-09 Thread Markus Jelsma
-- View this message in context: http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local..

2012-05-10 Thread Markus Jelsma
is mentioned. Tried to upgrade to hadoop-core-0.20.203.0.jar but then this is thrown: Exception in thread main java.lang.**NoClassDefFoundError: org/apache/commons/**configuration/Configuration Can someone, please, shed some light on this? Thanks. Igor -- Markus Jelsma - CTO - Openindex

Re: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local..

2012-05-10 Thread Markus Jelsma
and there is plenty free space. All the best, Igor On Thu, May 10, 2012 at 10:35 AM, Markus Jelsma wrote: Plenty of disk space does not mean you have enough room in your hadoop.tmp.dir which is /tmp by default. On Thu, 10 May 2012 10:26:00 +0200, Igor Salma wrote: Hi, Adriana, Sebastian, We

Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Markus Jelsma
Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Fwd: Re: Make Nutch to crawl internal urls only

2012-05-10 Thread Markus Jelsma
-tp3974397p3976568.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: De-duplication of Nutch parsed data

2012-05-10 Thread Markus Jelsma
hi On Thursday 10 May 2012 15:19:09 Vikas Hazrati wrote: Hi Markus, Thanks for your response. My responses inline On Thu, May 10, 2012 at 12:34 AM, Markus Jelsma markus.jel...@openindex.iowrote: hi On Thu, 10 May 2012 00:26:40 +0530, Vikas Hazrati vi...@knoldus.com wrote

Re: HTTP error 400

2012-05-10 Thread Markus Jelsma
should upgrade accordingly in trunk. Thanks Lewis On Thu, May 10, 2012 at 1:56 PM, Michael Erickson erickson.mich...@gmail.com wrote: On May 10, 2012, at 1:42 AM, Markus Jelsma wrote: Hi, On Thu, 10 May 2012 09:10:04 +0300, Tolga to...@ozses.net wrote: Hi

Re: Crawl-tool for iterative crawling?

2012-05-10 Thread Markus Jelsma
to work? Thanks Matthias -- Markus Jelsma - CTO - Openindex

Re: HTTP error 400

2012-05-10 Thread Markus Jelsma
, it works similar and uses the same signature algorithm as Nutch has. Please consult the Solr wiki page on deduplication. Good luck On Thu, 10 May 2012 22:54:37 +0300, Tolga to...@ozses.net wrote: Hi Markus, On 05/10/2012 09:42 AM, Markus Jelsma wrote: Hi, On Thu, 10 May 2012 09:10:04 +0300

Re: HTTP error 400

2012-05-11 Thread Markus Jelsma
, Markus Jelsma wrote: thanks This is a known issue: https://issues.apache.org/jira/browse/NUTCH-1100 I have not been able find the bug nor do i know how to reproduce it from scratch. If you have a public site with which we can reproduce it please comment to the Jira ticket. Make sure you use

Re: Separate logger for nutch

2012-05-11 Thread Markus Jelsma
mode. Also I want some urls filtered by my urlfilter to be stored in an external flat file. How can I achieve this. -- *Thanks Regards* * * *Vijith V* -- *Thanks Regards* * * *Vijith V* -- *Thanks Regards* * * *Vijith V* -- Markus Jelsma

Re: Heap space problem when running nutch on cluster

2012-05-13 Thread Markus Jelsma
in fact it uses much less memory than it can. Any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Heap-space-problem-when-running-nutch-on-cluster-tp3983561.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-14 Thread Markus Jelsma
://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: webpage download

2012-05-15 Thread Markus Jelsma
yes On Tuesday 15 May 2012 12:45:28 Taeseong Kim wrote: is whole web content download possible? include Flash, Image, CSS, JavaScript

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-15 Thread Markus Jelsma
you for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type -text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

Re: HTTP error 400

2012-05-15 Thread Markus Jelsma
/11/12 9:40 AM, Markus Jelsma wrote: Ah, that means don't use the crawl command and do a little shell scripting to execute the separte crawl cycle commands, see the nutch wiki for examples. And don't do solrdedup. Search the Solr wiki for deduplication. cheers On Fri, 11 May 2012 07

Re: Crawl-tool for iterative crawling?

2012-05-15 Thread Markus Jelsma
? Matthias On Thu, May 10, 2012 at 8:39 PM, Markus Jelsma markus.jel...@openindex.io wrote: By default each crawl is iterative. The crawl command is nothing more than a wrapper around the individual crawl cycle commands. The depth parameter is nothing

RE: Exclude certain mime-types

2012-05-18 Thread Markus Jelsma
-Original message- From:Matthias Paul magethle.nu...@gmail.com Sent: Fri 18-May-2012 14:57 To: user@nutch.apache.org Subject: Exclude certain mime-types How can I exlude certain mime-types from crawling, for example Word-documents? If I have parse-tika in plugin.includes it

RE: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Markus Jelsma
] Apache Nutch 1.5 release rc #1 When will Nutch 1.5 be released? Matthias On Wed, Apr 18, 2012 at 1:46 PM, Bharat Goyal bharat.go...@shiksha.com wrote: +1 On Monday 16 April 2012 12:34 PM, Markus Jelsma wrote:  +1  On Mon, 16 Apr 2012 05:43:22 +, Mattmann, Chris

RE: Setting the Fetch time with a CustomFetchSchedule

2012-05-21 Thread Markus Jelsma
Yes, you can pass ParseMeta keys to the FetchSchedule as part of the CrawlDatum's meta data as i did with: https://issues.apache.org/jira/browse/NUTCH-1024 -Original message- From:Vikas Hazrati vi...@knoldus.com Sent: Mon 21-May-2012 13:44 To: user@nutch.apache.org Subject:

RE: error parsing some xml

2012-05-21 Thread Markus Jelsma
Hi Which version do you use? It should list the troubling URL. What's the stack trace? Cheers -Original message- From:Ing. Eyeris Rodriguez Rueda eru...@uci.cu Sent: Mon 21-May-2012 17:07 To: user@nutch.apache.org Subject: error parsing some xml Hi all. When I try to crawl

RE: error parsing some xml

2012-05-21 Thread Markus Jelsma
) - Mensaje original - De: Markus Jelsma markus.jel...@openindex.io Para: user@nutch.apache.org Enviados: Lunes, 21 de Mayo 2012 11:41:40 Asunto: RE: error parsing some xml Hi Which version do you use? It should list the troubling URL

RE: PDF not crawled/indexed

2012-05-22 Thread Markus Jelsma
Please read the description. -Original message- From:Tolga to...@ozses.net Sent: Tue 22-May-2012 11:37 To: user@nutch.apache.org Subject: Re: PDF not crawled/indexed What is that value's unit? kilobytes? My PDF file is 4.7mb. On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:

RE: URL filtering and normalization

2012-05-22 Thread Markus Jelsma
-Original message- From:Bai Shen baishen.li...@gmail.com Sent: Tue 22-May-2012 19:40 To: user@nutch.apache.org Subject: URL filtering and normalization Somehow my crawler started fetching youtube. I'm not really sure why as I have db.ignore.external.links set to true. Weird!

RE: Apache Nutch release 1.5 RC2

2012-05-22 Thread Markus Jelsma
Great! My +1 for a new release based on the state of the codebase. -Original message- From:Julien Nioche lists.digitalpeb...@gmail.com Sent: Tue 22-May-2012 22:19 To: d...@nutch.apache.org Cc: user@nutch.apache.org Subject: Re: Apache Nutch release 1.5 RC2 Read

RE: Multiple nutch jobs on a Hadoop cluster simultaneosuly

2012-05-24 Thread Markus Jelsma
Hi, Yes, this is no problem. Cheers -Original message- From:Dustine Rene Bernasor dust...@thecyberguardian.com Sent: Thu 24-May-2012 12:58 To: user@nutch.apache.org Subject: Multiple nutch jobs on a Hadoop cluster simultaneosuly Hello I was wondering, would it be possible to

RE: No links to process, is the webgraph empty?

2012-05-29 Thread Markus Jelsma
Hi, That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set those two link.ignore parameters to true, no links between the same domain and host or aggregated, only links from/to external hosts and domains. This is a good setting for wide web crawls.

RE: No links to process, is the webgraph empty?

2012-05-29 Thread Markus Jelsma
and link.ignore.limit.domain to false and the link.ignore.internal.xxx can be set to true? Or should I just set all of the link.ignore.xxx.xxx values to false? On 5/29/2012 4:43 PM, Markus Jelsma wrote: Hi, That's a patch for the fetcher. The error you are seeing is quite simple actually. Because you set

RE: Setting the Fetch time with a CustomFetchSchedule

2012-05-29 Thread Markus Jelsma
valuecom.custom.CustomEventFetchScheduler/value /property How do I include my custom logic so that it gets picked as a part of the crawl cycle. Regards | Vikas On Mon, May 21, 2012 at 6:14 PM, Markus Jelsma markus.jel...@openindex.iowrote: Yes, you can pass ParseMeta keys to the FetchSchedule as part

RE: How to configure nutch to fetch only recent documents

2012-06-04 Thread Markus Jelsma
Hi, The generator can only do it the other way around via the addDays parameter. To make it work your way you can modifiy the generator to restrict to documents younger than 48 hours. Cheers -Original message- From:Shameema Umer shem...@gmail.com Sent: Mon 04-Jun-2012 08:33 To:

<    3   4   5   6   7   8   9   10   11   12   >