RE: Solr integration in nutch-1.1dev

2010-05-25 Thread Markus Jelsma
,   -Original message- From: Brian Tingle brian.tin...@ucop.edu Sent: Tue 25-05-2010 20:47 To: user@nutch.apache.org; Markus Jelsma markus.jel...@buyways.nl; Subject: RE: Solr integration in nutch-1.1dev Update the solr schema.xml so that it allows multiple values for that field? |-Original

Re: Solr integration in nutch-1.1dev

2010-05-26 Thread Markus Jelsma
Confirmed! It was the old schema.xml file. Next time i'd better check for differences :) On Tuesday 25 May 2010 21:38:45 Markus Jelsma wrote: Hi Brian, Again, thanks for the help. I have looked up the schema file from the trunk and 1.0 tag using web svn. It seems you are right

prefixed space in subcollection field

2010-06-15 Thread Markus Jelsma
Hi list,   Fields created by the subcollection plugin end up with a prefixed space in my Solr index but the name and id fields in my subcollection.xml don't have that same space prefixed, i checked it three times just to be certain i didn't mess up the configuration. I am unsure where the

RE: [ANNOUNCE] Apache Nutch 1.1 released

2010-06-21 Thread Markus Jelsma
Hi,   Wonderful, i'll check it out! But, i could only find it through your announcement. It cannot be found on the old lucene release URL - probably because Nutch' a TLP now, but i cannot find it on the Nutch' download page as well [1], that's a 404!   [1]: http://nutch.apache.org/release/

The parse-tika plug-in in 1.1

2010-06-21 Thread Markus Jelsma
Well, where is it now? The parse-plugins.xml still refers to it, but it's not present in the plugins/ directory.    

Solr ID field still mutliValued in 1.1

2010-06-22 Thread Markus Jelsma
Hi,   I sent my first update command to Solr with 1.1 and an earlier problem persists:   SEVERE: org.apache.solr.common.SolrException: ERROR: multiple values encountered for non multiValued copy field id: http://HOST/index.php/2009/December/30/   Well, i didn't load Nutch' shipped

Fetch queue's total size

2010-06-29 Thread Markus Jelsma
Hi,   I'm wondering why this value never exceeds 500? While watching the fetch log, i cannot determine the number of remaining fetches because as long as there are more than 500 due, the threads just wiggle between 490 and 500.   Is there a way to configure this? I haven't found a setting

RE: Host or domain www.abc123.com has more than 100 URLs for all 1 segments - skipping

2010-07-08 Thread Markus Jelsma
Hi,   This is what you're looking for:     property     namegenerate.max.per.host/name     value100/value   /property   Cheers   -Original message- From: brad b...@bcs-mail.net Sent: Thu 08-07-2010 02:24 To: user@nutch.apache.org; Subject: Host or domain www.abc123.com has more

Re: Crawl fails - Input path does not exist

2010-07-27 Thread Markus Jelsma
crawl/crawldb $SEGMENT -filter -normalize Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Querying case-sensitive fields

2010-08-17 Thread Markus Jelsma
, Jeroen Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Removing URLs from index

2010-08-17 Thread Markus Jelsma
is, how do I remove these documents from the index? Regards, Jeroen Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Removing URLs from index

2010-08-17 Thread Markus Jelsma
then optionally overwrite (delete) duplicates. [1]: http://wiki.apache.org/solr/Deduplication Thanks and best regards, Jeroen Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Not getting all documents

2010-08-17 Thread Markus Jelsma
and just taking whatever it gets before moving on. Maybe I should increase my wait times... On Tue, Aug 17, 2010 at 4:56 AM, Markus Jelsma markus.jel...@buyways.nlwrote: Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps the seeding didn't go too well? Make sure

RE: Why do nutch has Content Parsing in two places

2010-09-02 Thread Markus Jelsma
In small crawls, you could parse the documentright away. For large crawls, however, there may not be enough resources to fetch and parse at the same time.   -Original message- From: Nayanish Hinge nayanish.hi...@gmail.com Sent: Thu 02-09-2010 07:39 To: user@nutch.apache.org; Subject: Why

Re: Nutch crawl failure

2010-09-02 Thread Markus Jelsma
the crawling right from where we left? I mean, starting with only the unfetched urls. Thanks Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Subcollection is not really multi valued

2010-09-06 Thread Markus Jelsma
://issues.apache.org/jira/browse/NUTCH-716 Cheers Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

RE: Re: Subcollection is not really multi valued

2010-09-06 Thread Markus Jelsma
, Thanks. I would prefer that you don't reopen an already resolved issue. Just file a new issue and link it back to this one. Thanks for the heads up! Cheers, Chris On 9/6/10 4:57 AM, Markus Jelsma markus.jel...@buyways.nl wrote: Hi, It seems the NUTCH-716 [1] patch does not really produce a multi

Nutch 1.2 parser fails on application-zip

2010-09-07 Thread Markus Jelsma
the warning. Should i create a new ticket? At least i couldn't find a corresponding issue as of yet. Cheers, Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Nutch 1.2 parser fails on application-zip

2010-09-07 Thread Markus Jelsma
to find out the url which is causing the problem sp that we can reproduce the issue. Could be another case of a file trimmed to the max size allowed during the fetching which puts the parser in trouble. We'll see. Best, Julien Markus Jelsma - Technisch Architect - Buyways BV http

Re: Nutch 1.2 parser fails on application-zip

2010-09-07 Thread Markus Jelsma
of a file trimmed to the max size allowed during the fetching which puts the parser in trouble. We'll see. Best, Julien Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 Markus Jelsma - Technisch Architect - Buyways

Re: Subcollection is not really multi valued

2010-09-07 Thread Markus Jelsma
Issue has been solved in the current branch-1.2. I was using a too old nightly build. Thanks! On Monday 06 September 2010 20:00:52 Mattmann, Chris A (388J) wrote: Thanks! On 9/6/10 9:44 AM, Markus Jelsma markus.jel...@buyways.nl wrote: Done https://issues.apache.org/jira/browse

Re: Nutch 1.2 parser fails on application-zip

2010-09-07 Thread Markus Jelsma
Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 Markus Jelsma - Technisch Architect - Buyways BV http

Mime type via index-more plugin

2010-09-08 Thread Markus Jelsma
, Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: ERROR tika.TikaParser org.apache.pdfbox.io.PushBackInputStream

2010-09-08 Thread Markus Jelsma
/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050

Re: Mime type via index-more plugin

2010-09-08 Thread Markus Jelsma
the tokenization myself and i don't want my index to be polluted with this non- information =) Anyone knows how to configure the index-more plug-in? The wiki isn't very helpful. Cheers, Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620

Re: Mime type via index-more plugin

2010-09-08 Thread Markus Jelsma
) improvement? Just adding the option to disable the split? Or also add an option that spits out up to three distinct fields? Cheers M. Cheers, Chris On 9/8/10 2:27 AM, Markus Jelsma markus.jel...@buyways.nl wrote: Hi, I'm testing the index-more plug-in but, to my surprise

Re: Mime type via index-more plugin

2010-09-08 Thread Markus Jelsma
90089 USA ++ Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Input path does not exist revisited

2010-09-09 Thread Markus Jelsma
for unclear reasons. Madness! Can anyone try to explain what's really going on and why so many users suffer from this issue? FYI: I'm still running Nutch locally. A Hadoop cluster isn't set up yet. Cheers, Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050

RE: multiple values encountered for non multiValued field title

2010-09-09 Thread Markus Jelsma
 And now there's also a PDF giving this kind of trouble: http://gemeente.groningen.nl/assets/pdf/adviesrapport-wijkkranten-15012007.pdf   -Original message- From: Markus Jelsma markus.jel...@buyways.nl Sent: Thu 09-09-2010 18:06 To: user@nutch.apache.org; Subject: multiple values

RE: Re: multiple values encountered for non multiValued field title

2010-09-09 Thread Markus Jelsma
more than one value. For example the creative commons field has a lot of values for the same document (by, nc, us, etc...) I have field name=cc multiValued=true type=string stored=true indexed=true/ Hope this helps, André Ricardo On Thu, Sep 9, 2010 at 5:28 PM, Markus Jelsma markus.jel

RE: [Solved] Input path does not exist revisited

2010-09-14 Thread Markus Jelsma
processes' tmp data.   So, don't run multiple jobs on the local machine using the same hadoop.tmp.dir setting.   Cheers,   -Original message- From: Markus Jelsma markus.jel...@buyways.nl Sent: Fri 10-09-2010 15:52 To: user@nutch.apache.org; Subject: RE: Input path does not exist revisited

Re: java.net.UnknownHostException and Timeout during Fetching?

2010-09-20 Thread Markus Jelsma
-- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Mime type via index-more plugin

2010-09-20 Thread Markus Jelsma
Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

RE: Funky duplicate url's

2010-09-22 Thread Markus Jelsma
/office/www.microsoft.com/office/antivirus     This doesn't look good at all. Anyone got a suggestion or some pointer?       -Original message- From: Markus Jelsma markus.jel...@buyways.nl Sent: Wed 22-09-2010 12:12 To: user@nutch.apache.org; Subject: Funky duplicate url's Hi

RE: Re: Funky duplicate url's

2010-09-22 Thread Markus Jelsma
@nutch.apache.org; Subject: Re: Funky duplicate url's the conf/regex-urlfilter.txt file has an exclusion rule that should skip these viral urls. # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ -aj On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma

Re: Duplicate URLs

2010-09-24 Thread Markus Jelsma
* doing there? It shouldn't. Thanks for all your help Raj -Original Message- From: Markus Jelsma [mailto:markus.jel...@buyways.nl] Sent: Thursday, September 23, 2010 4:52 PM To: user@nutch.apache.org Subject: RE: Duplicate URLs bin/nutch solrdedup Usage: SolrDeleteDuplicates

Re: Nutch 1.2 solrdedup and OutOfMemoryError

2010-09-24 Thread Markus Jelsma
Is there something else I need to do? Some change to the Solr or Tomcat config I have missed. Config: Nutch Release 1.2 - 08/07/2010 CentOS Linux 5.5 Linux 2.6.18-194.3.1.el5 on x86_64 Intel(R) Xeon(R) CPU X3220 @ 2.40GHz 8gb of ram Thanks Brad Markus Jelsma - Technisch

RE: Nutch 1.2 solrdedup and OutOfMemoryError

2010-09-26 Thread Markus Jelsma
not as thorough as the regular dedup process (URL, Content, highest score, shortest URL), but I think it will work. Brad -Original Message- From: Markus Jelsma [mailto:markus.jel...@buyways.nl] Sent: Friday, September 24, 2010 5:27 AM To: user@nutch.apache.org Subject: Re: Nutch 1.2

RE: Duplicate URLs

2010-09-26 Thread Markus Jelsma
by an exact hashing algoritm such as MD5, it won't allow you do use the TextProfileSignature algoritm in Solr for fuzzy matching.   -Original message- From: Nemani, Raj raj.nem...@turner.com Sent: Fri 24-09-2010 23:18 To: user@nutch.apache.org; Markus Jelsma markus.jel...@buyways.nl

CrawlDB, very slow

2010-09-28 Thread Markus Jelsma
to try running Nutch on a Hadoop cluster (which i don't have) or try to let Hadoop take advantage of my multiple cores? Cheers, Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: crawl www

2010-09-28 Thread Markus Jelsma
?It used to be:# accept hosts in MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: CrawlDB, very slow

2010-09-28 Thread Markus Jelsma
very slow) it becomes a rocket... Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: CrawlDB, very slow

2010-09-28 Thread Markus Jelsma
, will it then still make use of multiple cores? Cheers, On Tuesday 28 September 2010 14:20:02 Andrzej Bialecki wrote: On 2010-09-28 14:02, Markus Jelsma wrote: Hi, My test setup (only local) now has just over 20 million URL's, i fetched 3m already and the rest needs to be fetched. It's

Re: Output for plugin.PluginRepository repeats in logs

2010-09-28 Thread Markus Jelsma
I see, complex indeed. I'll manage for now. Thanks for your answer. On Tuesday 28 September 2010 14:18:06 Andrzej Bialecki wrote: On 2010-09-28 13:55, Markus Jelsma wrote: Thanks. Could we modify the code so it will only output the info before the tasks are initialized? If so, how to proceed

Re: crawl www

2010-09-28 Thread Markus Jelsma
understand. How do I update your DB's?, What should I do about crawl-urlfilter.txt? Thanks Dennis --- On Tue, 9/28/10, Markus Jelsma markus.jel...@buyways.nl wrote: From: Markus Jelsma markus.jel...@buyways.nl Subject: Re: crawl www To: user@nutch.apache.org Date: Tuesday, September

Re: crawl www

2010-09-28 Thread Markus Jelsma
: Sorry for interrupting, Markus, But I'm not quite understand. How do I update your DB's?, What should I do about crawl-urlfilter.txt? Thanks Dennis --- On Tue, 9/28/10, Markus Jelsma markus.jel...@buyways.nl wrote: From: Markus Jelsma markus.jel...@buyways.nl Subject: Re: crawl

Re: crawl www

2010-09-28 Thread Markus Jelsma
September 2010 15:51:11 Dennis wrote: Thanks, Markus, Another question, the script will stop, right? I mean, I am not going to crawl for 100 days, I need it finish it's job. Dennis --- On Tue, 9/28/10, Markus Jelsma markus.jel...@buyways.nl wrote: From: Markus Jelsma markus.jel

Re: CrawlDB, very slow

2010-09-28 Thread Markus Jelsma
Thanks for your comments. I'll consult this thread later when i've got the time to test the distributed mode and possibly set up HDFS immediately as i'm going to need it anyway. On Tuesday 28 September 2010 16:26:48 Andrzej Bialecki wrote: On 2010-09-28 14:27, Markus Jelsma wrote: Thanks

Re: Nutch use case : SimilarPages

2010-09-28 Thread Markus Jelsma
://digitalpebble.blogspot.com/2010/09/similarpages-is-out.html and of course http://www.similarpages.com/ itself. Best, Julien Nioche Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Funky duplicate url's, getting much worse!

2010-09-28 Thread Markus Jelsma
/overijssel fetching http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/   -Original message- From: Markus Jelsma markus.jel...@buyways.nl Sent: Wed 22-09-2010 20:47 To: user@nutch.apache.org; Subject: RE: Re: Funky duplicate url's Thanks! I've

RE: Re: Funky duplicate url's, getting much worse!

2010-09-28 Thread Markus Jelsma
troubles with honoring the page's BASE tag when resolving relative outlinks. However, I don't see this BASE tag being used in the HTML pages you provide so that's might not be it. Mathijs On Sep 28, 2010, at 18:51 , Markus Jelsma wrote: Anyone? Where is a proper solution for this issue

Re: Funky duplicate url's, getting much worse!

2010-09-29 Thread Markus Jelsma
that going through a single huge file J. On 29 September 2010 10:11, Markus Jelsma markus.jel...@buyways.nl wrote: Yes but i need a little more testing. Anyone knows how i can only test that class? I currently use ant -v test -l logfile and need to dig through

RE: Nutch-Eclipse

2010-10-05 Thread Markus Jelsma
It seems you're trying to fetch 0 url's. Inject correct url's or adjust your url filters as not to filter out your injected url's.   -Original message- From: Yavuz Selim YILMAZ yvzslmyilm...@gmail.com Sent: Tue 05-10-2010 13:16 To: user user@nutch.apache.org; Subject: Nutch-Eclipse I

Can't find org.gora.sql.store.SqlStore

2010-10-07 Thread Markus Jelsma
Hi, I've finally fetched the latest trunk, added Gora as described in NUTCH-873 but i'm getting the following exception Exception in thread main java.lang.ClassNotFoundException: org.gora.sql.store.SqlStore It can't find the class configured in storage.data.store.class. Is it perhaps the

Re: fetcher.store.content and fetcher.parse

2010-10-07 Thread Markus Jelsma
Storing content will take up about as much disk space as the content you are fetching. If you don't store, there is nothing to parse. On Thu, 7 Oct 2010 05:42:00 -0700 (PDT), webdev1977 webdev1...@gmail.com wrote: Could someone please clarify the relationship between these two properties? I

Re: Ip filtering

2010-10-07 Thread Markus Jelsma
I suppose you would create an URL filter. It, as i understand, filters URL's that are about to enter the CrawlDB (during UpdateDB) as well as read from the CrawlDB (the generator). The LinkDB just holds a list of anchor's for URL's that are in the CrawlDB. Be sure to have a local DNS cache

Re: fetcher.store.content and fetcher.parse

2010-10-07 Thread Markus Jelsma
On Thu, 7 Oct 2010 09:48:57 -0700 (PDT), webdev1977 webdev1...@gmail.com wrote: So how is it that one is able to crawl huge websites with the crawl script and not use the parse = false? You would have to have enormous amounts of disk space to run the parse later. You can run smaller batches

Re: Can't find org.gora.sql.store.SqlStore

2010-10-11 Thread Markus Jelsma
, Where are you seeing this ClassNotFoundException? When you look at it in an IDE (e.g., Eclipse), or at runtime? Or building using Ant/Ivy? It seems like it built OK, so just trying to figure out how you are running Nutch. Cheers, Chris On 10/11/10 4:24 AM, Markus Jelsma markus.jel

Re: side by side versions of Nutch

2010-10-11 Thread Markus Jelsma
to keep the old project as is for now. Of course for production I will have it on two different server, as you can not run multiple instances on nutch on the same sever/cluster. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: [ANNOUNCE] Welcome Markus Jelsma as a Nutch Committer

2010-10-19 Thread Markus Jelsma
! Cheers! Hi Folks, A while back I nominated Markus Jelsma for Nutch committership and PMC membership. The VOTE tallies in Nutch PMC-ville have occurred and I'm happy to announce that Markus is now an Nutch committer! Markus, feel free to say a little bit about yourself, and, welcome aboard

Re: Crawl the whole blog, but store just the last post

2010-10-21 Thread Markus Jelsma
Fetch and parse the feeds and store the newly discovered URL's in the CrawlDB. Then generate a new fetch list, fetch and parse and index the most recent item. The remaining problem is how to know which is the most recent. Maybe you should create a plugin that will only add the most recent URL

Re: http.agent and unsupported browser

2010-10-21 Thread Markus Jelsma
Well, you could set a fake user agent. As I crawl more websites I finding I'm encountering more and more websites that reject the crawl by basically redirecting the crawl to an HTML page that that states something along the lines of: HTTP 602 Unsupported Browser The browser you are using

Re: Any changes to setting up solr with nutch 1.2?

2010-10-27 Thread Markus Jelsma
wrong somewhere so I'm going over the whole set-up again, slowly. Joe -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, October 26, 2010 3:04 PM To: user@nutch.apache.org Subject: RE: Any changes to setting up solr with nutch 1.2? I

Re: Any changes to setting up solr with nutch 1.2?

2010-10-27 Thread Markus Jelsma
and solrindex-mapping.xml/schema.xml file in conf is so that nutch can index the solr database, while the separate solr instance is used to search the result? Thanks, Steve On Tue, Oct 26, 2010 at 3:43 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, You'll need a 1.3

Re: http authentication and multicore

2010-10-27 Thread Markus Jelsma
Hi. I'm using nutch 1.2 to crawl a site and after that I want to index to solr using solrIndexer command. The problem is that the solr server needs a Digest authentication, Is there a way to authenticate from nutch? Never tried the authentication part, but i guess

Re: Writing a Book on Nutch

2010-11-02 Thread Markus Jelsma
help me to know the following: 1) What types of things you would want explained in a book / videos on Nutch? 2) What are the biggest problems you face using Nutch? 3) Anything special you would like answered or explained? Thanks in advance for any responses. Dennis -- Markus Jelsma

Boosts on NutchDocuments

2010-11-04 Thread Markus Jelsma
but what about the document itself? I believe not, but need to make sure. Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Boosts on NutchDocuments

2010-11-04 Thread Markus Jelsma
sets 0.0f boosts on some documents or something else it going wrnog. On Thursday 04 November 2010 14:26:17 Markus Jelsma wrote: Hi, Quick question: does Nutch set document boosts on documents that i send to Solr? I've got some trouble with fieldNorms which are calculated from document/field

Re: nutch - solrindex map reduce error

2010-11-04 Thread Markus Jelsma
What version of Solr are you indexing to and what is the Solr log telling you? Hello I am trying to get nutch to work after upgrading from nutch 1.0 to 1.2: solrindex map is working but as soon as I hit the reduce stage I start getting errors. I fixed a couple of the errors but I don't

Re: NullPointerException on solrdedup job in 1.2

2010-11-04 Thread Markus Jelsma
Found the problem. The boost field was removed but it seems the dedup job needs it. I haven't tested it but since i recently removed the field it makes sense. Why would i need the boost field anyway and why does the dedup job needs it? Hi all, For some reason i get an exception on this job.

Re: Stop Nutch

2010-11-04 Thread Markus Jelsma
- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, November 04, 2010 11:41 AM To: user@nutch.apache.org Cc: Eric Martin Subject: Re: Stop Nutch How did you start it? Are you running it on Hadoop at all? I can't find a way to stop Nutch 1.2 via command line. I use

Re: Stop Nutch

2010-11-04 Thread Markus Jelsma
Regards Alexander Aristov On 4 November 2010 22:00, Markus Jelsma markus.jel...@openindex.io wrote: Kill it! I guess it just runs standalone, just like executed jobs from the command line which you can just terminate with CTRL+C. I don't recommend stopping Nutch while executing jobs

Re: nutch - solrindex map reduce error

2010-11-04 Thread Markus Jelsma
Regards Alexander Aristov On 4 November 2010 21:40, Markus Jelsma markus.jel...@openindex.io wrote: What version of Solr are you indexing to and what is the Solr log telling you? Hello I am trying to get nutch to work after upgrading from nutch 1.0 to 1.2: solrindex map is working

Re: How to Index Mail using Nutch?

2010-11-11 Thread Markus Jelsma
Yes! But i wouldn't recommend it if you're using Solr as your search server as it can index e-mail boxes [1] via its data import handler [2]. However, using Nutch is possible too but it depends on your set up whether its easy or not. Nutch can crawl and index your file system and the mail is

Entities get translated back while parsing?

2010-11-15 Thread Markus Jelsma
elements afterwards is only a temporary work-around in this case. Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Fetch error during crawling

2010-11-16 Thread Markus Jelsma
at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: Fetch error during crawling

2010-11-18 Thread Markus Jelsma
headers value be correct. To do this, I modified the nutch-default.xml file: name http.agent.name / name value Mozilla/4.0 / value Is it enough? Thanks 2010/11/16 Markus Jelsma-2 [via Lucene] ml-node+1912155-1367979579-224...@n3.nabble.comml-node%2B1912155-136797957 9-224...@n3

Re: Entities get translated back while parsing?

2010-11-23 Thread Markus Jelsma
Perhaps something for the Tika list? On Monday 15 November 2010 17:57:13 Markus Jelsma wrote: Hi, A quite awful issue just occurred and i traced it back down the line. Apparently the parser seems to translate HTML entities back to their original form, lt; to and gt; to etc

Re: How to run Solr that comes with the Nutch distribution (1.2)?

2010-11-23 Thread Markus Jelsma
how to run Solr as the search engine along with Nutch. I've downloaded the latest stable release of Nutch (1.2) and I see that Solr in already integrated with it out-of-the-box. Question is: How we are supposed to use Solr within Nutch? Thanks -- Markus Jelsma - CTO - Openindex http

Re: indexing only custom fields to solr

2010-11-23 Thread Markus Jelsma
are appreciated. Thanks Guido By the way: Where comes nutch/conf/schema.xml into play? I assume that it is just a template to replace solr/conf/schema.xml. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

Re: adding -dir argument to Indexer and SolrIndexer

2010-11-26 Thread Markus Jelsma
As reference to other readers: https://issues.apache.org/jira/browse/NUTCH-939 On Friday 26 November 2010 11:59:26 Claudio Martella wrote: Hello list, I'm porting recrawl script to use hadoop (on an already existing hadoop cluster). I attach my version. What i found out is that Indexer

Re: adding -dir argument to Indexer and SolrIndexer

2010-11-26 Thread Markus Jelsma
2.0 and if possible 1.3, although the latter might not see daylight. Thanks for the patch! On Friday 26 November 2010 16:19:58 Claudio Martella wrote: Markus, with trunk you mean 1.3 or 2.0? The patches should apply to all 1.x. On 11/26/10 3:15 PM, Markus Jelsma wrote: As reference

Re: org.apache.hadoop.util.DiskChecker$DiskErrorException

2010-11-26 Thread Markus Jelsma
invertlinks fail because of the missing of directories in my merged segment is there anything i can do ? mehdi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Little Help for nutch Newbe...

2010-12-02 Thread Markus Jelsma
help me? Klaus. -- e-Mail : kl...@tachtler.net Homepage: http://www.tachtler.net DokuWiki: http://www.dokuwiki.tachtler.net -- Markus Jelsma - CTO

Re: Crawl Script

2010-12-10 Thread Markus Jelsma
Seems to be a carriage return issue. Remove them first. Hello List Using windows XP Cygwin to execute Nutch-1.2 Currently trying out various crawl scripts and have hit a problem using the one located here http://wiki.apache.org/nutch/Crawl I made some minor adjustments, however

Re: How to dump the crawled Html pages?

2010-12-17 Thread Markus Jelsma
Hi, Check out the readseg command. Cheers, Hi I am new to Nutch. I just started to use Nutch to crawl an intranet and extract a certain field from the html pages. The first step I would like to do is to dump all the html pages to a directory. I guess I should add a filter class to do it,

Re: What's the difference between crawl-urlfilter.txt and regex-urlfilter.txt

2010-12-24 Thread Markus Jelsma
One is being used by the crawl command. Their contents are very similar, are they being used by two difference plugin? Why there are two files?

Re: anchor text in crawldb/Generator

2010-12-24 Thread Markus Jelsma
Nobin Mathew -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Exception on segment merging

2011-01-04 Thread Markus Jelsma
Use the hadoop.tmp.dir setting in nutch-site.conf to point to a disk where plenty is space is available. Other users have previously reported similar problems which were due to a lack on space on disk as suggested by this *Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:

Re: How store only home page of domains but crawl all the pages to detect all different domains

2011-01-13 Thread Markus Jelsma
on Twitested Lib. And I want to learn more :-). I didn't know about different url filters for fetching, updating etc, ¿Where can I change those filters? Thank you, 2011/1/12 Markus Jelsma markus.jel...@openindex.io: Hi, This is rather tricky. You can crawl a lot but index a little if you

Re: Connecting MySQL to Apache Nutch

2011-01-13 Thread Markus Jelsma
the SolrWriter.java class and place my mysql connecter their but nothing happens so can please explain a little more with example of code that exactly which part of SolrWriter class is going to be replace by mysql connecter. -Thanks You Very Much On 1/13/11, Markus Jelsma markus.jel

Re: Connecting MySQL to Apache Nutch

2011-01-13 Thread Markus Jelsma
IOException ioe = new IOException(); ioe.initCause(e); return ioe; } } -Thanks you very much On 1/13/11, Markus Jelsma markus.jel...@openindex.io wrote: public void write gets called for each NutchDocument and collects them in inputDocs. You could, after line 60, call a customer

Re: Problems bu upgrading Nutch-1.0 - Nutch-1.2

2011-01-23 Thread Markus Jelsma
It seems this is the root of the problem. Caused by: java.lang.OutOfMemoryError: Java heap space

Re: Can Nucth detect modified and deleted URLs?

2011-01-23 Thread Markus Jelsma
Nutch can detect 404's by recrawling existing URL's. The mutation, however, is not pushed to Solr at the moment. As far as I know, Nutch can only discover new URLs to crawl and send the parsed content to Solr. But what about maintaining the index? Say that you have a daily Nutch script that

Re: Can Nucth detect modified and deleted URLs?

2011-01-24 Thread Markus Jelsma
byte STATUS_DB_GONE = 0x03; http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup Where is that information stored? it could be then easily used to issue deletes on solr. On 1/23/11 10:32 PM, Markus Jelsma wrote: Nutch can detect

Re: resuming the nutch crawl after interruption

2011-01-24 Thread Markus Jelsma
point where it has been interrupted.Is there any way that i can resume the crawl after interruption from the same point. Regards Amna Waqar -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

Re: Few questions from a newbie

2011-01-25 Thread Markus Jelsma
These values come from the CrawlDB and have the following meaning. db_unfetched This is the number of URL's that are to be crawled when the next batch is started. This number is usually limited with the generate.max.per.host setting. So, if there are 5000 unfetched and generate.max.per.host is

Re: Regarding crawling of short URL's

2011-01-25 Thread Markus Jelsma
Reading a URL from the DB returns the HTTP response of that URL, some header information and body. Crawling a URL with a HTTP redirect won't result in the HTTP response of the redirection target for that redirecting URL. Hi, My application needs to crawl a set of urls which I give to the

Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Markus Jelsma
for those entries. Is that what you guys have in mind? Should i file a JIRA? On 1/24/11 10:26 AM, Markus Jelsma wrote: Each item in the CrawlDB carries a status field. Reading the CrawlDB will return this information as well, the same goes for a complete dump with which you could create

Re: Can Nucth detect modified and deleted URLs?

2011-01-26 Thread Markus Jelsma
and it issues a delete to solr for those entries. Is that what you guys have in mind? Should i file a JIRA? On 1/24/11 10:26 AM, Markus Jelsma wrote: Each item in the CrawlDB carries a status field. Reading the CrawlDB will return this information as well, the same goes for a complete dump

  1   2   3   4   5   6   7   8   9   10   >