date:20120116

Re: Indexing specific metadata tags with urlmeta

2012-01-16 Thread Lewis John Mcgibbney

If this was done after you indexed your content then you will need to reindex all of your content to make this field searchable in your solr index. On Mon, Jan 16, 2012 at 5:31 AM, Vijith vijithkv...@gmail.com wrote: Hi Lewis, Ya it was when I added a field like - field dest=keywords

Re: Indexing specific metadata tags with urlmeta

2012-01-16 Thread Vijith

Im indexing it right away when I am crawling ( using -solr ). Iam using the 'crawl' command. should I use individual commands for inject, fetch etc.. l clear off the crawl data and solr index before I crawl. Any clue ? On Mon, Jan 16, 2012 at 1:48 PM, Lewis John Mcgibbney

Re: Focused crawling with nutch

2012-01-16 Thread Markus Jelsma

You would need a parsing fetcher for this to work. Also the fetch filter may offer some insights. https://issues.apache.org/jira/browse/NUTCH-828 We do similar things with outlinks while fetching. Hi Lewis, Thanks for the reply. What I really want to achieve is to find the occurrence of

Re: Deletion of duplicates fails with org.apache.lucene.search.BooleanQuery$TooManyClauses

2012-01-16 Thread Markus Jelsma

hi Hi, I started having this problem recently. For some reason, I did not have it before, when working with Nutch 1.4 pre-release code. The stack trace would be: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSpli ts(SolrDeleteDuplicates.java:200) at

Re: Focused crawling with nutch

2012-01-16 Thread Vijith

Thanks Markus. I think that will give me a good starting point. On Mon, Jan 16, 2012 at 2:11 PM, Markus Jelsma markus.jel...@openindex.io wrote: You would need a parsing fetcher for this to work. Also the fetch filter may offer some insights. https://issues.apache.org/jira/browse/NUTCH-828

Couldn't get robots.txt and EMPTY_RULES

2012-01-16 Thread remi tassing

Hello all, one of the sites I'm crawling doesn't have the robots.txt file, so I decide to modify RobotRulesParser.java so to give it default rules (EMPTY_RULES). But apparently, Nutch doesn't crawl it properly. Is it the correct way to handle this? Is it a better alternative? Remi

invalid uri with three dots

2012-01-16 Thread remi tassing

Hello all, I'm getting invalid uri error with some link that have three dots, i.e. They work perfectly well in browsers (IE and Chrome) but, apparently, not with Nutch. Is this a known issue? Any idea on how to handle it? Remi

Re: relative url problem with Nutch

2012-01-16 Thread remi tassing

Ok, for time being I'll stand-by and wait for solution. This is way beyond my competence :-( On Thu, Jan 12, 2012 at 11:47 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Remi, WRT fixing Nutch 1.2 I can't comment, we do not support this version any longer and it is no longer

Re: invalid uri with three dots

2012-01-16 Thread remi tassing

It comes under the error java.lang.IllegalArgumentException On Mon, Jan 16, 2012 at 3:58 PM, remi tassing tassingr...@gmail.com wrote: Hello all, I'm getting invalid uri error with some link that have three dots, i.e. They work perfectly well in browsers (IE and Chrome) but,

Re: invalid uri with three dots

2012-01-16 Thread Markus Jelsma

copy the stack trace please On Monday 16 January 2012 14:58:46 remi tassing wrote: Hello all, I'm getting invalid uri error with some link that have three dots, i.e. They work perfectly well in browsers (IE and Chrome) but, apparently, not with Nutch. Is this a known issue? Any idea

Re: invalid uri with three dots

2012-01-16 Thread remi tassing

Hello , this is a snapshot of the log: -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96

Re: Couldn't get robots.txt and EMPTY_RULES

2012-01-16 Thread remi tassing

Hello, after crawling is completed, I output the crawled urls with the following command bin/nutch readdb crawl/crawldb -dump output With 170 crawled urls, only one shows as db_fetched. That's why I think something is wrong. When I asked for the correct way to handle this, I meant what is

Re: Couldn't get robots.txt and EMPTY_RULES

2012-01-16 Thread Markus Jelsma

On Monday 16 January 2012 15:17:21 remi tassing wrote: Hello, after crawling is completed, I output the crawled urls with the following command bin/nutch readdb crawl/crawldb -dump output With 170 crawled urls, only one shows as db_fetched. That's why I think something is wrong. The

Re: invalid uri with three dots

2012-01-16 Thread Markus Jelsma

This? https://uri1...From=stats That's not a correct or valid URL if you ask me. On Monday 16 January 2012 15:12:51 remi tassing wrote: Hello , this is a snapshot of the log: -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96 -activeThreads=10, spinWaiting=9,

Re: invalid uri with three dots

2012-01-16 Thread remi tassing

Hello Markus, thanks for the help! Just to clarify a little bit. In my previous message, uri1 represented a normal, ordinary URL, I just didn't want to copy the exact URL. The weird part is that it all works in the browser... On Mon, Jan 16, 2012 at 4:35 PM, Markus Jelsma

RE: Deletion of duplicates fails with org.apache.lucene.search.BooleanQuery$TooManyClauses

2012-01-16 Thread Arkadi.Kosmynin

hi Hi, I started having this problem recently. For some reason, I did not have it before, when working with Nutch 1.4 pre-release code. The stack trace would be: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getS pli

Re: Indexing specific metadata tags with urlmeta

Re: Indexing specific metadata tags with urlmeta

Re: Focused crawling with nutch

Re: Deletion of duplicates fails with org.apache.lucene.search.BooleanQuery$TooManyClauses

Re: Focused crawling with nutch

Couldn't get robots.txt and EMPTY_RULES

invalid uri with three dots

Re: relative url problem with Nutch

Re: invalid uri with three dots

Re: invalid uri with three dots

Re: invalid uri with three dots

Re: Couldn't get robots.txt and EMPTY_RULES

Re: Couldn't get robots.txt and EMPTY_RULES

Re: invalid uri with three dots

Re: invalid uri with three dots

RE: Deletion of duplicates fails with org.apache.lucene.search.BooleanQuery$TooManyClauses

16 matches

Site Navigation

Mail list logo

Footer information