Re: Indexing specific metadata tags with urlmeta

2012-01-16 Thread Lewis John Mcgibbney
If this was done after you indexed your content then you will need to reindex all of your content to make this field searchable in your solr index. On Mon, Jan 16, 2012 at 5:31 AM, Vijith vijithkv...@gmail.com wrote: Hi Lewis, Ya it was when I added a field like - field dest=keywords

Re: Indexing specific metadata tags with urlmeta

2012-01-16 Thread Vijith
Im indexing it right away when I am crawling ( using -solr ). Iam using the 'crawl' command. should I use individual commands for inject, fetch etc.. l clear off the crawl data and solr index before I crawl. Any clue ? On Mon, Jan 16, 2012 at 1:48 PM, Lewis John Mcgibbney

Re: Focused crawling with nutch

2012-01-16 Thread Markus Jelsma
You would need a parsing fetcher for this to work. Also the fetch filter may offer some insights. https://issues.apache.org/jira/browse/NUTCH-828 We do similar things with outlinks while fetching. Hi Lewis, Thanks for the reply. What I really want to achieve is to find the occurrence of

Re: Deletion of duplicates fails with org.apache.lucene.search.BooleanQuery$TooManyClauses

2012-01-16 Thread Markus Jelsma
hi Hi, I started having this problem recently. For some reason, I did not have it before, when working with Nutch 1.4 pre-release code. The stack trace would be: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSpli ts(SolrDeleteDuplicates.java:200) at

Re: Focused crawling with nutch

2012-01-16 Thread Vijith
Thanks Markus. I think that will give me a good starting point. On Mon, Jan 16, 2012 at 2:11 PM, Markus Jelsma markus.jel...@openindex.io wrote: You would need a parsing fetcher for this to work. Also the fetch filter may offer some insights. https://issues.apache.org/jira/browse/NUTCH-828

Couldn't get robots.txt and EMPTY_RULES

2012-01-16 Thread remi tassing
Hello all, one of the sites I'm crawling doesn't have the robots.txt file, so I decide to modify RobotRulesParser.java so to give it default rules (EMPTY_RULES). But apparently, Nutch doesn't crawl it properly. Is it the correct way to handle this? Is it a better alternative? Remi

invalid uri with three dots

2012-01-16 Thread remi tassing
Hello all, I'm getting invalid uri error with some link that have three dots, i.e. They work perfectly well in browsers (IE and Chrome) but, apparently, not with Nutch. Is this a known issue? Any idea on how to handle it? Remi

Re: relative url problem with Nutch

2012-01-16 Thread remi tassing
Ok, for time being I'll stand-by and wait for solution. This is way beyond my competence :-( On Thu, Jan 12, 2012 at 11:47 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Remi, WRT fixing Nutch 1.2 I can't comment, we do not support this version any longer and it is no longer

Re: invalid uri with three dots

2012-01-16 Thread remi tassing
It comes under the error java.lang.IllegalArgumentException On Mon, Jan 16, 2012 at 3:58 PM, remi tassing tassingr...@gmail.com wrote: Hello all, I'm getting invalid uri error with some link that have three dots, i.e. They work perfectly well in browsers (IE and Chrome) but,

Re: invalid uri with three dots

2012-01-16 Thread Markus Jelsma
copy the stack trace please On Monday 16 January 2012 14:58:46 remi tassing wrote: Hello all, I'm getting invalid uri error with some link that have three dots, i.e. They work perfectly well in browsers (IE and Chrome) but, apparently, not with Nutch. Is this a known issue? Any idea

Re: invalid uri with three dots

2012-01-16 Thread remi tassing
Hello , this is a snapshot of the log: -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96

Re: Couldn't get robots.txt and EMPTY_RULES

2012-01-16 Thread remi tassing
Hello, after crawling is completed, I output the crawled urls with the following command bin/nutch readdb crawl/crawldb -dump output With 170 crawled urls, only one shows as db_fetched. That's why I think something is wrong. When I asked for the correct way to handle this, I meant what is

Re: Couldn't get robots.txt and EMPTY_RULES

2012-01-16 Thread Markus Jelsma
On Monday 16 January 2012 15:17:21 remi tassing wrote: Hello, after crawling is completed, I output the crawled urls with the following command bin/nutch readdb crawl/crawldb -dump output With 170 crawled urls, only one shows as db_fetched. That's why I think something is wrong. The

Re: invalid uri with three dots

2012-01-16 Thread Markus Jelsma
This? https://uri1...From=stats That's not a correct or valid URL if you ask me. On Monday 16 January 2012 15:12:51 remi tassing wrote: Hello , this is a snapshot of the log: -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=96 -activeThreads=10, spinWaiting=9,

Re: invalid uri with three dots

2012-01-16 Thread remi tassing
Hello Markus, thanks for the help! Just to clarify a little bit. In my previous message, uri1 represented a normal, ordinary URL, I just didn't want to copy the exact URL. The weird part is that it all works in the browser... On Mon, Jan 16, 2012 at 4:35 PM, Markus Jelsma

RE: Deletion of duplicates fails with org.apache.lucene.search.BooleanQuery$TooManyClauses

2012-01-16 Thread Arkadi.Kosmynin
hi Hi, I started having this problem recently. For some reason, I did not have it before, when working with Nutch 1.4 pre-release code. The stack trace would be: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getS pli