Re: exclude some urls from crawling

2012-04-13 Thread alessio crisantemi
org.apache.nutch.net.URLFilterChecker -allCombined Remi On Tuesday, April 10, 2012, alessio crisantemi wrote: Dear All, I try to exclude some urls of my website to the crawling process, but without success. For exclude it, I add this code on my regex-urlfilter.txt file BEFORE to write the home page

Re: request about snippets (with attachement)

2012-04-07 Thread alessio crisantemi
and port number plus search query... If you can provide the URL you wish to remove some particular HTML tag from then at least we can see what it is that you are having trouble with. Sorry if I've not made myself clear enough. Lewis 2012/4/6 alessio crisantemi alessio.crisant...@gmail.com

Re: request about snippets (with attachement)

2012-04-07 Thread alessio crisantemi
to speed with plugins on our wiki.[0] Once you have something that requires help get on to the list and let us know. Lewis [0] http://wiki.apache.org/nutch/PluginCentral On Sat, Apr 7, 2012 at 2:33 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: may be it'd my cause with my

Re: request about snippets (with attachement)

2012-04-06 Thread alessio crisantemi
? 2012/4/6 alessio crisantemi alessio.crisant...@gmail.com any suggestions for my cause? Il giorno 05 aprile 2012 23:20, alessio crisantemi alessio.crisant...@gmail.com ha scritto: here a part of results: [2] Live Score - GiocoNews - Tutto su casinò, poker, giochi online http

Fwd: request about snippets (with attachement)

2012-04-06 Thread alessio crisantemi
or this: http://pc-alessio:8983/*WoWSolrWebApp/search?query=giocosubmit=Search* -- Messaggio inoltrato -- Da: alessio crisantemi alessio.crisant...@gmail.com Date: 06 aprile 2012 22:42 Oggetto: Re: request about snippets (with attachement) A: user@nutch.apache.org that's can

Fwd: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
-- Messaggio inoltrato -- Da: alessio crisantemi alessio.crisant...@gmail.com Date: 05 aprile 2012 22:32 Oggetto: request about snippets A: user@nutch.apache.org Dear all, I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and index with success my website. I

Re: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
scritto: Hi Alessio, You need to determine in which field the unwanted content exists. Once you've done this you could write an indexing filter to remove this from your document prior to indexing. Lewis On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi alessio.crisant...@gmail.com wrote

Re: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
wrote: I can't see any of your attachments as they're not permitted on list. Can you provide an URL? On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Dear Lewis, thank you for your fast reply. But just thiat's my problem! I don't compred wich

Re: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
on list. Can you provide an URL? On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Dear Lewis, thank you for your fast reply. But just thiat's my problem! I don't compred wich is the field that crates this raw. But I see a date (eg: Mercoledì Apr 04

Re: crawling a website

2012-04-02 Thread alessio crisantemi
/alpha, http://ww.mywebsite.com/beta , http://ww.mywebsite.com/gamma *- ^http://ww.mywebsite.com/.*/$* This will exclude any URL that ends with / I would suggest you get familiar with regular expressions (in case you don't yet) Remi On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi

Re: nutch crawling file system SOLVED

2012-03-17 Thread alessio crisantemi
name=urlfile:/C:/Documents and Settings/Alessio/Documenti//str /doc suggestions? tx alessio Il giorno 12 marzo 2012 09:39, alessio crisantemi alessio.crisant...@gmail.com ha scritto: I add the path of my directory on regex-urlfilter but nutch crawl also other directories... And more: I follow

Re: nutch crawling file system SOLVED

2012-03-17 Thread alessio crisantemi
I would that the result of my search be the text of my pdf file and not the list of documents into the directory and the path address.. Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Hi Alessio, On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi

Re: nutch crawling file system SOLVED

2012-03-12 Thread alessio crisantemi
06:06, remi tassing tassingr...@gmail.com ha scritto: Using crawl-ulrfilter (or regex-urlfilter depending on which one you're using), you should be able to solve this. Unless you're not clear on what folders to exclude...? On Sunday, March 11, 2012, alessio crisantemi alessio.crisant

Re: nutch crawling file system SOLVED

2012-03-11 Thread alessio crisantemi
file, I have just this raw...And that's not a simple mode There is another method, I suppose? thank you alessio Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Please see below On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi alessio.crisant

Re: nutch crawling file system SOLVED

2012-03-10 Thread alessio crisantemi
that the guide write for the crawl-urlfilter on regex-urlfilter, all works. I would know this case. thank you alessio Il giorno 04 marzo 2012 17:02, alessio crisantemi alessio.crisant...@gmail.com ha scritto: Hi all, I need to crawl a directory with a lot of pdf file. But I know onlye the step-by-step

Re: nutch craling file system

2012-03-04 Thread alessio crisantemi
/nutch/FAQ#How_do_I_index_my_local_file_system.3F [2] http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch [3] http://stackoverflow.com/questions/941519/how-to-make-nutch-crawl-file-system On Sun, Mar 4, 2012 at 5:02 PM, alessio crisantemi alessio.crisant...@gmail.com