org.apache.nutch.net.URLFilterChecker -allCombined
Remi
On Tuesday, April 10, 2012, alessio crisantemi wrote:
Dear All,
I try to exclude some urls of my website to the crawling process, but
without success.
For exclude it, I add this code on my regex-urlfilter.txt file BEFORE to
write the home page
and
port number plus search query...
If you can provide the URL you wish to remove some particular HTML tag from
then at least we can see what it is that you are having trouble with. Sorry
if I've not made myself clear enough.
Lewis
2012/4/6 alessio crisantemi alessio.crisant...@gmail.com
to speed with plugins on our wiki.[0]
Once you have something that requires help get on to the list and let us
know.
Lewis
[0] http://wiki.apache.org/nutch/PluginCentral
On Sat, Apr 7, 2012 at 2:33 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
may be it'd my cause with my
?
2012/4/6 alessio crisantemi alessio.crisant...@gmail.com
any suggestions for my cause?
Il giorno 05 aprile 2012 23:20, alessio crisantemi
alessio.crisant...@gmail.com ha scritto:
here a part of results:
[2] Live Score - GiocoNews - Tutto su casinò, poker, giochi online
http
or this:
http://pc-alessio:8983/*WoWSolrWebApp/search?query=giocosubmit=Search*
-- Messaggio inoltrato --
Da: alessio crisantemi alessio.crisant...@gmail.com
Date: 06 aprile 2012 22:42
Oggetto: Re: request about snippets (with attachement)
A: user@nutch.apache.org
that's can
-- Messaggio inoltrato --
Da: alessio crisantemi alessio.crisant...@gmail.com
Date: 05 aprile 2012 22:32
Oggetto: request about snippets
A: user@nutch.apache.org
Dear all,
I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and
index with success my website.
I
scritto:
Hi Alessio,
You need to determine in which field the unwanted content exists. Once
you've done this you could write an indexing filter to remove this from
your document prior to indexing.
Lewis
On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote
wrote:
I can't see any of your attachments as they're not permitted on list.
Can you provide an URL?
On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
Dear Lewis, thank you for your fast reply.
But just thiat's my problem! I don't compred wich
on list.
Can you provide an URL?
On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
Dear Lewis, thank you for your fast reply.
But just thiat's my problem! I don't compred wich is the field that
crates
this raw.
But I see a date (eg: Mercoledì Apr 04
/alpha,
http://ww.mywebsite.com/beta
, http://ww.mywebsite.com/gamma
*- ^http://ww.mywebsite.com/.*/$*
This will exclude any URL that ends with /
I would suggest you get familiar with regular expressions (in case you
don't yet)
Remi
On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi
name=urlfile:/C:/Documents and Settings/Alessio/Documenti//str
/doc
suggestions?
tx
alessio
Il giorno 12 marzo 2012 09:39, alessio crisantemi
alessio.crisant...@gmail.com ha scritto:
I add the path of my directory on regex-urlfilter but nutch crawl also
other directories...
And more: I follow
I would that the result of my search be the text of my pdf file and not the
list of documents into the directory and the path address..
Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com ha scritto:
Hi Alessio,
On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi
06:06, remi tassing tassingr...@gmail.com ha
scritto:
Using crawl-ulrfilter (or regex-urlfilter depending on which one you're
using), you should be able to solve this. Unless you're not clear on what
folders to exclude...?
On Sunday, March 11, 2012, alessio crisantemi
alessio.crisant
file, I have just this raw...And that's not a simple mode
There is another method, I suppose?
thank you
alessio
Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com ha scritto:
Please see below
On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi
alessio.crisant
that the guide write for the crawl-urlfilter
on regex-urlfilter, all works.
I would know this case.
thank you
alessio
Il giorno 04 marzo 2012 17:02, alessio crisantemi
alessio.crisant...@gmail.com ha scritto:
Hi all,
I need to crawl a directory with a lot of pdf file.
But I know onlye the step-by-step
/nutch/FAQ#How_do_I_index_my_local_file_system.3F
[2]
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
[3]
http://stackoverflow.com/questions/941519/how-to-make-nutch-crawl-file-system
On Sun, Mar 4, 2012 at 5:02 PM, alessio crisantemi
alessio.crisant...@gmail.com
16 matches
Mail list logo