Re: exclude some urls from crawling

2012-04-13 Thread alessio crisantemi
thank you remi for your precious help but i have another problem now:
I can crawl and index my website, but when I search a query, I found any
results ONLY if my query is contened to the title of my document, and not
into my documents.
why, in your opinion? It's a failed crawl?
tx again
alessio
Il giorno 13 aprile 2012 15:46, remi tassing tassingr...@gmail.com ha
scritto:

 To exclude index.php and index.html just use:
 -index\.html
 -index\.php

 You can do the same for video and live-score.

 To ultimately make sure if a URL is blocked or not, try:
 echo URL | bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined

 Remi

 On Tuesday, April 10, 2012, alessio crisantemi wrote:

  Dear All,
  I try to exclude some urls of my website to the crawling process, but
  without success.
 
  For exclude it, I add this code on my regex-urlfilter.txt file BEFORE to
  write the home page to crawl:
 
  # skip URLs containing certain characters as probable queries, etc.
  -[?*!@=]
  -^http://www.mywebsite.it/video/
  -^http://www.mywebsite.it/live-score.html/
  -^http://www.mywebsite.it/([a-z0-9\-A-Z]*\.)*php/
  -^http://www.mywebsite.it/([a-z0-9\-A-Z]*\.)*index.html/
  # 
 
 
 http://www.gioconews.it/video/http://www.gioconews.it/live-score.htmlhttp://www.gioconews.it/([a-z0-9\-A-Z]*\.)*phphttp://www.gioconews.it/([a-z0-9\-A-Z]*\.)*index.html
  
 
  So, this code beacuse I would exclude the sub-urls: 'video', 'live-score'
  and all pages under the link '/index.php' and 'index.html' (because all
  section have this principal link).
 
  but this conde, don't work, and I have all time this sub-directories on
 my
  results list.
  Ho can i do?
  suggestions?
  thank you in advance
  alessio
 



Re: request about snippets (with attachement)

2012-04-07 Thread alessio crisantemi
no Lewis,
I'm sorry for missunderstanding!


But I dont's know this link, beacause this row, it's a fixed raow on my
website template.
And also if i see the source code of my html home page, I can't see this
row.

So, I can only read this link on my xml results from solr:
this is a snippet between my results:

-leaf label= id=VF162 webpage title=Nuove regole sulle slot
machine: la Grecia invia proposta alla Commissione Ue - GiocoNews - Tutto
su rank=30 url=
http://www.gioconews.it/generale/nuove-regole-sulle-slot-machine-la-grecia-invia-proposta-alla-commissione-ue-23813.html;
Nuove regole sulle slot machine: la Grecia invia proposta alla Commissione
Ue - GiocoNews - Tutto su casinò, poker, giochi online Mercoledì Apr 04
parent Home NEWSLOT/VLT SCOMMESSE ONLINE ... LOTTERIE Politica Video Live
Score Home Esteri Generale Nuove regole sulle slot machine: la Grecia invia
proposta alla Commissione Ue HOT NEWS Turchi (Aams): “Scommesse, è far west
in Italia: m... » Non ... ... Cronache Esteri Ippica Videogiochi Bingo
Normativa Gioco e Fisco Personaggi Flipper Sfoglia Rivista Nuove regole
sulle slot machine: la Grecia invia proposta alla Commissione Ue Scritto da
Sm Mercoledì 04 ... : #FF9900;
}//--slot-machine-la-grecia-invia-proposta-alla-commissione-ue-23813.html
target=_blankNuove regole ... sulle slot machine: la Grecia invia
proposta alla Commissione UeMercoledì 04 Aprile 2012© 2012 - a href
/webpage /leaf




this is the row is that i don't want on m results: GiocoNews - Tutto su
casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT/VLT
SCOMMESSE ONLINE ... LOTTERIE 

thanx
alessio


Il giorno 07 aprile 2012 12:09, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com ha scritto:

 No I mean the URL that you are having trouble with not your solr server and
 port number plus search query...

 If you can provide the URL you wish to remove some particular HTML tag from
 then at least we can see what it is that you are having trouble with. Sorry
 if I've not made myself clear enough.

 Lewis

 2012/4/6 alessio crisantemi alessio.crisant...@gmail.com

  or this:
 
  http://pc-alessio:8983/*WoWSolrWebApp/search?query=giocosubmit=Search*
 
 
  -- Messaggio inoltrato --
  Da: alessio crisantemi alessio.crisant...@gmail.com
  Date: 06 aprile 2012 22:42
  Oggetto: Re: request about snippets (with attachement)
  A: user@nutch.apache.org
 
 
 
  that's can be good?
  http://192.168.1.5:8983/WoWSolrWebApp/search?query=giocosubmit=Search
  Il giorno 06 aprile 2012 22:29, Lewis John Mcgibbney 
  lewis.mcgibb...@gmail.com ha scritto:
 
  It would be easier if you could provide an URL and people can see exactly
   what you are struggling with please?
  
  
   2012/4/6 alessio crisantemi alessio.crisant...@gmail.com
  
any suggestions for my cause?
   
Il giorno 05 aprile 2012 23:20, alessio crisantemi 
alessio.crisant...@gmail.com ha scritto:
   
 here a part of results:

  [2] Live Score - GiocoNews - Tutto su casinò, poker, giochi
 online
http://www.gioconews.it/live-score.html  Live
 Score - *Gioco*News - Tutto su casinò, poker, giochi online
 Mercoledì
   Apr
 04 Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live
  Score
 Home Live Score Questa opzione non funziona ... correttamente.
 Sfortunatamente, il tuo browser non supporta gli Inline Frame
   Visualizza
*
 Gioco*News sul tuo Smart Phone Detect Version | Versione Mobile |
 Versione Standard Ripristina configurazione standard ... ©
 Copyright
   2012
 *Gioco*News.it powered by GNMedia s.r.l. P.iva 01419700552, Tutti i
 diritti riservati  http://www.gioconews.it/live-score.html[3]
   Curcio
 (Sapar): Sviluppo consapevole del gioco da parte di tutti gli
operatori -
 GiocoNe
   
  
 
 http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html
   
 Curcio
 (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli
 operatori - *Gioco*News - Tutto su casinò, poker, giochi online
 Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ... ONLINE
   LOTTERIE
 Politica Video Live Score Home NEWSLOT/VLT Generale Curcio (Sapar):
 Sviluppo consapevole del *gioco* da parte di tutti gli operatori
  HOT
 NEWS Turchi (Aams): ?Scommesse, è far west in Italia ... , ...
   Serpelloni
 (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di
 indirizzo comuni a livello nazionale per riuscire a monitorare il
fenom...
 Curcio (Sapar): Sviluppo consapevole del *gioco* da... » ?Da ... ,
 ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo Normativa
*Gioco*e Fisco Personaggi Flipper Sfoglia Rivista Curcio (Sapar):
   Sviluppo
 consapevole del *gioco* da parte di tutti gli operatori Scritto da
  ...
 Sm Mercoledì 04 Aprile 2012 16:45 ?Da parte della commissione c?è
l?intento
 di approfondire i numeri in possesso e i dati del

Re: request about snippets (with attachement)

2012-04-07 Thread alessio crisantemi
thank you agin Lewis,
but do you think that my strange content field it's for my cause?
beacuse I disabled the indexing of about all field.

this is my schema:

 fields
field name=id type=string stored=true indexed=true/
!-- core fields --
field name=segment type=string stored=true indexed=false/
field name=digest type=string stored=true indexed=false/
field name=boost type=float stored=true indexed=false/
!-- fields for index-basic plugin --
field name=host type=url stored=false indexed=false/
field name=site type=string stored=true indexed=false/
field name=url type=url stored=true indexed=false
required=true/
field name=content type=text stored=true indexed=true/
field name=title type=text stored=true indexed=false/
field name=cache type=string stored=true indexed=false/
field name=tstamp type=date stored=true indexed=false/
!-- fields for index-anchor plugin --
field name=anchor type=string stored=true indexed=false
multiValued=true/
!-- fields for index-more plugin --
field name=type type=string stored=true indexed=false
multiValued=true/
field name=contentLength type=long stored=true
indexed=false/
field name=lastModified type=date stored=false
indexed=false/
field name=date type=date stored=true indexed=false/
!-- fields for languageidentifier plugin --
field name=lang type=string stored=true indexed=false/
!-- fields for subcollection plugin --
field name=subcollection type=string stored=true
indexed=false multiValued=true/
!-- fields for feed plugin (tag is also used by
microformats-reltag)--
field name=author type=string stored=true indexed=true/
field name=tag type=string stored=true indexed=true
multiValued=false/
field name=feed type=string stored=true indexed=false/
field name=publishedDate type=date stored=true
indexed=false/
field name=updatedDate type=date stored=true
indexed=false/
!-- fields for creativecommons plugin --
field name=cc type=string stored=true indexed=true
multiValued=true/
/fields

what do you think?

alessio


Il giorno 07 aprile 2012 21:57, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com ha scritto:

 From the limited HTML that I've seen I can only assume that the offending
 xhtml is in the content field.

 If this is the case then you will need to write a custom plugin
 implementation that removes this. There is loads of info allowing you to
 get up to speed with plugins on our wiki.[0]

 Once you have something that requires help get on to the list and let us
 know.

 Lewis

 [0] http://wiki.apache.org/nutch/PluginCentral

 On Sat, Apr 7, 2012 at 2:33 PM, alessio crisantemi 
 alessio.crisant...@gmail.com wrote:

  may be it'd my cause with my schema?
  I chose for inex about only title, author and content.
 
  can you help me for setting a parsefilter?
  thank you
  alessio
 
 



Re: request about snippets (with attachement)

2012-04-06 Thread alessio crisantemi
that's can be good?
http://192.168.1.5:8983/WoWSolrWebApp/search?query=giocosubmit=Search
Il giorno 06 aprile 2012 22:29, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com ha scritto:

 It would be easier if you could provide an URL and people can see exactly
 what you are struggling with please?


 2012/4/6 alessio crisantemi alessio.crisant...@gmail.com

  any suggestions for my cause?
 
  Il giorno 05 aprile 2012 23:20, alessio crisantemi 
  alessio.crisant...@gmail.com ha scritto:
 
   here a part of results:
  
[2] Live Score - GiocoNews - Tutto su casinò, poker, giochi online
  http://www.gioconews.it/live-score.html  Live
   Score - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì
 Apr
   04 Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score
   Home Live Score Questa opzione non funziona ... correttamente.
   Sfortunatamente, il tuo browser non supporta gli Inline Frame
 Visualizza
  *
   Gioco*News sul tuo Smart Phone Detect Version | Versione Mobile |
   Versione Standard Ripristina configurazione standard ... © Copyright
 2012
   *Gioco*News.it powered by GNMedia s.r.l. P.iva 01419700552, Tutti i
   diritti riservati  http://www.gioconews.it/live-score.html[3]
 Curcio
   (Sapar): Sviluppo consapevole del gioco da parte di tutti gli
  operatori -
   GiocoNe
 
 http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html
 
   Curcio
   (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli
   operatori - *Gioco*News - Tutto su casinò, poker, giochi online
   Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ... ONLINE
 LOTTERIE
   Politica Video Live Score Home NEWSLOT/VLT Generale Curcio (Sapar):
   Sviluppo consapevole del *gioco* da parte di tutti gli operatori HOT
   NEWS Turchi (Aams): ?Scommesse, è far west in Italia ... , ...
 Serpelloni
   (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di
   indirizzo comuni a livello nazionale per riuscire a monitorare il
  fenom...
   Curcio (Sapar): Sviluppo consapevole del *gioco* da... » ?Da ... ,
   ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo Normativa
  *Gioco*e Fisco Personaggi Flipper Sfoglia Rivista Curcio (Sapar):
 Sviluppo
   consapevole del *gioco* da parte di tutti gli operatori Scritto da ...
   Sm Mercoledì 04 Aprile 2012 16:45 ?Da parte della commissione c?è
  l?intento
   di approfondire i numeri in possesso e i dati del settore del *gioco*.
 Da
   parte nostra abbiamo cercato di chiarire le cifre e
  
 
 http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html
   [4] Serpelloni (Dip. Antidroga): ?Sul gioco necessarie linee di
 indirizzo
   per la cura delle patologie? -
 
 http://www.gioconews.it/generale/serpelloni-dip.-antidroga-sul-gioco-necessarie-linee-di-indirizzo-per-la-cura-delle-patologie-23847.html
 
   Serpelloni
   (Dip. Antidroga): ?Sul *gioco* necessarie linee di indirizzo per la
 cura
   delle patologie? - *Gioco*News - Tutto su casinò, poker, giochi online
   Mercoledì Apr 04 parent Home NEWSLOT ... /VLT SCOMMESSE ONLINE
 LOTTERIE
   Politica Video Live Score Home Politica Generale Serpelloni (Dip.
   Antidroga): ?Sul *gioco* necessarie linee di indirizzo per la cura
 delle
   patologie? HOT NEWS Turchi (Aams): ?Scommesse ... a tutti gli eccessi,
  ...
   Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono
 ?linee
   di indirizzo comuni a livello nazionale per riuscire a monitorare il
   fenom... Curcio (Sapar): Sviluppo ... consapevole del *gioco* da... »
   ?Da parte della commissione c?è l?intento di approfondire i numeri in
   possesso e i dati de... Scommesse sportive: il 9 aprile apertura
   anticipat... » Aams comunica che, per la ... montepremi complessivo
 delle
   vincite, ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo
  Normativa
   *Gioco* e Fisco Personaggi Flipper Sfoglia Rivista Serpelloni (Dip.
   Antidroga): ?Sul *gioco* necessarie
  
 
 http://www.gioconews.it/generale/serpelloni-dip.-antidroga-sul-gioco-necessarie-linee-di-indirizzo-per-la-cura-delle-patologie-23847.html
   [5] Generale - GiocoNews - Tutto su casinò, poker, giochi online
  http://www.gioconews.it/generale/index.php  Generale
   - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04
   parent Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live
   Score Home Politica Generale HOT NEWS Turchi ... sensibili e attenti a
   tutti gli eccessi, ... Serpelloni (Dip. Antidroga): ?Sul
  *gioco*necessarie... » Servono ?linee di indirizzo comuni a livello
  nazionale per
   riuscire a monitorare il fenom... Curcio (Sapar ... ): Sviluppo
   consapevole del *gioco* da... » ?Da parte della commissione c?è
 l?intento
   di approfondire i numeri in possesso e i dati de... Scommesse sportive:
  il
   9 aprile apertura anticipat... » Aams comunica che ... previsto, il
   montepremi complessivo delle vincite

Fwd: request about snippets (with attachement)

2012-04-06 Thread alessio crisantemi
or this:

http://pc-alessio:8983/*WoWSolrWebApp/search?query=giocosubmit=Search*


-- Messaggio inoltrato --
Da: alessio crisantemi alessio.crisant...@gmail.com
Date: 06 aprile 2012 22:42
Oggetto: Re: request about snippets (with attachement)
A: user@nutch.apache.org



that's can be good?
http://192.168.1.5:8983/WoWSolrWebApp/search?query=giocosubmit=Search
Il giorno 06 aprile 2012 22:29, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com ha scritto:

It would be easier if you could provide an URL and people can see exactly
 what you are struggling with please?


 2012/4/6 alessio crisantemi alessio.crisant...@gmail.com

  any suggestions for my cause?
 
  Il giorno 05 aprile 2012 23:20, alessio crisantemi 
  alessio.crisant...@gmail.com ha scritto:
 
   here a part of results:
  
[2] Live Score - GiocoNews - Tutto su casinò, poker, giochi online
  http://www.gioconews.it/live-score.html  Live
   Score - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì
 Apr
   04 Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score
   Home Live Score Questa opzione non funziona ... correttamente.
   Sfortunatamente, il tuo browser non supporta gli Inline Frame
 Visualizza
  *
   Gioco*News sul tuo Smart Phone Detect Version | Versione Mobile |
   Versione Standard Ripristina configurazione standard ... © Copyright
 2012
   *Gioco*News.it powered by GNMedia s.r.l. P.iva 01419700552, Tutti i
   diritti riservati  http://www.gioconews.it/live-score.html[3]
 Curcio
   (Sapar): Sviluppo consapevole del gioco da parte di tutti gli
  operatori -
   GiocoNe
 
 http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html
 
   Curcio
   (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli
   operatori - *Gioco*News - Tutto su casinò, poker, giochi online
   Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ... ONLINE
 LOTTERIE
   Politica Video Live Score Home NEWSLOT/VLT Generale Curcio (Sapar):
   Sviluppo consapevole del *gioco* da parte di tutti gli operatori HOT
   NEWS Turchi (Aams): ?Scommesse, è far west in Italia ... , ...
 Serpelloni
   (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di
   indirizzo comuni a livello nazionale per riuscire a monitorare il
  fenom...
   Curcio (Sapar): Sviluppo consapevole del *gioco* da... » ?Da ... ,
   ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo Normativa
  *Gioco*e Fisco Personaggi Flipper Sfoglia Rivista Curcio (Sapar):
 Sviluppo
   consapevole del *gioco* da parte di tutti gli operatori Scritto da ...
   Sm Mercoledì 04 Aprile 2012 16:45 ?Da parte della commissione c?è
  l?intento
   di approfondire i numeri in possesso e i dati del settore del *gioco*.
 Da
   parte nostra abbiamo cercato di chiarire le cifre e
  
 
 http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html
   [4] Serpelloni (Dip. Antidroga): ?Sul gioco necessarie linee di
 indirizzo
   per la cura delle patologie? -
 
 http://www.gioconews.it/generale/serpelloni-dip.-antidroga-sul-gioco-necessarie-linee-di-indirizzo-per-la-cura-delle-patologie-23847.html
 
   Serpelloni
   (Dip. Antidroga): ?Sul *gioco* necessarie linee di indirizzo per la
 cura
   delle patologie? - *Gioco*News - Tutto su casinò, poker, giochi online
   Mercoledì Apr 04 parent Home NEWSLOT ... /VLT SCOMMESSE ONLINE
 LOTTERIE
   Politica Video Live Score Home Politica Generale Serpelloni (Dip.
   Antidroga): ?Sul *gioco* necessarie linee di indirizzo per la cura
 delle
   patologie? HOT NEWS Turchi (Aams): ?Scommesse ... a tutti gli eccessi,
  ...
   Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono
 ?linee
   di indirizzo comuni a livello nazionale per riuscire a monitorare il
   fenom... Curcio (Sapar): Sviluppo ... consapevole del *gioco* da... »
   ?Da parte della commissione c?è l?intento di approfondire i numeri in
   possesso e i dati de... Scommesse sportive: il 9 aprile apertura
   anticipat... » Aams comunica che, per la ... montepremi complessivo
 delle
   vincite, ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo
  Normativa
   *Gioco* e Fisco Personaggi Flipper Sfoglia Rivista Serpelloni (Dip.
   Antidroga): ?Sul *gioco* necessarie
  
 
 http://www.gioconews.it/generale/serpelloni-dip.-antidroga-sul-gioco-necessarie-linee-di-indirizzo-per-la-cura-delle-patologie-23847.html
   [5] Generale - GiocoNews - Tutto su casinò, poker, giochi online
  http://www.gioconews.it/generale/index.php  Generale
   - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04
   parent Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live
   Score Home Politica Generale HOT NEWS Turchi ... sensibili e attenti a
   tutti gli eccessi, ... Serpelloni (Dip. Antidroga): ?Sul
  *gioco*necessarie... » Servono ?linee di indirizzo comuni a livello
  nazionale per
   riuscire a monitorare il fenom... Curcio (Sapar

Fwd: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
-- Messaggio inoltrato --
Da: alessio crisantemi alessio.crisant...@gmail.com
Date: 05 aprile 2012 22:32
Oggetto: request about snippets
A: user@nutch.apache.org


Dear all,
I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and
index with success my website.

I have only a problem with the results of my researches.
Into all results, the snippets have a raw with a string where I can read
all the categories of my website. I attached a screen shot for explain:
here, the no good raw is Mercoledì Apr 04 parent Home NEWSLOT/VLT
SCOMMESSE ONLINE LOTTERIE Politica Video Live Score )

This is a problem, because if solr read for any page the same raw, when my
query is the same word of this raw (eg: 'ONLINe') I have all my solr index
like a result.

When I can jump this raw during my crawling? Is possible exclude this raw?
thank you in adavande
alessio


Re: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
Dear Lewis, thank you for your fast reply.
But just thiat's my problem! I don't compred wich is the field that crates
this raw.

But I see a date (eg: Mercoledì Apr 04) followed by the word parent
anche after  and the the ame of categories (Home NEWSLOT/VLT SCOMMESSE
ONLINE LOTTERIE Politica Video Live Score).

Do you know wich field of default nutch configuration generate the 'parent'
raw.

as you can see in the attachement, this raw is into the content field,
between 'str' tags.
..
suggestions?
tx
a.

Il giorno 05 aprile 2012 22:45, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com ha scritto:

 Hi Alessio,

 You need to determine in which field the unwanted content exists. Once
 you've done this you could write an indexing filter to remove this from
 your document prior to indexing.

 Lewis

 On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi 
 alessio.crisant...@gmail.com wrote:

 
 
  -- Messaggio inoltrato --
  Da: alessio crisantemi alessio.crisant...@gmail.com
  Date: 05 aprile 2012 22:32
  Oggetto: request about snippets
  A: user@nutch.apache.org
 
 
  Dear all,
  I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and
  index with success my website.
 
  I have only a problem with the results of my researches.
  Into all results, the snippets have a raw with a string where I can read
  all the categories of my website. I attached a screen shot for explain:
  here, the no good raw is Mercoledì Apr 04 parent Home NEWSLOT/VLT
  SCOMMESSE ONLINE LOTTERIE Politica Video Live Score )
 
  This is a problem, because if solr read for any page the same raw, when
 my
  query is the same word of this raw (eg: 'ONLINe') I have all my solr
 index
  like a result.
 
  When I can jump this raw during my crawling? Is possible exclude this
 raw?
  thank you in adavande
  alessio
 
 


 --
 *Lewis*



Re: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
what is it 'breadcrumb' Markus?

Il giorno 05 aprile 2012 23:08, Markus Jelsma
markus.jel...@openindex.ioha scritto:

 Seems to me it's just the breadcrumb of the page popping up in Solr's
 highlighter snippet?



 In Thu, 5 Apr 2012 22:02:31 +0100, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 I can't see any of your attachments as they're not permitted on list.

 Can you provide an URL?

 On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi 
 alessio.crisant...@gmail.com wrote:

  Dear Lewis, thank you for your fast reply.
 But just thiat's my problem! I don't compred wich is the field that
 crates
 this raw.

 But I see a date (eg: Mercoledì Apr 04) followed by the word parent
 anche after  and the the ame of categories (Home NEWSLOT/VLT SCOMMESSE
 ONLINE LOTTERIE Politica Video Live Score).

 Do you know wich field of default nutch configuration generate the
 'parent'
 raw.

 as you can see in the attachement, this raw is into the content field,
 between 'str' tags.
 ..
 suggestions?
 tx
 a.

 Il giorno 05 aprile 2012 22:45, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com ha scritto:

  Hi Alessio,
 
  You need to determine in which field the unwanted content exists. Once
  you've done this you could write an indexing filter to remove this from
  your document prior to indexing.
 
  Lewis
 
  On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi 
  alessio.crisant...@gmail.com wrote:
 
  
  
   -- Messaggio inoltrato --
   Da: alessio crisantemi alessio.crisant...@gmail.com
   Date: 05 aprile 2012 22:32
   Oggetto: request about snippets
   A: user@nutch.apache.org
  
  
   Dear all,
   I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl
 and
   index with success my website.
  
   I have only a problem with the results of my researches.
   Into all results, the snippets have a raw with a string where I can
 read
   all the categories of my website. I attached a screen shot for
 explain:
   here, the no good raw is Mercoledì Apr 04 parent Home NEWSLOT/VLT
   SCOMMESSE ONLINE LOTTERIE Politica Video Live Score )
  
   This is a problem, because if solr read for any page the same raw,
 when
  my
   query is the same word of this raw (eg: 'ONLINe') I have all my solr
  index
   like a result.
  
   When I can jump this raw during my crawling? Is possible exclude this
  raw?
   thank you in adavande
   alessio
  
  
 
 
  --
  *Lewis*
 


 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350



Re: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
, ...
Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di
indirizzo comuni a livello nazionale per riuscire a monitorare il fenom...
Curcio (Sapar): Sviluppo ... consapevole del *gioco* da... » ?Da parte
della commissione c?è l?intento di approfondire i numeri in possesso e i
dati de... Scommesse sportive: il 9 aprile apertura anticipat... » Aams
comunica che, per la ... Iori, presidente Conagga, al convegno dedicato al *
gioco*... Visualizzazioni: 238 Da: redazione Intervista a Francesco...
Categoria: News - Interviste Intervista a Francesco Ginestra presidente di
Asso ... Snai Visualizzazioni: 169 Da: redazione Il Presidente Udc Rocco...
Categoria: News - Interviste Il Presidente Udc Rocco Buttiglione parla di *
gioco* e regolamentazione Visualizzazioni: 192 Da: redazione *Gioco*
http://www.gioconews.it/video.html

Il giorno 05 aprile 2012 23:02, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com ha scritto:

 I can't see any of your attachments as they're not permitted on list.

 Can you provide an URL?

 On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi 
 alessio.crisant...@gmail.com wrote:

  Dear Lewis, thank you for your fast reply.
  But just thiat's my problem! I don't compred wich is the field that
 crates
  this raw.
 
  But I see a date (eg: Mercoledì Apr 04) followed by the word parent
  anche after  and the the ame of categories (Home NEWSLOT/VLT SCOMMESSE
  ONLINE LOTTERIE Politica Video Live Score).
 
  Do you know wich field of default nutch configuration generate the
 'parent'
  raw.
 
  as you can see in the attachement, this raw is into the content field,
  between 'str' tags.
  ..
  suggestions?
  tx
  a.
 
  Il giorno 05 aprile 2012 22:45, Lewis John Mcgibbney 
  lewis.mcgibb...@gmail.com ha scritto:
 
   Hi Alessio,
  
   You need to determine in which field the unwanted content exists. Once
   you've done this you could write an indexing filter to remove this from
   your document prior to indexing.
  
   Lewis
  
   On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi 
   alessio.crisant...@gmail.com wrote:
  
   
   
-- Messaggio inoltrato --
Da: alessio crisantemi alessio.crisant...@gmail.com
Date: 05 aprile 2012 22:32
Oggetto: request about snippets
A: user@nutch.apache.org
   
   
Dear all,
I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl
 and
index with success my website.
   
I have only a problem with the results of my researches.
Into all results, the snippets have a raw with a string where I can
  read
all the categories of my website. I attached a screen shot for
 explain:
here, the no good raw is Mercoledì Apr 04 parent Home NEWSLOT/VLT
SCOMMESSE ONLINE LOTTERIE Politica Video Live Score )
   
This is a problem, because if solr read for any page the same raw,
 when
   my
query is the same word of this raw (eg: 'ONLINe') I have all my solr
   index
like a result.
   
When I can jump this raw during my crawling? Is possible exclude this
   raw?
thank you in adavande
alessio
   
   
  
  
   --
   *Lewis*
  
 



 --
 *Lewis*



Re: crawling a website

2012-04-02 Thread alessio crisantemi
dear Remi,
thank you for your reply but that's no good for my case.
because the first command stop my crawling at the first section and the
second stop it just at the start point.

so, I see that the sectiond of my website have like a first page a urls
with 'index.php' (EG: http://ww.mywebsite.com/beta/index.php)
so, for crawl all this section (http://ww.mywebsite.com/beta) but for not
include the parsing of the http://ww.mywebsite.com/beta/index.php page)
wich is the correct command?

(may be the following?
*- ^http://ww.mywebsite.com/index-php$* ) or similar?
thanks
alessio



Il giorno 02 aprile 2012 11:40, remi tassing tassingr...@gmail.com ha
scritto:

 It depends on the structure of your site and you can modify
 regex-urlfilter.txt to reach your goal.

 From the examples you gave, you can do this:
 *- ^http://ww.mywebsite.com/[^/]*$*
 it will exclude  http://ww.mywebsite.com/alpha,
 http://ww.mywebsite.com/beta
 , http://ww.mywebsite.com/gamma

 *- ^http://ww.mywebsite.com/.*/$*
 This will exclude any URL that ends with /

 I would suggest you get familiar with regular expressions (in case you
 don't yet)

 Remi

 On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi 
 alessio.crisant...@gmail.com wrote:

  Dear All,
  I would change my crawling operation but I don't know how can I do.
 
  crawling my website I used the follow command:
 
  $ bin/nutch crawl urls -solr http://localhost:8983/solr -threads 3
 -depth
  35 -topN 10
 
  for crawl with nutch and index results on solr index.
 
 
 
  But I would not crawl the single section of my website but only the
 single
  pages.
 
  for example:
 
  You considere a site: www.mywebsite.com composed with 3 section:
 
  http://ww.mywebsite.com/alpha
 
  http://ww.mywebsite.com/beta
 
  http://ww.mywebsite.com/gamma
 
 
 
  so, I want between my results, only the single pages of my articles, and
  not the list of articles on this directories also.
 
  So, I would for example, the parsong of the file:
 
  http://ww.mywebsite.com/alpha/artcle1.html
 
  http://ww.mywebsite.com/alpha/artcle3.html
 
  ...
 
 
 
  and i don't want the parsing of the parent section:
 
  http://ww.mywebsite.com/alpha/
 
 
 
  How can I do?
 
  suggestion?
 
  sorry if not all clear
 
  thank you
 
  alessio
 



Re: nutch crawling file system SOLVED

2012-03-17 Thread alessio crisantemi
this is the return after crawling with nutch and indexing on solr:

doc
float name=boost0.298293/float
-
str name=content
Index of C:\Documents and Settings\Alessio\Documenti Index of C:\Documents
and Settings\Alessio\Documenti ../ - - - 003_C_001_Alessio_2004_08_13.dvf
Tue, 17 Aug 2004 20:09:52 GMT 151552 13542 Christmas on Stars Affiliate
Banners 21.11/ Fri, 02 Dec 2011 20:45:37 GMT - 5 GETTONI 22 EURO.doc Mon,
08 Sep 2003 16:18:58 GMT 20480 adesione 191 con password.doc Fri, 29 Aug
2003 13:45:00 GMT 116736 art inceneritore.doc Sun, 18 Feb 2007 18:08:54 GMT
32768 articoli/ Wed, 26 Apr 2006 08:12:43 GMT - auguri pcplayer.it.jpg Sun,
24 Dec 2006 20:56:48 GMT 13888 Bluetooth Exchange Folder/ Wed, 22 Mar 2006
19:50:24 GMT - BluetoothXpSp2.pdf Mon, 11 Jul 2005 08:24:58 GMT 120812
Brevettato il disco volante.doc Wed, 30 Aug 2006 13:57:19 GMT 20992 Busta
Rhymes feat maria c.mp3 Sat, 05 Jul 2003 12:17:58 GMT 2595736 CALENDARIO
FANTA2005.doc Wed, 08 Sep 2004 13:13:20 GMT 141824 Cartella Scambio
Bluetooth/ Mon, 20 Mar 2006 22:03:45 GMT - cc_20111224_234440.reg Sat, 24
Dec 2011 22:44:44 GMT 1860 CD musicali 01 -01- 2003.xls Mon, 27 Jan 2003
00:10:00 GMT 515584 CLASSIFICA fantacalcio 2005.doc Wed, 08 Sep 2004
13:15:38 GMT 43008 Collegamento a My Shared Folder.lnk Thu, 29 Sep 2005
16:56:56 GMT 533 conte tagliaferri.doc Thu, 09 Oct 2008 00:00:36 GMT 29184
Corel User Files/ Sun, 16 Apr 2006 11:56:24 GMT - Curriculum ANGELO
CONTILI.doc Wed, 19 Jan 2005 22:35:34 GMT 42496 currriculum Alessio
agg2004.doc Thu, 27 Jan 2005 21:56:06 GMT 950784 currriculum Alessio.doc
Thu, 10 Apr 2003 19:29:28 GMT 44544 Default.rdp Mon, 11 Sep 2006 17:13:17
GMT 1166 desktop.ini Wed, 28 Sep 2011 20:52:02 GMT 75 DNSadsl.txt Mon, 01
Aug 2005 13:21:52 GMT 942 DNStabella.xls Mon, 01 Aug 2005 13:16:26 GMT
33792 Download/ Fri, 09 Mar 2012 23:03:54 GMT - Eseguibili JAVA.doc Mon, 20
Jun 2005 11:58:42 GMT 23552 FANTACALCIO/ Mon, 20 Mar 2006 22:04:34 GMT -
Fax/ Mon, 20 Mar 2006 22:04:40 GMT - File ricevuti/ Mon, 19 Oct 2009
12:50:30 GMT - FINALE TORNEO 06.doc Tue, 08 May 2007 17:47:41 GMT 49664
Finest/ Sat, 06 Mar 2010 15:05:46 GMT - FORMAZIONItipo2005.doc Mon, 13 Sep
2004 20:26:56 GMT 49664 free3gp/ Tue, 25 May 2010 16:23:43 GMT - Futurando/
Mon, 20 Mar 2006 09:59:24 GMT - GOL/ Mon, 20 Mar 2006 22:24:46 GMT -
guidadownloadconmirc.doc Sun, 20 Feb 2005 10:29:40 GMT 264704 HAPPY DAYS/
Mon, 20 Mar 2006 21:45:36 GMT - Happy Days2007/ Sun, 27 Jan 2008 15:53:33
GMT - hijackthis.log Fri, 04 Jul 2008 08:49:37 GMT 8573 Immagini/ Wed, 28
Sep 2011 20:52:03 GMT - Immagini.lnk Fri, 15 Aug 2008 16:26:58 GMT 375
intervisteEnada/ Mon, 20 Mar 2006 10:36:53 GMT - IP Pentima.txt Fri, 01 Jul
2005 08:19:56 GMT 99 L'AUTOMATICO/ Sun, 14 Jan 2007 17:30:41 GMT -
lavatr1h.mp3 Wed, 03 Oct 2001 15:19:52 GMT 2586624 lionsleeps_hq.wmv Tue,
17 May 2005 13:29:02 GMT 1842905 lista flip.docx Sat, 29 Mar 2008 14:04:24
GMT 13559 Masterizzare giochi con NERO BURNING ROM.doc Sun, 06 Mar 2005
19:32:28 GMT 23040 masterizzarre CD protetti.txt Thu, 20 Jan 2005 20:36:00
GMT 2326 Matlab 65 serial.txt Thu, 09 Oct 2003 22:34:00 GMT 86
MessageLog.xsl Sun, 21 Dec 2008 20:45:03 GMT 12160 mirc istruz.txt Sun, 09
Mar 2003 16:44:00 GMT 1123 Musica/ Wed, 28 Sep 2011 20:52:04 GMT - My Skype
Content/ Sat, 06 May 2006 12:04:04 GMT - My Skype Pictures/ Wed, 27 Apr
2011 19:54:03 GMT - My Skype Received Files/ Thu, 18 May 2006 16:50:34 GMT
- natale_flip.jpg Sat, 23 Dec 2006 17:18:52 GMT 118507 niagara.JPG Fri, 18
Aug 2006 16:53:48 GMT 1017782 niagara2.JPG Fri, 18 Aug 2006 16:53:44 GMT
988143 Norton AntiVirus_Key.txt Sun, 31 Oct 2004 19:28:24 GMT 357
postepay.txt Wed, 16 Jul 2008 07:48:38 GMT 16 presentazione_FB.pdf Thu, 09
Mar 2006 08:58:00 GMT 700629 richiesta.doc Sun, 16 Nov 2003 18:14:44 GMT
124928 ROSE FANTACALCIO 2005.doc Wed, 08 Sep 2004 13:59:54 GMT 45568
scudettoicona.ico Mon, 22 Sep 2003 19:55:10 GMT 13502
serial_akkxMDYwMTE0ODM5.txt Tue, 05 Aug 2003 20:19:26 GMT 155 Siti Web/
Sun, 03 Jun 2007 13:21:44 GMT - SitoTernanaGiochi/ Fri, 25 May 2007
19:52:51 GMT - sitoTGver1.1.pub Sun, 06 Mar 2005 20:23:56 GMT 1637888
starry(d).jpg Sun, 02 Apr 2006 10:10:26 GMT 2138166 suonerie/ Fri, 15 Aug
2008 16:29:09 GMT - Symantec/ Sun, 13 Aug 2006 12:09:04 GMT - Thumbs.db
Sun, 11 Feb 2007 14:45:34 GMT 71168 vecchioDocumenti/ Wed, 14 Jul 2010
15:42:28 GMT - virtualDub/ Mon, 20 Mar 2006 10:16:47 GMT - Voice Files/
Mon, 27 Mar 2006 11:57:35 GMT - ZbThumbnail.info Mon, 09 Jun 2008 08:25:30
GMT 2920 zurigo.doc Thu, 13 Apr 2006 15:24:45 GMT 27648
/str
str name=digest6717a734c4f78c7f7f2dbc9a7324199e/str
str name=idfile:/C:/Documents and Settings/Alessio/Documenti//str
str name=segment20120317175631/str
-
str name=title
Index of C:\Documents and Settings\Alessio\Documenti
/str
date name=tstamp2012-03-17T16:56:39.014Z/date
str name=urlfile:/C:/Documents and Settings/Alessio/Documenti//str
/doc

suggestions?
tx
alessio

Il giorno 12 marzo 2012 09:39, alessio crisantemi 
alessio.crisant...@gmail.com ha scritto:

 I add the path

Re: nutch crawling file system SOLVED

2012-03-17 Thread alessio crisantemi
I would that the result of my search be the text of my pdf file and not the
list of documents into the directory and the path address..




Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com ha scritto:

 Hi Alessio,

 On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi 
 alessio.crisant...@gmail.com wrote:

 
 
  suggestions?
 

 For what?



Re: nutch crawling file system SOLVED

2012-03-12 Thread alessio crisantemi
I add the path of my directory on regex-urlfilter but nutch crawl also
other directories...

And more: I follow your suggestions and I indexing again my root, But I
have still a index with the name of my pdf's files and not the content of
those.

I don't comprend..
alessio

Il giorno 12 marzo 2012 06:06, remi tassing tassingr...@gmail.com ha
scritto:

 Using crawl-ulrfilter (or regex-urlfilter depending on which one you're
 using), you should be able to solve this. Unless you're not clear on what
 folders to exclude...?

 On Sunday, March 11, 2012, alessio crisantemi 
 alessio.crisant...@gmail.com
 wrote:
  thank you Remi for your preciuos help. I try again and I write you the
  results.
  But I have another little question: how can I do for limit the crawling
  only to my selected root?
 
  Because all time, Nutch crawl also the parent directories. I read that
 The
  code that is responsable for this is in
 

 org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
  f). 
 
  And a guy suggest to change the following line:
  this.content = list2html(f.listFiles(), path, /.equals(path) ? false :
  true);
 
  to
  this.content = list2html(f.listFiles(), path, false);
 
  and recompiled.
 
  But in my class file, I have just this raw...And that's not a simple mode
 
  There is another method, I suppose?
 
  thank you
 
  alessio
 
 
 
  Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney 
  lewis.mcgibb...@gmail.com ha scritto:
 
  Please see below
 
  On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi 
  alessio.crisant...@gmail.com wrote:
 
  
   [1]
  http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
  
 
  I've now updated this link, thanks for pointing this out.
 
 
   And Now, I have another problem:
   I crawled my local file system: a directory with a lot of Pdf files.
 All
   works, and nutch index on Solr the results.
  
 
  OK
 
 
   But this is the problem: whe I submit a query on solr, I can see only
 a
   list of files, and not the pdf contents.
   why, in your opinion?
  
 
  Well this might be to with your file.content.limit in nutch-site.xml,
 maybe
  your documents are being truncated if they are too large.
  Additionally your Solr mapping's and or schema configuration may need to
 be
  tweaked slightly to permit you to view snippets of the PDF content
 within
  your Solr search results. In your schema configuration for index-basec,
 try
  changing
 
  field name=content type=text stored=false indexed=true/
 
  to
 
  field name=content type=text stored=true indexed=true/
 
 
  You will need to reindex your content if you wish to see the results
  through Solr.
 
 



Re: nutch crawling file system SOLVED

2012-03-11 Thread alessio crisantemi
thank you Remi for your preciuos help. I try again and I write you the
results.
But I have another little question: how can I do for limit the crawling
only to my selected root?

Because all time, Nutch crawl also the parent directories. I read that The
code that is responsable for this is in
org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
f). 

And a guy suggest to change the following line:
this.content = list2html(f.listFiles(), path, /.equals(path) ? false :
true);

to
this.content = list2html(f.listFiles(), path, false);

and recompiled.

But in my class file, I have just this raw...And that's not a simple mode

There is another method, I suppose?

thank you

alessio



Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com ha scritto:

 Please see below

 On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi 
 alessio.crisant...@gmail.com wrote:

 
  [1]
 http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
 

 I've now updated this link, thanks for pointing this out.


  And Now, I have another problem:
  I crawled my local file system: a directory with a lot of Pdf files. All
  works, and nutch index on Solr the results.
 

 OK


  But this is the problem: whe I submit a query on solr, I can see only a
  list of files, and not the pdf contents.
  why, in your opinion?
 

 Well this might be to with your file.content.limit in nutch-site.xml, maybe
 your documents are being truncated if they are too large.
 Additionally your Solr mapping's and or schema configuration may need to be
 tweaked slightly to permit you to view snippets of the PDF content within
 your Solr search results. In your schema configuration for index-basec, try
 changing

 field name=content type=text stored=false indexed=true/

 to

 field name=content type=text stored=true indexed=true/


 You will need to reindex your content if you wish to see the results
 through Solr.



Re: nutch crawling file system SOLVED

2012-03-10 Thread alessio crisantemi
I'm partially solved.
following the tutorial, I configured my nutch for crawl a local file system,
thank you.

But I have a duobt: why all tutorials and guide about nutch speak about
crawl-urlfilter.txt' file, but the default config or Nutch don't have this
file? But If I insert the code that the guide write for the crawl-urlfilter
on regex-urlfilter, all works.
I would know this case.
thank you
alessio

Il giorno 04 marzo 2012 17:02, alessio crisantemi 
alessio.crisant...@gmail.com ha scritto:

 Hi all,
 I need to crawl a directory with a lot of pdf file.
 But I know onlye the step-by-step mode for crawl a website.
 how can I do for a root?
 thank you for help me
 alessio



Re: nutch craling file system

2012-03-04 Thread alessio crisantemi
thank you for this fast reply!
I use solr 1.4.1 and nutch 1.4, These solutions works with those versions?
tx
a.

Il giorno 04 marzo 2012 17:06, remi tassing tassingr...@gmail.com ha
scritto:

 Plz try GOOGLing that first!

 If you don't find anything then try these:
 [1]http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
 [2]
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

 [3]

 http://stackoverflow.com/questions/941519/how-to-make-nutch-crawl-file-system


 On Sun, Mar 4, 2012 at 5:02 PM, alessio crisantemi 
 alessio.crisant...@gmail.com wrote:

  Hi all,
  I need to crawl a directory with a lot of pdf file.
  But I know onlye the step-by-step mode for crawl a website.
  how can I do for a root?
  thank you for help me
  alessio