Re: exclude some urls from crawling
thank you remi for your precious help but i have another problem now: I can crawl and index my website, but when I search a query, I found any results ONLY if my query is contened to the title of my document, and not into my documents. why, in your opinion? It's a failed crawl? tx again alessio Il giorno 13 aprile 2012 15:46, remi tassing tassingr...@gmail.com ha scritto: To exclude index.php and index.html just use: -index\.html -index\.php You can do the same for video and live-score. To ultimately make sure if a URL is blocked or not, try: echo URL | bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined Remi On Tuesday, April 10, 2012, alessio crisantemi wrote: Dear All, I try to exclude some urls of my website to the crawling process, but without success. For exclude it, I add this code on my regex-urlfilter.txt file BEFORE to write the home page to crawl: # skip URLs containing certain characters as probable queries, etc. -[?*!@=] -^http://www.mywebsite.it/video/ -^http://www.mywebsite.it/live-score.html/ -^http://www.mywebsite.it/([a-z0-9\-A-Z]*\.)*php/ -^http://www.mywebsite.it/([a-z0-9\-A-Z]*\.)*index.html/ # http://www.gioconews.it/video/http://www.gioconews.it/live-score.htmlhttp://www.gioconews.it/([a-z0-9\-A-Z]*\.)*phphttp://www.gioconews.it/([a-z0-9\-A-Z]*\.)*index.html So, this code beacuse I would exclude the sub-urls: 'video', 'live-score' and all pages under the link '/index.php' and 'index.html' (because all section have this principal link). but this conde, don't work, and I have all time this sub-directories on my results list. Ho can i do? suggestions? thank you in advance alessio
Re: request about snippets (with attachement)
no Lewis, I'm sorry for missunderstanding! But I dont's know this link, beacause this row, it's a fixed raow on my website template. And also if i see the source code of my html home page, I can't see this row. So, I can only read this link on my xml results from solr: this is a snippet between my results: -leaf label= id=VF162 webpage title=Nuove regole sulle slot machine: la Grecia invia proposta alla Commissione Ue - GiocoNews - Tutto su rank=30 url= http://www.gioconews.it/generale/nuove-regole-sulle-slot-machine-la-grecia-invia-proposta-alla-commissione-ue-23813.html; Nuove regole sulle slot machine: la Grecia invia proposta alla Commissione Ue - GiocoNews - Tutto su casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ONLINE ... LOTTERIE Politica Video Live Score Home Esteri Generale Nuove regole sulle slot machine: la Grecia invia proposta alla Commissione Ue HOT NEWS Turchi (Aams): “Scommesse, è far west in Italia: m... » Non ... ... Cronache Esteri Ippica Videogiochi Bingo Normativa Gioco e Fisco Personaggi Flipper Sfoglia Rivista Nuove regole sulle slot machine: la Grecia invia proposta alla Commissione Ue Scritto da Sm Mercoledì 04 ... : #FF9900; }//--slot-machine-la-grecia-invia-proposta-alla-commissione-ue-23813.html target=_blankNuove regole ... sulle slot machine: la Grecia invia proposta alla Commissione UeMercoledì 04 Aprile 2012© 2012 - a href /webpage /leaf this is the row is that i don't want on m results: GiocoNews - Tutto su casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ONLINE ... LOTTERIE thanx alessio Il giorno 07 aprile 2012 12:09, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: No I mean the URL that you are having trouble with not your solr server and port number plus search query... If you can provide the URL you wish to remove some particular HTML tag from then at least we can see what it is that you are having trouble with. Sorry if I've not made myself clear enough. Lewis 2012/4/6 alessio crisantemi alessio.crisant...@gmail.com or this: http://pc-alessio:8983/*WoWSolrWebApp/search?query=giocosubmit=Search* -- Messaggio inoltrato -- Da: alessio crisantemi alessio.crisant...@gmail.com Date: 06 aprile 2012 22:42 Oggetto: Re: request about snippets (with attachement) A: user@nutch.apache.org that's can be good? http://192.168.1.5:8983/WoWSolrWebApp/search?query=giocosubmit=Search Il giorno 06 aprile 2012 22:29, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: It would be easier if you could provide an URL and people can see exactly what you are struggling with please? 2012/4/6 alessio crisantemi alessio.crisant...@gmail.com any suggestions for my cause? Il giorno 05 aprile 2012 23:20, alessio crisantemi alessio.crisant...@gmail.com ha scritto: here a part of results: [2] Live Score - GiocoNews - Tutto su casinò, poker, giochi online http://www.gioconews.it/live-score.html Live Score - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score Home Live Score Questa opzione non funziona ... correttamente. Sfortunatamente, il tuo browser non supporta gli Inline Frame Visualizza * Gioco*News sul tuo Smart Phone Detect Version | Versione Mobile | Versione Standard Ripristina configurazione standard ... © Copyright 2012 *Gioco*News.it powered by GNMedia s.r.l. P.iva 01419700552, Tutti i diritti riservati http://www.gioconews.it/live-score.html[3] Curcio (Sapar): Sviluppo consapevole del gioco da parte di tutti gli operatori - GiocoNe http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html Curcio (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli operatori - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ... ONLINE LOTTERIE Politica Video Live Score Home NEWSLOT/VLT Generale Curcio (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli operatori HOT NEWS Turchi (Aams): ?Scommesse, è far west in Italia ... , ... Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di indirizzo comuni a livello nazionale per riuscire a monitorare il fenom... Curcio (Sapar): Sviluppo consapevole del *gioco* da... » ?Da ... , ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo Normativa *Gioco*e Fisco Personaggi Flipper Sfoglia Rivista Curcio (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli operatori Scritto da ... Sm Mercoledì 04 Aprile 2012 16:45 ?Da parte della commissione c?è l?intento di approfondire i numeri in possesso e i dati del
Re: request about snippets (with attachement)
thank you agin Lewis, but do you think that my strange content field it's for my cause? beacuse I disabled the indexing of about all field. this is my schema: fields field name=id type=string stored=true indexed=true/ !-- core fields -- field name=segment type=string stored=true indexed=false/ field name=digest type=string stored=true indexed=false/ field name=boost type=float stored=true indexed=false/ !-- fields for index-basic plugin -- field name=host type=url stored=false indexed=false/ field name=site type=string stored=true indexed=false/ field name=url type=url stored=true indexed=false required=true/ field name=content type=text stored=true indexed=true/ field name=title type=text stored=true indexed=false/ field name=cache type=string stored=true indexed=false/ field name=tstamp type=date stored=true indexed=false/ !-- fields for index-anchor plugin -- field name=anchor type=string stored=true indexed=false multiValued=true/ !-- fields for index-more plugin -- field name=type type=string stored=true indexed=false multiValued=true/ field name=contentLength type=long stored=true indexed=false/ field name=lastModified type=date stored=false indexed=false/ field name=date type=date stored=true indexed=false/ !-- fields for languageidentifier plugin -- field name=lang type=string stored=true indexed=false/ !-- fields for subcollection plugin -- field name=subcollection type=string stored=true indexed=false multiValued=true/ !-- fields for feed plugin (tag is also used by microformats-reltag)-- field name=author type=string stored=true indexed=true/ field name=tag type=string stored=true indexed=true multiValued=false/ field name=feed type=string stored=true indexed=false/ field name=publishedDate type=date stored=true indexed=false/ field name=updatedDate type=date stored=true indexed=false/ !-- fields for creativecommons plugin -- field name=cc type=string stored=true indexed=true multiValued=true/ /fields what do you think? alessio Il giorno 07 aprile 2012 21:57, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: From the limited HTML that I've seen I can only assume that the offending xhtml is in the content field. If this is the case then you will need to write a custom plugin implementation that removes this. There is loads of info allowing you to get up to speed with plugins on our wiki.[0] Once you have something that requires help get on to the list and let us know. Lewis [0] http://wiki.apache.org/nutch/PluginCentral On Sat, Apr 7, 2012 at 2:33 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: may be it'd my cause with my schema? I chose for inex about only title, author and content. can you help me for setting a parsefilter? thank you alessio
Re: request about snippets (with attachement)
that's can be good? http://192.168.1.5:8983/WoWSolrWebApp/search?query=giocosubmit=Search Il giorno 06 aprile 2012 22:29, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: It would be easier if you could provide an URL and people can see exactly what you are struggling with please? 2012/4/6 alessio crisantemi alessio.crisant...@gmail.com any suggestions for my cause? Il giorno 05 aprile 2012 23:20, alessio crisantemi alessio.crisant...@gmail.com ha scritto: here a part of results: [2] Live Score - GiocoNews - Tutto su casinò, poker, giochi online http://www.gioconews.it/live-score.html Live Score - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score Home Live Score Questa opzione non funziona ... correttamente. Sfortunatamente, il tuo browser non supporta gli Inline Frame Visualizza * Gioco*News sul tuo Smart Phone Detect Version | Versione Mobile | Versione Standard Ripristina configurazione standard ... © Copyright 2012 *Gioco*News.it powered by GNMedia s.r.l. P.iva 01419700552, Tutti i diritti riservati http://www.gioconews.it/live-score.html[3] Curcio (Sapar): Sviluppo consapevole del gioco da parte di tutti gli operatori - GiocoNe http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html Curcio (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli operatori - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ... ONLINE LOTTERIE Politica Video Live Score Home NEWSLOT/VLT Generale Curcio (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli operatori HOT NEWS Turchi (Aams): ?Scommesse, è far west in Italia ... , ... Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di indirizzo comuni a livello nazionale per riuscire a monitorare il fenom... Curcio (Sapar): Sviluppo consapevole del *gioco* da... » ?Da ... , ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo Normativa *Gioco*e Fisco Personaggi Flipper Sfoglia Rivista Curcio (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli operatori Scritto da ... Sm Mercoledì 04 Aprile 2012 16:45 ?Da parte della commissione c?è l?intento di approfondire i numeri in possesso e i dati del settore del *gioco*. Da parte nostra abbiamo cercato di chiarire le cifre e http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html [4] Serpelloni (Dip. Antidroga): ?Sul gioco necessarie linee di indirizzo per la cura delle patologie? - http://www.gioconews.it/generale/serpelloni-dip.-antidroga-sul-gioco-necessarie-linee-di-indirizzo-per-la-cura-delle-patologie-23847.html Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie linee di indirizzo per la cura delle patologie? - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT ... /VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score Home Politica Generale Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie linee di indirizzo per la cura delle patologie? HOT NEWS Turchi (Aams): ?Scommesse ... a tutti gli eccessi, ... Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di indirizzo comuni a livello nazionale per riuscire a monitorare il fenom... Curcio (Sapar): Sviluppo ... consapevole del *gioco* da... » ?Da parte della commissione c?è l?intento di approfondire i numeri in possesso e i dati de... Scommesse sportive: il 9 aprile apertura anticipat... » Aams comunica che, per la ... montepremi complessivo delle vincite, ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo Normativa *Gioco* e Fisco Personaggi Flipper Sfoglia Rivista Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie http://www.gioconews.it/generale/serpelloni-dip.-antidroga-sul-gioco-necessarie-linee-di-indirizzo-per-la-cura-delle-patologie-23847.html [5] Generale - GiocoNews - Tutto su casinò, poker, giochi online http://www.gioconews.it/generale/index.php Generale - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score Home Politica Generale HOT NEWS Turchi ... sensibili e attenti a tutti gli eccessi, ... Serpelloni (Dip. Antidroga): ?Sul *gioco*necessarie... » Servono ?linee di indirizzo comuni a livello nazionale per riuscire a monitorare il fenom... Curcio (Sapar ... ): Sviluppo consapevole del *gioco* da... » ?Da parte della commissione c?è l?intento di approfondire i numeri in possesso e i dati de... Scommesse sportive: il 9 aprile apertura anticipat... » Aams comunica che ... previsto, il montepremi complessivo delle vincite
Fwd: request about snippets (with attachement)
or this: http://pc-alessio:8983/*WoWSolrWebApp/search?query=giocosubmit=Search* -- Messaggio inoltrato -- Da: alessio crisantemi alessio.crisant...@gmail.com Date: 06 aprile 2012 22:42 Oggetto: Re: request about snippets (with attachement) A: user@nutch.apache.org that's can be good? http://192.168.1.5:8983/WoWSolrWebApp/search?query=giocosubmit=Search Il giorno 06 aprile 2012 22:29, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: It would be easier if you could provide an URL and people can see exactly what you are struggling with please? 2012/4/6 alessio crisantemi alessio.crisant...@gmail.com any suggestions for my cause? Il giorno 05 aprile 2012 23:20, alessio crisantemi alessio.crisant...@gmail.com ha scritto: here a part of results: [2] Live Score - GiocoNews - Tutto su casinò, poker, giochi online http://www.gioconews.it/live-score.html Live Score - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score Home Live Score Questa opzione non funziona ... correttamente. Sfortunatamente, il tuo browser non supporta gli Inline Frame Visualizza * Gioco*News sul tuo Smart Phone Detect Version | Versione Mobile | Versione Standard Ripristina configurazione standard ... © Copyright 2012 *Gioco*News.it powered by GNMedia s.r.l. P.iva 01419700552, Tutti i diritti riservati http://www.gioconews.it/live-score.html[3] Curcio (Sapar): Sviluppo consapevole del gioco da parte di tutti gli operatori - GiocoNe http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html Curcio (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli operatori - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ... ONLINE LOTTERIE Politica Video Live Score Home NEWSLOT/VLT Generale Curcio (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli operatori HOT NEWS Turchi (Aams): ?Scommesse, è far west in Italia ... , ... Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di indirizzo comuni a livello nazionale per riuscire a monitorare il fenom... Curcio (Sapar): Sviluppo consapevole del *gioco* da... » ?Da ... , ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo Normativa *Gioco*e Fisco Personaggi Flipper Sfoglia Rivista Curcio (Sapar): Sviluppo consapevole del *gioco* da parte di tutti gli operatori Scritto da ... Sm Mercoledì 04 Aprile 2012 16:45 ?Da parte della commissione c?è l?intento di approfondire i numeri in possesso e i dati del settore del *gioco*. Da parte nostra abbiamo cercato di chiarire le cifre e http://www.gioconews.it/generale/curcio-sapar-sviluppo-consapevole-del-gioco-da-parte-di-tutti-gli-operatori-23848.html [4] Serpelloni (Dip. Antidroga): ?Sul gioco necessarie linee di indirizzo per la cura delle patologie? - http://www.gioconews.it/generale/serpelloni-dip.-antidroga-sul-gioco-necessarie-linee-di-indirizzo-per-la-cura-delle-patologie-23847.html Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie linee di indirizzo per la cura delle patologie? - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT ... /VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score Home Politica Generale Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie linee di indirizzo per la cura delle patologie? HOT NEWS Turchi (Aams): ?Scommesse ... a tutti gli eccessi, ... Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di indirizzo comuni a livello nazionale per riuscire a monitorare il fenom... Curcio (Sapar): Sviluppo ... consapevole del *gioco* da... » ?Da parte della commissione c?è l?intento di approfondire i numeri in possesso e i dati de... Scommesse sportive: il 9 aprile apertura anticipat... » Aams comunica che, per la ... montepremi complessivo delle vincite, ottenuto nei... Cronache Esteri Ippica Videogiochi Bingo Normativa *Gioco* e Fisco Personaggi Flipper Sfoglia Rivista Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie http://www.gioconews.it/generale/serpelloni-dip.-antidroga-sul-gioco-necessarie-linee-di-indirizzo-per-la-cura-delle-patologie-23847.html [5] Generale - GiocoNews - Tutto su casinò, poker, giochi online http://www.gioconews.it/generale/index.php Generale - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score Home Politica Generale HOT NEWS Turchi ... sensibili e attenti a tutti gli eccessi, ... Serpelloni (Dip. Antidroga): ?Sul *gioco*necessarie... » Servono ?linee di indirizzo comuni a livello nazionale per riuscire a monitorare il fenom... Curcio (Sapar
Fwd: request about snippets (with attachement)
-- Messaggio inoltrato -- Da: alessio crisantemi alessio.crisant...@gmail.com Date: 05 aprile 2012 22:32 Oggetto: request about snippets A: user@nutch.apache.org Dear all, I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and index with success my website. I have only a problem with the results of my researches. Into all results, the snippets have a raw with a string where I can read all the categories of my website. I attached a screen shot for explain: here, the no good raw is Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score ) This is a problem, because if solr read for any page the same raw, when my query is the same word of this raw (eg: 'ONLINe') I have all my solr index like a result. When I can jump this raw during my crawling? Is possible exclude this raw? thank you in adavande alessio
Re: request about snippets (with attachement)
Dear Lewis, thank you for your fast reply. But just thiat's my problem! I don't compred wich is the field that crates this raw. But I see a date (eg: Mercoledì Apr 04) followed by the word parent anche after and the the ame of categories (Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score). Do you know wich field of default nutch configuration generate the 'parent' raw. as you can see in the attachement, this raw is into the content field, between 'str' tags. .. suggestions? tx a. Il giorno 05 aprile 2012 22:45, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Hi Alessio, You need to determine in which field the unwanted content exists. Once you've done this you could write an indexing filter to remove this from your document prior to indexing. Lewis On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: -- Messaggio inoltrato -- Da: alessio crisantemi alessio.crisant...@gmail.com Date: 05 aprile 2012 22:32 Oggetto: request about snippets A: user@nutch.apache.org Dear all, I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and index with success my website. I have only a problem with the results of my researches. Into all results, the snippets have a raw with a string where I can read all the categories of my website. I attached a screen shot for explain: here, the no good raw is Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score ) This is a problem, because if solr read for any page the same raw, when my query is the same word of this raw (eg: 'ONLINe') I have all my solr index like a result. When I can jump this raw during my crawling? Is possible exclude this raw? thank you in adavande alessio -- *Lewis*
Re: request about snippets (with attachement)
what is it 'breadcrumb' Markus? Il giorno 05 aprile 2012 23:08, Markus Jelsma markus.jel...@openindex.ioha scritto: Seems to me it's just the breadcrumb of the page popping up in Solr's highlighter snippet? In Thu, 5 Apr 2012 22:02:31 +0100, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I can't see any of your attachments as they're not permitted on list. Can you provide an URL? On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Dear Lewis, thank you for your fast reply. But just thiat's my problem! I don't compred wich is the field that crates this raw. But I see a date (eg: Mercoledì Apr 04) followed by the word parent anche after and the the ame of categories (Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score). Do you know wich field of default nutch configuration generate the 'parent' raw. as you can see in the attachement, this raw is into the content field, between 'str' tags. .. suggestions? tx a. Il giorno 05 aprile 2012 22:45, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Hi Alessio, You need to determine in which field the unwanted content exists. Once you've done this you could write an indexing filter to remove this from your document prior to indexing. Lewis On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: -- Messaggio inoltrato -- Da: alessio crisantemi alessio.crisant...@gmail.com Date: 05 aprile 2012 22:32 Oggetto: request about snippets A: user@nutch.apache.org Dear all, I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and index with success my website. I have only a problem with the results of my researches. Into all results, the snippets have a raw with a string where I can read all the categories of my website. I attached a screen shot for explain: here, the no good raw is Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score ) This is a problem, because if solr read for any page the same raw, when my query is the same word of this raw (eg: 'ONLINe') I have all my solr index like a result. When I can jump this raw during my crawling? Is possible exclude this raw? thank you in adavande alessio -- *Lewis* -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: request about snippets (with attachement)
, ... Serpelloni (Dip. Antidroga): ?Sul *gioco* necessarie... » Servono ?linee di indirizzo comuni a livello nazionale per riuscire a monitorare il fenom... Curcio (Sapar): Sviluppo ... consapevole del *gioco* da... » ?Da parte della commissione c?è l?intento di approfondire i numeri in possesso e i dati de... Scommesse sportive: il 9 aprile apertura anticipat... » Aams comunica che, per la ... Iori, presidente Conagga, al convegno dedicato al * gioco*... Visualizzazioni: 238 Da: redazione Intervista a Francesco... Categoria: News - Interviste Intervista a Francesco Ginestra presidente di Asso ... Snai Visualizzazioni: 169 Da: redazione Il Presidente Udc Rocco... Categoria: News - Interviste Il Presidente Udc Rocco Buttiglione parla di * gioco* e regolamentazione Visualizzazioni: 192 Da: redazione *Gioco* http://www.gioconews.it/video.html Il giorno 05 aprile 2012 23:02, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: I can't see any of your attachments as they're not permitted on list. Can you provide an URL? On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Dear Lewis, thank you for your fast reply. But just thiat's my problem! I don't compred wich is the field that crates this raw. But I see a date (eg: Mercoledì Apr 04) followed by the word parent anche after and the the ame of categories (Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score). Do you know wich field of default nutch configuration generate the 'parent' raw. as you can see in the attachement, this raw is into the content field, between 'str' tags. .. suggestions? tx a. Il giorno 05 aprile 2012 22:45, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Hi Alessio, You need to determine in which field the unwanted content exists. Once you've done this you could write an indexing filter to remove this from your document prior to indexing. Lewis On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: -- Messaggio inoltrato -- Da: alessio crisantemi alessio.crisant...@gmail.com Date: 05 aprile 2012 22:32 Oggetto: request about snippets A: user@nutch.apache.org Dear all, I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and index with success my website. I have only a problem with the results of my researches. Into all results, the snippets have a raw with a string where I can read all the categories of my website. I attached a screen shot for explain: here, the no good raw is Mercoledì Apr 04 parent Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score ) This is a problem, because if solr read for any page the same raw, when my query is the same word of this raw (eg: 'ONLINe') I have all my solr index like a result. When I can jump this raw during my crawling? Is possible exclude this raw? thank you in adavande alessio -- *Lewis* -- *Lewis*
Re: crawling a website
dear Remi, thank you for your reply but that's no good for my case. because the first command stop my crawling at the first section and the second stop it just at the start point. so, I see that the sectiond of my website have like a first page a urls with 'index.php' (EG: http://ww.mywebsite.com/beta/index.php) so, for crawl all this section (http://ww.mywebsite.com/beta) but for not include the parsing of the http://ww.mywebsite.com/beta/index.php page) wich is the correct command? (may be the following? *- ^http://ww.mywebsite.com/index-php$* ) or similar? thanks alessio Il giorno 02 aprile 2012 11:40, remi tassing tassingr...@gmail.com ha scritto: It depends on the structure of your site and you can modify regex-urlfilter.txt to reach your goal. From the examples you gave, you can do this: *- ^http://ww.mywebsite.com/[^/]*$* it will exclude http://ww.mywebsite.com/alpha, http://ww.mywebsite.com/beta , http://ww.mywebsite.com/gamma *- ^http://ww.mywebsite.com/.*/$* This will exclude any URL that ends with / I would suggest you get familiar with regular expressions (in case you don't yet) Remi On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Dear All, I would change my crawling operation but I don't know how can I do. crawling my website I used the follow command: $ bin/nutch crawl urls -solr http://localhost:8983/solr -threads 3 -depth 35 -topN 10 for crawl with nutch and index results on solr index. But I would not crawl the single section of my website but only the single pages. for example: You considere a site: www.mywebsite.com composed with 3 section: http://ww.mywebsite.com/alpha http://ww.mywebsite.com/beta http://ww.mywebsite.com/gamma so, I want between my results, only the single pages of my articles, and not the list of articles on this directories also. So, I would for example, the parsong of the file: http://ww.mywebsite.com/alpha/artcle1.html http://ww.mywebsite.com/alpha/artcle3.html ... and i don't want the parsing of the parent section: http://ww.mywebsite.com/alpha/ How can I do? suggestion? sorry if not all clear thank you alessio
Re: nutch crawling file system SOLVED
this is the return after crawling with nutch and indexing on solr: doc float name=boost0.298293/float - str name=content Index of C:\Documents and Settings\Alessio\Documenti Index of C:\Documents and Settings\Alessio\Documenti ../ - - - 003_C_001_Alessio_2004_08_13.dvf Tue, 17 Aug 2004 20:09:52 GMT 151552 13542 Christmas on Stars Affiliate Banners 21.11/ Fri, 02 Dec 2011 20:45:37 GMT - 5 GETTONI 22 EURO.doc Mon, 08 Sep 2003 16:18:58 GMT 20480 adesione 191 con password.doc Fri, 29 Aug 2003 13:45:00 GMT 116736 art inceneritore.doc Sun, 18 Feb 2007 18:08:54 GMT 32768 articoli/ Wed, 26 Apr 2006 08:12:43 GMT - auguri pcplayer.it.jpg Sun, 24 Dec 2006 20:56:48 GMT 13888 Bluetooth Exchange Folder/ Wed, 22 Mar 2006 19:50:24 GMT - BluetoothXpSp2.pdf Mon, 11 Jul 2005 08:24:58 GMT 120812 Brevettato il disco volante.doc Wed, 30 Aug 2006 13:57:19 GMT 20992 Busta Rhymes feat maria c.mp3 Sat, 05 Jul 2003 12:17:58 GMT 2595736 CALENDARIO FANTA2005.doc Wed, 08 Sep 2004 13:13:20 GMT 141824 Cartella Scambio Bluetooth/ Mon, 20 Mar 2006 22:03:45 GMT - cc_20111224_234440.reg Sat, 24 Dec 2011 22:44:44 GMT 1860 CD musicali 01 -01- 2003.xls Mon, 27 Jan 2003 00:10:00 GMT 515584 CLASSIFICA fantacalcio 2005.doc Wed, 08 Sep 2004 13:15:38 GMT 43008 Collegamento a My Shared Folder.lnk Thu, 29 Sep 2005 16:56:56 GMT 533 conte tagliaferri.doc Thu, 09 Oct 2008 00:00:36 GMT 29184 Corel User Files/ Sun, 16 Apr 2006 11:56:24 GMT - Curriculum ANGELO CONTILI.doc Wed, 19 Jan 2005 22:35:34 GMT 42496 currriculum Alessio agg2004.doc Thu, 27 Jan 2005 21:56:06 GMT 950784 currriculum Alessio.doc Thu, 10 Apr 2003 19:29:28 GMT 44544 Default.rdp Mon, 11 Sep 2006 17:13:17 GMT 1166 desktop.ini Wed, 28 Sep 2011 20:52:02 GMT 75 DNSadsl.txt Mon, 01 Aug 2005 13:21:52 GMT 942 DNStabella.xls Mon, 01 Aug 2005 13:16:26 GMT 33792 Download/ Fri, 09 Mar 2012 23:03:54 GMT - Eseguibili JAVA.doc Mon, 20 Jun 2005 11:58:42 GMT 23552 FANTACALCIO/ Mon, 20 Mar 2006 22:04:34 GMT - Fax/ Mon, 20 Mar 2006 22:04:40 GMT - File ricevuti/ Mon, 19 Oct 2009 12:50:30 GMT - FINALE TORNEO 06.doc Tue, 08 May 2007 17:47:41 GMT 49664 Finest/ Sat, 06 Mar 2010 15:05:46 GMT - FORMAZIONItipo2005.doc Mon, 13 Sep 2004 20:26:56 GMT 49664 free3gp/ Tue, 25 May 2010 16:23:43 GMT - Futurando/ Mon, 20 Mar 2006 09:59:24 GMT - GOL/ Mon, 20 Mar 2006 22:24:46 GMT - guidadownloadconmirc.doc Sun, 20 Feb 2005 10:29:40 GMT 264704 HAPPY DAYS/ Mon, 20 Mar 2006 21:45:36 GMT - Happy Days2007/ Sun, 27 Jan 2008 15:53:33 GMT - hijackthis.log Fri, 04 Jul 2008 08:49:37 GMT 8573 Immagini/ Wed, 28 Sep 2011 20:52:03 GMT - Immagini.lnk Fri, 15 Aug 2008 16:26:58 GMT 375 intervisteEnada/ Mon, 20 Mar 2006 10:36:53 GMT - IP Pentima.txt Fri, 01 Jul 2005 08:19:56 GMT 99 L'AUTOMATICO/ Sun, 14 Jan 2007 17:30:41 GMT - lavatr1h.mp3 Wed, 03 Oct 2001 15:19:52 GMT 2586624 lionsleeps_hq.wmv Tue, 17 May 2005 13:29:02 GMT 1842905 lista flip.docx Sat, 29 Mar 2008 14:04:24 GMT 13559 Masterizzare giochi con NERO BURNING ROM.doc Sun, 06 Mar 2005 19:32:28 GMT 23040 masterizzarre CD protetti.txt Thu, 20 Jan 2005 20:36:00 GMT 2326 Matlab 65 serial.txt Thu, 09 Oct 2003 22:34:00 GMT 86 MessageLog.xsl Sun, 21 Dec 2008 20:45:03 GMT 12160 mirc istruz.txt Sun, 09 Mar 2003 16:44:00 GMT 1123 Musica/ Wed, 28 Sep 2011 20:52:04 GMT - My Skype Content/ Sat, 06 May 2006 12:04:04 GMT - My Skype Pictures/ Wed, 27 Apr 2011 19:54:03 GMT - My Skype Received Files/ Thu, 18 May 2006 16:50:34 GMT - natale_flip.jpg Sat, 23 Dec 2006 17:18:52 GMT 118507 niagara.JPG Fri, 18 Aug 2006 16:53:48 GMT 1017782 niagara2.JPG Fri, 18 Aug 2006 16:53:44 GMT 988143 Norton AntiVirus_Key.txt Sun, 31 Oct 2004 19:28:24 GMT 357 postepay.txt Wed, 16 Jul 2008 07:48:38 GMT 16 presentazione_FB.pdf Thu, 09 Mar 2006 08:58:00 GMT 700629 richiesta.doc Sun, 16 Nov 2003 18:14:44 GMT 124928 ROSE FANTACALCIO 2005.doc Wed, 08 Sep 2004 13:59:54 GMT 45568 scudettoicona.ico Mon, 22 Sep 2003 19:55:10 GMT 13502 serial_akkxMDYwMTE0ODM5.txt Tue, 05 Aug 2003 20:19:26 GMT 155 Siti Web/ Sun, 03 Jun 2007 13:21:44 GMT - SitoTernanaGiochi/ Fri, 25 May 2007 19:52:51 GMT - sitoTGver1.1.pub Sun, 06 Mar 2005 20:23:56 GMT 1637888 starry(d).jpg Sun, 02 Apr 2006 10:10:26 GMT 2138166 suonerie/ Fri, 15 Aug 2008 16:29:09 GMT - Symantec/ Sun, 13 Aug 2006 12:09:04 GMT - Thumbs.db Sun, 11 Feb 2007 14:45:34 GMT 71168 vecchioDocumenti/ Wed, 14 Jul 2010 15:42:28 GMT - virtualDub/ Mon, 20 Mar 2006 10:16:47 GMT - Voice Files/ Mon, 27 Mar 2006 11:57:35 GMT - ZbThumbnail.info Mon, 09 Jun 2008 08:25:30 GMT 2920 zurigo.doc Thu, 13 Apr 2006 15:24:45 GMT 27648 /str str name=digest6717a734c4f78c7f7f2dbc9a7324199e/str str name=idfile:/C:/Documents and Settings/Alessio/Documenti//str str name=segment20120317175631/str - str name=title Index of C:\Documents and Settings\Alessio\Documenti /str date name=tstamp2012-03-17T16:56:39.014Z/date str name=urlfile:/C:/Documents and Settings/Alessio/Documenti//str /doc suggestions? tx alessio Il giorno 12 marzo 2012 09:39, alessio crisantemi alessio.crisant...@gmail.com ha scritto: I add the path
Re: nutch crawling file system SOLVED
I would that the result of my search be the text of my pdf file and not the list of documents into the directory and the path address.. Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Hi Alessio, On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: suggestions? For what?
Re: nutch crawling file system SOLVED
I add the path of my directory on regex-urlfilter but nutch crawl also other directories... And more: I follow your suggestions and I indexing again my root, But I have still a index with the name of my pdf's files and not the content of those. I don't comprend.. alessio Il giorno 12 marzo 2012 06:06, remi tassing tassingr...@gmail.com ha scritto: Using crawl-ulrfilter (or regex-urlfilter depending on which one you're using), you should be able to solve this. Unless you're not clear on what folders to exclude...? On Sunday, March 11, 2012, alessio crisantemi alessio.crisant...@gmail.com wrote: thank you Remi for your preciuos help. I try again and I write you the results. But I have another little question: how can I do for limit the crawling only to my selected root? Because all time, Nutch crawl also the parent directories. I read that The code that is responsable for this is in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File f). And a guy suggest to change the following line: this.content = list2html(f.listFiles(), path, /.equals(path) ? false : true); to this.content = list2html(f.listFiles(), path, false); and recompiled. But in my class file, I have just this raw...And that's not a simple mode There is another method, I suppose? thank you alessio Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Please see below On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: [1] http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F I've now updated this link, thanks for pointing this out. And Now, I have another problem: I crawled my local file system: a directory with a lot of Pdf files. All works, and nutch index on Solr the results. OK But this is the problem: whe I submit a query on solr, I can see only a list of files, and not the pdf contents. why, in your opinion? Well this might be to with your file.content.limit in nutch-site.xml, maybe your documents are being truncated if they are too large. Additionally your Solr mapping's and or schema configuration may need to be tweaked slightly to permit you to view snippets of the PDF content within your Solr search results. In your schema configuration for index-basec, try changing field name=content type=text stored=false indexed=true/ to field name=content type=text stored=true indexed=true/ You will need to reindex your content if you wish to see the results through Solr.
Re: nutch crawling file system SOLVED
thank you Remi for your preciuos help. I try again and I write you the results. But I have another little question: how can I do for limit the crawling only to my selected root? Because all time, Nutch crawl also the parent directories. I read that The code that is responsable for this is in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File f). And a guy suggest to change the following line: this.content = list2html(f.listFiles(), path, /.equals(path) ? false : true); to this.content = list2html(f.listFiles(), path, false); and recompiled. But in my class file, I have just this raw...And that's not a simple mode There is another method, I suppose? thank you alessio Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Please see below On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: [1] http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F I've now updated this link, thanks for pointing this out. And Now, I have another problem: I crawled my local file system: a directory with a lot of Pdf files. All works, and nutch index on Solr the results. OK But this is the problem: whe I submit a query on solr, I can see only a list of files, and not the pdf contents. why, in your opinion? Well this might be to with your file.content.limit in nutch-site.xml, maybe your documents are being truncated if they are too large. Additionally your Solr mapping's and or schema configuration may need to be tweaked slightly to permit you to view snippets of the PDF content within your Solr search results. In your schema configuration for index-basec, try changing field name=content type=text stored=false indexed=true/ to field name=content type=text stored=true indexed=true/ You will need to reindex your content if you wish to see the results through Solr.
Re: nutch crawling file system SOLVED
I'm partially solved. following the tutorial, I configured my nutch for crawl a local file system, thank you. But I have a duobt: why all tutorials and guide about nutch speak about crawl-urlfilter.txt' file, but the default config or Nutch don't have this file? But If I insert the code that the guide write for the crawl-urlfilter on regex-urlfilter, all works. I would know this case. thank you alessio Il giorno 04 marzo 2012 17:02, alessio crisantemi alessio.crisant...@gmail.com ha scritto: Hi all, I need to crawl a directory with a lot of pdf file. But I know onlye the step-by-step mode for crawl a website. how can I do for a root? thank you for help me alessio
Re: nutch craling file system
thank you for this fast reply! I use solr 1.4.1 and nutch 1.4, These solutions works with those versions? tx a. Il giorno 04 marzo 2012 17:06, remi tassing tassingr...@gmail.com ha scritto: Plz try GOOGLing that first! If you don't find anything then try these: [1]http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F [2] http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch [3] http://stackoverflow.com/questions/941519/how-to-make-nutch-crawl-file-system On Sun, Mar 4, 2012 at 5:02 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Hi all, I need to crawl a directory with a lot of pdf file. But I know onlye the step-by-step mode for crawl a website. how can I do for a root? thank you for help me alessio