[General] Webboard: exclude mime types

2016-10-13 Thread bar
Author: fabien
Email: fabien.lahau...@gmail.com
Message:
Hi,

I tried today the disallow statements, and it works like a charm ! :)
I can now exclude typical useless urls before they get downloaded by the 
indexer.

Thanks for your help and for your work !


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: exclude mime types

2016-10-12 Thread bar
Author: Alexander Barkov
Email: 
Message:
> And to be more precise, i finally want to index only html pages and not all 
> other types of data (css/js/pictures/pdf/rss/...) .
> 

Something like this should do the trick:

NoIndexIf NoMatch Content-Type text/html*


Additionally, try to use the Disallow command to reduce the number of URLs that 
indexer has actually to download.
See here for details:
http://www.mnogosearch.org/board/message.php?id=21793


> Fabien.
> 



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: exclude mime types

2016-10-12 Thread bar
Author: Alexander Barkov
Email: 
Message:
> > Thanks for your quick answer.
> > 
> > I tried to add the NoIndexIf but i cannot get it to work.
> > 
> > I used the indexer.conf default file, and added the two following lines at 
> > the end of that file : 
> > Server http://www.wearethelous.com/feed/
> > NoIndexIf Content-Type application/rss+xml
> 
> I tried the same thing, and it seems to work fine.
> This page is not returned in search results.
> 
> If I remove the NoIndexIf command, this page IS returned by search results.
> 
> 
> Note, indexer shows the URL in its log, because it still must
> download this URL to know its content type.
> But the fact that you can see the "SectionFilter:..." line in the log
> tells that indexer marks it as "not for indexing" and thus stores no data 
> into the underlying tables cachedcopy and bdicti, so "indexer --index" later 
> does see it when creating the search index.
> 


Note, if you know that documents under certain location return 
application/rss+xml or some other not desired content type,
then consider using Disallow instead. In this case indexer will
not even download these documents.

NoIndexIf is rather for the cases when it's not possible to describe "bad" 
documents by their URL pattern.





Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: exclude mime types

2016-10-12 Thread bar
Author: Alexander Barkov
Email: 
Message:
> Thanks for your quick answer.
> 
> I tried to add the NoIndexIf but i cannot get it to work.
> 
> I used the indexer.conf default file, and added the two following lines at 
> the end of that file : 
> Server http://www.wearethelous.com/feed/
> NoIndexIf Content-Type application/rss+xml

I tried the same thing, and it seems to work fine.
This page is not returned in search results.

If I remove the NoIndexIf command, this page IS returned by search results.


Note, indexer shows the URL in its log, because it still must
download this URL to know its content type.
But the fact that you can see the "SectionFilter:..." line in the log
tells that indexer marks it as "not for indexing" and thus stores no data into 
the underlying tables cachedcopy and bdicti, so "indexer --index" later does 
see it when creating the search index.

> 
> I got the following log : 
> 
> [71598]{--} Clearing
> [71598]{--} Clearing done   0.01
> [71600]{--} indexer from mnogosearch-3.4.1-mysql-pqsql started with 
> '/etc/mnogosearch/indexer.conf'
> [71600]{01} URL: http://www.wearethelous.com/feed/
> [71600]{01} Server Path Allow 'http://www.wearethelous.com/feed/'
> [71600]{01} Allow by default
> [71600]{01} ROBOTS: http://www.wearethelous.com/robots.txt
> [71600]{01} Request.Accept-Encoding: gzip,deflate,compress
> [71600]{01} Request.Host: www.wearethelous.com
> [71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
> [71600]{01} Response.Connection: close
> [71600]{01} Response.Content-Encoding: gzip
> [71600]{01} Response.Content-Length: 67
> [71600]{01} Response.Content-Type: text/plain
> [71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:46 GMT
> [71600]{01} Response.Link: ; 
> rel="https://api.w.org/;
> [71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
> [71600]{01} Response.ResponseSize: 475
> [71600]{01} Response.ResponseTime: 2261
> [71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 
> OpenSSL/1.0.1e-fips mod_bwlimited/1.4
> [71600]{01} Response.Server-Charset: utf-8
> [71600]{01} Response.Status: 200
> [71600]{01} Response.URL: http://www.wearethelous.com/robots.txt
> [71600]{01} Response.URL_ID: 1928115922
> [71600]{01} Response.Vary: Accept-Encoding,User-Agent
> [71600]{01} Response.X-Powered-By: PHP/5.5.29
> [71600]{01} Response.X-Robots-Tag: noindex, follow
> [71600]{01} Request.Accept-Encoding: gzip,deflate,compress
> [71600]{01} Request.Host: www.wearethelous.com
> [71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
> [71600]{01} Response.body: 
> [71600]{01} Response.Charset: 
> [71600]{01} Response.Connection: close
> [71600]{01} Response.Content-Encoding: gzip
> [71600]{01} Response.Content-Language: 
> [71600]{01} Response.Content-Length: 2337
> [71600]{01} Response.Content-Type: application/rss+xml
> [71600]{01} Response.crc32: 0
> [71600]{01} Response.crc32old: 0
> [71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:48 GMT
> [71600]{01} Response.ETag: "7059155a990290887650add31475f88e"
> [71600]{01} Response.Hops: 0
> [71600]{01} Response.ID: 5
> [71600]{01} Response.ilinktext: 
> [71600]{01} Response.Last-Modified: Thu, 29 Sep 2016 12:48:50 GMT
> [71600]{01} Response.Link: ; 
> rel="https://api.w.org/;
> [71600]{01} Response.MaxDocPerSite: 0
> [71600]{01} Response.MaxHops: 256
> [71600]{01} Response.meta.description: 
> [71600]{01} Response.meta.keywords: 
> [71600]{01} Response.msg.from: 
> [71600]{01} Response.msg.subject: 
> [71600]{01} Response.msg.to: 
> [71600]{01} Response.PrevStatus: 0
> [71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
> [71600]{01} Response.ResponseSize: 2842
> [71600]{01} Response.ResponseTime: 1455
> [71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 
> OpenSSL/1.0.1e-fips mod_bwlimited/1.4
> [71600]{01} Response.Server-Charset: utf-8
> [71600]{01} Response.Server_id: -2050898686
> [71600]{01} Response.Status: 200
> [71600]{01} Response.title: 
> [71600]{01} Response.URL: http://www.wearethelous.com/feed/
> [71600]{01} Response.url.file: 
> [71600]{01} Response.url.host: 
> [71600]{01} Response.url.path: 
> [71600]{01} Response.url.proto: 
> [71600]{01} Response.URL_ID: -2050898686
> [71600]{01} Response.Vary: Accept-Encoding,User-Agent
> [71600]{01} Response.X-Powered-By: PHP/5.5.29
> [71600]{01} Response.X-Robots-Tag: noindex, follow
> [71600]{01} Status: 200 OK
> [71600]{01} Guesser: Lang: , Charset: utf-8
> [71600]{01} SectionFilter: NoIndexIf Match Wild Insensitive 'Content-Type' 
> 'application/rss+xml'
> [71600]{01} Flushing word cache
> [71600]{01} Flushing word cache done0.00
> [71600]{01} Done (4 seconds, 1 documents, 2842 bytes,  0.69 Kbytes/sec.)
> 
> I see that the section filter talks about the NoIndexIf filter that i added, 
> but the url is still indexed.
> So what can be wrong ?
> 
> Thanks in advance for your help.
> Fabien.
> 
> 
> > Hi,
> > 
> > > Hi all,
> > > 
> > > Is it possible to exclude certain mime types 

[General] Webboard: exclude mime types

2016-10-12 Thread bar
Author: fabien
Email: fabien.lahau...@gmail.com
Message:
And to be more precise, i finally want to index only html pages and not all 
other types of data (css/js/pictures/pdf/rss/...) .

Fabien.

> Thanks for your quick answer.
> 
> I tried to add the NoIndexIf but i cannot get it to work.
> 
> I used the indexer.conf default file, and added the two following lines at 
> the end of that file : 
> Server http://www.wearethelous.com/feed/
> NoIndexIf Content-Type application/rss+xml
> 
> I got the following log : 
> 
> [71598]{--} Clearing
> [71598]{--} Clearing done   0.01
> [71600]{--} indexer from mnogosearch-3.4.1-mysql-pqsql started with 
> '/etc/mnogosearch/indexer.conf'
> [71600]{01} URL: http://www.wearethelous.com/feed/
> [71600]{01} Server Path Allow 'http://www.wearethelous.com/feed/'
> [71600]{01} Allow by default
> [71600]{01} ROBOTS: http://www.wearethelous.com/robots.txt
> [71600]{01} Request.Accept-Encoding: gzip,deflate,compress
> [71600]{01} Request.Host: www.wearethelous.com
> [71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
> [71600]{01} Response.Connection: close
> [71600]{01} Response.Content-Encoding: gzip
> [71600]{01} Response.Content-Length: 67
> [71600]{01} Response.Content-Type: text/plain
> [71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:46 GMT
> [71600]{01} Response.Link: ; 
> rel="https://api.w.org/;
> [71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
> [71600]{01} Response.ResponseSize: 475
> [71600]{01} Response.ResponseTime: 2261
> [71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 
> OpenSSL/1.0.1e-fips mod_bwlimited/1.4
> [71600]{01} Response.Server-Charset: utf-8
> [71600]{01} Response.Status: 200
> [71600]{01} Response.URL: http://www.wearethelous.com/robots.txt
> [71600]{01} Response.URL_ID: 1928115922
> [71600]{01} Response.Vary: Accept-Encoding,User-Agent
> [71600]{01} Response.X-Powered-By: PHP/5.5.29
> [71600]{01} Response.X-Robots-Tag: noindex, follow
> [71600]{01} Request.Accept-Encoding: gzip,deflate,compress
> [71600]{01} Request.Host: www.wearethelous.com
> [71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
> [71600]{01} Response.body: 
> [71600]{01} Response.Charset: 
> [71600]{01} Response.Connection: close
> [71600]{01} Response.Content-Encoding: gzip
> [71600]{01} Response.Content-Language: 
> [71600]{01} Response.Content-Length: 2337
> [71600]{01} Response.Content-Type: application/rss+xml
> [71600]{01} Response.crc32: 0
> [71600]{01} Response.crc32old: 0
> [71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:48 GMT
> [71600]{01} Response.ETag: "7059155a990290887650add31475f88e"
> [71600]{01} Response.Hops: 0
> [71600]{01} Response.ID: 5
> [71600]{01} Response.ilinktext: 
> [71600]{01} Response.Last-Modified: Thu, 29 Sep 2016 12:48:50 GMT
> [71600]{01} Response.Link: ; 
> rel="https://api.w.org/;
> [71600]{01} Response.MaxDocPerSite: 0
> [71600]{01} Response.MaxHops: 256
> [71600]{01} Response.meta.description: 
> [71600]{01} Response.meta.keywords: 
> [71600]{01} Response.msg.from: 
> [71600]{01} Response.msg.subject: 
> [71600]{01} Response.msg.to: 
> [71600]{01} Response.PrevStatus: 0
> [71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
> [71600]{01} Response.ResponseSize: 2842
> [71600]{01} Response.ResponseTime: 1455
> [71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 
> OpenSSL/1.0.1e-fips mod_bwlimited/1.4
> [71600]{01} Response.Server-Charset: utf-8
> [71600]{01} Response.Server_id: -2050898686
> [71600]{01} Response.Status: 200
> [71600]{01} Response.title: 
> [71600]{01} Response.URL: http://www.wearethelous.com/feed/
> [71600]{01} Response.url.file: 
> [71600]{01} Response.url.host: 
> [71600]{01} Response.url.path: 
> [71600]{01} Response.url.proto: 
> [71600]{01} Response.URL_ID: -2050898686
> [71600]{01} Response.Vary: Accept-Encoding,User-Agent
> [71600]{01} Response.X-Powered-By: PHP/5.5.29
> [71600]{01} Response.X-Robots-Tag: noindex, follow
> [71600]{01} Status: 200 OK
> [71600]{01} Guesser: Lang: , Charset: utf-8
> [71600]{01} SectionFilter: NoIndexIf Match Wild Insensitive 'Content-Type' 
> 'application/rss+xml'
> [71600]{01} Flushing word cache
> [71600]{01} Flushing word cache done0.00
> [71600]{01} Done (4 seconds, 1 documents, 2842 bytes,  0.69 Kbytes/sec.)
> 
> I see that the section filter talks about the NoIndexIf filter that i added, 
> but the url is still indexed.
> So what can be wrong ?
> 
> Thanks in advance for your help.
> Fabien.
> 
> 
> > Hi,
> > 
> > > Hi all,
> > > 
> > > Is it possible to exclude certain mime types such as rss feeds ?
> > > 
> > 
> > This can be done using the NoIndexIf command:
> > 
> > http://www.mnogosearch.org/doc34/msearch-cmdref-noindexif.html
> > 
> > Put this command into indexer.conf to disallow a certain Content-Type:
> > 
> > NoIndexIf Content-Type application/rss+xml
> > 
> > 
> > Another option is to use NoIndexIf in a combination with a user defined 
> > section, to check 

[General] Webboard: exclude mime types

2016-10-12 Thread bar
Author: Alexander Barkov
Email: 
Message:
Hi,

> Hi all,
> 
> Is it possible to exclude certain mime types such as rss feeds ?
> 

This can be done using the NoIndexIf command:

http://www.mnogosearch.org/doc34/msearch-cmdref-noindexif.html

Put this command into indexer.conf to disallow a certain Content-Type:

NoIndexIf Content-Type application/rss+xml


Another option is to use NoIndexIf in a combination with a user defined 
section, to check raw content fragments:

http://www.mnogosearch.org/doc34/msearch-cmdref-section.html#cmdref-section-user-defined

The idea is to define a user section using a regex pattern to catch some known 
RSS text fragments, and then use NoIndexIf with this section.


> Thanks in advance,
> Fabien.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: exclude mime types

2016-10-12 Thread bar
Author: fabien
Email: fabien.lahau...@gmail.com
Message:
Hi all,

Is it possible to exclude certain mime types such as rss feeds ?

Thanks in advance,
Fabien.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general