[General] Webboard: if...then block in template

2013-02-28 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:


> I'd like to have a block in the results template that would only display if I 
> had less than 5 results for the current search. Is this possible? Can someone 
> point me towards documentation supporting this?
> 
> Thanks

Conditional operators in templates are described here:
http://www.mnogosearch.org/doc33/msearch-templates-oper.html


I think what you're looking for is:


I AM HERE


Note, this code is Ok for the "" section.


If you want to put this in "", or any other section
which is printed even if no search has been done
(i.e. start page with an empty form or in case in no words have
been typed), then you'll need an additional condition.
The "WE" variable (word hit statistics) is a good candidate
for this:




I AM HERE





Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: search.cgi crashes with buffer overflow

2013-02-28 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
This bug is most likely fixed in 3.3.13. Please upgrade,
and report back if the problem remains.

>From Changelog:
http://www.mnogosearch.org/doc33/msearch-changelog.html#changelog-3-3-13
* Bug#4803 "buffer overflow detected with search.cgi" was fixed. 


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: PGExec error when having a ' in the file name

2013-02-28 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> It seems that postgresql has changed the way they handle escape characters. 
> When setting 'standard_conforming_strings = off' in postgresql.conf the error 
> is gone and indexer seem to finish successfully.
> 
> For more see 
> http://www.postgresql.org/docs/9.1/static/sql-syntax-lexical.html.

This problem was fixed in 3.3.13.
>From ChangeLog:

Improved compatibility with the latest versions of PostgreSQL. Escaping of the 
SQL character literals for PostgreSQL >= 9 was changed from the C-alike 
stype (using backslash) to the standard SQL style. 

Please upgrade.

> 
> Unfortunatly search.cgi crashes. 

Can you please provide more information?
Thanks.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Fix PHP Extension Debian Squeeze 64bit

2013-02-28 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi, 
> 
> anyony having trouble compiling the php extension on a 64 bit system with a 
> relocation / fpic error caused by libmnogosearch?
> 
> do this: run the install.pl script with shared lib creation turned on 
> and set these environment variables before running make.
> 
> export CC='gcc -fPIC'
> export CXX=_g++ -fPIC'
> 
> then go into the php extensions source directory and run phpize, configure, 
> make.
> 
> Hope this helps you save some time.

Just tried on Ubuntu 64, it compiled without problems.

Can you please post output of these commands:

udm-config --version
udm-config --libs
udm-config --cflags


Also, try without install.pl, just run configure directly,
on a fresh source tree. For example,

rm -rf mnogosearch-3.3.13
tar -zxf mnogosearch-3.3.13.tar.gz
cd mnogosearch-3.3.13
./configure --with-mysql --prefix=/tmp/mnogosearch33test
make
make install
cd php
phpize
./configure --with-mnogosearch=/tmp/mnogosearch33test
make


> 
> Cheers
> Jens


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: No MySQL in latest 3.3.13 snapshot

2013-03-06 Thread bar
Author: Amar Bouchibane
Email: 
Message:
Hi Alexander,

sorry, this was my fault: the MySQL libraries weren't there anymore!
So, the building of mnoGoSearch works when I add "--with-mysql=...".

best regards,
Amar

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: New office suite parsers for mnoGoSearch (*.docx), (*.pptx), (*.xlsx), (*.wps), (*.wpd) , (*.odt) and (*.sxw)

2013-04-17 Thread bar
Author: Yannick
Email: yl...@laposte.net
Message:

New office suite parsers for mnoGoSearch (*.docx), (*.pptx), (*.xlsx), (*.wps), 
(*.wpd) , (*.odt) and (*.sxw)


*** Microsoft  ***

PATCH to add docx2txt (*.docx) (MS Word 2007 and later) parser
configuration.
http://www.mnogosearch.org/bugs/bugs.php?id=4815

PATCH to add pptx2txt (*.pptx) (MS Powerpoint 2007 and later) parser
configuration
http://www.mnogosearch.org/bugs/bugs.php?id=4816

PATCH to add xlsx2csv (*.xlsx) (MS Excel 2007 and later) parser configuration.
http://www.mnogosearch.org/bugs/index.php?id=4823

PATCH to add libpws (*.wps) parser configuration for MS Works docs.
http://www.mnogosearch.org/bugs/index.php?id=4817

*** Corel ***

PATCH to add libpwd (*.wpd) parser configuration for indexing WordPerfect docs
http://www.mnogosearch.org/bugs/index.php?id=4814

*** LibreOffice / OpenOffice / StarOffice ***

PATCH to add odfreader parser configuration (parse *.odt docs)
http://www.mnogosearch.org/bugs/index.php?id=4813

PATCH to fix dead URLs for parsers and add SofficeToHtml parser configuration
http://www.mnogosearch.org/bugs/index.php?id=4810


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Index all post in the forum or website category?

2013-04-21 Thread bar
Author: nanang
Email: nh3...@yahoo.co.id
Message:
Can you help me, how indexer.conf settings in order to be able to index all the 
links or post in the forum or website category?

for example I want to index all the posts on tips & tricks category, at any 
website found so I do not need to define a server path on which so many 
websites in the world

on the website link will be like this _http://website.ws/category/css-tips/
but on the forum it will be just as _http://forum.ws/forum-x.html (MyBB) or may 
_http://forum.ws/forum/14/ (vBulletin)

how so I can index all the posts in a category? I tried to use the title 
IndexIf *tips* but this is bad, only categories that can index and all search 
results contained the phrase "tips"

thanks for the help

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Index all post in the forum or website category?

2013-04-22 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Can you help me, how indexer.conf settings in order to be able to index all 
> the links or post in the forum or website category?
> 
> for example I want to index all the posts on tips & tricks category, at any 
> website found so I do not need to define a server path on which so many 
> websites in the world
> 
> on the website link will be like this _http://website.ws/category/css-tips/
> but on the forum it will be just as _http://forum.ws/forum-x.html (MyBB) or 
> may _http://forum.ws/forum/14/ (vBulletin)
> 
> how so I can index all the posts in a category? I tried to use the title 
> IndexIf *tips* but this is bad, only categories that can index and all search 
> results contained the phrase "tips"
> 
> thanks for the help

Can you please clarify: what is the condition telling
that a forum or a website belongs to a certain category?


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Index all post in the forum or website category?

2013-04-23 Thread bar
Author: nanang
Email: nh3...@yahoo.co.id
Message:
I want to index all the sites, about "template"
Indexing posting in the forum categories and categories on the website

current indexer setup like this, but I'm confused a lot of link wikipedia and 
amazon are indexed by crawl

Realm http://*.com/*
Realm http://*.org/*
Realm http://*.net/*
Realm http://*.info/*

Tag template
Server http://www.templatesite.com/
Server http://www.template.com/
Server http://www.alltemplate.com/

I tried to use IndexIf *template* but this is bad, only categories that can 
index (not post) and all search results contained the phrase "template" in the 
title
I do not use maxhops, because I want to index as many sites that found by the 
crawl, only post about template

I tried using this setup, Realm http://*.com/*template* but I'm not sure that's 
true because I saw just delete the url of the database that is not defined by 
Server


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Upgrading from 3.2?

2013-04-30 Thread bar
Author: Ian
Email: 
Message:
Hi, we successfully installed mnogosearch on a Linux/Apache server website back 
in 2007. This was version 3.2 and it has worked very well since then. However, 
we recently changed server hardware and upgraded the server software as well. 
Here is our current setup:

Apache version  2.2.23
PHP version 5.2.17
MySQL version   5.0.96-community
Architecturex86_64
Operating systemlinux

When visitors now try to use search they get an internal server error, e.g.:

http://fourthirds-user.com/cgi-bin/search.cgi/search.html?q=lens

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to 
complete your request.

Please contact the server administrator, webmas...@fourthirds-user.com and 
inform them of the time the error occurred, and anything you might have done 
that may have caused the error.

More information about this error may be available in the server error log.

Additionally, a 404 Not Found error was encountered while trying to use an 
ErrorDocument to handle the request.
Apache/2.2.23 (Unix) mod_ssl/2.2.23 OpenSSL/1.0.0-fips mod_auth_passthrough/2.1 
mod_bwlimited/1.4 FrontPage/5.0.2.2635 PHP/5.2.17 Server at fourthirds-user.com 
Port 80

My initial assumption is that we should upgrade to a more recent version of 
mnogosearch. Does this make sense?

If so, is there a guide concerning how to do this someone can point me to?

Thanks,

Ian

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Upgrading from 3.2?

2013-04-30 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi, we successfully installed mnogosearch on a Linux/Apache server website 
> back in 2007. This was version 3.2 and it has worked very well since then. 
> However, we recently changed server hardware and upgraded the server software 
> as well. Here is our current setup:
> 
> Apache version2.2.23
> PHP version   5.2.17
> MySQL version 5.0.96-community
> Architecture  x86_64
> Operating system  linux
> 
> When visitors now try to use search they get an internal server error, e.g.:
> 
> http://fourthirds-user.com/cgi-bin/search.cgi/search.html?q=lens
> 
> Internal Server Error
> 
> The server encountered an internal error or misconfiguration and was unable 
> to complete your request.
> 
> Please contact the server administrator, webmas...@fourthirds-user.com and 
> inform them of the time the error occurred, and anything you might have done 
> that may have caused the error.
> 
> More information about this error may be available in the server error log.
> 

Possibly, it does not find some shared libraries.

Have a look into Apache's error.log,
or try to run search.cgi from command line.



> 
> My initial assumption is that we should upgrade to a more recent version of 
> mnogosearch. Does this make sense?
> 

I'd suggest to upgrade to the latest 3.3.

> If so, is there a guide concerning how to do this someone can point me to?

There is no a special upgrade guide.
The easiest way is just to install the new version,
copy and adjust configuration files, and crawl your
site again from scratch.

Feel free to send your configuration files 
(indexer.conf and search.htm) for review
whether they need any adjustments for 3.3.

> 
> Thanks,
> 
> Ian


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Index all post in the forum or website category?

2013-04-30 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> I want to index all the sites, about "template"
> Indexing posting in the forum categories and categories on the website
> 
> current indexer setup like this, but I'm confused a lot of link wikipedia and 
> amazon are indexed by crawl
> 
> Realm http://*.com/*
> Realm http://*.org/*
> Realm http://*.net/*
> Realm http://*.info/*
> 
> Tag template
> Server http://www.templatesite.com/
> Server http://www.template.com/
> Server http://www.alltemplate.com/
> 
> I tried to use IndexIf *template* but this is bad, only categories that can 
> index (not post) and all search results contained the phrase "template" in 
> the title
> I do not use maxhops, because I want to index as many sites that found by the 
> crawl, only post about template
> 
> I tried using this setup, Realm http://*.com/*template* but I'm not sure 
> that's true because I saw just delete the url of the database that is not 
> defined by Server
> 

Is my understanding correct:
you want to find all pages on the Internet that contain
the word "template"?

I'm afraid mnoGoSearch crawler is not very suitable for that
purposes. There are about 190 million active sites in the world,
according to the latest surveys. Every site has usually several 
documents (some sites have hundreds and thousands documents).
It will take forever to crawl all these volumes of data.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Upgrading from 3.2?

2013-04-30 Thread bar
Author: Ian
Email: 
Message:
Thanks Alexander - where should I send the config files as you suggested?

Ian

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Upgrading from 3.2?

2013-04-30 Thread bar
Author: Ian
Email: 
Message:
Also, which version should I download - RPM or Deb?

Ian

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Upgrading from 3.2?

2013-04-30 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Thanks Alexander - where should I send the config files as you suggested?
> 
Feel free to post here, or to send to my personal address:
b...@mnogosearch.org



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Upgrading from 3.2?

2013-04-30 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Also, which version should I download - RPM or Deb?
> 
> Ian

Which Linux is it?

What does "cat /etc/os-release" print?


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Upgrading from 3.2?

2013-05-02 Thread bar
Author: Ian
Email: 
Message:
The following error suggests a library is missing:

[Wed May 01 10:31:47 2013] [error] [client 66.249.73.220] Premature end of
script headers: search.cgi [Wed May 01 10:31:47 2013] [error] [client
66.249.73.220] File does not exist: /home/fourth/public_html/500.shtml [Wed
May 01 10:31:57 2013] [error] [client 66.249.73.220] search.cgi: error while
loading shared libraries: libmysqlclient.so.14: cannot open shared object
file: No such file or directory

I am talking to our server support people about this - presumably this is a 
server OS library and not part of monogosearch?

Ian

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Index all post in the forum or website category?

2013-05-09 Thread bar
Author: nanang
Email: nh3...@yahoo.co.id
Message:
My mistake choosing the unlimited category, it requires a large hardware like 
google
I want to make more specific, it's about selling search on my country
is it possible if I configure the indexer only do index when the data is new? 
custom by month and year
so just do the index when the post was in 2013, when posting on longer then 
skip the index and data on dabatase will be deleted automatically

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Index all post in the forum or website category?

2013-05-09 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> My mistake choosing the unlimited category, it requires a large hardware like 
> google
> I want to make more specific, it's about selling search on my country
> is it possible if I configure the indexer only do index when the data is new? 
> custom by month and year
> so just do the index when the post was in 2013, when posting on longer then 
> skip the index and data on dabatase will be deleted automatically



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Index all post in the forum or website category?

2013-05-09 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> My mistake choosing the unlimited category, it requires a large hardware like 
> google
> I want to make more specific, it's about selling search on my country
> is it possible if I configure the indexer only do index when the data is new? 
> custom by month and year
> so just do the index when the post was in 2013, when posting on longer then 
> skip the index and data on dabatase will be deleted automatically

There's no a feature like this.
I'm afraid it cannot be done easily on a single computer.
Your crawler will have to crawl through the all Internet anyway.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: $(body) result optimization

2013-06-10 Thread bar
Author: Olivier Obéron
Email: 
Message:
Hi,

When I search something with Mnogosearch 3.3.14 (installed on Debian) on my 
website the body part of each result is always the same : it display the 
website menu instead of focus on the searched word and underline it as done on 
the mnogosearch site (ex : 
http://www.mnogosearch.org/search/index.html?q=search&x=0&y=0).

for example if I search for the word "car" it will display "Home Menu1 Menu2 
..." instead of displaying "there was a CAR in the street...and the CAR was 
here..."


What should I do in order to correct it ?

Thanks

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: $(body) result optimization

2013-06-10 Thread bar
Author: Olivier Obéron
Email: olivier.obe...@capgemini.com
Message:
Problem solved... I found this 
:http://mnogosearch.org/doc33/msearch-stored.html. The problem was i didn't 
configure mnogosearch with zlib...

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Tags

2013-08-30 Thread bar
Author: Paul Stewart
Email: p...@paulstewart.org
Message:
Hi there...

I'm looking to build a search solution for a site I'm working on.

The site has a web directory aspect which looks like "province/city/service" or 
similar.

Is there any limitations on how many tags you can use and/or lengths?

I'd like to use the above example as the actual tag within Mnogosearch and then 
using a variable within search.htm have it output the category and a link.  The 
link I'd like to use http://www.domain.com/directory/%tag% for example - is 
this possible?

Basically the user does a search for "xyz" and all the Mnogosearch grouped 
results show up and with each result there will be a link to the directory 
category it belongs to (using tags feature).

Thanks,
Paul

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Tags

2013-08-31 Thread bar
Author: Alexander Barkov
Email: 
Message:
> Hi there...
> 
> I'm looking to build a search solution for a site I'm working on.
> 
> The site has a web directory aspect which looks like "province/city/service" 
> or similar.
> 
> Is there any limitations on how many tags you can use and/or lengths?
> 
> I'd like to use the above example as the actual tag within Mnogosearch and 
> then using a variable within search.htm have it output the category and a 
> link.  The link I'd like to use http://www.domain.com/directory/%tag% for 
> example - is this possible?
> 
> Basically the user does a search for "xyz" and all the Mnogosearch grouped 
> results show up and with each result there will be a link to the directory 
> category it belongs to (using tags feature).

Tag value is not limited. 
But, grouping results by tag value is not possible.
Grouping can only be done by site.

One of the ideas to achieve this would be
to use a set of ReverseAlias and Alias commands
to rewrite URLs 

from:

http://www.domain.com/directory/tag/page.html

to something like this:

http://www.domain.com-directory-tag/page.html

Then grouping will be possible by 
the original site name plus directory name + tag name.

Then, at search time, you'll need to rewrite URLs back
to their original form.

> 
> Thanks,
> Paul

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Unable to remove SESSIONID with AliasProg

2013-10-08 Thread bar
Author: monsieurpaul
Email: 
Message:
Hi,

I'm quite new to mnogosearch and I'm experiencing an issue with AliasProg.

I want to index a page like 
http://myserver/myapp/doc.do%3Bjsessionid=12CC078A31A8321D910CC965932A0F36?idDoc=5032

I want to remove the jsessionid, so I try using AliasProg :
AliasProg "echo $1 | sed 's/%3Bjsession[^?]*//g'"

My server line :
Server site http://myserver/myapp

with indexer -v 5 I can see that everything SEEMS to be fine :

indexer[24738]: [24738]{01} URL: 
http://myserver/myapp/doc.do%3Bjsessionid=12CC078A31A8321D910CC965932A0F36?idDoc=5032
indexer[24738]: [24738]{01} Starting AliasProg: 'echo 
http://myserver/myapp/doc.do%3Bjsessionid=12CC078A31A8321D910CC965932A0F36?idDoc=5032
 | sed 's/%3Bjsession[^?]*//g''
indexer[24738]: [24738]{01} AliasProg result: 
'http://myserver/myapp/doc.do?idDoc=5032'
indexer[24738]: [24738]{01} Server Site Allow 'http://myserver/'
indexer[24738]: [24738]{01} Allow by default
indexer[24738]: [24738]{01} Alias: 'http://myserver/myapp/doc.do?idDoc=5032'


But if I search this document, I see that the url WITH the jsessions thing is 
stored instead of the URL alias.

What I am doing wrong?

Thank you for your help


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Unable to remove SESSIONID with AliasProg

2013-10-09 Thread bar
Author: monsieurpaul
Email: 
Message:
OK, after a good night, I found this page :
http://www.mnogosearch.org/doc/msearch-indexer-configuration.html#alias-reverse


and then the way to make it using ReverseAlias :
ReverseAlias regex (http.*)%3Bjsession[^?]*(.*) $1$2

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: erwan plop
Email: 
Message:
Hi,
 I try to use mnoGoSearch 3.3.14 on my website but I have some trouble.
The indexation works fine but after when I try to do a query, I always have no 
results. I activated the log and I noticed something weird, indeed, even when I 
put irrelevant password or username, mnoGoSearch seems to have no complaint and 
tell it's connected to the database and will do the query and of course for 
this case I have got no results. So, according to the log file, everything 
works fine but it's not...
Do you know what could be the problem here ?

Thanks

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: monsieurpaul
Email: 
Message:
hi,

did you check that you have activated the right database connexion in your 
search.htm?

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi,
>  I try to use mnoGoSearch 3.3.14 on my website but I have some trouble.
> The indexation works fine but after when I try to do a query, I always have 
> no results. I activated the log and I noticed something weird, indeed, even 
> when I put irrelevant password or username, mnoGoSearch seems to have no 
> complaint and tell it's connected to the database and will do the query and 
> of course for this case I have got no results. So, according to the log file, 
> everything works fine but it's not...
> Do you know what could be the problem here ?
> 
> Thanks

Which log file do you mean?
search.cgi does not produce logs by default.

Make sure DBAddr in indexer.conf and in search.htm are the same.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: erwan plop
Email: 
Message:
Thanks for the reply.
By log file, I uncommented this line 'LogLevel 6' in the search.htm.

I've got the same DBADDR in indexer.conf and in search.htm.



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Thanks for the reply.
> By log file, I uncommented this line 'LogLevel 6' in the search.htm.
> 
> I've got the same DBADDR in indexer.conf and in search.htm.
> 
> 

Can you please try a wrong data base name instead of
a wrong user name or a password?


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: erwan plop
Email: 
Message:
It's the same thing, no error shows up in the log.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> It's the same thing, no error shows up in the log.

Please post the output from the log.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: erwan plop
Email: 
Message:
Here is an example of the log.

Oct 21 05:15:30 fedora16 search.cgi[8300]: search.cgi started with 
'/usr/local/mnogosearch/etc/search.htm'
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start UdmFind
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start Prepare
Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  Prepare 0.00
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start FindWords
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start FindWordsDB for 
oracle://:@127.0.0.1:1521/orcl/?dbmode=blob
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start loading limits
Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  loading limits  0.00 
(0 URLs found)
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start fetching words
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start search for 'homme'
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start fetching
Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  FindWordsDB:0.21
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start UdmConvert
Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  UdmConvert: 0.00
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start Excerpts
Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  Excerpts:   0.00
Oct 21 05:15:30 fedora16 search.cgi[8300]: Start WordInfo
Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  WordInfo:   0.00
Oct 21 05:15:31 fedora16 search.cgi[8300]: Stop  UdmFind:0.21

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Here is an example of the log.
> 
> Oct 21 05:15:30 fedora16 search.cgi[8300]: search.cgi started with 
> '/usr/local/mnogosearch/etc/search.htm'
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start UdmFind
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start Prepare
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  Prepare 0.00
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start FindWords
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start FindWordsDB for 
> oracle://:@127.0.0.1:1521/orcl/?dbmode=blob




Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Here is an example of the log.
> 
> Oct 21 05:15:30 fedora16 search.cgi[8300]: search.cgi started with 
> '/usr/local/mnogosearch/etc/search.htm'
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start UdmFind
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start Prepare
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  Prepare 0.00
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start FindWords
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start FindWordsDB for 
> oracle://:@127.0.0.1:1521/orcl/?dbmode=blob
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start loading limits
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  loading limits  0.00 
> (0 URLs found)
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start fetching words
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start search for 'homme'
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start fetching
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  FindWordsDB:0.21
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start UdmConvert
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  UdmConvert: 0.00
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start Excerpts
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  Excerpts:   0.00
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Start WordInfo
> Oct 21 05:15:30 fedora16 search.cgi[8300]: Stop  WordInfo:   0.00
> Oct 21 05:15:31 fedora16 search.cgi[8300]: Stop  UdmFind:0.21



The error message is not in fact printed in the error log.
It's printed in the search output, like this:

   An error occurred!

DB err: Oracle: InitDB: ORA-12154: TNS:could not resolve the connect identifier 
specified! -



Btw, is "orlc" a valid Oracle SID that is configured in 
tnsnames.ora and/or listener.ora ?


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Yes, orcl is a valid SID for the database in question.
> 
> If I use the search.cgi by command line and an error occured, the error will 
> be printed on the stdout/stderr ?
> Because mnoGoSearch never printed any error relating to the connection to the 
> database.

As I told in the previous letter, the error is printed to stdout,
like this:


An error occurred!

DB err: Oracle: InitDB: ORA-12154: TNS:could not resolve the connect identifier 
specified! -

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: erwan plop
Email: 
Message:
Yes, orcl is a valid SID for the database in question.

If I use the search.cgi by command line and an error occured, the error will be 
printed on the stdout/stderr ?
Because mnoGoSearch never printed any error relating to the connection to the 
database.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: erwan plop
Email: 
Message:
Ok, I don't have any error message, the output always looks the same (the one I 
copy in a previous message), no matter what I put in the DBADDR. That's why I 
really don't know how to resolve this issue.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-21 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Ok, I don't have any error message, the output always looks the same (the one 
> I copy in a previous message), no matter what I put in the DBADDR. That's why 
> I really don't know how to resolve this issue.

Please try to run it from command line:

./search.cgi test > test.html

then check test.html in the browser.

Does it display the error message?

Make sure that you have the part of the template
that is responsible to print the error message:




An error occurred!
$(E)




Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-22 Thread bar
Author: erwan plop
Email: 
Message:
Hi,
 I ran /search.cgi test > test.html and this time, I've got an error which is : 
Unsupported DBAddr. 
My database is an Oracle 11g and for the configure I dit : ./configure 
--with-oracle8i --enable-news. That's correct or this is where I made a mistake 
?


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-22 Thread bar
Author: erwan plop
Email: 
Message:
Sorry, I made a mistake, I forgot to replace the line DBaddr from a previous 
test...
In fact, I don't have any error and it's seem to work since I have got a result 
("Search results: test : 64.")
So the problem comes from my node.xml, i'll look into that.

Thanks for your time

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-22 Thread bar
Author: erwan plop
Email: 
Message:
So from a command line it works but when I try to use it from from my website, 
I've got the following error : DB err: Oracle: InitDB: ORA-12154: TNS:could not 
resolve the connect identifier specified! - 


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-22 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> So from a command line it works but when I try to use it from from my 
> website, I've got the following error : DB err: Oracle: InitDB: ORA-12154: 
> TNS:could not resolve the connect identifier specified! - 
> 

Perhaps it wants environment variables like ORACLE_HOME to be
set.

Configure your web server to set ORACLE_HOME and
all other environment variables that Oracle might need.



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-22 Thread bar
Author: erwan plop
Email: 
Message:
I checked all the environment variable related to Oracle and everything is 
correct.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: database connection

2013-10-22 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> I checked all the environment variable related to Oracle and everything is 
> correct.

Something is different between when you run search.cgi from
command line and from the web server.


Possibly, the user that's running the web server
(usually "apache", or sometimes "nobody")
does not have read permissions to the oracle directories.

Try it from command line:
change user to "apache", then run search.cgi
and check its output.




Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Section meta.description

2013-10-23 Thread bar
Author: Brett
Email: brett.albert...@zootweb.com
Message:
I cannot get the indexer to index the information in the meta tags. The 
indexer.conf section looks like this. Am I missing something?

# Standard HTML sections: body, title
Section body1   256
Section title   2   128
# META tags
# For example 
#
Section meta.keywords   3   256
Section meta.description4   256

the body and title index fine


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Section meta.description

2013-10-23 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> I cannot get the indexer to index the information in the meta tags. The 
> indexer.conf section looks like this. Am I missing something?
> 
> # Standard HTML sections: body, title
> Section   body1   256
> Section title 2   128
> # META tags
> # For example 
> #
> Section meta.keywords 3   256
> Section   meta.description4   256
> 
> the body and title index fine
> 

Can you please clarify what happens?
search does not find the words from the meta tags?
Can you please send your indexer.conf and search.htm
to b...@mnogosearch.org, so I can check them?


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: parameter tag

2013-10-24 Thread bar
Author: erwan plop
Email: 
Message:
Hi,
 I upgraded to the version 3.3.14 from the version 3.3.4 and now when I use the 
parameter tag, it doesn't work anymore. I don't know  exactly how this 
parameter works. Could you help me understand this problem ?

Thanks

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: parameter tag

2013-10-25 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi,
>  I upgraded to the version 3.3.14 from the version 3.3.4 and now when I use 
> the parameter tag, it doesn't work anymore. I don't know  exactly how this 
> parameter works. Could you help me understand this problem ?
> 
> Thanks

Can you please clarify what happens? 
Does search.cgi ignore the t=xxx  query string parameter
and returns all results, or does it return an empty result?
How does the full URL look like?


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: parameter tag

2013-10-25 Thread bar
Author: erwan plop
Email: 
Message:
When I use this URL : node.xml?ps=500&m=all&wm=wrd&wf=111F&q=personnel,
it returns the results that I want but when I use this one : 
node.xml?ps=500&m=all&wm=wrd&wf=111F&q=personnel&tag=nomenclature, it returns 
nothing.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: parameter tag

2013-10-25 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> When I use this URL : node.xml?ps=500&m=all&wm=wrd&wf=111F&q=personnel,
> it returns the results that I want but when I use this one : 
> node.xml?ps=500&m=all&wm=wrd&wf=111F&q=personnel&tag=nomenclature, it returns 
> nothing.
> 

Does this query return any records:

SELECT url.rec_id FROM url, server s WHERE (s.tag LIKE 'nomenclature') AND 
s.rec_id=url.server_id

?


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: parameter tag

2013-10-25 Thread bar
Author: erwan plop
Email: 
Message:
The query returns more than 7000 records

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: parameter tag

2013-10-25 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> The query returns more than 7000 records

Please try the following:

1. Add these commands into node.xml:

Log2Stderr yes
LogLevel 6

2. Run search.cgi from command line like this:

./search.cgi -d /path/to/node.xml 
"personnel&ps=500&m=all&wm=wrd&wf=111F&tag=nomenclature" >output.xml

It will print search results to output.xml,
and debug information to stderr.

Does output.xml have still no results?

How does stderr output look like?



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: parameter tag

2013-10-25 Thread bar
Author: erwan plop
Email: 
Message:
output.xml is still empty and this is the copy of the terminal output :

search.cgi[3152]: Start loading limits
search.cgi[3152]: WHERE limit loaded. 6297 URLs found
search.cgi[3152]: Stop  loading limits  0.38 (6297 URLs found)
search.cgi[3152]: Start fetching words
search.cgi[3152]: Start search for 'personnel'
search.cgi[3152]: Start fetching
search.cgi[3152]: Stop  fetching0.00
search.cgi[3152]: Start BlobAddCoords
search.cgi[3152]: Secno=1 len=2375
search.cgi[3152]: Secno=7 len=735
search.cgi[3152]: Stop  BlobAddCoords   0.00
search.cgi[3152]: Stop  search for 'personnel'  0.01 (0 coords found)
search.cgi[3152]: Stop  fetching words: 0.01
search.cgi[3152]: Start merging 0 lists
search.cgi[3152]: Stop  merging:0.00 (0 sections)
search.cgi[3152]: Start GroupByURL 0 sections
search.cgi[3152]: Stop  GroupByURL  0.00 (0 docs found)
search.cgi[3152]: Stop  FindWordsDB:0.39
search.cgi[3152]: Stop  FindWords   0.39
search.cgi[3152]: Start UdmConvert
search.cgi[3152]: Stop  UdmConvert: 0.00
search.cgi[3152]: Start Excerpts
search.cgi[3152]: Stop  Excerpts:   0.00
search.cgi[3152]: Start WordInfo
search.cgi[3152]: Stop  WordInfo:   0.00
search.cgi[3152]: Stop  UdmFind:0.40


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: parameter tag

2013-10-25 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> output.xml is still empty and this is the copy of the terminal output :
> 
> search.cgi[3152]: Start loading limits
> search.cgi[3152]: WHERE limit loaded. 6297 URLs found
> search.cgi[3152]: Stop  loading limits  0.38 (6297 URLs found)
> search.cgi[3152]: Start fetching words
> search.cgi[3152]: Start search for 'personnel'
> search.cgi[3152]: Start fetching
> search.cgi[3152]: Stop  fetching0.00
> search.cgi[3152]: Start BlobAddCoords
> search.cgi[3152]: Secno=1 len=2375
> search.cgi[3152]: Secno=7 len=735
> search.cgi[3152]: Stop  BlobAddCoords   0.00
> search.cgi[3152]: Stop  search for 'personnel'  0.01 (0 coords found)
> search.cgi[3152]: Stop  fetching words: 0.01
> search.cgi[3152]: Start merging 0 lists
> search.cgi[3152]: Stop  merging:0.00 (0 sections)
> search.cgi[3152]: Start GroupByURL 0 sections
> search.cgi[3152]: Stop  GroupByURL  0.00 (0 docs found)
> search.cgi[3152]: Stop  FindWordsDB:0.39
> search.cgi[3152]: Stop  FindWords   0.39
> search.cgi[3152]: Start UdmConvert
> search.cgi[3152]: Stop  UdmConvert: 0.00
> search.cgi[3152]: Start Excerpts
> search.cgi[3152]: Stop  Excerpts:   0.00
> search.cgi[3152]: Start WordInfo
> search.cgi[3152]: Stop  WordInfo:   0.00
> search.cgi[3152]: Stop  UdmFind:0.40
> 

>From the output it seems that the word 'personnel'
is just not found in the documents that have
the given tag value.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Laurent
Email: 
Message:
Hi Guys,

It's a long since my udm-gw script in Y2K.
I am back on mnoGosearch and face a newbie issue I cant solve.

I want to index a server but not some specific regex on it.
I tried disallow with server, all fails.
Server disallow with pattern is not possible to me, no try.

Here I want to index www.a.com/
without www.a.com/news/*/2000/*
and www.a.com/index.html?*setlang=za

I did:
Disallow regex www.a.com/news/*/2000/*
Disallow regex www.a.com/index.html\?*setlang=za
Server allow www.a.com/

I also tried using .* as pattern for any instead of *, no success.

Any help appreciated :-)

Thanks

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi Guys,
> 
> It's a long since my udm-gw script in Y2K.
> I am back on mnoGosearch and face a newbie issue I cant solve.
> 
> I want to index a server but not some specific regex on it.
> I tried disallow with server, all fails.

Can you please clarify what fails?
Does it crawl the entire site?
Or does it crawl nothing?

> Server disallow with pattern is not possible to me, no try.
> 
> Here I want to index www.a.com/
> without www.a.com/news/*/2000/*
> and www.a.com/index.html?*setlang=za
> 
> I did:
> Disallow regex www.a.com/news/*/2000/*
> Disallow regex www.a.com/index.html\?*setlang=za
> Server allow www.a.com/

The correct command is:

Server http://www.a.com/

Notice the "http://"; prefix.

> 
> I also tried using .* as pattern for any instead of *, no success.

".*" is correct.

Btw, which version are you using?

> 
> Any help appreciated :-)
> 
> Thanks


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Laurent
Email: 
Message:
Hi Alex,

Thanks for your answer.

I did not wrote perfectly the URL.
What you wrote is what I did and it does not work, apparently.
I am on FreeBSD, mnoGo 3.3.14

Disallow regex www.a.com/news/*/2000/*
Disallow regex www.a.com/index.html\?*setlang=za
Server https://allow www.a.com/

Is this the correct format ?

In the log, I see https://www.a.com/index.php?title=Toto&value=1&setlang=za
as well as:
https://www.a.com/index.html?Special/file_2007_Conference

thanks



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Hi Alex,
> 
> Thanks for your answer.
> 
> I did not wrote perfectly the URL.
> What you wrote is what I did and it does not work, apparently.
> I am on FreeBSD, mnoGo 3.3.14
> 
> Disallow regex www.a.com/news/*/2000/*
> Disallow regex www.a.com/index.html\?*setlang=za
> Server https://allow www.a.com/
> 
> Is this the correct format ?

Try this:

Disallow regex "www[.]a[.]com/news/.*/2000/.*"
Disallow regex "www[.]a[.]com/index[.]html[?].*setlang=za"
Server allow https://www.a.com/

If it does not help, try this command:

indexer -amv6 -u "https://www.a.com/index.php?title=Toto&value=1&setlang=za";

It will print debug output and explain why this URL
is accepted or rejected. Please post its output here.


> 
> In the log, I see https://www.a.com/index.php?title=Toto&value=1&setlang=za
> as well as:
> https://www.a.com/index.html?Special/file_2007_Conference
> 
> thanks
> 
> 

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Content-type

2013-11-09 Thread bar
Author: Laurent
Email: 
Message:
Hi Guys,

Indexing, I see the "unsupported content-type" values growing hugely.

Since I disallow for example *.png, putting it as a specific type, as Checkonly 
also to try reducing this, I dont understand why it is detected as unsupported 
content type.

It should not be indexed and so listed as unsupported no ?

Thanks

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Laurent
Email: 
Message:
indexer from mnogosearch-3.3.14-mysql started with 
'/usr/local/etc/mnogosearch/indexer.conf'
[57177]{01} URL: https://www.a.com/index.php/code_2007_:_Selection
[57177]{01} Server Path Allow 'https://www.a.com/'
[57177]{01} Allow Regex InSensitive '\.php$|\.cgi$|\.pl$'
[57177]{01} ROBOTS: https://www.a.com/robots.txt
[57177]{01} Request.Accept-Encoding: gzip,deflate,compress
[57177]{01} Request.Accept-Language: en, fr
[57177]{01} Request.From: b...@toto.com
[57177]{01} Request.Host: www.a.com
[57177]{01} Request.User-Agent: bot
[57177]{01} Response.Accept-Ranges: bytes
[57177]{01} Response.Connection: close
[57177]{01} Response.Content-Encoding: gzip
[57177]{01} Response.Content-Length: 0
[57177]{01} Response.Content-Type: text/plain
[57177]{01} Response.Date: Sun, 10 Nov 2013 07:41:01 GMT
[57177]{01} Response.DefaultLang: en
[57177]{01} Response.DetectClones: 1
[57177]{01} Response.ETag: "1ea26b-0-4e0f93dabf240"
[57177]{01} Response.Last-Modified: Mon, 08 Jul 2013 05:23:13 GMT
[57177]{01} Response.Method: Disallow
[57177]{01} Response.Period: 604800
[57177]{01} Response.Request.Accept-Language: en, fr
[57177]{01} Response.Request.From: b...@toto.com
[57177]{01} Response.Request.User-Agent: bot
[57177]{01} Response.ResponseLine: HTTP/1.1 200 OK
[57177]{01} Response.ResponseSize: 360
[57177]{01} Response.Server: Apache
[57177]{01} Response.Status: 200
[57177]{01} Response.Tag: www_en
[57177]{01} Response.URL: https://www.a.com/robots.txt
[57177]{01} Response.URL_ID: -1277106540
[57177]{01} Response.Vary: Accept-Encoding
[57177]{01} Response.VaryLang: en fr
[57177]{01} Response.X-Frame-Options: Deny
[57177]{01} Response.X-XSS-Protection: 1; mode=block
[57177]{01} Request.Accept-Encoding: gzip,deflate,compress
[57177]{01} Request.Accept-Language: en, fr
[57177]{01} Request.From: b...@toto.com
[57177]{01} Request.Host: www.a.com
[57177]{01} Request.User-Agent: bot
[57177]{01} Response.body: 
[57177]{01} Response.Cache-Control: private, must-revalidate, max-age=0
[57177]{01} Response.CachedCopy: 
[57177]{01} Response.Charset: 
[57177]{01} Response.Connection: close
[57177]{01} Response.Content-Encoding: gzip
[57177]{01} Response.Content-Language: en
[57177]{01} Response.Content-Length: 7496
[57177]{01} Response.Content-Type: text/html
[57177]{01} Response.crc32: 1003223498
[57177]{01} Response.crc32old: 1003223498
[57177]{01} Response.crosswords: 
[57177]{01} Response.Date: Sun, 10 Nov 2013 07:41:01 GMT
[57177]{01} Response.DefaultLang: en
[57177]{01} Response.DetectClones: 1
[57177]{01} Response.Expires: Thu, 01 Jan 1970 00:00:00 GMT
[57177]{01} Response.Hops: 14
[57177]{01} Response.ID: 405428
[57177]{01} Response.Last-Modified: Mon, 14 Oct 2013 15:14:00 GMT
[57177]{01} Response.MaxDocPerSite: 0
[57177]{01} Response.MaxHops: 256
[57177]{01} Response.meta.description: 
[57177]{01} Response.meta.keywords: 
[57177]{01} Response.Method: Disallow
[57177]{01} Response.msg.from: 
[57177]{01} Response.msg.subject: 
[57177]{01} Response.msg.to: 
[57177]{01} Response.Period: 604800
[57177]{01} Response.PrevStatus: 200
[57177]{01} Response.Request.Accept-Language: en, fr
[57177]{01} Response.Request.From: b...@toto.com
[57177]{01} Response.Request.User-Agent: bot
[57177]{01} Response.ResponseLine: HTTP/1.1 200 OK
[57177]{01} Response.ResponseSize: 7952
[57177]{01} Response.Server: Apache
[57177]{01} Response.Server-Charset: utf-8
[57177]{01} Response.Server_id: -1149994654
[57177]{01} Response.Site_id: -1149994654
[57177]{01} Response.Status: 200
[57177]{01} Response.Tag: www_en
[57177]{01} Response.title: 
[57177]{01} Response.URL: https://www.a.com/index.php/code_2007_:_Selection
[57177]{01} Response.url.file: 
[57177]{01} Response.url.host: 
[57177]{01} Response.url.path: 
[57177]{01} Response.url.proto: 
[57177]{01} Response.URL_ID: 1908964734
[57177]{01} Response.Vary: Accept-Encoding,Cookie
[57177]{01} Response.VaryLang: en fr
[57177]{01} Response.X-Content-Type-Options: nosniff
[57177]{01} Response.X-Frame-Options: Deny
[57177]{01} Response.X-XSS-Protection: 1; mode=block
[57177]{01} Status: 200 OK
[57177]{01} Stored rec_id: 405428 Size: 25459 Ratio: 29.35%
[57177]{01} Guesser: Lang: en, Charset: utf-8
[57177]{01} SectionFilter: Allow by default
[57177]{01} Link '/favicon.ico' https://www.a.com/favicon.ico
[57177]{01}  Server applied: site_id: -1149994654 URL: https://www.a.com/
[57177]{01} Allow Regex InSensitive '\.php$|\.cgi$|\.pl$'
[57177]{01} Link '/opensearch_desc.php' https://www.a.com/opensearch_desc.php

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-11 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> indexer from mnogosearch-3.3.14-mysql started with 
> '/usr/local/etc/mnogosearch/indexer.conf'
> [57177]{01} URL: https://www.a.com/index.php/code_2007_:_Selection
> [57177]{01} Server Path Allow 'https://www.a.com/'
> [57177]{01} Allow Regex InSensitive '\.php$|\.cgi$|\.pl$'

Can you please send your indexer.conf to b...@mnogosearch.org?
Thanks.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-13 Thread bar
Author: Laurent
Email: 
Message:
Hi Alex,

Ok, I finally found the issue...

First, there was a:
Allow NoMatch Regex \.php$|\.cgi$|\.pl$

Activated. Because of it, mostly all URLs were acceptable.

This because this allow was before the disallow related to the servers.
This totally changed my approach of the indexing file.

Before, I was Allowing/Disallow specific wide patterns (*.suffix etc), than the 
disallow of URLs and then allow of URLs.
Now I disallow servers first, then allow/disallow wide patterns and finally 
server allows.

This strongly lowered by unsupported content-type statistics as well:-)

Thanks for your support !!

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Content-type

2013-11-13 Thread bar
Author: Laurent
Email: 
Message:
solved !

See http://www.mnogosearch.org/board/message.php?id=21584

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: mnoGo improvement in SQL storage ?

2013-11-13 Thread bar
Author: Laurent
Email: 
Message:
Hi Guys,

Digging in the urlinfo datadase, I see it contains many sname with the full URL 
response type (e.g. Content-Type).

I wonder if it would not be a good idea to reduce these names to a much shorter 
value, directly inside mnoGo, to reduce storage as well ?

Just a suggestion, if valuable...

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: mnoGo improvement in SQL storage ?

2013-11-19 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi Guys,
> 
> Digging in the urlinfo datadase, I see it contains many sname with the full 
> URL response type (e.g. Content-Type).
> 
> I wonder if it would not be a good idea to reduce these names to a much 
> shorter value, directly inside mnoGo, to reduce storage as well ?
> 
> Just a suggestion, if valuable...

In case of MySQL you can ALTER the table
to use the ENUM data type for the column "sval",
instead of a VARCHAR. The list of all possible
values is known from the "Section" commands in indexer.conf.

In the next major version 3.4.x the tables structure will 
be different: the table "urlinfo" will be used only
for user-defined variables. It won't be used
for the things like Content-Type.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Index immediately specific URL ?

2013-11-27 Thread bar
Author: Laurent
Email: 
Message:
Hi Guys,

mnoGoSearch works perfectly now, apparently :-)

I wanted to index immediately a specific URL, how can I do that ?

when I force (-am) the reindex, it does not index it now, just confirm 
it has to be done :-(

Nota : there is an alias (Server URL file) for this index.

Thx

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Index immediately specific URL ?

2013-11-27 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi Guys,
> 
> mnoGoSearch works perfectly now, apparently :-)
> 
> I wanted to index immediately a specific URL, how can I do that ?
> 
> when I force (-am) the reindex, it does not index it now, just confirm 
> it has to be done :-(

Check this:

indexer -S -u http://www.site.com/specific-url.html

It will report empty or non-empty statistics, depending
on whether the given URL is already in the database.

Then do either of the following:

1. If the URL is not in the database yet, then run this command:

indexer -i -u http://www.site.com/specific-url.html

It will insert the URL into the database and download it.


2. If the URL *is* already in the database, and you want force
the crawler to download it again, then run this:

indexer -am -u http://www.site.com/specific-url.html


> 
> Nota : there is an alias (Server URL file) for this index.
> 
> Thx

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Regex syntax for sections with multiple matches

2013-11-27 Thread bar
Author: Felix Heller
Email: felix.hel...@aimcom.de
Message:
Hello,

I've installed and configured MnoGoSearch as a powerful full text search engine 
for 
CMS websites a few days ago. But right now I am a little bit confused about the 
configuration of document sections.

I would like to index the headlines (, , ) in special fields so 
that I 
can weight them more in comparison to the body text.

There is one example given in indexer.conf:
Section h1  26  128  "(.*)" $1

This works fine because normally there is only one  on a webpage. But when 
I try 
to index all  headlines using the regular expression "(.*)" $1, 
the 
whole content between the first  and the last  gets indexed. What I 
would 
like to get is only the text between the ... tags.

Could somebody please tell me if there is a solution for that problem?

Thanks a lot for your help
Felix

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Regex syntax for sections with multiple matches

2013-11-27 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hello,

> Hello,
> 
> I've installed and configured MnoGoSearch as a powerful full text search 
> engine for 
> CMS websites a few days ago. But right now I am a little bit confused about 
> the 
> configuration of document sections.
> 
> I would like to index the headlines (, , ) in special fields so 
> that I 
> can weight them more in comparison to the body text.
> 
> There is one example given in indexer.conf:
> Section h1  26  128  "(.*)" $1
> 
> This works fine because normally there is only one  on a webpage. But 
> when I try 
> to index all  headlines using the regular expression "(.*)" $1, 
> the 
> whole content between the first  and the last  gets indexed. What I 
> would 
> like to get is only the text between the ... tags.
> 
> Could somebody please tell me if there is a solution for that problem?

There are two problems here:
1. Nested tags: .

Unfortunately, there is no a general solution for this,
because the underlying regexp library does not support
so called "non-greedy quantifiers". We definitely need
to switch to the PCRE library eventually, to make it possible.

But there is a workaround that I think should work for  and .
The idea is that  and  usually do not have nested tags,
so the regexp can scan everything until the next '<' character:

Section h2  27  128  "([^<]*)" $1
Section h3  28  128  "([^<]*)" $1

It will work for: text text

It will not work for: text text text
where xxx is some other tag. 

Do you know any tags that are possible inside  or ?


2. Multiple  or  tags.
The user defined sections do not support multiple entries.
They catch only the first match. Adding support for multiple
matches (e.g. to concatenate them) will need some coding.


> 
> Thanks a lot for your help
> Felix


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Monogosearch error (crawl won't start)

2013-12-01 Thread bar
Author: Mamadoo
Email: 
Message:
Hi there,

I'm using the last version of Mnogosearch for UNIX with mysql support.
Mysql has been installed using MAMP.
Mysql server is started.

When launching this command :
sudo ./indexer -am -u http://www.mywebsite.com

Here is what it says :
indexer[54150] : indexer from mnogosearch-3.3.14-mysql started with 
'/usr/local/mnogosearch/etc/indexer.conf'
indexer[54150] : [54150]{01} Done (0 seconds, 0 documents, 0 bytes,  
0.00 
Kbytes/sec.)

Any idea of what's going wrong ?

Many thanks !

I'm running it from Mac OS X command line tool.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Monogosearch error (crawl won't start)

2013-12-02 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi there,
> 
> I'm using the last version of Mnogosearch for UNIX with mysql support.
> Mysql has been installed using MAMP.
> Mysql server is started.
> 
> When launching this command :
> sudo ./indexer -am -u http://www.mywebsite.com

Try this to crawl to the home page only:

./indexer -am -u http://www.mywebsite.com/

or this to crawl the entire site:

./indexer -am -u "http://www.mywebsite.com/%";



> 
> Here is what it says :
> indexer[54150] : indexer from mnogosearch-3.3.14-mysql started with 
> '/usr/local/mnogosearch/etc/indexer.conf'
> indexer[54150] : [54150]{01} Done (0 seconds, 0 documents, 0 bytes,  
> 0.00 
> Kbytes/sec.)
> 
> Any idea of what's going wrong ?
> 
> Many thanks !
> 
> I'm running it from Mac OS X command line tool.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Monogosearch error (crawl won't start)

2013-12-03 Thread bar
Author: Mamadoo
Email: 
Message:
Thanks, I had forgotten to uncomment the Server line on the indexer.conf...
After having done this, everything worked.

Thanks for help

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Working on Mac OSX

2013-12-03 Thread bar
Author: Mamadoo
Email: 
Message:
Hi,

Just wanted to say THANK YOU SO MUCH to the creator(s) of this wonderful tool.
I'm running it successfully on Mac OS X Mavericks and MAMP.

Thanks !

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Working on Mac OSX

2013-12-03 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi,
> 
> Just wanted to say THANK YOU SO MUCH to the creator(s) of this wonderful tool.
> I'm running it successfully on Mac OS X Mavericks and MAMP.

You're are very welcome. Thanks for using it!

> 
> Thanks !

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: In/out links and fetching time for each page + xpath

2013-12-04 Thread bar
Author: Mamadoo
Email: fohoi...@gmail.com
Message:
Hi there,

Is it possible to obtain these informations after having crawled a website :
- Fetching / downloading time of each page
- Total in and out links (from the website structure itself)

Would it be possible to add xpath support instead of regex for Sections ?
Using a plugin or natively.

Many thanks !

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: In/out links and fetching time for each page + xpath

2013-12-05 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi there,
> 
> Is it possible to obtain these informations after having crawled a website :
> - Fetching / downloading time of each page
> - Total in and out links (from the website structure itself)

This is possible in mnogosearch-3.4.0, which is in pre-alpha stage at the 
moment. If you'd like to give it a try, please download it from here:
http://www.mnogosearch.org/Download/mnogosearch-3.4.0.tar.gz
(note, this is not the final 3.4.0).

- See the ResponseTime special purpose section here:
http://www.mnogosearch.org/doc34/msearch-cmdref-section.html#cmdref-section-special

- The structure of the table "links" has changed.
It now can store all links between the pages.
Please see here how to configure it:
http://www.mnogosearch.org/doc34/msearch-cmdref-collectlinks.html


> 
> Would it be possible to add xpath support instead of regex for Sections ?
> Using a plugin or natively.

I guess you need this is for XML files.

XPath is currently not possible. We could take advantage
of libxml2 to add XPath support. But this needs some
development efforts.

> 
> Many thanks !


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: In/out links and fetching time for each page + xpath

2013-12-05 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:

> 
> I guess you need this is for XML files.
> 
> XPath is currently not possible. We could take advantage
> of libxml2 to add XPath support. But this needs some
> development efforts.
> 

Btw, simple extraction from a given XML tag is supported
in 3.3.x, with help of the Section command.

For example:


  
   I want to extract this
  


A command like this will do the trick:

Section xml.a.b  10 128



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: In/out links and fetching time for each page + xpath

2013-12-06 Thread bar
Author: Mamadoo
Email: fohoi...@gmail.com
Message:
For fetching time, ok thanks ! Great news !
For the in / out links per page, any chance you add this one day ?

For xpath, thanks but no, it's not for XML parsing.
I would need it, for example, to scrap specific content on my pages.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: In/out links and fetching time for each page + xpath

2013-12-06 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> For fetching time, ok thanks ! Great news !
> For the in / out links per page, any chance you add this one day ?

As I said in the previous message, in 3.3.4
*ALL* in/out links can be collected into the table "links".
It's trivial to count incoming and outgoing links
for any URL by using a simple SQL query.


> 
> For xpath, thanks but no, it's not for XML parsing.
> I would need it, for example, to scrap specific content on my pages.

XPath is a query language to address to various parts of an XML document. It 
assumes a well-formed XML value.
It does not work for an arbitrary HTML file.



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: In/out links and fetching time for each page + xpath

2013-12-06 Thread bar
Author: Mamadoo
Email: fohoi...@gmail.com
Message:
Many thanks

I use Xpath everyday to find content on xHTML content and it works pretty well.

Thank you so much for your answers.

Any idea of when the 3.4 could be released ?

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Search of...Indexing on 2 DB

2013-12-07 Thread bar
Author: Laurent
Email: 
Message:
Hi Guys,

To improve performance, I split my index database (reindexing from start) on 2 
different platforms.

Separetely, the search.htm works perfectly, limited in each of the indexes of 
course.

I would now like to merge the search so data is taken from the 2 SQL servers. I 
saw in the doc the brief explanation, but it is a bit confusing to me.
search.htm is PHP and the explanations are made for a more risky CGI.

Does someone knows the trick for the PHP version of the search script ?

Thanks in advance

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Search of...Indexing on 2 DB

2013-12-09 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi Guys,
> 
> To improve performance, I split my index database (reindexing from start) on 
> 2 different platforms.
> 
> Separetely, the search.htm works perfectly, limited in each of the indexes of 
> course.
> 
> I would now like to merge the search so data is taken from the 2 SQL servers. 
> I saw in the doc the brief explanation, but it is a bit confusing to me.
> search.htm is PHP and the explanations are made for a more risky CGI.
> 
> Does someone knows the trick for the PHP version of the search script ?

You can use Udm_Alloc_Agent_Array() to specify multiple databases:
http://www.php.net/manual/en/function.udm-alloc-agent-array.php

However, the PHP module does not support parallel execution.
It queries the database consequently.

Note, the CGI version queries the databases in parallel.
So it should be faster.

Btw, how many documents do you have? 
What is the output from "indexer -S"?

> 
> Thanks in advance

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: In/out links and fetching time for each page + xpath

2013-12-09 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Many thanks
> 
> I use Xpath everyday to find content on xHTML content and it works pretty 
> well.

xHTML is a valid XML. So XPath should work.

> 
> Thank you so much for your answers.
> 
> Any idea of when the 3.4 could be released ?

Around January 2014, if everything goes fine.



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Search of...Indexing on 2 DB

2013-12-09 Thread bar
Author: Laurent
Email: 
Message:
Hi Alex,

Thanks for your reply.

Currently, I dont have that many documents.
I am talking about avg 300K in the main DB, and 100K in the other one.
But the robot is currently frozen due to lack of disk space.

During Xmas, I'll update to 2x600 Go and, from that, I'll free the indexer. I 
expect millions of URLs to be indexed at the end, and so I just anticipate this.
I already see the difference when BLOBing. From 1800s to 800s, just because I 
split in 2 logical groups of information..

About parallel search, could it be smart, to avoid using more risky CGI, to 
make (for example) a perl search front-end, able to thread and merge the 
results and, from its actions, just use PHP to get and display the results ??

Maybe a major improvement idea to consider :-)

Thx

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Saving html code in database

2013-12-10 Thread bar
Author: fasfuuiios
Email: 
Message:
I'm trying to use mnogosearch as simple parser because it is much 
better than other scripts that were created specially for data 
extraction and analysis in my opinion. Is it possible to store full 
html code in database using "Section"? I have tried but it always strip 
html tags. CachedCopy looks encrypted. I want to save full pages and 
than explore dump with prepared parser to extract structured data.

If such thing is not possible with "Section" by default what source 
code files I must explore? Any simple hack is possible? 

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Saving html code in database

2013-12-10 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> I'm trying to use mnogosearch as simple parser because it is much 
> better than other scripts that were created specially for data 
> extraction and analysis in my opinion. Is it possible to store full 
> html code in database using "Section"? I have tried but it always strip 
> html tags. CachedCopy looks encrypted.

It's a compressed content (using "deflate"), then wrapped into base64.
So to get the full HTML code, you can do base64-decode, followed by 
zlib's inflate. This needs some programming. A simple PHP program
should do the trick.

Alternatively, you can extract cached copies using search.cgi,
like this:
./search.cgi "&cc=1&URL=http://www.site.com/test.html";


> I want to save full pages and 
> than explore dump with prepared parser to extract structured data.
> 
> If such thing is not possible with "Section" by default what source 
> code files I must explore? Any simple hack is possible? 

Storing the original HTML code is possible in the version 3.4.
You can download a pre-release of 3.4.0 from here:
http://www.mnogosearch.org/Download/mnogosearch-3.4.0.tar.gz

3.4 stores cached copies differently (comparing to 3.3):
- in a new table "urlinfob", separately from the "Section" values.
- without base64 encoding (in a "BLOB" instead of "TEXT" column)
- compressed by default using deflate,
  but with an option to switch compression off.

To store cached copies uncompressed, add this command
into indexer.conf:

CachedCopyEncoding identity

Note, the table name "urlinfob" will probably change to "cachedcopy"
in the final 3.4.0 release.

The 3.4 manual is already online.
These pages might be of interest for you:
http://www.mnogosearch.org/doc34/msearch-changelog.html
http://www.mnogosearch.org/doc34/msearch-cmdref-cachedcopyencoding.html


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Antispam algorythm

2013-12-11 Thread bar
Author: fasfuuiios
Email: 
Message:
Currently it looks like there is no way to stop indexing of spammed 
sites. Link spammers even spam this board automatically from time to 
time. That software is very pluggable and can be adapted for any type 
of cms and submit forms. 

I thought about global dirty solution that could haunt spam during 
indexing process. Here is the idea.

-

Say we have new option for 3.4 + versions:

ExternalLinkCount [maxlinks] [maxpages] [nofollow]

maxlinks is the limit for external links on page. (Spammers are trying 
to add direct links for pagerank etc.)

maxpages is the limit for probably spammed pages on same host.

nofollow is true or false. Filter only spam pages with or without 
rel="nofollow"

---

Examples:

ExternalLinkCount 20

This will delete any page which has more than 20 external links.

ExternalLinkCount 20 20

This will automatically ban and remove site that has more than 20 
pages where each page has more than 20 external links.

ExternalLinkCount 20 20 true

This will do previos thing with and without nofollow links.

ExternalLinkCount 20 20 false

Only for direct links that play with pagerank etc.

---

This is not ideal. It can cut normal pages. But those webmasters who 
use nofollow as google recommended are rather safe. This can cut blog 
pages with tons of good comments.
Big scientific pages, catalogs and wikis are not probably safe from 
such dirty filtering. 

Anyway this is probably the simplest way to catch those sites that 
have tons of spammed pages. With high limits it could probably help.



Example of site that is currently under spam attack. It generates 
thousands of such spammed pages. That is why I thought about this 
problem in very basic but cruel way.

http://www.gksbeton.ru/index.php/peremychki-pb/item/35-novost-1/35-
novost-1?start=400


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Antispam algorythm

2013-12-11 Thread bar
Author: fasfuuiios
Email: 
Message:
I'm not completely sure that it's good idea but probably it is better 
than nothing at all to stop this. Of course, it needs tests and 
analysis. I believe that normal html page has no more than 5 external 
links. Currently even paid links are usually limited to 3, and they are 
located inside of article to avoid google filter penalties etc. 

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: cpu usage

2013-12-11 Thread bar
Author: fasfuuiios
Email: 
Message:
I have noted that even if I start indexer with 5 or 10 or 20 or 40 
threads with CrawlerThreads option in indexer.conf, top command is 
always showing not more than 40% of cpu and very rarely it can rise up 
to 55%. With more threads it can slightly ddos some sites and they 
give 503 error or even 508. Using munin for server monitoring is 
showing rather stable perfomance without high cpu and memory usage 
during indexation. Sometimes indexer hangs but I check it with cron 
each minute and start it again if it is not active.

* * * * *   rootpgrep indexer > /dev/null || 
/usr/local/mnogosearch/sbin/indexer -l

Does mnogosearch has some internal perfomance limitations for indexer 
to make possible parallel searches and indexing? Or maybe I have 
missed something in compiling options or any special options in 
indexer.conf? I have not experimented with more than one indexer 
processes. Is it possible to achieve 80% of cpu usage constantly? If 
yes, what is the safest and stablest way to do it, if server is used 
only for indexing?

Or maybe it is good practice to limit indexer? I have seen php 
crawlers that can easily eat 90% of cpu. Of course, their slow 
perfomance are not compared with mnogosearch high speed. It works very 
fast. But of course, it is interesting how to load server completely 
during indexing.

 

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: cpu usage

2013-12-13 Thread bar
Author: fasfuuiios
Email: 
Message:
Regarding to these tests I have forgotten to add configuration 
specific details. 

I use PostreSQL that is tuned with 
http://pgfoundry.org/projects/pgtune/ 
on each node.

Nodes are simple and old.

1) Pentium(R) Dual-Core CPU   T4500  @ 2.30GHz x2 
with 4 Gb memory
with usual HDD
OS Debian 32bit

2) Intel(R) Celeron(R) CPUE1400  @ 2.00GHz x2 
with 1 Gb memory
with usual HDD
OS Debian 32bit

3) AMD Athlon(tm) 64 X2 Dual Core Processor 5600+ 2x2800 MHz
with 2 Gb memory
with Debian 64bit
with usual two usual HDDs with software RAID 1 

I have noted that SSD openvz vps can work much faster. This is 
understandable. SSD is always recommended for such database things.

But anyway none of these nodes can be overloaded by indexer to high 
cpu with 5/10/20/40/50 threads. I have not tested more threads that in 
some cases it becames little ddos attack. CrawlDelay is not used.

It seems that hard drive is always the main bottleneck.

Currently I think that maybe sysctl.conf must be edited to work 
faster.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Saving html code in database

2013-12-13 Thread bar
Author: fasfuuiios
Email: 
Message:
With such options mnogosearch can be positioned mot only as search 
engine but also as universal data miner to collect and analyze some 
data with external parsing libraries. In most cases so-called parsers 
can't crawl sites normally. So if anyone needs to download some site it 
is better to use mnogosearch. With wget it becames unpredictable. 
Probably the only concurent of mnogosearch is python library named 
scrapy. But it is also needs preparations for everything. And it is 
unpredictable on high volume of data, in my opinion.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: cpu usage

2013-12-15 Thread bar
Author: fasfuuiios
Email: 
Message:
Found this related thread http://www.mnogosearch.org/board/message.php?
id=19643 

I have tried to start 2 instances of indexer. 
indexer.conf has 
CrawlerThreads 50

I thought that maybe it is related to number of cores. But it looks 
like there is no difference between 1 and 2 indexer instances with 
defined 
CrawlerThreads 50

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: n grams / stemmed n grams

2013-12-18 Thread bar
Author: Mamadoo
Email: fohoi...@gmail.com
Message:
Hi,

How can I extract n grams or stemmed n grams of a page that has been crawled by 
mnogosearch ?

For example i give mnogosearch the url of a page and it gives me n grams, 
stemmed n grams.

Thanks

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Antispam algorythm

2013-12-18 Thread bar
Author: fasfuuiios
Email: 
Message:
I have forgotten to add that this "black hat" seo program is still 
under active development because after end of November spam activity 
grows.

The say on black hat forums that this program currently can recognize 
up to 100.000 of text based capthcaz and it can collects these 
questions and send them to developers servers for analysis. 

One of the ways to stop them is using database from
http://www.stopforumspam.com/

But this needs statistics and preparation. So it looks like simple 
solution as described is much better. It can save a lot of traffic and 
keep search results clean. But also harm normal sites. 



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Core Dump when using ServerTable

2014-01-31 Thread bar
Author: momma
Email: 
Message:
I just moved to a new server and upgraded mnogosearch to 3.3.15 from 
3.3.7. Now when I run indexer, I immediately get a core dump. As soon 
as I comment out the one ServerTable command I have, all is well.

Old server = Red Hat Enterprise 32-bit, mnogosearch v3.3.7
New Server = CentOS 6.5 64-bit mnogosearch v3.3.15

The indexer configcheck option shows no problems.

Is anyone else using 3.3.15 and a ServerTable directive with no 
problems.

The one main difference I see between the server table and my custom 
server table is that the url field is a blob in the server and mine is 
varchar(255)...but, that worked before the upgrade.

I can not get to the core dump file because 'abrt' is gobbling it up 
and I have not figured out how to get around that yet.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Core Dump when using ServerTable

2014-02-02 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> I just moved to a new server and upgraded mnogosearch to 3.3.15 from 
> 3.3.7. Now when I run indexer, I immediately get a core dump. As soon 
> as I comment out the one ServerTable command I have, all is well.
> 
> Old server = Red Hat Enterprise 32-bit, mnogosearch v3.3.7
> New Server = CentOS 6.5 64-bit mnogosearch v3.3.15
> 
> The indexer configcheck option shows no problems.
> 
> Is anyone else using 3.3.15 and a ServerTable directive with no 
> problems.

Can you please send your indexer.conf to b...@mnogosearch.org.
I'll test it on my Feboda 64-bit box.

> 
> The one main difference I see between the server table and my custom 
> server table is that the url field is a blob in the server and mine is 
> varchar(255)...but, that worked before the upgrade.
> 
> I can not get to the core dump file because 'abrt' is gobbling it up 
> and I have not figured out how to get around that yet.

Can you please try running it inder gdb?
Getting a backtrace would be the best helpful to find the reason
of the crash.


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: n grams / stemmed n grams

2014-02-02 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

Sorry for a late reply, I did not see this message before.

> Hi,
> 
> How can I extract n grams or stemmed n grams of a page that has been crawled 
> by 
> mnogosearch ?
> 
> For example i give mnogosearch the url of a page and it gives me n grams, 
> stemmed n grams.
> 

Can you please clarify what you mean by n grams and stemmed n grams?

If you want a list of all search terms that indexer found on
the page, you can do the following:

- configure indexer.conf to crawl this URL:
Server page http://somesite/somepage.html
- crawl it by running "indexer"
- index it by running "indexer --index"
- check the result of "select word, secno from bdict" 


> Thanks

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Core Dump when using ServerTable

2014-02-02 Thread bar
Author: momma
Email: 
Message:
OK, give me a day or 2. Thank you.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


  1   2   3   >