Re: [sqlite] HTML Tokenizer

2014-02-13 Thread RSmith


On 2014/02/13 22:35, Petite Abeille wrote:
While we are at it, www.sqlite.org exhibits many validation errors: 
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.sqlite.org%2F=%28detect+automatically%29=Inline=0=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices#result


Yea I actually had some web zealot once point out to me that one of my sites produced 25 errors in the W3C validator. I 
laughed and pointed out that that www.google.com scored 30 errors (I see they are down to 23 errors and 4 warnings now) and that his 
first 3 sites I checked all had over a 100 - (which is quite normal actually - Amazon.com gets over 500 errors, but I did not 
mention this).


Point is - I wouldn't lose any sleep over the 37 SQLite errors, neither should 
any SQLite web admin :)

For ref: Google:
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2F=%28detect+automatically%29=Inline=0=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices

Amazon:
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.amazon.com%2F=%28detect+automatically%29=Inline=0=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices




___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Petite Abeille

On Feb 13, 2014, at 9:52 PM, Jan Nijtmans  wrote:

> But if you put the validator in HTML5 mode, there are many less errors:

Possibly. But it says 'HTML 4.01 Strict' on the tin:

http://www.w3.org/TR/html4/strict.dtd”>

Either way, a bunch of errors.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Jan Nijtmans
2014-02-13 21:35 GMT+01:00 Petite Abeille :
>
> On Feb 13, 2014, at 9:08 PM, Petite Abeille  wrote:
>
>> curl -s http://www.sqlite.org | lynx -nolist -stdin -dump
>
> While we are at it, www.sqlite.org exhibits many validation errors:
>
> http://validator.w3.org/check?uri=http%3A%2F%2Fwww.sqlite.org%2F=%28detect+automatically%29=Inline=0=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices#result

But if you put the validator in HTML5 mode, there are many less errors:



Regards,
Jan Nijtmans
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Petite Abeille

On Feb 13, 2014, at 9:08 PM, Petite Abeille  wrote:

> curl -s http://www.sqlite.org | lynx -nolist -stdin -dump

While we are at it, www.sqlite.org exhibits many validation errors:

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.sqlite.org%2F=%28detect+automatically%29=Inline=0=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices#result



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Scott Robison
My current project needed to tokenize the text in HTML without the tags.
The easy solution for us was to license a library from Chilkat that
supported text extraction then tokenize that. I'm on my phone at the moment
but could supply more details later if desired.

SDR
On Feb 13, 2014 1:02 PM, "David King"  wrote:

> > New to Sqlite, anybody knows is there a HTML tokenizer for full text
> search,
> > Or do I need to implement my own?
>
> There isn't an HTML tokeniser. But the default tokeniser considers
> punctuation like <> to be word breaks so it may already work for you with
> the down side that things like Hello! will consider
> "div", "class", "foo", and "hello" as words. (Rather than the just "hello"
> that you may be after)
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] HTML Tokenizer

2014-02-13 Thread Petite Abeille

On Feb 13, 2014, at 8:48 PM, Wang, Baoping  wrote:

> New to Sqlite, anybody knows is there a HTML tokenizer for full text search,

No.

> Or do I need to implement my own?

If you feel the urge. Otherwise, try lynx -dump.

For example:

curl -s http://www.sqlite.org | lynx -nolist -stdin -dump

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] HTML Tokenizer

2014-02-13 Thread David King
> New to Sqlite, anybody knows is there a HTML tokenizer for full text search,
> Or do I need to implement my own?

There isn't an HTML tokeniser. But the default tokeniser considers punctuation 
like <> to be word breaks so it may already work for you with the down side 
that things like Hello! will consider "div", "class", 
"foo", and "hello" as words. (Rather than the just "hello" that you may be 
after)


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] HTML Tokenizer

2014-02-13 Thread Wang, Baoping
New to Sqlite, anybody knows is there a HTML tokenizer for full text search,
Or do I need to implement my own?

Thanks


Pursuant to Treasury Regulations, any U.S. federal tax advice contained in this 
communication, unless otherwise
stated, is not intended and cannot be used for the purpose of avoiding 
tax-related penalties.

The information contained in this E-mail message is privileged, confidential, 
and may be protected from disclosure;
please be aware that any other use, printing, copying, disclosure or 
dissemination of this communication may be 
subject to legal restriction or sanction. If you think that you have received 
this E-mail message in 
error, please reply to the sender.

This E-mail message and any attachments have been scanned for viruses and are 
believed to be free of any virus or 
other defect that might affect any computer system into which it is received 
and opened. However, it is the 
responsibility of the recipient to ensure that it is virus free and no 
responsibility is accepted by Kelley 
Drye & Warren LLP for any loss or damage arising in any way from its use.



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users