Re: [sqlite] Fuzzy Matching

2008-07-06 Thread Stephen Woodbridge
OK, thanks for that. I had seen the references to the n-gram stuff and 
just started reading about them and not my head is ready to explode! 
Between n-grams, bit vectors, Bloom filters, perfect hashes, lots of 
academic papers spouting fancy equations and statistics, I'm not sure 
this is gelling into any course of action for me. While this technology 
seems very cool, I still get the sense that you have to scan each 
documents n-gram vector against the query n-gram vector to get a set of 
potential documents that might have what I want in them. Or am I missing 
something? What? While this type of search is probably faster then some 
alternatives, it does not lend itself to the fundamentals of RDBMS or 
indexed searches.

I would appreciate any thoughts on how to proceed with this.

Levenshtein edit distance is a good way to score the results of a fuzzy 
match. I have also used metaphone and double metaphone and soundex in 
the past for fuzzy searching.

-Steve

Harold Wood wrote:
> I cant go into too much detail because of my current job, but for
> fuzzy matching levenstien isnt very good, you need to try looking
> into ngram matching techniques, it is absolutely awesome in reducing
> over/under matches.
> 
> Woody
> 
> --- On Sat, 7/5/08, Stephen Woodbridge <[EMAIL PROTECTED]>
> wrote:
> 
> From: Stephen Woodbridge <[EMAIL PROTECTED]> Subject: Re:
> [sqlite] Fuzzy Matching To: "General Discussion of SQLite Database"
>  Date: Saturday, July 5, 2008, 11:24 PM
> 
> Stephen Woodbridge wrote:
>> I would be interested in having something like this also.
>> 
>> What I don't understand in your approach is how you compute the 
>> (Levenstein) distance during a search. It seems like you have a
>> fixed set of tokens from your document text and these are indexed.
>> Then you have a query token the you want to compare to the index
>> based on some fuzzy distance. Since every query can be different I
>> think you have to compute the distance for every key in the index?
>> that would require doing a full index scan.
>> 
>> If there ware a function that you could run a token through that
>> would given you that tokens "location" in some space then you could
>> 
> generate a
>> similar "location" for the query token and then use the rtree
> and
>> distance. I'm not aware of any such functions, but my expertise is
> more
>> in GIS the search searching.
> 
> Hmmm, that was supposed to say text searching.
> 
>> Thoughts?
>> 
>> Best, -Steve
>> 
>> Martin Pfeifle wrote:
>>> Hi, I think there is nothing available except FTS. Doing a full
>>> table scan and computing for each string the (Levenstein)
>>> distance to the query object is too time consuming. So what I
>>> would like to see is the implementation of a generic metric index
>>> which needs as one parameter a metric distance function. Based on
>>> such a distance function you could then do similarity search on
>>> any objects , e.g. images, strings, etc. One possible index would
>>> be the M-tree (which you can also organize relational as it was
>>> done with the R*-tree). The idea is that you have a hierarchical
>>> index and each node is represented by a database  object o and a
>>> covering radius r reflecting the maximal distance of all objects
>>> in that subtree to the object o. If you do a range query now, you
>>> compute the distance of your query object to the object o. If
>>> this distance minus the coverage radius r is bigger than your
>>> query range you can prune that subtree. You can either implement
>>> such a similarity module as an own extension similar toFTS or the
>>> Spatial module, or integrate it into FTS and use it only for
>>> strings. Personally, I need the second solution because I'd like
>>> to do full and fuzzy text search. Are
> there
>>> any plans to implement something like this, if yes, I would like
>>> to take part in such a development. . Best Martin
>>> 
>>> 
>>> 
>>> 
>>> - Ursprüngliche Mail  Von: Alberto Simões 
>>> <[EMAIL PROTECTED]> An: General Discussion of SQLite Database 
>>>  Gesendet: Donnerstag, den 3. Juli
> 2008,
>>> 21:52:05 Uhr Betreff: [sqlite] Fuzzy Matching
>>> 
>>> Hello
>>> 
>>> Although I am quite certain that the answer is that SQLite does
>>> not provide any mechanism to help me on this, it doesn't hurt to
>>> ask.
> Who
>>> know if anybody have any suggestion.
>>> 
>>> Basically, I am using SQLite for a dictionary, and I want to let
>>> the user do fuzzy searches. OK, some simple Levenshtein distance
>>> of one or two would do the trick, probably.
>>> 
>>> I imagine that SQLite (given the lite), does not provide any kind
>>> of nearmisses search. But probably, somebody here did anything
>>> similar in any language?
>>> 
>>> Cheers Alberto
>> ___ sqlite-users
>> mailing list sqlite-users@sqlite.org 
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> 
> 

Re: [sqlite] PRAGMA synchronous = OFF on transaction-safe file system TFAT WinCE

2008-07-06 Thread Karl Tomlinson
Daniel wrote:

> Is it safe do set PRAGMA synchronous = OFF when a
> transaction-safe file system is used?
>
> We are working on WinCE with TFAT (see below) - but this might
> be a general question.

> TFAT:
> ... By making file operations 
> transaction-safe, TFAT stabilizes the file system and 
> ensures that the file system is not corrupted when an interruption occurs.

One thing you'd need to ensure is that data is safe as well as the
filesystem.

It seems that TransactData provides this:

 http://msdn.microsoft.com/en-us/library/aa915463.aspx
 "By default, only modifications to a directory and the FAT are
  backed up during a transaction using TFAT. To back up
  modifications to the data of a file, you must set the TransactData
  registry key to 1."

For database consistency, the other thing you'd need to ensure is
that file system transactions are committed in the same order as
they are scheduled.  ForceWriteThrough or FILE_FLAG_WRITE_THROUGH
would be sufficient to ensure this.

 "If you set this value and also set the ForceWriteThrough value to
  1, all successful write operations are committed."

I don't know whether write through is necessary with TFAT but it
sounds like it may be:

 http://msdn.microsoft.com/en-us/library/aa916300.aspx
 "All databases and memory-mapped files must commit every WriteFile
  operation to the FAT atomically and set the
  FILE_FLAG_WRITE_THROUGH value to TRUE."

Donald wrote:

> A transaction in the database would usually involve multiple disk
> writes, and I don't see how the filesystem would know what constitutes
> an sqlite transaction.

SQLite would still need to keep a journal.

IIUC, synchronous = OFF, doesn't disable the journal, but merely
disables the syncing of the journal.

It looks to me like TFAT can be safe wrt keeping a consistent
database with synchronous = OFF, but, if that requires turning on
write through, then there is no gain as that would make every
write synchronous, and the journal will still be the same size.

Daniel wrote:

> We have a 14MB SQLite database on a 16MB flash disk. The journal
> file gets to big on some queries, which results in a SQLITE_FULL
> error.
> 
> Any other ideas to make data storage secure?

I can't think how to implement transactions without a journal.
You don't have much room to play with here.
All I can suggest is reducing the size of the transactions (or
recording them more efficiently in the journal).
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users