Re: [sqlite] 'insert or ignore' vs self join?

2008-05-28 Thread Petite Abeille
Hello,

On May 27, 2008, at 9:07 PM, Stephen Oberholtzer wrote:

> Well, the first thing you should bring away from this experience is  
> that the
> number of VM instructions isn't really an indicator of how efficient  
> the
> query is :)

Good point :)

> Now, I'm not sure exactly why one is faster than the other,  
> especially since
> you didn't post your exact schema and indices,

The DDL is rather straightforward:

 create table if not exists token
 (
 id  integer primary key not null,
 nametext not null
 )

 create unique index if not exists token_name on token( name )

http://dev.alt.textdrive.com/browser/HTTP/Finder.ddl#L60

> and I have no idea how many rows there are in either table.

The incoming data set size varies from 108 to 3345 rows, with a  
average size of around 930 rows. The target table size is about 75,148  
rows. Over time, most of the incoming rows will already exist in the  
target table.

> But if I had to guess, it's because of the ORDER BY clause.  In  
> general, an
> ORDER BY means that SQLite needs to generate a temporary table with  
> all the
> rows to be selected/inserted,
> then sort that temporary table.  The INSERT OR IGNORE version has to
> unconditionally sort the entire 'stage' table; your second query  
> only has to
> sort those rows in 'stage' that don't already exist in 'table'.  If  
> each
> table fits comfortably in your computer's disk cache, the extra pass  
> won't
> matter so much.

Ah... yes... the order by clause... good point... indeed removing the  
'order by' from the 'insert or ignore' statement brings down its  
execution time a whisker away from the 'self join' version :)

Thanks for the explaination!

Cheers,

--
PA.
http://alt.textdrive.com/nanoki/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] 'insert or ignore' vs self join?

2008-05-27 Thread Stephen Oberholtzer
On Tue, May 27, 2008 at 2:41 PM, Petite Abeille <[EMAIL PROTECTED]>
wrote:

> Hello,
>
> % sqlite3 -version
> 3.5.9
>
> I'm trying to figure out a frugal way to handle a unique key
> constrain...
>
> I tried using both 'insert or ignore' and a self join. The self join
> seems to be noticeably faster even though 'insert or ignore' would
> empirically appear to be the better deal (shorter query plan, less VM
> instructions).
>
> Specifically, given the following DML:
>
> insert  or ignore
> intotoken( name )
> select  stage.token as name
> fromstage
> order bystage.token;
>
> One gets a query plan like such:
>
> 0|0|TABLE stage
>
> And 'explain' reports 58 VM instructions.
>
>
> On the other hand, the following self join...
>
> insert
> intotoken( name )
> select  stage.token as name
> fromstage
> left join   token on token.name = stage.token
> where   token.id is null
> order bystage.token;
>
> ... uses a query plan like such:
>
> 0|0|TABLE stage
> 1|1|TABLE token WITH INDEX token_name
>
> ... and 82 VM instructions.
>
> Nonetheless, the self join would appear to be around 10% faster than
> the 'insert or ignore' flavor.
>
> Not sure why this is the case though... considering the apparent
> overhead incurred by the join.
>
> Thoughts?
>

Well, the first thing you should bring away from this experience is that the
number of VM instructions isn't really an indicator of how efficient the
query is :)

Now, I'm not sure exactly why one is faster than the other, especially since
you didn't post your exact schema and indices, and I have no idea how many
rows there are in either table.
But if I had to guess, it's because of the ORDER BY clause.  In general, an
ORDER BY means that SQLite needs to generate a temporary table with all the
rows to be selected/inserted,
then sort that temporary table.  The INSERT OR IGNORE version has to
unconditionally sort the entire 'stage' table; your second query only has to
sort those rows in 'stage' that don't already exist in 'table'.  If each
table fits comfortably in your computer's disk cache, the extra pass won't
matter so much.

In any case, I invite you to try the following:

1. Add an index: [[ create index stage_token_ix on stage(token);  ]] SQLite
will use that index to improve the ORDER BY.

2. Try the following variation:

insert intotoken( name )
select  stage.token as name
fromstage
where not exists(select 1 from token where token.name = stage.token)
order bystage.token;

-- 
-- Stevie-O
Real programmers use COPY CON PROGRAM.EXE
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] 'insert or ignore' vs self join?

2008-05-27 Thread Petite Abeille
Hello,

% sqlite3 -version
3.5.9

I'm trying to figure out a frugal way to handle a unique key  
constrain...

I tried using both 'insert or ignore' and a self join. The self join  
seems to be noticeably faster even though 'insert or ignore' would  
empirically appear to be the better deal (shorter query plan, less VM  
instructions).

Specifically, given the following DML:

insert  or ignore
intotoken( name )
select  stage.token as name
fromstage
order bystage.token;

One gets a query plan like such:

0|0|TABLE stage

And 'explain' reports 58 VM instructions.


On the other hand, the following self join...

insert
intotoken( name )
select  stage.token as name
fromstage
left join   token on token.name = stage.token
where   token.id is null
order bystage.token;

... uses a query plan like such:

0|0|TABLE stage
1|1|TABLE token WITH INDEX token_name

... and 82 VM instructions.

Nonetheless, the self join would appear to be around 10% faster than  
the 'insert or ignore' flavor.

Not sure why this is the case though... considering the apparent  
overhead incurred by the join.

Thoughts?

--
PA.
http://alt.textdrive.com/nanoki/


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users