Re: [HACKERS] Netflix Prize data

2006-10-05 Thread Heikki Linnakangas

Mark Woodward wrote:


I tried to cluster the data along a particular index but had to cancel it
after 3 hours.


If the data is in random order, it's faster to do

SELECT * INTO foo_sorted FROM foo ORDER BY bar

then CREATE INDEX, than to run CLUSTER.

That's because CLUSTER does a full index scan of the table, which is 
slower than a seqscan + sort if the table is not already clustered.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Mark Woodward
> "Mark Woodward" <[EMAIL PROTECTED]> writes:
>> The rating, however, is one char 1~9. Would making it a char(1) buy
>> anything?
>
> No, that would actually hurt because of the length word for the char
> field.  Even if you used the "char" type, which really is only one byte,
> you wouldn't win anything because of alignment issues.  Personally I'd
> just go for three ints and a date, rather than trying to be cute with
> the rating.


Actually, the date is just days, right? I don't actualy need it too much.
So, create a small int for date and do this: smalldate =
date('1970-01-01') - rdate. And use small int for rating.

 Column |   Type   | Modifiers
+--+---
 movie  | integer  |
 client | integer  |
 day| smallint |
 rating | smallint |



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Tom Lane
"Mark Woodward" <[EMAIL PROTECTED]> writes:
> The rating, however, is one char 1~9. Would making it a char(1) buy anything?

No, that would actually hurt because of the length word for the char
field.  Even if you used the "char" type, which really is only one byte,
you wouldn't win anything because of alignment issues.  Personally I'd
just go for three ints and a date, rather than trying to be cute with
the rating.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Mark Woodward
>
> "Greg Sabino Mullane" <[EMAIL PROTECTED]> writes:
>
>> CREATE TABLE rating (
>>   movie  SMALLINT NOT NULL,
>>   person INTEGER  NOT NULL,
>>   rating SMALLINT NOT NULL,
>>   viewed DATE NOT NULL
>> );
>
> You would probably be better off putting the two smallints first followed
> by
> the integer and date. Otherwise both the integer and the date field will
> have
> an extra two bytes of padding wasting 4 bytes of space.
>
> If you reorder the fields that way you'll be down to 28 bytes of tuple
> header
> overhead and 12 bytes of data. There's actually another 4 bytes in the
> form of
> the line pointer so a total of 44 bytes per record. Ie, almost 73% of the
> disk
> i/o you're seeing is actually per-record overhead.
>

That's good advice, however, It is said that Netflix has greater than 64K
movies, so, while the test info may work with a small int, I doubt the
overall system would work.

The rating, however, is one char 1~9. Would making it a char(1) buy anything?

In wonder

If I started screwing around with movie ID and rating, and moved them into
one int. One byte for rating, three bytes for movie ID. That could reduce
the data size by at least half gig.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Gregory Stark

"Greg Sabino Mullane" <[EMAIL PROTECTED]> writes:

> CREATE TABLE rating (
>   movie  SMALLINT NOT NULL,
>   person INTEGER  NOT NULL,
>   rating SMALLINT NOT NULL,
>   viewed DATE NOT NULL
> );

You would probably be better off putting the two smallints first followed by
the integer and date. Otherwise both the integer and the date field will have
an extra two bytes of padding wasting 4 bytes of space.

If you reorder the fields that way you'll be down to 28 bytes of tuple header
overhead and 12 bytes of data. There's actually another 4 bytes in the form of
the line pointer so a total of 44 bytes per record. Ie, almost 73% of the disk
i/o you're seeing is actually per-record overhead.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Mark Woodward
> "Mark Woodward" <[EMAIL PROTECTED]> writes:
>> The one thing I notice is that it is REAL slow.
>
> How fast is your disk?  Counting on my fingers, I estimate you are
> scanning the table at about 47MB/sec, which might or might not be
> disk-limited...
>
>> I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06"
>
> So why aren't you storing it as type "date"?
>

You are assuming I gave it any thought at all. :-)

I converted it to a date type (create table ratings2 as )
[EMAIL PROTECTED]:~/netflix/download$ time psql -c "select count(*) from
ratings" netflix
   count
---
 100480507
(1 row)


real1m29.852s
user0m0.002s
sys 0m0.005s

That's about the right increase based on the reduction in data size.

OK, I guess I am crying wolf, 47M/sec isn't all that bad for the system.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Mark Woodward
>> I signed up for the Netflix Prize. (www.netflixprize.com)
>> and downloaded their data and have imported it into PostgreSQL.
>> Here is how I created the table:
>
> I signed up as well, but have the table as follows:
>
> CREATE TABLE rating (
>   movie  SMALLINT NOT NULL,
>   person INTEGER  NOT NULL,
>   rating SMALLINT NOT NULL,
>   viewed DATE NOT NULL
> );
>
> I also recommend not loading the entire file until you get further
> along in the algorithm solution. :)
>
> Not that I have time to really play with this

As luck would have it, I wrote a recommendations system based on music
ratings a few years ago.

After reading the NYT article, it seems as though one or more of the guys
behind "Net Perceptions" is either helping them or did their system, I'm
not sure. I wrote my system because Net Perceptions was too slow and did a
lousy job.

I think the notion of "communities" in general is an interesting study in
statistics, but every thing I've seen in the form of bad recommendations
shows that while [N] people may share certain tastes, but that doesn't
nessisarily mean that what one likes the others do. This is especially
flawed with movie rentals because it is seldom a 1:1 ratio of movies to
people. There are often multiple people in a household. Also, movies are
almost always for multiple people.

Anyway, good luck! (Not better than me, of course :-)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


> I signed up for the Netflix Prize. (www.netflixprize.com)
> and downloaded their data and have imported it into PostgreSQL.
> Here is how I created the table:

I signed up as well, but have the table as follows:

CREATE TABLE rating (
  movie  SMALLINT NOT NULL,
  person INTEGER  NOT NULL,
  rating SMALLINT NOT NULL,
  viewed DATE NOT NULL
);

I also recommend not loading the entire file until you get further
along in the algorithm solution. :)

Not that I have time to really play with this

- --
Greg Sabino Mullane [EMAIL PROTECTED]
PGP Key: 0x14964AC8 200610041827
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iD8DBQFFJDZ0vJuQZxSWSsgRAr6OAKCiOuspNm8QCsujaEN0Kgie6RsTjgCdGPda
9zVzpkrhTEhySEVBwMBTOdU=
=zF7u
-END PGP SIGNATURE-



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Gregory Stark

"Mark Woodward" <[EMAIL PROTECTED]> writes:

> I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So,
> the raw data is 23 bytes, the date string will probably be rounded up to
> 12 bytes, that's 24 bytes per row of data. What is the overhead per
> variable? per row?
>
> Is there any advantage to using "varchar(10)" over "text" ?

I'll second the "use a date" comment.

But to answer the questions, text and varchar are handled identically in
almost every respect. The overhead per variable width field (like text or
varchar) is 4 bytes. The overhead per row depends on a few factors, but figure
28 bytes.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Tom Lane
"Mark Woodward" <[EMAIL PROTECTED]> writes:
> The one thing I notice is that it is REAL slow.

How fast is your disk?  Counting on my fingers, I estimate you are
scanning the table at about 47MB/sec, which might or might not be
disk-limited...

> I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06"

So why aren't you storing it as type "date"?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Netflix Prize data

2006-10-04 Thread Luke Lonergan
Mark,

On 10/4/06 1:43 PM, "Mark Woodward" <[EMAIL PROTECTED]> wrote:

> [EMAIL PROTECTED]:~/netflix$ time psql netflix -c "select count(*) from 
> ratings"
>count
> ---
>  100480507
> (1 row)
> 
> 
> real2m6.270s
> user0m0.004s
> sys 0m0.005s

I think you are getting about 40MB/s on your sequential scan of about 5GB of
heap data in this case.  I calculate the size of the data as:

3 Integers (12 bytes), one text date field (10 bytes ?) and tuple overhead
(24 bytes) = 46 bytes per row

100 million rows x 46 bytes / row = 4.6 Gbytes

- Luke



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] Netflix Prize data

2006-10-04 Thread Mark Woodward
I signed up for the Netflix Prize. (www.netflixprize.com) and downloaded
their data and have imported it into PostgreSQL. Here is how I created the
table:
Table "public.ratings"
 Column |  Type   | Modifiers
+-+---
 item   | integer |
 client | integer |
 rating | integer |
 rdate  | text|
Indexes:
"ratings_client" btree (client)
"ratings_item" btree (item)

[EMAIL PROTECTED]:~/netflix$ time psql netflix -c "select count(*) from ratings"
   count
---
 100480507
(1 row)


real2m6.270s
user0m0.004s
sys 0m0.005s


The one thing I notice is that it is REAL slow. I know it is, in fact, 100
million records, but I don't think PostgreSQL is usually slow like this.
I'm going to check with some other machines to see if there is a problem
with my test machine or if something is wierd about PostgreSQL and large
numbers of rows.

I tried to cluster the data along a particular index but had to cancel it
after 3 hours.

I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So,
the raw data is 23 bytes, the date string will probably be rounded up to
12 bytes, that's 24 bytes per row of data. What is the overhead per
variable? per row?

Is there any advantage to using "varchar(10)" over "text" ?



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq