Re: missing rows after using performance.py

James Heather Tue, 08 Sep 2015 12:39:50 -0700

Thanks.

I've discovered that the cause is even simpler. With 100M rows, you getcollisions in the primary key in the CSV file. An experiment (capturingthe CSV file, and counting the rows with a unique primary key) revealsthat the number of unique primary keys is about 500 short of the full100M. So the upserting is working as it should!

I don't know if there's a way round this, because it does produce rathersuspicious-looking results. It might be worth having the program emit awarning to this effect if the parameter size is large, or finding a wayto increase the entropy in the primary keys that are generated, toensure that there won't be collisions.

It's a bit surprising no one has run into this before! Hopefully thisscript has been run on that many rows before... it seems a reasonablenumber for testing performance of a scalable database... (in fact I wasplanning to increase the row count somewhat).


James

On 08/09/15 20:16, James Taylor wrote:

Hi James,
Looks like currently you'll get a error log message generated if a rowis attempted to be imported but cannot be (usually due to the data notbeing compatible with the schema). For psql.py, this would be theclient side log and messages would look like this:LOG.error("Error upserting record {}: {}", csvRecord,errorMessage);
FWIW, we have a "strict" option for CSV loading (using the -s or--strict option) which is meant to cause the load to abort if bad datais found, but it doesn't look like this is currently checked (when baddata is encountered). I've filed PHOENIX-2239 for this.
Thanks,
James
On Tue, Sep 8, 2015 at 11:26 AM, James Heather<james.heat...@mendeley.com <mailto:james.heat...@mendeley.com>> wrote:
    I've had another go running the performance.py script to upsert
    100,000,000 rows into a Phoenix table, and again I've ended up
    with around 500 rows missing.

    Can anyone explain this, or reproduce it?

    It is rather concerning: I'm reluctant to use Phoenix if I'm not
    sure whether rows will be silently dropped.

    James

Re: missing rows after using performance.py

Reply via email to