Thanks James. Filed https://issues.apache.org/jira/browse/PHOENIX-2240.
On Tue, Sep 8, 2015 at 12:38 PM, James Heather <[email protected]> wrote: > Thanks. > > I've discovered that the cause is even simpler. With 100M rows, you get > collisions in the primary key in the CSV file. An experiment (capturing the > CSV file, and counting the rows with a unique primary key) reveals that the > number of unique primary keys is about 500 short of the full 100M. So the > upserting is working as it should! > > I don't know if there's a way round this, because it does produce rather > suspicious-looking results. It might be worth having the program emit a > warning to this effect if the parameter size is large, or finding a way to > increase the entropy in the primary keys that are generated, to ensure that > there won't be collisions. > > It's a bit surprising no one has run into this before! Hopefully this > script has been run on that many rows before... it seems a reasonable > number for testing performance of a scalable database... (in fact I was > planning to increase the row count somewhat). > > James > > > On 08/09/15 20:16, James Taylor wrote: > > Hi James, > Looks like currently you'll get a error log message generated if a row is > attempted to be imported but cannot be (usually due to the data not being > compatible with the schema). For psql.py, this would be the client side log > and messages would look like this: > LOG.error("Error upserting record {}: {}", csvRecord, > errorMessage); > > FWIW, we have a "strict" option for CSV loading (using the -s or --strict > option) which is meant to cause the load to abort if bad data is found, but > it doesn't look like this is currently checked (when bad data is > encountered). I've filed PHOENIX-2239 for this. > > Thanks, > James > > On Tue, Sep 8, 2015 at 11:26 AM, James Heather <[email protected] > > wrote: > >> I've had another go running the performance.py script to upsert >> 100,000,000 rows into a Phoenix table, and again I've ended up with around >> 500 rows missing. >> >> Can anyone explain this, or reproduce it? >> >> It is rather concerning: I'm reluctant to use Phoenix if I'm not sure >> whether rows will be silently dropped. >> >> James >> > > >
