Hi Jerome, Good point!! Really a nice usage of static columns! BTW, wouldn't the EmpID be static as well?
Cheers Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso> On 7 July 2015 at 14:42, Jérôme Mainaud <jer...@mainaud.com> wrote: > Hello, > > You can slightly adapt Carlos answer to reduce repliation of data that > don't change for month to month. > Static columns are great for this. > > The table become: > > CREATE TABLE salaries ( > EmpID varchar, > FN varchar *static*, > LN varchar *static*, > Phone varchar *static*, > Address varchar *static*, > month integer, > basic integer, > flexible_allowance float, > PRIMARY KEY(EmpID, month) > ) > > There is only one copy of static column per partition the value is shared > between all rows of the partition. > When Employee data change you can update it with the partition key in the > where clause. > When you insert a new month entry you just fill non static columns. > The table can be queried the same way as the original one. > > Cheers > > > > -- > Jérôme Mainaud > jer...@mainaud.com > > 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services < > rory.bramw...@devopservices.com>: > >> Hi, >> >> I've been following this thread and my thoughts are inline with Carlos' >> latest response... Model your data to suite your queries. That is one of >> the data model / design considerations in Cassandra that differs from the >> RDBMS world. Embrace demoralization and data duplication. Disk space is >> cheapest, so exploit how your data is laid out in order to optimize for >> faster reads (which are more costly than writes). >> >> Regards, >> >> Rory Bramwell >> Founder and CEO >> DevOp Services >> >> Skype: devopservices >> Email: rory.bramw...@devopservices.com >> Web: www.devopservices.com >> On Jul 7, 2015 4:02 AM, "Carlos Alonso" <i...@mrcalonso.com> wrote: >> >>> I guess you're right, using my proposal, getting last employee's record >>> is straightforward and quick, but also, as Peter pointed, getting all slips >>> for a particular month requires you to know all the employee IDs and, >>> ideally, run a query for each employee. This would work depending on how >>> many employees you're managing. >>> >>> At this moment I'm beginning to feel that maybe using both approaches is >>> the best way to go. And I think this is one of Cassandra's recommendations: >>> Write your data in several formats if required to fit your reads. Therefore >>> I'd use my suggestion for getting a salary by employee ID and I'd also have >>> Peter's one to run the end of the month query. >>> Does it make sense? >>> >>> Cheers! >>> >>> Carlos Alonso | Software Engineer | @calonso >>> <https://twitter.com/calonso> >>> >>> On 7 July 2015 at 09:07, Srinivasa T N <seen...@gmail.com> wrote: >>> >>>> Thanks for the inputs. >>>> >>>> Now my question is how should the app populate the duplicate data, >>>> i.e., if I have an employee record (along with his FN, LN,..) for the month >>>> of Apr and later I am populating the same record for the month of may (with >>>> salary changed), should my application first read/fetch the corresponding >>>> data for apr and re-insert with modification for month of may? >>>> >>>> Regards, >>>> Seenu. >>>> >>>> On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded <oded.p...@rsa.com> wrote: >>>> >>>>> The data model suggested isn’t optimal for the “end of month” query >>>>> you want to run since you are not querying by partition key. >>>>> >>>>> The query would look like “select EmpID, FN, LN, basic from salaries >>>>> where month = 1” which requires filtering and has unpredictable >>>>> performance. >>>>> >>>>> >>>>> >>>>> For this type of query to be fast you can use the “month” column as >>>>> the partition key and the “EmpID” and the clustering column. >>>>> >>>>> This approach also has drawbacks: >>>>> >>>>> 1. This data model creates a wide row. Depending on the number of >>>>> employees this partition might be very large. You should limit partition >>>>> sizes to 25MB >>>>> >>>>> 2. Distributing data according to month means that only a small number >>>>> of nodes will hold all of the salary data for a specific month which might >>>>> cause hotspots on those nodes. >>>>> >>>>> >>>>> >>>>> Choose the approach that works best for you. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *From:* Carlos Alonso [mailto:i...@mrcalonso.com] >>>>> *Sent:* Monday, July 06, 2015 7:04 PM >>>>> *To:* user@cassandra.apache.org >>>>> *Subject:* Re: Example Data Modelling >>>>> >>>>> >>>>> >>>>> Hi Srinivasa, >>>>> >>>>> >>>>> >>>>> I think you're right, In Cassandra you should favor denormalisation >>>>> when in RDBMS you find a relationship like this. >>>>> >>>>> >>>>> >>>>> I'd suggest a cf like this >>>>> >>>>> CREATE TABLE salaries ( >>>>> >>>>> EmpID varchar, >>>>> >>>>> FN varchar, >>>>> >>>>> LN varchar, >>>>> >>>>> Phone varchar, >>>>> >>>>> Address varchar, >>>>> >>>>> month integer, >>>>> >>>>> basic integer, >>>>> >>>>> flexible_allowance float, >>>>> >>>>> PRIMARY KEY(EmpID, month) >>>>> >>>>> ) >>>>> >>>>> >>>>> >>>>> That way the salaries will be partitioned by EmpID and clustered by >>>>> month, which I guess is the natural sorting you want. >>>>> >>>>> >>>>> >>>>> Hope it helps, >>>>> >>>>> Cheers! >>>>> >>>>> >>>>> Carlos Alonso | Software Engineer | @calonso >>>>> <https://twitter.com/calonso> >>>>> >>>>> >>>>> >>>>> On 6 July 2015 at 13:01, Srinivasa T N <seen...@gmail.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have basic doubt: I have an RDBMS with the following two tables: >>>>> >>>>> Emp - EmpID, FN, LN, Phone, Address >>>>> Sal - Month, Empid, Basic, Flexible Allowance >>>>> >>>>> My use case is to print the Salary slip at the end of each month >>>>> and the slip contains emp name and his other details. >>>>> >>>>> Now, if I want to have the same in cassandra, I will have a single >>>>> cf with emp personal details and his salary details. Is this the right >>>>> approach? Should we have the employee personal details duplicated each >>>>> month? >>>>> >>>>> Regards, >>>>> Seenu. >>>>> >>>>> >>>>> >>>> >>>> >>> >