> As I do not have Billions of input records (but a max of 10 Milllion) the > added benefit of scaling out the per-line processing is probably not worth > the additional setup and operations effort of Hadoop. I would start with a regular app and then go to hadoop if needed, assuming you are only dealing with a few MB's of data.
There can be a significant human startup cost to brining hadoop into a C* setup. I recommend using http://www.datastax.com/what-we-offer/datastax-enterprise in a development environment to see if it's something you want to do (requires a licence to run in prod). Cheers ----------------- Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 10/08/2013, at 8:15 PM, Jan Algermissen <jan.algermis...@nordsc.com> wrote: > Hi, > > I have a specific use case to address with Cassandra and I can't get my head > around whether using Hadoop on top creates any significant benefit or not. > > Situation: > > I have product data and each product 'contains' a number of articles (<100 / > product), representing individual colors/sizes etc. > > My plan is to store each product in cassandra as a wide row, containing all > the articles per product. I choose this design because sometimes I need to > work with all the articles in a product and sometimes I just need to pick one > of them per product. > > My understanding is that picking a certain 'row' from all the 'rows' in a > wide row is nice (because it works on a per-row basis) and that any other > approach would require a scan over essentially all the rows (not good). > > So, after selecting one or some or all of the 'rows' (articles) from every > single wide row (product) the input to my data processing is essentially a > bunch articles. > > The final output of the overall processing will be and export file (XML or > CSV) containing one line (or element) per article. There is no 'cross > article' analysis going on, it is really sort of one-in/on-out. > > I am looking a Hadoop because I see MapReduce as a nice fit given the > independence of the per-article transformation into an output 'line'. > > What I am worried about is whether Hadoop will actually give me a real > benefit: While there will be processing (mostly string operations) going on > to vreate lines from articles, the output still needs to be pulled over the > wire to some place to create the single output file. > > I wonder whether it would not work equally well to per-article pull the > necessary data from Cassandra and create the output file in a single process > (in my case Java Web app). As I do not have Billions of input records (but a > max of 10 Milllion) the added benefit of scaling out the per-line processing > is probably not worth the additional setup and operations effort of Hadoop. > > Any idea how I could make a judgement call here? > > Another question: I read in a C* 1.1 related slidedeck that Hadoop output to > CFS is only possible with DSE and not with DSC - that with DSC the Hadoop > output would be HDFS. Is that correct? For homogeneity, I would certainly > want to store the output files in CFS, too. > > Sorry, that this was a bit of a longer question/explanation. > > Jan > > > > > >