Re: Hadoop/Cassandra for data transformation (rather than analysis)?

Aaron Morton Mon, 12 Aug 2013 14:18:55 -0700

>  As I do not have Billions of input records (but a max of 10 Milllion) the 
> added benefit of scaling out the per-line processing is probably not worth 
> the additional setup and operations effort of Hadoop. 
I would start with a regular app and then go to hadoop if needed, assuming you 
are only dealing with a few MB's of data.


There can be a significant human startup cost to brining hadoop into a C* 
setup. I recommend using 
http://www.datastax.com/what-we-offer/datastax-enterprise in a development 
environment to see if it's something you want to do (requires a licence to run 
in prod). 

Cheers


 
-----------------
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/08/2013, at 8:15 PM, Jan Algermissen <jan.algermis...@nordsc.com> wrote:

> Hi,
> 
> I have a specific use case to address with Cassandra and I can't get my head 
> around whether using Hadoop on top creates any significant benefit or not.
> 
> Situation:
> 
> I have product data and each product 'contains' a number of articles (<100 / 
> product), representing individual colors/sizes etc.
> 
> My plan is to store each product in cassandra as a wide row, containing all 
> the  articles per product. I choose this design because sometimes I need to 
> work with all the articles in a product and sometimes I just need to pick one 
> of them per product.
> 
> My understanding is that picking a certain 'row' from all the 'rows' in a 
> wide row is nice (because it works on a per-row basis) and that any other 
> approach would require a scan over essentially all the rows (not good).
> 
> So, after selecting one or  some or all of the 'rows' (articles) from every 
> single wide row (product) the input to my data processing is essentially a 
> bunch articles.
> 
> The final output of the overall processing will be and export file (XML or 
> CSV) containing one line (or element) per article. There is no 'cross 
> article' analysis going on, it is really sort of one-in/on-out.
> 
> I am looking a Hadoop because I see MapReduce as a nice fit given the 
> independence of the per-article transformation into an output 'line'.
> 
> What I am worried about is whether Hadoop will actually give me a real 
> benefit: While there will be processing (mostly string operations) going on 
> to vreate lines from articles, the output still needs to be pulled over the 
> wire to some place to create the single output file. 
> 
> I wonder whether it would not work equally well to per-article pull the 
> necessary data from Cassandra and create the output file in a single process 
> (in my case Java Web app). As I do not have Billions of input records (but a 
> max of 10 Milllion) the added benefit of scaling out the per-line processing 
> is probably not worth the additional setup and operations effort of Hadoop. 
> 
> Any idea how I could make a judgement call here?
> 
> Another question: I read in a C* 1.1 related slidedeck that Hadoop output to 
> CFS is only possible with DSE and not with DSC - that with DSC the Hadoop 
> output would be HDFS. Is that correct?  For homogeneity, I would certainly 
> want to store the output files in CFS, too.
> 
> Sorry, that this was a bit of a longer question/explanation.
> 
> Jan
> 
> 
> 
> 
> 
>

Re: Hadoop/Cassandra for data transformation (rather than analysis)?

Reply via email to