Re: MultiInput/MultiGet CF in MapReduce

2013-03-29 Thread Alicia Leong
This is the current flow for ColumnFamilyInputFormat. Please correct me If I'm wrong 1) In ColumnFamilyInputFormat, Get all nodes token ranges using * client.describe_ring* 2) Get CfSplit using *client.describe_splits_ex *with the token range 2) new ColumnFamilySplit with start range, end range a

Re: MultiInput/MultiGet CF in MapReduce

2013-03-29 Thread Edward Capriolo
You can use the output of describe_ring along with partitioner information to determine which nodes data lives on. On Fri, Mar 29, 2013 at 12:33 PM, Alicia Leong wrote: > Hi All > > I’m thinking to do in this way. > > 1) 1) get_slice ( MMDDHH ) from Index Table. > > 2) 2) With th

Re: Insert v/s Update performance

2013-03-29 Thread Jay Svc
Hi Aaron, Thank you for your input. I have been monitoring my GC activities and looking at my Heap, it shows pretty linear activities, without any spikes. When I look at CPU it shows higher utilization while during writes alone. I also expect hevy read traffic. When I tried compaction_throughput

Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Edward Capriolo
It should be easy to control the number of map tasks. http://wiki.apache.org/hadoop/HowManyMapsAndReduces. It standard HDFS you might run into a directory with 10,000 small files and you do not want 10,000 map tasks. This is what the CombinedInputFormat's do, they help you control the number of map

Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Edward Capriolo
Yes but my point, is with 50 map slots you can only be processing 50 at once. So it will take 1000/50 "waves" of mappers to complete the job. On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis wrote: > My point is that if you have over 16MB of data per node, you're going > to get thousands of map

Re: MultiInput/MultiGet CF in MapReduce

2013-03-29 Thread Alicia Leong
Hi All I’m thinking to do in this way. 1) 1) get_slice ( MMDDHH ) from Index Table. 2) 2) With the returned list of ROWKEYs 3) 3) Pass it to multiget_slice ( keys …) But my questions is how to ensure ‘Data Locality’ ?? On Tue, Mar 19, 2013 at 3:33 PM, aaron morton wrot

Re: CQL queries timing out (and had worked)

2013-03-29 Thread David McNelis
Final reason for problem: We'd had one node's config for rpc type changed from sync to hsha... So that mismatch can break rpc across the cluster, apparently. It would be nice if there was a good way to set that in a single spot for the cluster or handle the mismatch differently. Otherwise, if y

Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Jonathan Ellis
My point is that if you have over 16MB of data per node, you're going to get thousands of map tasks (that is: hundreds per node) with or without vnodes. On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo wrote: > Every map reduce task typically has a minimum Xmx of 256MB memory. See > mapred.child.

Cassandra/MapReduce ‘Data Locality’

2013-03-29 Thread Alicia Leong
Hi All, CfSplit that highlighted in RED* **, *in *d2t0053g* But why it being submitted to *d2t0051g *not *d2t0053g ??* Is this normal? Does this matter? In this case is no longer ‘Data Locality’ correct ? I’m using hadoop-1.1.2 & apache-cassandra-1.2.3. TokenRange (1) >> 1276058875953519237

Re: CQL queries timing out (and had worked)

2013-03-29 Thread David McNelis
Appears that restarting a node makes CQL available on that node again, but only that node. Looks like I'll be doing a rolling restart. On Fri, Mar 29, 2013 at 10:26 AM, David McNelis wrote: > I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in my > cluster. > > I'd had a la

CQL queries timing out (and had worked)

2013-03-29 Thread David McNelis
I'm running 1.2.3 and have both CQL3 tabels and old school style CFs in my cluster. I'd had a large insert job running the last several days which just ended it had been inserting using cql3 insert statements in a cql3 table. Now, I show no compactions going on in my cluster but for some reas

Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Edward Capriolo
This is the second person who has mentioned that hadoop performance has tanked after switching to vnodes on list. On Fri, Mar 29, 2013 at 10:42 AM, Edward Capriolo wrote: > Every map reduce task typically has a minimum Xmx of 256MB memory. See > mapred.child.java.opts... > So if you have a 10 no

Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Edward Capriolo
Every map reduce task typically has a minimum Xmx of 256MB memory. See mapred.child.java.opts... So if you have a 10 node cluster with 256 vnodes... You will need to spawn 2,560 map tasks to complete a job. And a 10 node hadoop cluster with 5 map slotes a node... You have 50 map slots. Wouldnt it

Lost data after expanding cluster c* 1.2.3-1

2013-03-29 Thread Kais Ahmed
Hi all, I follow this tutorial for expanding a 4 c* cluster (production) and add 3 new nodes. Datacenter: eu-west === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.34.142.xxx

Re: Vnodes - HUNDRED of MapReduce jobs

2013-03-29 Thread Jonathan Ellis
I still don't see the hole in the following reasoning: - Input splits are 64k by default. At this size, map processing time dominates job creation. - Therefore, if job creation time dominates, you have a toy data set (< 64K * 256 vnodes = 16 MB) Adding complexity to our inputformat to improve pe