Create multiple rows from elements in array on a single row

2015-06-08 Thread Bill Q
Hi, I have a rdd with the following structure: row1: key: Seq[a, b]; value: value 1 row2: key: seq[a, c, f]; value: value 2 Is there an efficient way to "de-flat" the rows into? row1: key: a; value: value1 row2: key: a; value: value2 row3: key: b; value: value1 row4: key: c; value: value2 row5: ke

Re: How to use spark to access HBase with Security enabled

2015-05-21 Thread Bill Q
t;> val c = it.next() >>val family = Bytes.toString(CellUtil.cloneFamily(c)) >>val qualifier = Bytes.toString(CellUtil.cloneQualifier(c)) >>val value = Bytes.toString(CellUtil.cloneValue(c)) >>val tm = c.getTimestamp >>printl

Re: How to use spark to access HBase with Security enabled

2015-05-20 Thread Bill Q
I have similar problem that I cannot pass the HBase configuration file as extra classpath to Spark any more using spark.executor.extraClassPath=MY_HBASE_CONF_DIR in the Spark 1.3. We used to run this in 1.2 without any problem. On Tuesday, May 19, 2015, donhoff_h <165612...@qq.com> wrote: > > Sor

Spark 1.3 classPath problem

2015-05-19 Thread Bill Q
Hi, We have some Spark job that ran well under Spark 1.2 using spark-submit --conf "spark.executor.extraClassPath=/etc/hbase/conf" and the Java HBase driver code the Spark called can pick up the settings for HBase such as ZooKeeper addresses. But after upgrade to CDH 5.4.1 Spark 1.3, the Spark cod

Re: Map one RDD into two RDD

2015-05-07 Thread Bill Q
ipelines / DAGs within the Spark Framework > > RDD1 = RDD.filter() > > RDD2 = RDD.filter() > > > > > > *From:* Bill Q [mailto:bill.q@gmail.com > ] > *Sent:* Thursday, May 7, 2015 4:55 PM > *To:* Evo Eftimov > *Cc:* user@spark.apache.org > > *Subject:*

CompositeInputFormat implementation in Spark

2015-05-07 Thread Bill Q
Hi, We are trying to join two sets of data. One of them are smaller and pretty stable. The other data set is volatile and much larger. But neither can be loaded in memory. So our idea is to pre-sort the smaller data set, cache them in multiple partitions. Any we use the same logic to sort the larg

Re: Map one RDD into two RDD

2015-05-07 Thread Bill Q
Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov wrote: > RDD1 = RDD.filter() > > RDD2 = RDD.filter() > > >

Map one RDD into two RDD

2015-05-05 Thread Bill Q
Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thank

Re: Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Bill Q
void it. The new sort-based shuffle might help > in this regard. > > On Fri, Oct 31, 2014 at 3:25 PM, Bill Q > wrote: > > Hi, > > I am trying to make Spark SQL 1.1 to work to replace part of our ETL > > processes that are currently done by Hive 0.12. > > > >

Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Bill Q
Hi, I am trying to make Spark SQL 1.1 to work to replace part of our ETL processes that are currently done by Hive 0.12. A common problem that I have encountered is the "Too many files open" error. Once that happened, the query just failed. I started the spark-shell by using "ulimit -n 4096 & spar