Hi,
I have a rdd with the following structure:
row1: key: Seq[a, b]; value: value 1
row2: key: seq[a, c, f]; value: value 2
Is there an efficient way to "de-flat" the rows into?
row1: key: a; value: value1
row2: key: a; value: value2
row3: key: b; value: value1
row4: key: c; value: value2
row5: ke
t;> val c = it.next()
>>val family = Bytes.toString(CellUtil.cloneFamily(c))
>>val qualifier = Bytes.toString(CellUtil.cloneQualifier(c))
>>val value = Bytes.toString(CellUtil.cloneValue(c))
>>val tm = c.getTimestamp
>>printl
I have similar problem that I cannot pass the HBase configuration file as
extra classpath to Spark any more using
spark.executor.extraClassPath=MY_HBASE_CONF_DIR in the Spark 1.3. We used
to run this in 1.2 without any problem.
On Tuesday, May 19, 2015, donhoff_h <165612...@qq.com> wrote:
>
> Sor
Hi,
We have some Spark job that ran well under Spark 1.2 using spark-submit
--conf "spark.executor.extraClassPath=/etc/hbase/conf" and the Java HBase
driver code the Spark called can pick up the settings for HBase such as
ZooKeeper addresses.
But after upgrade to CDH 5.4.1 Spark 1.3, the Spark cod
ipelines / DAGs within the Spark Framework
>
> RDD1 = RDD.filter()
>
> RDD2 = RDD.filter()
>
>
>
>
>
> *From:* Bill Q [mailto:bill.q@gmail.com
> ]
> *Sent:* Thursday, May 7, 2015 4:55 PM
> *To:* Evo Eftimov
> *Cc:* user@spark.apache.org
>
> *Subject:*
Hi,
We are trying to join two sets of data. One of them are smaller and pretty
stable. The other data set is volatile and much larger. But neither can be
loaded in memory.
So our idea is to pre-sort the smaller data set, cache them in multiple
partitions. Any we use the same logic to sort the larg
Thanks for the replies. We decided to use concurrency in Scala to do the
two mappings using the same source RDD in parallel. So far, it seems to be
working. Any comments?
On Wednesday, May 6, 2015, Evo Eftimov wrote:
> RDD1 = RDD.filter()
>
> RDD2 = RDD.filter()
>
>
>
Hi all,
I have a large RDD that I map a function to it. Based on the nature of each
record in the input RDD, I will generate two types of data. I would like to
save each type into its own RDD. But I can't seem to find an efficient way
to do it. Any suggestions?
Many thanks.
Bill
--
Many thank
void it. The new sort-based shuffle might help
> in this regard.
>
> On Fri, Oct 31, 2014 at 3:25 PM, Bill Q > wrote:
> > Hi,
> > I am trying to make Spark SQL 1.1 to work to replace part of our ETL
> > processes that are currently done by Hive 0.12.
> >
> >
Hi,
I am trying to make Spark SQL 1.1 to work to replace part of our ETL
processes that are currently done by Hive 0.12.
A common problem that I have encountered is the "Too many files open"
error. Once that happened, the query just failed. I started the
spark-shell by using "ulimit -n 4096 & spar
10 matches
Mail list logo