Distributed R-Trees are not very common. Most "big data" spatial solutions
collapse multi-dimensional data into a distributed one-dimensional index
using a space-filling curve. Many implementations exist outside of Spark
for eg. Hbase or Accumulo. It's simple enough to write a map function that
tak
Hi, David,
This is the code that I use to create a JavaPairRDD from an Accumulo table:
JavaSparkContext sc = new JavaSparkContext(conf);
Job hadoopJob = Job.getInstance(conf,"TestSparkJob");
job.setInputFormatClass(AccumuloInputFormat.class);
AccumuloInputFormat.setZooKeeperInstance(job,
conf
Hi, Tao,
When I used newAPIHadoopRDD (Accumulo not HBase) I found that I had to
specify executor-memory and num-executors explicitly on the command line or
else I didn't get any parallelism across the cluster.
I used --executor-memory 3G --num-executors 24 but obviously other
parameters will be b
No, they do not implement Serializable. There are a couple of places where
I've had to do a Text->String conversion but generally it hasn't been a
problem.
-Russ
On Wed, Sep 24, 2014 at 10:27 AM, Steve Lewis wrote:
> Do your custom Writable classes implement Serializable - I think that is
> the
I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using
Accumulo's Key and Value classes, both of which extend Writable. Works like
a charm. I use the same InputFormat for all my MR jobs.
-Russ
On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis wrote:
> I tried newAPIHadoopFile an
query time down to 30s from 18 minutes and I'm seeing much better
utilization of my accumulo tablet servers.
-Russ
On Tue, Sep 9, 2014 at 5:13 PM, Russ Weeks wrote:
> Hi,
>
> I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
> Not sure if I shoul
It's very straightforward to set up a Hadoop RDD to use
AccumuloInputFormat. Something like this will do the trick:
private JavaPairRDD newAccumuloRDD(JavaSparkContext sc,
AgileConf agileConf, String appName, Authorizations auths)
throws IOException, AccumuloSecurityException {
Job had
Hi,
I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
Not sure if I should be asking on the Spark list or the Accumulo list, but
I'll try here. The problem is that the workload to process SQL queries
doesn't seem to be distributed across my cluster very well.
My Spark SQL