Hi, Marc, If your rows are small you can use the WholeRowIterator to get all the values with the key in one consuming function. If your rows are big but you know up-front that you'll only need a small part of each row, you could put a filter in front of the WholeRowIterator.
I expect there's a performance hit (I haven't done any benchmarks myself) because of the extra serialization/deserialization but it's a very convenient way of working with Rows in Spark. Regards, -Russ On Mon, May 4, 2015 at 8:46 AM, Marc Reichman <[email protected]> wrote: > Has anyone done any testing with Spark and AccumuloRowInputFormat? I have > no problem doing this for AccumuloInputFormat: > > JavaPairRDD<Key, Value> pairRDD = > sparkContext.newAPIHadoopRDD(job.getConfiguration(), > AccumuloInputFormat.class, > Key.class, Value.class); > > But I run into a snag trying to do a similar thing: > > JavaPairRDD<Text, PeekingIterator<Map.Entry<Key, Value>>> pairRDD = > sparkContext.newAPIHadoopRDD(job.getConfiguration(), > AccumuloRowInputFormat.class, > Text.class, PeekingIterator.class); > > The compilation error is (big, sorry): > > Error:(141, 97) java: method newAPIHadoopRDD in class > org.apache.spark.api.java.JavaSparkContext cannot be applied to given types; > required: > org.apache.hadoop.conf.Configuration,java.lang.Class<F>,java.lang.Class<K>,java.lang.Class<V> > found: > org.apache.hadoop.conf.Configuration,java.lang.Class<org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat>,java.lang.Class<org.apache.hadoop.io.Text>,java.lang.Class<org.apache.accumulo.core.util.PeekingIterator> > reason: inferred type does not conform to declared bound(s) > inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat > bound(s): > org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.Text,org.apache.accumulo.core.util.PeekingIterator> > > I've tried a few things, the signature of the function is: > > public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, V>> > JavaPairRDD<K, V> newAPIHadoopRDD(Configuration conf, Class<F> fClass, > Class<K> kClass, Class<V> vClass) > > I guess it's having trouble with the format extending InputFormatBase with > its own additional generic parameters (the Map.Entry inside > PeekingIterator). > > This may be an issue to chase with Spark vs Accumulo, unless something can > be tweaked on the Accumulo side or I could wrap the InputFormat with my own > somehow. > > Accumulo 1.6.1, Spark 1.3.1, JDK 7u71. > > Stopping short of this, can anyone think of a good way to use > AccumuloInputFormat to get what I'm getting from the Row version in a > performant way? It doesn't necessarily have to be an iterator approach, but > I'd need all my values with the key in one consuming function. I'm looking > into ways to do it in spark functions but trying to avoid any major > performance hits. > > Thanks, > > Marc > > p.s. The summit was absolutely great, thank you all for having it! > >
