thanx; i have already done that; it works... im am trying something that will work faster of many files in a directory. i just want to use the file directory read and parse the rfile directly (much like the print rfiles class does with the rfile reader; however, need to decouple it for external use)
On Mon, Aug 3, 2020 at 5:38 PM Jim Hughes <[email protected]> wrote: > Good question. As a very general note, one can leverage Hadoop > InputFormats to create Spark RDDs. > > As a rather non-trivial example, you could check out GeoMesa's > implementation of mapping Accumulo entries to geospatial data types. > > The basic strategy is make a Hadoop Configuration object representing > what to scan in Accumulo and call SparkContext.newAPIHadoopRDD to get an > RDD. > > If you want a DataFrame/DataSet, you'll need to implement that Spark > DataSource API. > > Hope that helps! > > Cheers, > > Jim > > 1. Current implementation; decently refactored. > > https://github.com/locationtech/geomesa/blob/main/geomesa-accumulo/geomesa-accumulo-spark/src/main/scala/org/locationtech/geomesa/spark/accumulo/AccumuloSpatialRDDProvider.scala#L52-L82 > > 2. Older implementation; less refactoring, may be more clear. > > https://github.com/locationtech/geomesa/blob/geomesa_2.11-1.3.0/geomesa-accumulo/geomesa-accumulo-spark/src/main/scala/org/locationtech/geomesa/spark/accumulo/AccumuloSpatialRDDProvider.scala#L51-L100 > > p.s. Alternatively, if you just want to get a little data out of > Accumulo, you could just query for it on the master, and fan the data > out on the cluster. *shrugs* > > On 8/3/20 4:46 PM, Bulldog20630405 wrote: > > > > we would like to read rfiles directly outside an active accumulo > > instance using spark. is there a example to do this? > > > > note: i know there is an utility to print rfiles and i could start > > there and build my own; but was hoping to leverage something already > > there. > > >
