https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameReader.html#json(org.apache.spark.rdd.RDD)
You can pass a rdd to spark.read.json. // Spark here is SparkSession Also it works completely fine with smaller dataset in a table but with 1B records it takes forever and more importantly the network throughput is only 2.2 KB/s which is too low. It should be somewhere in MB/s On Sat, Nov 26, 2016 at 10:09 AM, Anastasios Zouzias <zouz...@gmail.com> wrote: > Hi there, > > spark.read.json usually takes a filesystem path (usually HDFS) where there > is a file containing JSON per new line. See also > > http://spark.apache.org/docs/latest/sql-programming-guide.html > > Hence, in your case > > val df4 = spark.read.json(rdd) // This line takes forever > > seems wrong. I guess you might want to first store rdd as a text file on > HDFS and then read it using spark.read.json . > > Cheers, > Anastasios > > > > On Sat, Nov 26, 2016 at 9:34 AM, kant kodali <kanth...@gmail.com> wrote: > >> up vote >> down votefavorite >> <http://stackoverflow.com/questions/40797231/apache-spark-or-spark-cassandra-connector-doesnt-look-like-it-is-reading-multipl?noredirect=1#> >> >> Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading >> multiple partitions in parallel. >> >> Here is my code using spark-shell >> >> import org.apache.spark.sql._ >> import org.apache.spark.sql.types.StringType >> spark.sql("""CREATE TEMPORARY VIEW hello USING >> org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", >> cluster "Test Cluster", pushdown "true")""") >> val df = spark.sql("SELECT test from hello") >> val df2 = df.select(df("test").cast(StringType).as("test")) >> val rdd = df2.rdd.map { case Row(j: String) => j } >> val df4 = spark.read.json(rdd) // This line takes forever >> >> I have about 700 million rows each row is about 1KB and this line >> >> val df4 = spark.read.json(rdd) takes forever as I get the following >> output after 1hr 30 mins >> >> Stage 1:==========> (4866 + 2) / 25256] >> >> so at this rate it will probably take days. >> >> I measured the network throughput rate of spark worker nodes using iftop >> and it is about 2.2KB/s (kilobytes per second) which is too low so that >> tells me it not reading partitions in parallel or at very least it is not >> reading good chunk of data else it would be in MB/s. Any ideas on how to >> fix it? >> >> > > > -- > -- Anastasios Zouzias > <a...@zurich.ibm.com> >