Re: What are the alternatives to nested DataFrames?

Shahab Yunus Fri, 28 Dec 2018 09:34:41 -0800

Can you have a dataframe with a column which stores json (type string)? Or
you can also have a column of array type in which you store all cities
matching your query.




On Fri, Dec 28, 2018 at 2:48 AM <em...@yeikel.com> wrote:

> Hi community ,
>
>
>
> As shown in other answers online , Spark does not support the nesting of
> DataFrames , but what are the options?
>
>
>
> I have the following scenario :
>
>
>
> dataFrame1 = List of Cities
>
>
>
> dataFrame2 = Created after searching in ElasticSearch for each city in
> dataFrame1
>
>
>
> I've tried :
>
>
>
>  val cities    = sc.parallelize(Seq("New York")).toDF()
>
>    cities.foreach(r => {
>
>     val companyName = r.getString(0)
>
>     println(companyName)
>
>     val dfs = sqlContext.esDF("cities/docs", "?q=" + companyName)
>  //returns a DataFrame consisting of all the cities matching the entry in
> cities
>
>     })
>
>
>
> Which triggers the expected null pointer exception
>
>
>
> java.lang.NullPointerException
>
>     at org.elasticsearch.spark.sql.EsSparkSQL$.esDF(EsSparkSQL.scala:53)
>
>     at org.elasticsearch.spark.sql.EsSparkSQL$.esDF(EsSparkSQL.scala:51)
>
>     at
> org.elasticsearch.spark.sql.package$SQLContextFunctions.esDF(package.scala:37)
>
>     at Main$$anonfun$main$1.apply(Main.scala:43)
>
>     at Main$$anonfun$main$1.apply(Main.scala:39)
>
>     at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>
>     at
> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
>
>     at
> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:921)
>
>     at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
>
>     at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
>
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>
>     at org.apache.spark.scheduler.Task.run(Task.scala:109)
>
>     at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>     at java.lang.Thread.run(Thread.java:748)
>
> 2018-12-28 02:01:00 ERROR TaskSetManager:70 - Task 7 in stage 0.0 failed 1
> times; aborting job
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent
> failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver):
> java.lang.NullPointerException
>
>
>
> What options do I have?
>
>
>
> Thank you.
>

Re: What are the alternatives to nested DataFrames?

Reply via email to