[Pyspark 2.4] Large number of row groups in parquet files created using spark

2019-07-24 Thread Rishi Shah
Hi All,

I have the following code which produces 1 600MB parquet file as expected,
however within this parquet file there are 42 row groups! I would expect it
to crate max 6 row groups, could someone please shed some light on this? Is
there any config setting which I can enable while submitting application
using spark-submit?

df = spark.read.parquet(INPUT_PATH)
df.coalesce(1).write.parquet(OUT_PATH)

I did try --conf spark.parquet.block.size & spark.dfs.blocksize, but that
makes no difference.

-- 
Regards,

Rishi Shah


[Spark SQL] dependencies to use test helpers

2019-07-24 Thread James Pirz
I have a Scala application in which I have added some extra rules to
Catalyst.
While adding some unit tests, I am trying to use some existing functions
from Catalyst's test code: Specifically comparePlans() and normalizePlan()
under PlanTestBase

[1].

I am just wondering which additional dependencies I need to add to my
project to access them. Currently, I have below dependencies but they do
not cover above APIs.

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.4.3"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.4.3"
libraryDependencies += "org.apache.spark" % "spark-catalyst_2.11" % "2.4.3"


[1] 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala

Thanks,
James


How to get Peak CPU Utilization Rate in Spark

2019-07-24 Thread Praups Kumar
Hi Spark dev

Map Reduce can be  used ResourceCalculatorProcessTree in Task.java to
get peakCPUUTilization.


*The same is done at Yarn node manager level in class *
ContainersMonitorsImpl

However , I am not able to find any way to get peakCPUUtilization of the
executor in spark .

Please help me on the same.


Re: Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset

2019-07-24 Thread Chris Teoh
This might be a hint. Maybe invalid data?

Caused by: java.lang.IllegalArgumentException: Missing required char
':' at 'struct>'


On Wed., 24 Jul. 2019, 2:15 pm Balakumar iyer S, 
wrote:

> Hi Bobby Evans,
>
> I apologise for the delayed response , yes you are right I missed out to
> paste the complete stack trace exception. Here with I have attached the
> complete yarn log for the same.
>
> Thank you , It would be helpful if you guys could assist me on this error.
>
>
> -
> Regards
> Balakumar Seetharaman
>
>
> On Mon, Jul 22, 2019 at 7:05 PM Bobby Evans  wrote:
>
>> You are missing a lot of the stack trace that could explain the
>> exception.  All it shows is that an exception happened while writing out
>> the orc file, not what that underlying exception is, there should be at
>> least one more caused by under the one you included.
>>
>> Thanks,
>>
>> Bobby
>>
>> On Mon, Jul 22, 2019 at 5:58 AM Balakumar iyer S 
>> wrote:
>>
>>> Hi ,
>>>
>>> I am trying to perform a group by  followed by aggregate collect set
>>> operation on a two column data-setschema (LeftData int , RightData
>>> int).
>>>
>>> code snippet
>>>
>>>   val wind_2  =
>>> dframe.groupBy("LeftData").agg(collect_set(array("RightData")))
>>>
>>>  wind_2.write.mode(SaveMode.Append).format("orc").save(args(1))
>>>
>>> the above code works fine on a smaller dataset but throws the following
>>> error on large dataset (where each keys in LeftData column  needs to be
>>> grouped with 64k values approximately ).
>>>
>>> Could some one assist me on this , should i  set any configuration to
>>> accommodate such a large  values?
>>>
>>> ERROR
>>> -
>>> Driver stacktrace:
>>> at org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>>> at scala.Option.foreach(Option.scala:257)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>>> at
>>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>>>
>>>
>>> Caused by: org.apache.spark.SparkException: Task failed while writing
>>> rows.
>>> at
>>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>>>
>>> --
>>> REGARDS
>>> BALAKUMAR SEETHARAMAN
>>>
>>>
>
> --
> REGARDS
> BALAKUMAR SEETHARAMAN
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org