Re: Dataframe / Dataset partition size...

Muthu Jayakumar Sat, 06 Aug 2016 15:52:10 -0700

Hello Dr Mich Talebzadeh,

>Can you kindly advise on your number of nodes, the cores for each node and
the RAM for each node.
I have a 32 node (1 executor per node currently) cluster. All these have
512 GB  of memory. Most of these are either 16 or 20 physical cores (with
out HT enabled). The HDFS is configured to run on another set of nodes (but
are all part of the same rack / subnet)


>Is this a parquet file?
Yes, it is a parquet directory.

>What I don't understand why you end up with 220 files whereas you
partition says 25
I do have some kind of hack ;) in place that can roughly size the file to a
block size of my HDFS so that the number of parts created can be optimized
for HDFS storage. But I wanted to understand why it allocates smaller
number of cores during a read cycle?

My current work around for this problem is to run multiple parallel queries
of this kind :( (basically scala Future - fork-join magic). But, this seem
incorrect.

I do have some parquet files that uses like 9 partitions (though the files
are 200).

Here is a sample code from Spark 2.0.0 shell that i tried...

case class Customer(number: Int)
import org.apache.spark.sql._
import spark.implicits._
val parquetFile = "hdfs://myip:port/tmp/dummy.parquet"
spark.createDataset(1 to
10000).map(Customer).repartition(200).write.mode(SaveMode.Overwrite).parquet(parquetFile)

scala> spark.read.parquet(parquetFile).toJavaRDD.partitions.size()
res1: Int = 23

scala> spark.read.parquet(parquetFile).toJavaRDD.partitions.size()
res2: Int = 20

Can I suspect something with dynamic allocation perhaps?

Please advice,
Muthu


On Sat, Aug 6, 2016 at 3:23 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> 720 cores Wow. That is a hell of cores  Muthu :)
>
> Ok let us take a step back
>
> Can you kindly advise on your number of nodes, the cores for each node and
> the RAM for each node.
>
> What I don't understand why you end up with 220 files whereas you
> partition says 25. Now you have 2.2GB of size so each file only has
> 2.2GB/220  = 10MB. That is a lot of files for nothing. The app has to load
> each file
> Is this a parquet file?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 6 August 2016 at 23:09, Muthu Jayakumar <bablo...@gmail.com> wrote:
>
>> Hello Dr Mich Talebzadeh,
>>
>> Thank you for looking into my question. W.r.t
>> >However, in reality the number of partitions should not exceed the
>> total number of cores in your cluster?
>> I do have 720 cores available in a cluster for this to run. It does run
>> in dynamic provisioning.
>>
>> On a side note, I was expecting the partition count to match up to what
>> you have. But :( , my numbers above now asks me to understand the APIs
>> better :).
>>
>> Please advice,
>> Muthu
>>
>>
>>
>> On Sat, Aug 6, 2016 at 1:54 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Muthu.
>>>
>>> Interesting question.
>>>
>>> I have the following:
>>>
>>> scala> val s = HiveContext.table("dummy_parqu
>>> et").toJavaRDD.partitions.size()
>>> s: Int = 256
>>>
>>> and on HDFS it has
>>>
>>> hdfs dfs -ls  /user/hive/warehouse/oraclehadoop.db/dummy_parquet|wc -l
>>> 16/08/06 21:50:45 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>>      257
>>>
>>> Which is somehow consistent. its size
>>>
>>> hdfs dfs -du -h -s  /user/hive/warehouse/oraclehadoop.db/dummy_parquet
>>> 16/08/06 21:51:50 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>> 5.9 G  /user/hive/warehouse/oraclehadoop.db/dummy_parquet
>>>
>>> nearly 6GB
>>>
>>> sc.defaultParallelism
>>> res6: Int = 1
>>>
>>>
>>> However, in reality the number of partitions should not exceed the total
>>> number of cores in your cluster?
>>>
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 6 August 2016 at 20:56, Muthu Jayakumar <bablo...@gmail.com> wrote:
>>>
>>>> Hello there,
>>>>
>>>> I am trying to understand how I could improve (or increase) the
>>>> parallelism of tasks that run for a particular spark job.
>>>> Here is my observation...
>>>>
>>>> scala> spark.read.parquet("hdfs://somefile").toJavaRDD.partitions.s
>>>> ize()
>>>> 25
>>>>
>>>> > hadoop fs -ls hdfs://somefile | grep 'part-r' | wc -l
>>>> 200
>>>>
>>>> > hadoop fs -du -h -s hdfs://somefile
>>>> 2.2 G
>>>>
>>>> I notice that, depending on what the repartition / coalesce the number
>>>> of part files to HDFS is created appropriately during the save operation.
>>>> Meaning the number of part files can be tweaked according to this 
>>>> parameter.
>>>>
>>>> But, how do I control the 'partitions.size()'? Meaning, I want to have
>>>> this to be 200 (without having to repartition it during the read operation
>>>> so that I would be able have more number of tasks run for this job)
>>>> This has a major impact in-terms of the time it takes to perform query
>>>> operations on this job.
>>>>
>>>> On a side note, I do understand that 200 parquet part files for the
>>>> above 2.2 G seems over-kill for a 128 MB block size. Ideally it should be
>>>> 18 parts or so.
>>>>
>>>> Please advice,
>>>> Muthu
>>>>
>>>
>>>
>>
>

Re: Dataframe / Dataset partition size...

Reply via email to