Re: How to specify number of Partition using newAPIHadoopFile()

Prateek Rajput Tue, 30 Apr 2019 06:45:50 -0700

On Tue, Apr 30, 2019 at 6:48 PM Vatsal Patel <vatsal.pa...@flipkart.com>
wrote:


> *Issue: *
>
> When I am reading sequence file in spark, I can specify the number of
> partitions as an argument to the API, below is the way
> *public <K, V> JavaPairRDD<K, V> sequenceFile(String path, Class<K>
> keyClass, Class<V> valueClass, int minPartitions)*
>
> *In newAPIHadoopFile(), this support has been removed. below are the APIs.*
>
>    - public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K,
>    V>> JavaPairRDD<K, V> *newAPIHadoopFile*(String path, Class<F> fClass,
>    Class<K> kClass, Class<V> vClass, Configuration conf)
>    -
>
>    public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, V>>
>    JavaPairRDD<K, V> *newAPIHadoopRDD*(Configuration conf, Class<F>
>    fClass, Class<K> kClass, Class<V> vClass)
>
> Is there a way to specify the number of partitions when I will read *Avro*
> file using *newAPIHadoopFile*(). I explored and found that we can specify
> the hadoop configuration and in that, we can set various Hadoop properties.
> but there we can specify the size using this property 
> ("*mapred.max.split.size","50mb").
> *based on this it will calculate the number of partitions but then each
> partition's size may or may not be equal or less than the specified size.
>
>    - *note - *A way other than repartition()
>
> *Execution Environment*
>
>    - SPARK-JAVA VERSION - 2.4.0
>    - JDK VERSION - 1.8
>    - SPARK ARTIFACTID - spark-core_2.11
>    - AVRO VERSION -  1.8.2
>
> Please help us understand, why this issue is coming?
>
> Thanks,
> Vatsal
>

Re: How to specify number of Partition using newAPIHadoopFile()

Reply via email to