On Tue, Apr 30, 2019 at 6:48 PM Vatsal Patel <vatsal.pa...@flipkart.com> wrote:
> *Issue: * > > When I am reading sequence file in spark, I can specify the number of > partitions as an argument to the API, below is the way > *public <K, V> JavaPairRDD<K, V> sequenceFile(String path, Class<K> > keyClass, Class<V> valueClass, int minPartitions)* > > *In newAPIHadoopFile(), this support has been removed. below are the APIs.* > > - public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, > V>> JavaPairRDD<K, V> *newAPIHadoopFile*(String path, Class<F> fClass, > Class<K> kClass, Class<V> vClass, Configuration conf) > - > > public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, V>> > JavaPairRDD<K, V> *newAPIHadoopRDD*(Configuration conf, Class<F> > fClass, Class<K> kClass, Class<V> vClass) > > Is there a way to specify the number of partitions when I will read *Avro* > file using *newAPIHadoopFile*(). I explored and found that we can specify > the hadoop configuration and in that, we can set various Hadoop properties. > but there we can specify the size using this property > ("*mapred.max.split.size","50mb"). > *based on this it will calculate the number of partitions but then each > partition's size may or may not be equal or less than the specified size. > > - *note - *A way other than repartition() > > *Execution Environment* > > - SPARK-JAVA VERSION - 2.4.0 > - JDK VERSION - 1.8 > - SPARK ARTIFACTID - spark-core_2.11 > - AVRO VERSION - 1.8.2 > > Please help us understand, why this issue is coming? > > Thanks, > Vatsal >