Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?
Ran into this need myself. Does Spark have an equivalent of "mapreduce. input.fileinputformat.list-status.num-threads"? Thanks. On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Parkwrote: > Hi, > > I am wondering if anyone has successfully enabled > "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I > usually set this property to 25 to speed up file listing in MR jobs (Hive > and Pig). But for some reason, this property does not take effect in Spark > HadoopRDD resulting in serious delay in file listing. > > I verified that the property is indeed set in HadoopRDD by logging the > value of the property in the getPartitions() function. I also tried to > attach VisualVM to Spark and Pig clients, which look as follows- > > In Pig, I can see 25 threads running in parallel for file listing- > [image: Inline image 1] > > In Spark, I only see 2 threads running in parallel for file listing- > [image: Inline image 2] > > What's strange is that the # of concurrent threads in Spark is throttled > no matter how high I > set "mapreduce.input.fileinputformat.list-status.num-threads". > > Is anyone using Spark with this property enabled? If so, can you please > share how you do it? > > Thanks! > Cheolsoo >
Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?
Alex, see this jira- https://issues.apache.org/jira/browse/SPARK-9926 On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Ran into this need myself. Does Spark have an equivalent of "mapreduce. > input.fileinputformat.list-status.num-threads"? > > Thanks. > > On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park> wrote: > >> Hi, >> >> I am wondering if anyone has successfully enabled >> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I >> usually set this property to 25 to speed up file listing in MR jobs (Hive >> and Pig). But for some reason, this property does not take effect in Spark >> HadoopRDD resulting in serious delay in file listing. >> >> I verified that the property is indeed set in HadoopRDD by logging the >> value of the property in the getPartitions() function. I also tried to >> attach VisualVM to Spark and Pig clients, which look as follows- >> >> In Pig, I can see 25 threads running in parallel for file listing- >> [image: Inline image 1] >> >> In Spark, I only see 2 threads running in parallel for file listing- >> [image: Inline image 2] >> >> What's strange is that the # of concurrent threads in Spark is throttled >> no matter how high I >> set "mapreduce.input.fileinputformat.list-status.num-threads". >> >> Is anyone using Spark with this property enabled? If so, can you please >> share how you do it? >> >> Thanks! >> Cheolsoo >> > >
Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?
Thanks. I was actually able to get mapreduce.input. fileinputformat.list-status.num-threads working in Spark against a regular fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive. On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Parkwrote: > Alex, see this jira- > https://issues.apache.org/jira/browse/SPARK-9926 > > On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky < > alex.nastet...@vervemobile.com> wrote: > >> Ran into this need myself. Does Spark have an equivalent of "mapreduce. >> input.fileinputformat.list-status.num-threads"? >> >> Thanks. >> >> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park >> wrote: >> >>> Hi, >>> >>> I am wondering if anyone has successfully enabled >>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I >>> usually set this property to 25 to speed up file listing in MR jobs (Hive >>> and Pig). But for some reason, this property does not take effect in Spark >>> HadoopRDD resulting in serious delay in file listing. >>> >>> I verified that the property is indeed set in HadoopRDD by logging the >>> value of the property in the getPartitions() function. I also tried to >>> attach VisualVM to Spark and Pig clients, which look as follows- >>> >>> In Pig, I can see 25 threads running in parallel for file listing- >>> [image: Inline image 1] >>> >>> In Spark, I only see 2 threads running in parallel for file listing- >>> [image: Inline image 2] >>> >>> What's strange is that the # of concurrent threads in Spark is throttled >>> no matter how high I >>> set "mapreduce.input.fileinputformat.list-status.num-threads". >>> >>> Is anyone using Spark with this property enabled? If so, can you please >>> share how you do it? >>> >>> Thanks! >>> Cheolsoo >>> >> >> >