[ https://issues.apache.org/jira/browse/SPARK-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960908#comment-13960908 ]
Xusen Yin commented on SPARK-1415: ---------------------------------- Hi Matei, I just looked around in those Hadoop APIs. I find that the new Hadoop API deprecates the minSplit, instead of minSplit, they prefer minSplitSize and maxSplitSize to control the split. minSplit is negative correlated with maxSplitSize, so I think we have 2 ways to fix the issue: 1. We just provide a new API with maxSplitSize, say, wholeTextFiles(path: String, maxSplitSize: Long); 2. We write a delegation to compute the maxSplitSize using minSplit (easy to write, taking old Hadoop API as an example), and provide the API wholeTextFile(path: String, minSplit: Int); I also think we can provide the two APIs simultaneously. What do you think? > Add a minSplits parameter to wholeTextFiles > ------------------------------------------- > > Key: SPARK-1415 > URL: https://issues.apache.org/jira/browse/SPARK-1415 > Project: Spark > Issue Type: Bug > Reporter: Matei Zaharia > Assignee: Xusen Yin > Labels: Starter > > This probably requires adding one to newAPIHadoopFile too. -- This message was sent by Atlassian JIRA (v6.2#6252)