[ 
https://issues.apache.org/jira/browse/SPARK-20071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15938475#comment-15938475
 ] 

Barry Becker commented on SPARK-20071:
--------------------------------------

Yes. I agree. I wanted to report the issue, but wasn't sure if it should be a 
bug or enhancement request. The main problem for me was that it took a fair bit 
of debugging to discover what the problem was. It might be nice to provide a 
warning or more info in the exception. I will find a way to work around it. 

> StringIndexer overflows Kryo serialization buffer when run on column with 
> many long distinct values
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20071
>                 URL: https://issues.apache.org/jira/browse/SPARK-20071
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Barry Becker
>            Priority: Minor
>
> I marked this as minor because there are workarounds.
> I have a 2 million row dataset with a string column that is mostly unique and 
> contains many very long values.
> Most of the values are between 1,000 and 40,0000 characters long.
> I am using Kryoserializer and increased the  spark.kryoserializer.buffer.max 
> to 256m. 
> If I try to run StringIndexer.fit on this column, I will get an OutOfMemory 
> exception or more likely a Buffer overflow error like 
> {code}
> org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. 
> Available: 0, required: 23. 
> To avoid this, increase spark.kryoserializer.buffer.max 
> value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
>  
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {code}
> This result is not that surprising given that we are trying to index a column 
> like this, however, I can think of some suggestions that would help avoid the 
> error and maybe help performance.
> These possible enhancements to StringIndexer might be hard, but I thought I 
> would suggest them anyway, just in case they are not.
> 1) Add param for Top N values. I know that StringIndexer gives lower indices 
> to the more commonly occurring values. It would be great if one could specify 
> that I only want to index the top N values and long everything else into a 
> special "Other" value.
> 2) Add param for label length limit. Only consider the first L characters of 
> labels when doing the indexing.
> Either of these enhancements would work, but I suppose they can also be 
> implemented with additional work as steps preceding the indexer in the 
> pipeline. Perhaps topByKey could be used to replace the column with one that 
> has the top values and "Other" as suggesed in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to