[jira] [Commented] (SPARK-20339) Issue in regex_replace in Apache Spark Java
[ https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971054#comment-15971054 ] Nischay commented on SPARK-20339: - Sure I'll not add redundant code in future, also I'll use u...@spark.apache.org "For such a huge sequence of generating columns you are probably much better off contstructing a Row directly in a transformation" we are not able to understand can you please explain in detail. We used UDF but getting "Task not serializable exception". UDF1 removeSpecialCharaters = new UDF1() { public String call(final String types) throws Exception { while(names.hasMoreElements()) { String str = (String) names.nextElement(); types.replaceAll(str, manufacturerNames.get(str).toString()); } return types; } }; sqlContext.udf().register("removeSpecialCharatersUDF", removeSpecialCharaters, DataTypes.StringType); dataFileContent.createOrReplaceTempView("DataSetOfinterest"); Dataset newDF = sqlContext.sql("select removeSpecialCharatersUDF(ManufacturerSource) FROM DataSetOfinterest"); > Issue in regex_replace in Apache Spark Java > --- > > Key: SPARK-20339 > URL: https://issues.apache.org/jira/browse/SPARK-20339 > Project: Spark > Issue Type: Question > Components: Java API, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Nischay > > We are currently facing couple of issues > 1. > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB". > 2. "java.lang.StackOverflowError" > The first issue is reported as a Major bug in Jira of Apache spark > https://issues.apache.org/jira/browse/SPARK-18492 > We got these issues by the following program. We are trying to replace the > Manufacturer name by its equivalent alternate name, > These issues occur only when we have Huge number of alternate names to > replace, for small number of replacements it works with no issues. > dataFileContent=dataFileContent.withColumn("ManufacturerSource", > regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));` > Kindly suggest us an alternative method or a solution to go around this > problem. > {code} > Hashtable manufacturerNames = new Hashtable(); > Enumeration names; > String str; > double bal; > manufacturerNames.put("Allen","Apex Tool Group"); > manufacturerNames.put("Armstrong","Apex Tool Group"); > manufacturerNames.put("Campbell","Apex Tool Group"); > manufacturerNames.put("Lubriplate","Apex Tool Group"); > manufacturerNames.put("Delta","Apex Tool Group"); > manufacturerNames.put("Gearwrench","Apex Tool Group"); > manufacturerNames.put("H.K. Porter","Apex Tool > Group"); > manufacturerNames.put("Jacobs","Apex Tool Group"); > manufacturerNames.put("Jobox","Apex Tool Group"); > ...about 100 more ... > manufacturerNames.put("Standard Safety","Standard > Safety Equipment Company"); > manufacturerNames.put("Standard Safety","Standard > Safety Equipment Company"); > // Show all balances in hash table. > names = manufacturerNames.keys(); > Dataset dataFileContent = > sqlContext.load("com.databricks.spark.csv", options); > > > while(names.hasMoreElements()) { >str = (String) names.nextElement(); > > dataFileContent=dataFileContent.withColumn("ManufacturerSource", > regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString())); > } > dataFileContent.show(); > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20339) Issue in regex_replace in Apache Spark Java
[ https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nischay updated SPARK-20339: Description: We are currently facing couple of issues 1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB". 2. "java.lang.StackOverflowError" The first issue is reported as a Major bug in Jira of Apache spark https://issues.apache.org/jira/browse/SPARK-18492 We got these issues by the following program. We are trying to replace the Manufacturer name by its equivalent alternate name, These issues occur only when we have Huge number of alternate names to replace, for small number of replacements it works with no issues. dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));` Kindly suggest us an alternative method or a solution to go around this problem. Hashtable manufacturerNames = new Hashtable(); Enumeration names; String str; double bal; manufacturerNames.put("Allen","Apex Tool Group"); manufacturerNames.put("Armstrong","Apex Tool Group"); manufacturerNames.put("Campbell","Apex Tool Group"); manufacturerNames.put("Lubriplate","Apex Tool Group"); manufacturerNames.put("Delta","Apex Tool Group"); manufacturerNames.put("Gearwrench","Apex Tool Group"); manufacturerNames.put("H.K. Porter","Apex Tool Group"); manufacturerNames.put("Jacobs","Apex Tool Group"); manufacturerNames.put("Jobox","Apex Tool Group"); manufacturerNames.put("Lufkin","Apex Tool Group"); manufacturerNames.put("Nicholson","Apex Tool Group"); manufacturerNames.put("Plumb","Apex Tool Group"); manufacturerNames.put("Wiss","Apex Tool Group"); manufacturerNames.put("Covert","Apex Tool Group"); manufacturerNames.put("Apex-Geta","Apex Tool Group"); manufacturerNames.put("Dotco-Airetool","Apex Tool Group"); manufacturerNames.put("Apex","Apex Tool Group"); manufacturerNames.put("Cleco","Apex Tool Group"); manufacturerNames.put("Dotco","Apex Tool Group"); manufacturerNames.put("Erem","Apex Tool Group"); manufacturerNames.put("Master Power","Apex Tool Group"); manufacturerNames.put("Recoules Quackenbush","Apex Tool Group"); manufacturerNames.put("Apex-Utica","Apex Tool Group"); manufacturerNames.put("Weller","Apex Tool Group"); manufacturerNames.put("Xcelite","Apex Tool Group"); manufacturerNames.put("JET","JPW Industries"); manufacturerNames.put("Powermatic","JPW Industries"); manufacturerNames.put("Wilton","JPW Industries"); manufacturerNames.put("Black+Decker","StanleyBlack & Decker"); manufacturerNames.put("BlackhawkBy Proto","StanleyBlack & Decker"); manufacturerNames.put("Bostitch","StanleyBlack & Decker"); manufacturerNames.put("Cribmaster","StanleyBlack & Decker"); manufacturerNames.put("DeWALT","StanleyBlack & Decker"); manufacturerNames.put("Expert (Hand Tools & Accessories); Expert (Wrenches)","StanleyBlack & Decker"); manufacturerNames.put("Facom","StanleyBlack & Decker"); manufacturerNames.put("Mac","StanleyBlack & Decker"); manufacturerNames.put("Lista","StanleyBlack & Decker"); manufacturerNames.put("Porter-Cable","StanleyBlack & Decker"); manufacturerNames.put("Powers","StanleyBlack & Decker"); manufacturerNames.put("Proto","StanleyBlack & Decker"); manufacturerNames.put("Stanley","StanleyBlack & Decker"); manufacturerNames.put("Vidmar","StanleyBlack & Decker"); manufacturerNames.put("Abell-Howe","Columbus McKinnon"); manufacturerNames.put("Budgit Hoists","Columbus McKinnon"); manufacturerNames.put("Cady Lifters","Columbus McKinnon"); manufacturerNames.put("Chester Hoist","Columbus McKinnon");
[jira] [Created] (SPARK-20339) Issue in regex_replace in Apache Spark Java
Nischay created SPARK-20339: --- Summary: Issue in regex_replace in Apache Spark Java Key: SPARK-20339 URL: https://issues.apache.org/jira/browse/SPARK-20339 Project: Spark Issue Type: Question Components: Java API, Spark Core, SQL Affects Versions: 2.1.0 Reporter: Nischay We are currently facing couple of issues 1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB". 2. "java.lang.StackOverflowError" The first issue is reported as a Major bug in Jira of Apache spark https://issues.apache.org/jira/browse/SPARK-18492 We got these issues by the following program. We are trying to replace the Manufacturer name by its equivalent alternate name, These issues occur only when we have Huge number of alternate names to replace, for small number of replacements it works with no issues. `dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));` Kindly suggest us an alternative method or a solution to go around this problem. Hashtable manufacturerNames = new Hashtable(); Enumeration names; String str; double bal; manufacturerNames.put("Allen","Apex Tool Group"); manufacturerNames.put("Armstrong","Apex Tool Group"); manufacturerNames.put("Campbell","Apex Tool Group"); manufacturerNames.put("Lubriplate","Apex Tool Group"); manufacturerNames.put("Delta","Apex Tool Group"); manufacturerNames.put("Gearwrench","Apex Tool Group"); manufacturerNames.put("H.K. Porter","Apex Tool Group"); manufacturerNames.put("Jacobs","Apex Tool Group"); manufacturerNames.put("Jobox","Apex Tool Group"); manufacturerNames.put("Lufkin","Apex Tool Group"); manufacturerNames.put("Nicholson","Apex Tool Group"); manufacturerNames.put("Plumb","Apex Tool Group"); manufacturerNames.put("Wiss","Apex Tool Group"); manufacturerNames.put("Covert","Apex Tool Group"); manufacturerNames.put("Apex-Geta","Apex Tool Group"); manufacturerNames.put("Dotco-Airetool","Apex Tool Group"); manufacturerNames.put("Apex","Apex Tool Group"); manufacturerNames.put("Cleco","Apex Tool Group"); manufacturerNames.put("Dotco","Apex Tool Group"); manufacturerNames.put("Erem","Apex Tool Group"); manufacturerNames.put("Master Power","Apex Tool Group"); manufacturerNames.put("Recoules Quackenbush","Apex Tool Group"); manufacturerNames.put("Apex-Utica","Apex Tool Group"); manufacturerNames.put("Weller","Apex Tool Group"); manufacturerNames.put("Xcelite","Apex Tool Group"); manufacturerNames.put("JET","JPW Industries"); manufacturerNames.put("Powermatic","JPW Industries"); manufacturerNames.put("Wilton","JPW Industries"); manufacturerNames.put("Black+Decker","StanleyBlack & Decker"); manufacturerNames.put("BlackhawkBy Proto","StanleyBlack & Decker"); manufacturerNames.put("Bostitch","StanleyBlack & Decker"); manufacturerNames.put("Cribmaster","StanleyBlack & Decker"); manufacturerNames.put("DeWALT","StanleyBlack & Decker"); manufacturerNames.put("Expert (Hand Tools & Accessories); Expert (Wrenches)","StanleyBlack & Decker"); manufacturerNames.put("Facom","StanleyBlack & Decker"); manufacturerNames.put("Mac","StanleyBlack & Decker"); manufacturerNames.put("Lista","StanleyBlack & Decker"); manufacturerNames.put("Porter-Cable","StanleyBlack & Decker"); manufacturerNames.put("Powers","StanleyBlack & Decker"); manufacturerNames.put("Proto","StanleyBlack & Decker"); manufacturerNames.put("Stanley","StanleyBlack & Decker"); manufacturerNames.put("Vidmar","StanleyBlack & Decker"); manufacturerNames.put("Abell-Howe","Columbus McKinnon"); manufacturerNames.put("Budgit Hoists","Columbu