[jira] [Commented] (SPARK-20339) Issue in regex_replace in Apache Spark Java

2017-04-17 Thread Nischay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971054#comment-15971054
 ] 

Nischay commented on SPARK-20339:
-

Sure I'll not add redundant code in future, also I'll use u...@spark.apache.org

"For such a huge sequence of generating columns you are probably much better 
off contstructing a Row directly in a transformation" we are not able to 
understand can you please explain in detail. 

We used UDF but getting "Task not serializable exception".
 
UDF1 removeSpecialCharaters = new UDF1() {
public String call(final String types) throws Exception {   
while(names.hasMoreElements()) {
String str = (String) names.nextElement();
   types.replaceAll(str, 
manufacturerNames.get(str).toString());
}   
return types;
}
};
sqlContext.udf().register("removeSpecialCharatersUDF", removeSpecialCharaters, 
DataTypes.StringType);
dataFileContent.createOrReplaceTempView("DataSetOfinterest");
Dataset newDF = sqlContext.sql("select 
removeSpecialCharatersUDF(ManufacturerSource) FROM DataSetOfinterest");


> Issue in regex_replace in Apache Spark Java
> ---
>
> Key: SPARK-20339
> URL: https://issues.apache.org/jira/browse/SPARK-20339
> Project: Spark
>  Issue Type: Question
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Nischay
>
> We are currently facing couple of issues
> 1. 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB".
> 2. "java.lang.StackOverflowError"
> The first issue is reported as a Major bug in Jira of Apache spark 
> https://issues.apache.org/jira/browse/SPARK-18492
> We got these issues by the following program. We are trying to replace the 
> Manufacturer name by its equivalent alternate name,
> These issues occur only when we have Huge number of alternate names to 
> replace, for small number of replacements it works with no issues.
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`
> Kindly suggest us an alternative method or a solution to go around this 
> problem.
> {code}
>   Hashtable manufacturerNames = new Hashtable();
> Enumeration names;
> String str;
> double bal;
> manufacturerNames.put("Allen","Apex Tool Group");
> manufacturerNames.put("Armstrong","Apex Tool Group");
> manufacturerNames.put("Campbell","Apex Tool Group");
> manufacturerNames.put("Lubriplate","Apex Tool Group");
> manufacturerNames.put("Delta","Apex Tool Group");
> manufacturerNames.put("Gearwrench","Apex Tool Group");
> manufacturerNames.put("H.K. Porter","Apex Tool 
> Group");
> manufacturerNames.put("Jacobs","Apex Tool Group");
> manufacturerNames.put("Jobox","Apex Tool Group");
> ...about 100 more ...
> manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");
> manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");   
> // Show all balances in hash table.
> names = manufacturerNames.keys();
> Dataset dataFileContent = 
> sqlContext.load("com.databricks.spark.csv", options);
>   
> 
> while(names.hasMoreElements()) {
>str = (String) names.nextElement();
>
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
> }
> dataFileContent.show();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20339) Issue in regex_replace in Apache Spark Java

2017-04-14 Thread Nischay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nischay updated SPARK-20339:

Description: 
We are currently facing couple of issues

1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB".
2. "java.lang.StackOverflowError"
The first issue is reported as a Major bug in Jira of Apache spark 
https://issues.apache.org/jira/browse/SPARK-18492

We got these issues by the following program. We are trying to replace the 
Manufacturer name by its equivalent alternate name,

These issues occur only when we have Huge number of alternate names to replace, 
for small number of replacements it works with no issues.
dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`

Kindly suggest us an alternative method or a solution to go around this problem.

Hashtable manufacturerNames = new Hashtable();
  Enumeration names;
  String str;
  double bal;

  manufacturerNames.put("Allen","Apex Tool Group");
  manufacturerNames.put("Armstrong","Apex Tool Group");
  manufacturerNames.put("Campbell","Apex Tool Group");
  manufacturerNames.put("Lubriplate","Apex Tool Group");
  manufacturerNames.put("Delta","Apex Tool Group");
  manufacturerNames.put("Gearwrench","Apex Tool Group");
  manufacturerNames.put("H.K. Porter","Apex Tool 
Group");
  manufacturerNames.put("Jacobs","Apex Tool Group");
  manufacturerNames.put("Jobox","Apex Tool Group");
  manufacturerNames.put("Lufkin","Apex Tool Group");
  manufacturerNames.put("Nicholson","Apex Tool Group");
  manufacturerNames.put("Plumb","Apex Tool Group");
  manufacturerNames.put("Wiss","Apex Tool Group");
  manufacturerNames.put("Covert","Apex Tool Group");
  manufacturerNames.put("Apex-Geta","Apex Tool Group");
  manufacturerNames.put("Dotco-Airetool","Apex Tool 
Group");
  manufacturerNames.put("Apex","Apex Tool Group");
  manufacturerNames.put("Cleco","Apex Tool Group");
  manufacturerNames.put("Dotco","Apex Tool Group");
  manufacturerNames.put("Erem","Apex Tool Group");
  manufacturerNames.put("Master Power","Apex Tool 
Group");
  manufacturerNames.put("Recoules Quackenbush","Apex 
Tool Group");
  manufacturerNames.put("Apex-Utica","Apex Tool Group");
  manufacturerNames.put("Weller","Apex Tool Group");
  manufacturerNames.put("Xcelite","Apex Tool Group");
  manufacturerNames.put("JET","JPW Industries");
  manufacturerNames.put("Powermatic","JPW Industries");
  manufacturerNames.put("Wilton","JPW Industries");
  manufacturerNames.put("Black+Decker","StanleyBlack & 
Decker");
  manufacturerNames.put("BlackhawkBy 
Proto","StanleyBlack & Decker");
  manufacturerNames.put("Bostitch","StanleyBlack & 
Decker");
  manufacturerNames.put("Cribmaster","StanleyBlack & 
Decker");
  manufacturerNames.put("DeWALT","StanleyBlack & 
Decker");
  manufacturerNames.put("Expert (Hand Tools & 
Accessories); Expert (Wrenches)","StanleyBlack & Decker");
  manufacturerNames.put("Facom","StanleyBlack & 
Decker");
  manufacturerNames.put("Mac","StanleyBlack & Decker");
  manufacturerNames.put("Lista","StanleyBlack & 
Decker");
  manufacturerNames.put("Porter-Cable","StanleyBlack & 
Decker");
  manufacturerNames.put("Powers","StanleyBlack & 
Decker");
  manufacturerNames.put("Proto","StanleyBlack & 
Decker");
  manufacturerNames.put("Stanley","StanleyBlack & 
Decker");
  manufacturerNames.put("Vidmar","StanleyBlack & 
Decker");
  manufacturerNames.put("Abell-Howe","Columbus 
McKinnon");
  manufacturerNames.put("Budgit Hoists","Columbus 
McKinnon");
  manufacturerNames.put("Cady Lifters","Columbus 
McKinnon");
  manufacturerNames.put("Chester Hoist","Columbus 
McKinnon");

[jira] [Created] (SPARK-20339) Issue in regex_replace in Apache Spark Java

2017-04-14 Thread Nischay (JIRA)
Nischay created SPARK-20339:
---

 Summary: Issue in regex_replace in Apache Spark Java
 Key: SPARK-20339
 URL: https://issues.apache.org/jira/browse/SPARK-20339
 Project: Spark
  Issue Type: Question
  Components: Java API, Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Nischay


We are currently facing couple of issues
1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB".
2. "java.lang.StackOverflowError"
The first issue is reported as a Major bug in Jira of Apache spark 
https://issues.apache.org/jira/browse/SPARK-18492

We got these issues by the following program. We are trying to replace the 
Manufacturer name by its equivalent alternate name,

These issues occur only when we have Huge number of alternate names to replace, 
for small number of replacements it works with no issues.
`dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`

Kindly suggest us an alternative method or a solution to go around this problem.

Hashtable manufacturerNames = new Hashtable();
  Enumeration names;
  String str;
  double bal;

  manufacturerNames.put("Allen","Apex Tool Group");
  manufacturerNames.put("Armstrong","Apex Tool Group");
  manufacturerNames.put("Campbell","Apex Tool Group");
  manufacturerNames.put("Lubriplate","Apex Tool Group");
  manufacturerNames.put("Delta","Apex Tool Group");
  manufacturerNames.put("Gearwrench","Apex Tool Group");
  manufacturerNames.put("H.K. Porter","Apex Tool 
Group");
  manufacturerNames.put("Jacobs","Apex Tool Group");
  manufacturerNames.put("Jobox","Apex Tool Group");
  manufacturerNames.put("Lufkin","Apex Tool Group");
  manufacturerNames.put("Nicholson","Apex Tool Group");
  manufacturerNames.put("Plumb","Apex Tool Group");
  manufacturerNames.put("Wiss","Apex Tool Group");
  manufacturerNames.put("Covert","Apex Tool Group");
  manufacturerNames.put("Apex-Geta","Apex Tool Group");
  manufacturerNames.put("Dotco-Airetool","Apex Tool 
Group");
  manufacturerNames.put("Apex","Apex Tool Group");
  manufacturerNames.put("Cleco","Apex Tool Group");
  manufacturerNames.put("Dotco","Apex Tool Group");
  manufacturerNames.put("Erem","Apex Tool Group");
  manufacturerNames.put("Master Power","Apex Tool 
Group");
  manufacturerNames.put("Recoules Quackenbush","Apex 
Tool Group");
  manufacturerNames.put("Apex-Utica","Apex Tool Group");
  manufacturerNames.put("Weller","Apex Tool Group");
  manufacturerNames.put("Xcelite","Apex Tool Group");
  manufacturerNames.put("JET","JPW Industries");
  manufacturerNames.put("Powermatic","JPW Industries");
  manufacturerNames.put("Wilton","JPW Industries");
  manufacturerNames.put("Black+Decker","StanleyBlack & 
Decker");
  manufacturerNames.put("BlackhawkBy 
Proto","StanleyBlack & Decker");
  manufacturerNames.put("Bostitch","StanleyBlack & 
Decker");
  manufacturerNames.put("Cribmaster","StanleyBlack & 
Decker");
  manufacturerNames.put("DeWALT","StanleyBlack & 
Decker");
  manufacturerNames.put("Expert (Hand Tools & 
Accessories); Expert (Wrenches)","StanleyBlack & Decker");
  manufacturerNames.put("Facom","StanleyBlack & 
Decker");
  manufacturerNames.put("Mac","StanleyBlack & Decker");
  manufacturerNames.put("Lista","StanleyBlack & 
Decker");
  manufacturerNames.put("Porter-Cable","StanleyBlack & 
Decker");
  manufacturerNames.put("Powers","StanleyBlack & 
Decker");
  manufacturerNames.put("Proto","StanleyBlack & 
Decker");
  manufacturerNames.put("Stanley","StanleyBlack & 
Decker");
  manufacturerNames.put("Vidmar","StanleyBlack & 
Decker");
  manufacturerNames.put("Abell-Howe","Columbus 
McKinnon");
  manufacturerNames.put("Budgit Hoists","Columbu