[jira] [Created] (SPARK-8493) Fisher Vector Feature Transformer
Feynman Liang created SPARK-8493: Summary: Fisher Vector Feature Transformer Key: SPARK-8493 URL: https://issues.apache.org/jira/browse/SPARK-8493 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Fisher vectors provide a vocabulary-based encoding for images (see https://hal.inria.fr/hal-00830491/file/journal.pdf). This representation is useful due to reduced dimensionality, providing regularization as well as increased scalability. An implementation of FVs in Spark ML should provide a way to both train a GMM vocabulary as well compute Fisher kernel encodings of provided images. The vocabulary trainer can be implemented as a standalone GMM pipeline. The feature transformer can be implemented as a org.apache.spark.ml.UnaryTransformer. It should accept a vocabulary (Array[Array[Double]]) as well as an image (Array[Double]) and produce the Fisher kernel encoding (Array[Double]). See Enceval (http://www.robots.ox.ac.uk/~vgg/software/enceval_toolkit/) for a reference implementation in MATLAB/C++. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8452) expose jobGroup API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8452. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6889 [https://github.com/apache/spark/pull/6889] expose jobGroup API in SparkR - Key: SPARK-8452 URL: https://issues.apache.org/jira/browse/SPARK-8452 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Hossein Falaki Fix For: 1.5.0, 1.4.1 Following job management calls are missing in SparkR: {code} setJobGroup() cancelJobGroup() clearJobGroup() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8452) expose jobGroup API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8452: - Assignee: Hossein Falaki expose jobGroup API in SparkR - Key: SPARK-8452 URL: https://issues.apache.org/jira/browse/SPARK-8452 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Hossein Falaki Assignee: Hossein Falaki Fix For: 1.4.1, 1.5.0 Following job management calls are missing in SparkR: {code} setJobGroup() cancelJobGroup() clearJobGroup() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8315) Better error when saving to parquet with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-8315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594144#comment-14594144 ] Yin Huai commented on SPARK-8315: - I tried the following in 1.4 {code} import org.apache.spark.sql.functions._ val df1 = Seq((1, 1)).toDF(i, j).as(t1) val df2 = Seq((1, 1)).toDF(i, j).as(t2) val joined = df1.join(df2, col(t1.i) === col(t2.j)) joined.explain(true) joined.write.format(parquet).saveAsTable(yinParquetSameColumnNames) {code} and I got an analysis exception {{org.apache.spark.sql.AnalysisException: Reference 'i' is ambiguous, could be: i#30, i#34.;}}. Seems it is fixed in 1.4, but it will be good to add to add a regression test. Better error when saving to parquet with duplicate columns -- Key: SPARK-8315 URL: https://issues.apache.org/jira/browse/SPARK-8315 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Critical Parquet allows you to silently write out files with duplicate column names and then emits a very confusing error when trying to read the data back in: {code} Error in SQL statement: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 21.0 failed 4 times, most recent failure: Lost task 4.3 in stage 21.0 (TID 2767, ...): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file ... {code} We should throw a better error before attempting to write out an invalid file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593895#comment-14593895 ] Yu Ishikawa commented on SPARK-8477: Oh, I'm sorry. I didn't know we have already have {{inSet}}. It seems that the function of {{inSet}} is almost like that of {{in}}. Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593893#comment-14593893 ] Michael Armbrust commented on SPARK-8470: - Additional info on how this is being run: {code} We're using the normal command line: --- bin/spark-submit --properties-file ./spark-submit.conf --class com.rr.data.visits.VisitSequencerRunner ./mvt-master-SNAPSHOT-jar-with-dependencies.jar --- Our jar contains both com.rr.data.visits.orc.OrcReadWrite (which you can see in the stack trace) and the unfound com.rr.data.Visit. {code} MissingRequirementError for ScalaReflection on user classes --- Key: SPARK-8470 URL: https://issues.apache.org/jira/browse/SPARK-8470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Andrew Or Priority: Blocker From the mailing list: {code} Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YARN). I do not get this error running locally (standalone). 2) Reverting to Spark 1.3.1 makes the problem go away 3) The jar file containing the referenced class (the app assembly jar) is not listed in the classpath expansion dumped in the error message. I have seen SPARK-5281, and am guessing that this is the root cause, especially since the code added there is involved in the stacktrace. That said, my grasp on scala reflection isn't strong enough to make sense of the change to say for sure. It certainly looks, though, that in this scenario the current thread's context classloader may not be what we think it is (given #3 above). Any ideas? App code: def registerTable[A : Product : TypeTag](name: String, rdd: RDD[A])(implicit hc: HiveContext) = { val df = hc.createDataFrame(rdd) df.registerTempTable(name) } Stack trace: scala.reflect.internal.MissingRequirementError: class comMyClass in JavaMirror with sun.misc.Launcher$AppClassLoader@d16e5d6 of type class sun.misc.Launcher$AppClassLoader with classpath [ lots and lots of paths and jars, but not the app assembly jar] not found at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at com.ipcoop.spark.sql.SqlEnv$$typecreator1$1.apply(SqlEnv.scala:87) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) {code} Another report: {code} Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in thread main scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class sun.misc.Launcher$ExtClassLoader with classpath
[jira] [Updated] (SPARK-8485) Feature transformers for image processing
[ https://issues.apache.org/jira/browse/SPARK-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8485: - Component/s: ML Feature transformers for image processing - Key: SPARK-8485 URL: https://issues.apache.org/jira/browse/SPARK-8485 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Many transformers exist to convert from image representations into more compact descriptors amenable to standard ML techniques. We should implement these transformers in Spark to support machine learning on richer content types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8484) Add TrainValidationSplit to ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593905#comment-14593905 ] Martin Zapletal commented on SPARK-8484: I can work on this one. Can you please assign to me? Add TrainValidationSplit to ml.tuning - Key: SPARK-8484 URL: https://issues.apache.org/jira/browse/SPARK-8484 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8486) SIFT Feature Extractor
Feynman Liang created SPARK-8486: Summary: SIFT Feature Extractor Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a [[org.apache.spark.ml.Transformer]]. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian as described by Lowe can be even further improved using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7360) Compare Pyrolite performance affected by useMemo
[ https://issues.apache.org/jira/browse/SPARK-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-7360. Resolution: Done Fix Version/s: 1.4.0 Assignee: Xiangrui Meng (was: Nicholas Chammas) Target Version/s: 1.4.0 (was: 1.5.0) I marked this as done. [~davies] mentioned that we need to serialize a lot of classes without useMemo, which hurts performance. So turning useMemo off globally is not a good option. Compare Pyrolite performance affected by useMemo Key: SPARK-7360 URL: https://issues.apache.org/jira/browse/SPARK-7360 Project: Spark Issue Type: Task Components: PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 As discussed in SPARK-6288, disabling useMemo shows significant performance on some ML tasks. We should test whether this is true across PySpark, and consider patch Pyrolite for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Description: Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Furthermore, the implementation should support various approximations for approximating the Laplacian of Gaussian. In addition to approximating using Difference of Gaussian (as described by Lowe), we should support * SURF approximation using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf) should also be supported. * DAISY was: Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). SIFT/SURF Feature Transformer - Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Furthermore, the implementation should support various approximations for approximating the Laplacian of Gaussian. In addition to approximating using Difference of Gaussian (as described by Lowe), we should support * SURF approximation using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf) should also be supported. * DAISY -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF/DAISY Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Summary: SIFT/SURF/DAISY Feature Transformer (was: SIFT/SURF Feature Transformer) SIFT/SURF/DAISY Feature Transformer --- Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Furthermore, the implementation should support various approximations for approximating the Laplacian of Gaussian. In addition to approximating using Difference of Gaussian (as described by Lowe), we should support * SURF approximation using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf) should also be supported. * DAISY -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8494) ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3
[ https://issues.apache.org/jira/browse/SPARK-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated SPARK-8494: -- Description: I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a SPARK-1923 Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.None$ java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} was: I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.None$ java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3 --- Key: SPARK-8494 URL: https://issues.apache.org/jira/browse/SPARK-8494 Project: Spark Issue Type: Bug Components: Spark Core Reporter: PJ Fanning Assignee: Patrick Wendell I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a SPARK-1923 Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.None$ java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native
[jira] [Updated] (SPARK-8494) ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3
[ https://issues.apache.org/jira/browse/SPARK-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated SPARK-8494: -- Description: I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.None$ java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} was: I just wanted to document this for posterity. I had an issue when running a Spark 1.0 app locally with sbt. The issue was that if you both: 1. Reference a scala class (e.g. None) inside of a closure. 2. Run your program with 'sbt run' It throws an exception. Upgrading the scalaVersion to 2.10.4 in sbt solved this issue. Somehow scala classes were not being loaded correctly inside of the executors: Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.None$ java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3 --- Key: SPARK-8494 URL: https://issues.apache.org/jira/browse/SPARK-8494 Project: Spark Issue Type: Bug Components: Spark Core Reporter: PJ Fanning Assignee: Patrick Wendell I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.None$ java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354)
[jira] [Commented] (SPARK-6749) Make metastore client robust to underlying socket connection loss
[ https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594145#comment-14594145 ] Apache Spark commented on SPARK-6749: - User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/6912 Make metastore client robust to underlying socket connection loss - Key: SPARK-6749 URL: https://issues.apache.org/jira/browse/SPARK-6749 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, if metastore get restarted, we have to restart the driver to get a new connection to the metastore client because the underlying socket connection is gone. We should make metastore client robust to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6749) Make metastore client robust to underlying socket connection loss
[ https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6749: --- Assignee: (was: Apache Spark) Make metastore client robust to underlying socket connection loss - Key: SPARK-6749 URL: https://issues.apache.org/jira/browse/SPARK-6749 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, if metastore get restarted, we have to restart the driver to get a new connection to the metastore client because the underlying socket connection is gone. We should make metastore client robust to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6749) Make metastore client robust to underlying socket connection loss
[ https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6749: --- Assignee: Apache Spark Make metastore client robust to underlying socket connection loss - Key: SPARK-6749 URL: https://issues.apache.org/jira/browse/SPARK-6749 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Apache Spark Priority: Critical Right now, if metastore get restarted, we have to restart the driver to get a new connection to the metastore client because the underlying socket connection is gone. We should make metastore client robust to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593884#comment-14593884 ] Reynold Xin commented on SPARK-8477: Maybe inSet ? Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8484) Add TrainValidationSplit to ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593933#comment-14593933 ] Xiangrui Meng commented on SPARK-8484: -- Assigned:) Add TrainValidationSplit to ml.tuning - Key: SPARK-8484 URL: https://issues.apache.org/jira/browse/SPARK-8484 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Martin Zapletal Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593954#comment-14593954 ] Andrew Or edited comment on SPARK-8470 at 6/19/15 9:19 PM: --- Closing this as a duplicate. We will add the regression test in SPARK-8489. was (Author: andrewor14): Closing this as FIXED. We will add the regression test in SPARK-8489. MissingRequirementError for ScalaReflection on user classes --- Key: SPARK-8470 URL: https://issues.apache.org/jira/browse/SPARK-8470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Andrew Or Priority: Blocker Fix For: 1.4.1, 1.5.0 From the mailing list: {code} Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YARN). I do not get this error running locally (standalone). 2) Reverting to Spark 1.3.1 makes the problem go away 3) The jar file containing the referenced class (the app assembly jar) is not listed in the classpath expansion dumped in the error message. I have seen SPARK-5281, and am guessing that this is the root cause, especially since the code added there is involved in the stacktrace. That said, my grasp on scala reflection isn't strong enough to make sense of the change to say for sure. It certainly looks, though, that in this scenario the current thread's context classloader may not be what we think it is (given #3 above). Any ideas? App code: def registerTable[A : Product : TypeTag](name: String, rdd: RDD[A])(implicit hc: HiveContext) = { val df = hc.createDataFrame(rdd) df.registerTempTable(name) } Stack trace: scala.reflect.internal.MissingRequirementError: class comMyClass in JavaMirror with sun.misc.Launcher$AppClassLoader@d16e5d6 of type class sun.misc.Launcher$AppClassLoader with classpath [ lots and lots of paths and jars, but not the app assembly jar] not found at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at com.ipcoop.spark.sql.SqlEnv$$typecreator1$1.apply(SqlEnv.scala:87) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) {code} Another report: {code} Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in thread main scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class sun.misc.Launcher$ExtClassLoader with classpath [file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunec.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunjce_provider.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunpkcs11.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/zipfs.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/localedata.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/dnsns.jar] and parent being primordial
[jira] [Updated] (SPARK-8491) DAISY Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8491: - Description: DAISY (Tola et al, PAMI 2010, http://infoscience.epfl.ch/record/138785/files/tola_daisy_pami_1.pdf) is another local image descriptor utilizing histograms of local orientation similar to SIFT. However, one key difference is that the weighted sum of gradient norms used in SIFT's orientation assignment is replaced by convolution with Gaussian kernels. This provides a significant speedup in computing dense descriptors. We can implement DAISY in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the DAISY features for the provided image. The convolution operation can leverage GPU parallelism for efficiency. A C++/MATLAB reference implementation is available at http://cvlab.epfl.ch/software/daisy. was: DAISY is another local image descriptor utilizing histograms of local orientation similar to SIFT. However, one key difference is that the weighted sum of gradient norms used in SIFT's orientation assignment is replaced by convolution with Gaussian kernels. This provides a significant speedup in computing dense descriptors. We can implement DAISY in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the DAISY features for the provided image. The convolution operation can leverage GPU parallelism for efficiency. DAISY Feature Transformer - Key: SPARK-8491 URL: https://issues.apache.org/jira/browse/SPARK-8491 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang DAISY (Tola et al, PAMI 2010, http://infoscience.epfl.ch/record/138785/files/tola_daisy_pami_1.pdf) is another local image descriptor utilizing histograms of local orientation similar to SIFT. However, one key difference is that the weighted sum of gradient norms used in SIFT's orientation assignment is replaced by convolution with Gaussian kernels. This provides a significant speedup in computing dense descriptors. We can implement DAISY in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the DAISY features for the provided image. The convolution operation can leverage GPU parallelism for efficiency. A C++/MATLAB reference implementation is available at http://cvlab.epfl.ch/software/daisy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4438) Add HistoryServer RESTful API
[ https://issues.apache.org/jira/browse/SPARK-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594092#comment-14594092 ] Jonathan Kelly commented on SPARK-4438: --- This API was added in 1.4.0, right? Should this JIRA be resolved now? Add HistoryServer RESTful API - Key: SPARK-4438 URL: https://issues.apache.org/jira/browse/SPARK-4438 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Gankun Luo Attachments: HistoryServer RESTful API Design Doc.pdf Spark HistoryServer currently only supports keep track of all completed applications through the WEBUI, does not provide RESTful API for external system query completed application information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8494) ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3
[ https://issues.apache.org/jira/browse/SPARK-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated SPARK-8494: -- Attachment: spark-test-case.zip ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3 --- Key: SPARK-8494 URL: https://issues.apache.org/jira/browse/SPARK-8494 Project: Spark Issue Type: Bug Components: Spark Core Reporter: PJ Fanning Assignee: Patrick Wendell Attachments: spark-test-case.zip I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a ClassNotFoundException otherwise. I have a spark-assembly jar built using Spark 1.3.2-SNAPSHOT. Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.collection.immutable.Range java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} {code} name := spark-test-case version := 1.0 scalaVersion := 2.10.4 resolvers += spray repo at http://repo.spray.io; resolvers += Scalaz Bintray Repo at https://dl.bintray.com/scalaz/releases; val akkaVersion = 2.3.11 val sprayVersion = 1.3.3 libraryDependencies ++= Seq( com.h2database % h2 % 1.4.187, com.typesafe.akka %% akka-actor % akkaVersion, com.typesafe.akka %% akka-slf4j % akkaVersion, ch.qos.logback % logback-classic % 1.0.13, io.spray %% spray-can% sprayVersion, io.spray %% spray-routing% sprayVersion, io.spray %% spray-json % 1.3.1, com.databricks %% spark-csv% 1.0.3, org.specs2 %% specs2 % 2.4.17 % test, org.specs2 %% specs2-junit % 2.4.17 % test, io.spray %% spray-testkit% sprayVersion % test, com.typesafe.akka %% akka-testkit % akkaVersion% test, junit % junit% 4.12 % test ) scalacOptions ++= Seq( -unchecked, -deprecation, -Xlint, -Ywarn-dead-code, -language:_, -target:jvm-1.7, -encoding, UTF-8 ) testOptions += Tests.Argument(TestFrameworks.JUnit, -v) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
[ https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593898#comment-14593898 ] Olivier Girardot commented on SPARK-8332: - You're right sorry NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer -- Key: SPARK-8332 URL: https://issues.apache.org/jira/browse/SPARK-8332 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: spark 1.4 hadoop 2.3.0-cdh5.0.0 Reporter: Tao Li Priority: Critical Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson I complied new spark 1.4.0 version. But when I run a simple WordCount demo, it throws NoSuchMethodError {code} java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer {code} I found out that the default fasterxml.jackson.version is 2.4.4. Is there any wrong or conflict with the jackson version? Or is there possibly some project maven dependency containing the wrong version of jackson? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593927#comment-14593927 ] Andrew Or edited comment on SPARK-8470 at 6/19/15 8:35 PM: --- FYI, I was able to reproduce this locally. This allowed me to conclude two things: 1. It has nothing to do with YARN specifically. 2. It is caused by some code in the hive module; I could reproduce this only with HiveContext, but not with SQLContext. Small reproduction: {code} bin/spark-submit --master local --class FunTest app.jar {code} Inside app.jar: FunTest.scala {code} object FunTest { def main(args: Array[String]): Unit = { println(Runnin' my cool class) val conf = new SparkConf().setAppName(testing) val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) val coolClasses = Seq( MyCoolClass(ast, resent, uture), MyCoolClass(mamazing, papazing, fafazing)) val df = sqlContext.createDataFrame(coolClasses) df.collect() } } {code} Inside app.jar: MyCoolClass.scala {code} case class MyCoolClass(past: String, present: String, future: String) {code} Result: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class MyCoolClass not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureClassSymbol(Mirrors.scala:90) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at FunTest$$typecreator1$1.apply(FunTest.scala:13) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:426) at FunTest$.main(FunTest.scala:13) at FunTest.main(FunTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} was (Author: andrewor14): FYI, I was able to reproduce this locally. I was able to conclude two things: 1. It has nothing to do with YARN specifically. 2. It is caused by some code in the hive module; I could reproduce this only with HiveContext, but not with SQLContext. Small reproduction: {code} bin/spark-submit --master local --class FunTest app.jar {code} Inside app.jar: FunTest.scala {code} object FunTest { def main(args: Array[String]): Unit = { println(Runnin' my cool class) val conf = new SparkConf().setAppName(testing) val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) val coolClasses = Seq( MyCoolClass(ast, resent, uture), MyCoolClass(mamazing, papazing, fafazing)) val df = sqlContext.createDataFrame(coolClasses) df.collect() } } {code} Inside app.jar: MyCoolClass.scala {code} case class MyCoolClass(past: String, present: String, future: String) {code} Result: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class MyCoolClass not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureClassSymbol(Mirrors.scala:90) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at FunTest$$typecreator1$1.apply(FunTest.scala:13) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at
[jira] [Commented] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593928#comment-14593928 ] Davies Liu commented on SPARK-8477: --- [~rxin] [~yuu.ishik...@gmail.com] We already have `inSet` to match the Scala API `in`, we cloud close this one. Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8470. Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 MissingRequirementError for ScalaReflection on user classes --- Key: SPARK-8470 URL: https://issues.apache.org/jira/browse/SPARK-8470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Andrew Or Priority: Blocker Fix For: 1.4.1, 1.5.0 From the mailing list: {code} Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YARN). I do not get this error running locally (standalone). 2) Reverting to Spark 1.3.1 makes the problem go away 3) The jar file containing the referenced class (the app assembly jar) is not listed in the classpath expansion dumped in the error message. I have seen SPARK-5281, and am guessing that this is the root cause, especially since the code added there is involved in the stacktrace. That said, my grasp on scala reflection isn't strong enough to make sense of the change to say for sure. It certainly looks, though, that in this scenario the current thread's context classloader may not be what we think it is (given #3 above). Any ideas? App code: def registerTable[A : Product : TypeTag](name: String, rdd: RDD[A])(implicit hc: HiveContext) = { val df = hc.createDataFrame(rdd) df.registerTempTable(name) } Stack trace: scala.reflect.internal.MissingRequirementError: class comMyClass in JavaMirror with sun.misc.Launcher$AppClassLoader@d16e5d6 of type class sun.misc.Launcher$AppClassLoader with classpath [ lots and lots of paths and jars, but not the app assembly jar] not found at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at com.ipcoop.spark.sql.SqlEnv$$typecreator1$1.apply(SqlEnv.scala:87) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) {code} Another report: {code} Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in thread main scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class sun.misc.Launcher$ExtClassLoader with classpath [file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunec.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunjce_provider.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunpkcs11.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/zipfs.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/localedata.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/dnsns.jar] and parent being primordial classloader with boot classpath
[jira] [Commented] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593954#comment-14593954 ] Andrew Or commented on SPARK-8470: -- Closing this as FIXED. We will add the regression test in SPARK-8489. MissingRequirementError for ScalaReflection on user classes --- Key: SPARK-8470 URL: https://issues.apache.org/jira/browse/SPARK-8470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Andrew Or Priority: Blocker Fix For: 1.4.1, 1.5.0 From the mailing list: {code} Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YARN). I do not get this error running locally (standalone). 2) Reverting to Spark 1.3.1 makes the problem go away 3) The jar file containing the referenced class (the app assembly jar) is not listed in the classpath expansion dumped in the error message. I have seen SPARK-5281, and am guessing that this is the root cause, especially since the code added there is involved in the stacktrace. That said, my grasp on scala reflection isn't strong enough to make sense of the change to say for sure. It certainly looks, though, that in this scenario the current thread's context classloader may not be what we think it is (given #3 above). Any ideas? App code: def registerTable[A : Product : TypeTag](name: String, rdd: RDD[A])(implicit hc: HiveContext) = { val df = hc.createDataFrame(rdd) df.registerTempTable(name) } Stack trace: scala.reflect.internal.MissingRequirementError: class comMyClass in JavaMirror with sun.misc.Launcher$AppClassLoader@d16e5d6 of type class sun.misc.Launcher$AppClassLoader with classpath [ lots and lots of paths and jars, but not the app assembly jar] not found at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at com.ipcoop.spark.sql.SqlEnv$$typecreator1$1.apply(SqlEnv.scala:87) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) {code} Another report: {code} Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in thread main scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class sun.misc.Launcher$ExtClassLoader with classpath [file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunec.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunjce_provider.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunpkcs11.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/zipfs.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/localedata.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/dnsns.jar] and parent being primordial classloader with boot classpath
[jira] [Resolved] (SPARK-8093) Spark 1.4 branch's new JSON schema inference has changed the behavior of handling inner empty JSON object.
[ https://issues.apache.org/jira/browse/SPARK-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-8093. - Resolution: Fixed Fix Version/s: 1.4.1 Issue resolved by pull request 6799 [https://github.com/apache/spark/pull/6799] Spark 1.4 branch's new JSON schema inference has changed the behavior of handling inner empty JSON object. -- Key: SPARK-8093 URL: https://issues.apache.org/jira/browse/SPARK-8093 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Harish Butani Assignee: Nathan Howell Priority: Critical Fix For: 1.4.1 Attachments: t1.json This is similar to SPARK-3365. Sample json is attached. Code to reproduce {code} var jsonDF = read.json(/tmp/t1.json) jsonDF.write.parquet(/tmp/t1.parquet) {code} The 'integration' object is empty in the json. StackTrace: {code} Caused by: java.io.IOException: Could not read footer: java.lang.IllegalStateException: Cannot build an empty group at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) at org.apache.spark.sql.parquet.ParquetRelation2.refresh(newParquet.scala:197) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:134) ... 69 more Caused by: java.lang.IllegalStateException: Cannot build an empty group {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8494) ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3
PJ Fanning created SPARK-8494: - Summary: ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3 Key: SPARK-8494 URL: https://issues.apache.org/jira/browse/SPARK-8494 Project: Spark Issue Type: Bug Components: Spark Core Reporter: PJ Fanning Assignee: Patrick Wendell I just wanted to document this for posterity. I had an issue when running a Spark 1.0 app locally with sbt. The issue was that if you both: 1. Reference a scala class (e.g. None) inside of a closure. 2. Run your program with 'sbt run' It throws an exception. Upgrading the scalaVersion to 2.10.4 in sbt solved this issue. Somehow scala classes were not being loaded correctly inside of the executors: Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.None$ java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Description: Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). was: Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). SIFT/SURF Feature Transformer - Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8420) Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0
[ https://issues.apache.org/jira/browse/SPARK-8420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593915#comment-14593915 ] Michael Armbrust commented on SPARK-8420: - This is no longer true as we now special case equality. Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0 -- Key: SPARK-8420 URL: https://issues.apache.org/jira/browse/SPARK-8420 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Justin Yip Assignee: Michael Armbrust Priority: Blocker Labels: releasenotes I am trying out 1.4.0 and notice there are some differences in behavior with Timestamp between 1.3.1 and 1.4.0. In 1.3.1, I can compare a Timestamp with string. {code} scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01 00:00:00 ... scala df.filter($_2 = 2014-06-01).show ... _1 _2 2 2014-01-01 00:00:... {code} However, in 1.4.0, the filter is always false: {code} scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01 00:00:00 df: org.apache.spark.sql.DataFrame = [_1: int, _2: timestamp] scala df.filter($_2 = 2014-06-01).show +--+--+ |_1|_2| +--+--+ +--+--+ {code} Not sure if that is intended, but I cannot find any doc mentioning these inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593895#comment-14593895 ] Yu Ishikawa edited comment on SPARK-8477 at 6/19/15 8:27 PM: - Oh, I'm sorry. I didn't know we have already had {{inSet}}. It seems that the function of {{inSet}} is almost like that of {{in}}. was (Author: yuu.ishik...@gmail.com): Oh, I'm sorry. I didn't know we have already have {{inSet}}. It seems that the function of {{inSet}} is almost like that of {{in}}. Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8093) Spark 1.4 branch's new JSON schema inference has changed the behavior of handling inner empty JSON object.
[ https://issues.apache.org/jira/browse/SPARK-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594097#comment-14594097 ] Yin Huai commented on SPARK-8093: - With https://github.com/apache/spark/pull/6799, we have changed the behavior back to Spark 1.3's behavior. Empty inner structs will not be in the schema. Spark 1.4 branch's new JSON schema inference has changed the behavior of handling inner empty JSON object. -- Key: SPARK-8093 URL: https://issues.apache.org/jira/browse/SPARK-8093 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Harish Butani Assignee: Nathan Howell Priority: Critical Fix For: 1.4.1, 1.5.0 Attachments: t1.json This is similar to SPARK-3365. Sample json is attached. Code to reproduce {code} var jsonDF = read.json(/tmp/t1.json) jsonDF.write.parquet(/tmp/t1.parquet) {code} The 'integration' object is empty in the json. StackTrace: {code} Caused by: java.io.IOException: Could not read footer: java.lang.IllegalStateException: Cannot build an empty group at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) at org.apache.spark.sql.parquet.ParquetRelation2.refresh(newParquet.scala:197) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:134) ... 69 more Caused by: java.lang.IllegalStateException: Cannot build an empty group {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8485) Feature transformers for image processing
Feynman Liang created SPARK-8485: Summary: Feature transformers for image processing Key: SPARK-8485 URL: https://issues.apache.org/jira/browse/SPARK-8485 Project: Spark Issue Type: New Feature Reporter: Feynman Liang Many transformers exist to convert from image representations into more compact descriptors amenable to standard ML techniques. We should implement these transformers in Spark to support machine learning on richer content types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF Feature Extractor
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Description: Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). was: Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a [[org.apache.spark.ml.Transformer]]. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). SIFT/SURF Feature Extractor --- Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593927#comment-14593927 ] Andrew Or commented on SPARK-8470: -- FYI, I was able to reproduce this locally. I was able to conclude two things: 1. It has nothing to do with YARN specifically. 2. It is caused by some code in the hive module; I could reproduce this only with HiveContext, but not with SQLContext. Small reproduction: {code} bin/spark-submit --master local --class FunTest app.jar {code} Inside app.jar: FunTest.scala {code} object FunTest { def main(args: Array[String]): Unit = { println(Runnin' my cool class) val conf = new SparkConf().setAppName(testing) val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) val coolClasses = Seq( MyCoolClass(ast, resent, uture), MyCoolClass(mamazing, papazing, fafazing)) val df = sqlContext.createDataFrame(coolClasses) df.collect() } } {code} Inside app.jar: MyCoolClass.scala {code} case class MyCoolClass(past: String, present: String, future: String) {code} Result: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class MyCoolClass not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureClassSymbol(Mirrors.scala:90) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at FunTest$$typecreator1$1.apply(FunTest.scala:13) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:426) at FunTest$.main(FunTest.scala:13) at FunTest.main(FunTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} MissingRequirementError for ScalaReflection on user classes --- Key: SPARK-8470 URL: https://issues.apache.org/jira/browse/SPARK-8470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Andrew Or Priority: Blocker From the mailing list: {code} Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YARN). I do not get this error running locally (standalone). 2) Reverting to Spark 1.3.1 makes the problem go away 3) The jar file containing the referenced class (the app assembly jar) is not listed in the classpath expansion dumped in the error message. I have seen SPARK-5281, and am guessing that this is the root cause, especially since the code added there is involved in the stacktrace. That said, my grasp on scala reflection isn't strong enough to make sense of the change to say for sure. It certainly looks, though, that in this scenario the current thread's context classloader may not be what we think it is (given #3 above). Any ideas? App code: def registerTable[A : Product : TypeTag](name: String, rdd: RDD[A])(implicit hc: HiveContext) = { val df = hc.createDataFrame(rdd) df.registerTempTable(name) } Stack trace: scala.reflect.internal.MissingRequirementError: class comMyClass in JavaMirror with sun.misc.Launcher$AppClassLoader@d16e5d6 of type class sun.misc.Launcher$AppClassLoader with
[jira] [Created] (SPARK-8487) Update reduceByKeyAndWindow docs to highlight that filtering Function must be used
Tathagata Das created SPARK-8487: Summary: Update reduceByKeyAndWindow docs to highlight that filtering Function must be used Key: SPARK-8487 URL: https://issues.apache.org/jira/browse/SPARK-8487 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8470. Resolution: Duplicate MissingRequirementError for ScalaReflection on user classes --- Key: SPARK-8470 URL: https://issues.apache.org/jira/browse/SPARK-8470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Andrew Or Priority: Blocker Fix For: 1.4.1, 1.5.0 From the mailing list: {code} Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YARN). I do not get this error running locally (standalone). 2) Reverting to Spark 1.3.1 makes the problem go away 3) The jar file containing the referenced class (the app assembly jar) is not listed in the classpath expansion dumped in the error message. I have seen SPARK-5281, and am guessing that this is the root cause, especially since the code added there is involved in the stacktrace. That said, my grasp on scala reflection isn't strong enough to make sense of the change to say for sure. It certainly looks, though, that in this scenario the current thread's context classloader may not be what we think it is (given #3 above). Any ideas? App code: def registerTable[A : Product : TypeTag](name: String, rdd: RDD[A])(implicit hc: HiveContext) = { val df = hc.createDataFrame(rdd) df.registerTempTable(name) } Stack trace: scala.reflect.internal.MissingRequirementError: class comMyClass in JavaMirror with sun.misc.Launcher$AppClassLoader@d16e5d6 of type class sun.misc.Launcher$AppClassLoader with classpath [ lots and lots of paths and jars, but not the app assembly jar] not found at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at com.ipcoop.spark.sql.SqlEnv$$typecreator1$1.apply(SqlEnv.scala:87) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) {code} Another report: {code} Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in thread main scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class sun.misc.Launcher$ExtClassLoader with classpath [file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunec.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunjce_provider.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunpkcs11.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/zipfs.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/localedata.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/dnsns.jar] and parent being primordial classloader with boot classpath
[jira] [Reopened] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-8470: -- MissingRequirementError for ScalaReflection on user classes --- Key: SPARK-8470 URL: https://issues.apache.org/jira/browse/SPARK-8470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Andrew Or Priority: Blocker Fix For: 1.4.1, 1.5.0 From the mailing list: {code} Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YARN). I do not get this error running locally (standalone). 2) Reverting to Spark 1.3.1 makes the problem go away 3) The jar file containing the referenced class (the app assembly jar) is not listed in the classpath expansion dumped in the error message. I have seen SPARK-5281, and am guessing that this is the root cause, especially since the code added there is involved in the stacktrace. That said, my grasp on scala reflection isn't strong enough to make sense of the change to say for sure. It certainly looks, though, that in this scenario the current thread's context classloader may not be what we think it is (given #3 above). Any ideas? App code: def registerTable[A : Product : TypeTag](name: String, rdd: RDD[A])(implicit hc: HiveContext) = { val df = hc.createDataFrame(rdd) df.registerTempTable(name) } Stack trace: scala.reflect.internal.MissingRequirementError: class comMyClass in JavaMirror with sun.misc.Launcher$AppClassLoader@d16e5d6 of type class sun.misc.Launcher$AppClassLoader with classpath [ lots and lots of paths and jars, but not the app assembly jar] not found at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at com.ipcoop.spark.sql.SqlEnv$$typecreator1$1.apply(SqlEnv.scala:87) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) {code} Another report: {code} Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in thread main scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class sun.misc.Launcher$ExtClassLoader with classpath [file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunec.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunjce_provider.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunpkcs11.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/zipfs.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/localedata.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/dnsns.jar] and parent being primordial classloader with boot classpath
[jira] [Created] (SPARK-8490) SURF Feature Transformer
Feynman Liang created SPARK-8490: Summary: SURF Feature Transformer Key: SPARK-8490 URL: https://issues.apache.org/jira/browse/SPARK-8490 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Speeded up robust features (SURF) (Bay et al, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf) is an image descriptor transform very similar to SIFT (SPARK-8486) but can be computed more efficiently. One key difference is using box filters (Difference of Boxes) to approximate the Laplacian of the Gaussian. We can implement SURF in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SURF features for the provided image. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594005#comment-14594005 ] Apache Spark commented on SPARK-8470: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/6909 MissingRequirementError for ScalaReflection on user classes --- Key: SPARK-8470 URL: https://issues.apache.org/jira/browse/SPARK-8470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Andrew Or Priority: Blocker Fix For: 1.4.1, 1.5.0 From the mailing list: {code} Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YARN). I do not get this error running locally (standalone). 2) Reverting to Spark 1.3.1 makes the problem go away 3) The jar file containing the referenced class (the app assembly jar) is not listed in the classpath expansion dumped in the error message. I have seen SPARK-5281, and am guessing that this is the root cause, especially since the code added there is involved in the stacktrace. That said, my grasp on scala reflection isn't strong enough to make sense of the change to say for sure. It certainly looks, though, that in this scenario the current thread's context classloader may not be what we think it is (given #3 above). Any ideas? App code: def registerTable[A : Product : TypeTag](name: String, rdd: RDD[A])(implicit hc: HiveContext) = { val df = hc.createDataFrame(rdd) df.registerTempTable(name) } Stack trace: scala.reflect.internal.MissingRequirementError: class comMyClass in JavaMirror with sun.misc.Launcher$AppClassLoader@d16e5d6 of type class sun.misc.Launcher$AppClassLoader with classpath [ lots and lots of paths and jars, but not the app assembly jar] not found at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at com.ipcoop.spark.sql.SqlEnv$$typecreator1$1.apply(SqlEnv.scala:87) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) {code} Another report: {code} Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in thread main scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class sun.misc.Launcher$ExtClassLoader with classpath [file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunec.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunjce_provider.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunpkcs11.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/zipfs.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/localedata.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/dnsns.jar] and parent being primordial classloader with boot classpath
[jira] [Updated] (SPARK-8494) ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3
[ https://issues.apache.org/jira/browse/SPARK-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated SPARK-8494: -- Description: I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a ClassNotFoundException otherwise. Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.collection.immutable.Range java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} {code} name := spark-test-case version := 1.0 scalaVersion := 2.10.4 resolvers += spray repo at http://repo.spray.io; resolvers += Scalaz Bintray Repo at https://dl.bintray.com/scalaz/releases; val akkaVersion = 2.3.11 val sprayVersion = 1.3.3 libraryDependencies ++= Seq( com.h2database % h2 % 1.4.187, com.typesafe.akka %% akka-actor % akkaVersion, com.typesafe.akka %% akka-slf4j % akkaVersion, ch.qos.logback % logback-classic % 1.0.13, io.spray %% spray-can% sprayVersion, io.spray %% spray-routing% sprayVersion, io.spray %% spray-json % 1.3.1, com.databricks %% spark-csv% 1.0.3, org.specs2 %% specs2 % 2.4.17 % test, org.specs2 %% specs2-junit % 2.4.17 % test, io.spray %% spray-testkit% sprayVersion % test, com.typesafe.akka %% akka-testkit % akkaVersion% test, junit % junit% 4.12 % test ) scalacOptions ++= Seq( -unchecked, -deprecation, -Xlint, -Ywarn-dead-code, -language:_, -target:jvm-1.7, -encoding, UTF-8 ) testOptions += Tests.Argument(TestFrameworks.JUnit, -v) {code} was: I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a SPARK-1923 Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.None$ java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3 --- Key: SPARK-8494 URL: https://issues.apache.org/jira/browse/SPARK-8494 Project: Spark Issue Type: Bug Components: Spark Core Reporter: PJ Fanning
[jira] [Commented] (SPARK-8492) Support BinaryType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594131#comment-14594131 ] Apache Spark commented on SPARK-8492: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/6911 Support BinaryType in UnsafeRow --- Key: SPARK-8492 URL: https://issues.apache.org/jira/browse/SPARK-8492 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8492) Support BinaryType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8492: --- Assignee: Davies Liu (was: Apache Spark) Support BinaryType in UnsafeRow --- Key: SPARK-8492 URL: https://issues.apache.org/jira/browse/SPARK-8492 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593888#comment-14593888 ] Yu Ishikawa commented on SPARK-8477: Should I rename the upper case {{In}} to {{inSet}}? Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF Feature Extractor
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Description: Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a [[org.apache.spark.ml.Transformer]]. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (aka SURF) as described by Lowe can be even further improved using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). was: Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a [[org.apache.spark.ml.Transformer]]. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian as described by Lowe can be even further improved using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). SIFT/SURF Feature Extractor --- Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a [[org.apache.spark.ml.Transformer]]. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (aka SURF) as described by Lowe can be even further improved using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF Feature Extractor
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Summary: SIFT/SURF Feature Extractor (was: SIFT Feature Extractor) SIFT/SURF Feature Extractor --- Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a [[org.apache.spark.ml.Transformer]]. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian as described by Lowe can be even further improved using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF Feature Extractor
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Description: Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a [[org.apache.spark.ml.Transformer]]. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). was: Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a [[org.apache.spark.ml.Transformer]]. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (aka SURF) as described by Lowe can be even further improved using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). SIFT/SURF Feature Extractor --- Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a [[org.apache.spark.ml.Transformer]]. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Description: Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). was: Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). SIFT/SURF Feature Transformer - Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Summary: SIFT Feature Transformer (was: SIFT/SURF/DAISY Feature Transformer) SIFT Feature Transformer Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Furthermore, the implementation should support various approximations for approximating the Laplacian of Gaussian using Difference of Gaussian (as described by Lowe). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF/DAISY Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Description: Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Furthermore, the implementation should support various approximations for approximating the Laplacian of Gaussian using Difference of Gaussian (as described by Lowe). was: Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Furthermore, the implementation should support various approximations for approximating the Laplacian of Gaussian. In addition to approximating using Difference of Gaussian (as described by Lowe), we should support * SURF approximation using box filters (Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf) should also be supported. * DAISY SIFT/SURF/DAISY Feature Transformer --- Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a scale and rotation invariant method to transform images into matrices describing local features. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SIFT features for the provided image. The implementation should support computation of SIFT at predefined interest points, every kth pixel, and densely (over all pixels). Furthermore, the implementation should support various approximations for approximating the Laplacian of Gaussian using Difference of Gaussian (as described by Lowe). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa closed SPARK-8477. -- Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8492) Support BinaryType in UnsafeRow
Davies Liu created SPARK-8492: - Summary: Support BinaryType in UnsafeRow Key: SPARK-8492 URL: https://issues.apache.org/jira/browse/SPARK-8492 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8491) DAISY Feature Transformer
Feynman Liang created SPARK-8491: Summary: DAISY Feature Transformer Key: SPARK-8491 URL: https://issues.apache.org/jira/browse/SPARK-8491 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang DAISY is another local image descriptor utilizing histograms of local orientation similar to SIFT. However, one key difference is that the weighted sum of gradient norms used in SIFT's orientation assignment is replaced by convolution with Gaussian kernels. This provides a significant speedup in computing dense descriptors. We can implement DAISY in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the DAISY features for the provided image. The convolution operation can leverage GPU parallelism for efficiency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8093) Spark 1.4 branch's new JSON schema inference has changed the behavior of handling inner empty JSON object.
[ https://issues.apache.org/jira/browse/SPARK-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8093: Fix Version/s: 1.5.0 Spark 1.4 branch's new JSON schema inference has changed the behavior of handling inner empty JSON object. -- Key: SPARK-8093 URL: https://issues.apache.org/jira/browse/SPARK-8093 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Harish Butani Assignee: Nathan Howell Priority: Critical Fix For: 1.4.1, 1.5.0 Attachments: t1.json This is similar to SPARK-3365. Sample json is attached. Code to reproduce {code} var jsonDF = read.json(/tmp/t1.json) jsonDF.write.parquet(/tmp/t1.parquet) {code} The 'integration' object is empty in the json. StackTrace: {code} Caused by: java.io.IOException: Could not read footer: java.lang.IllegalStateException: Cannot build an empty group at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) at org.apache.spark.sql.parquet.ParquetRelation2.refresh(newParquet.scala:197) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:134) ... 69 more Caused by: java.lang.IllegalStateException: Cannot build an empty group {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8494) ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3
[ https://issues.apache.org/jira/browse/SPARK-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PJ Fanning updated SPARK-8494: -- Description: I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a ClassNotFoundException otherwise. I have a spark-assembly jar built using Spark 1.3.2-SNAPSHOT. Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.collection.immutable.Range java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} {code} name := spark-test-case version := 1.0 scalaVersion := 2.10.4 resolvers += spray repo at http://repo.spray.io; resolvers += Scalaz Bintray Repo at https://dl.bintray.com/scalaz/releases; val akkaVersion = 2.3.11 val sprayVersion = 1.3.3 libraryDependencies ++= Seq( com.h2database % h2 % 1.4.187, com.typesafe.akka %% akka-actor % akkaVersion, com.typesafe.akka %% akka-slf4j % akkaVersion, ch.qos.logback % logback-classic % 1.0.13, io.spray %% spray-can% sprayVersion, io.spray %% spray-routing% sprayVersion, io.spray %% spray-json % 1.3.1, com.databricks %% spark-csv% 1.0.3, org.specs2 %% specs2 % 2.4.17 % test, org.specs2 %% specs2-junit % 2.4.17 % test, io.spray %% spray-testkit% sprayVersion % test, com.typesafe.akka %% akka-testkit % akkaVersion% test, junit % junit% 4.12 % test ) scalacOptions ++= Seq( -unchecked, -deprecation, -Xlint, -Ywarn-dead-code, -language:_, -target:jvm-1.7, -encoding, UTF-8 ) testOptions += Tests.Argument(TestFrameworks.JUnit, -v) {code} was: I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a ClassNotFoundException otherwise. Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.collection.immutable.Range java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} {code} name := spark-test-case version := 1.0 scalaVersion := 2.10.4 resolvers += spray repo at http://repo.spray.io; resolvers += Scalaz Bintray Repo at https://dl.bintray.com/scalaz/releases; val akkaVersion = 2.3.11 val sprayVersion = 1.3.3 libraryDependencies
[jira] [Commented] (SPARK-8494) ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3
[ https://issues.apache.org/jira/browse/SPARK-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594132#comment-14594132 ] PJ Fanning commented on SPARK-8494: --- [~pwendell] Apologies about the JIRA being assigned to you. I cloned SPARK-1923 and now can't change the Assignee. ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3 --- Key: SPARK-8494 URL: https://issues.apache.org/jira/browse/SPARK-8494 Project: Spark Issue Type: Bug Components: Spark Core Reporter: PJ Fanning Assignee: Patrick Wendell I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a ClassNotFoundException otherwise. Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.collection.immutable.Range java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} {code} name := spark-test-case version := 1.0 scalaVersion := 2.10.4 resolvers += spray repo at http://repo.spray.io; resolvers += Scalaz Bintray Repo at https://dl.bintray.com/scalaz/releases; val akkaVersion = 2.3.11 val sprayVersion = 1.3.3 libraryDependencies ++= Seq( com.h2database % h2 % 1.4.187, com.typesafe.akka %% akka-actor % akkaVersion, com.typesafe.akka %% akka-slf4j % akkaVersion, ch.qos.logback % logback-classic % 1.0.13, io.spray %% spray-can% sprayVersion, io.spray %% spray-routing% sprayVersion, io.spray %% spray-json % 1.3.1, com.databricks %% spark-csv% 1.0.3, org.specs2 %% specs2 % 2.4.17 % test, org.specs2 %% specs2-junit % 2.4.17 % test, io.spray %% spray-testkit% sprayVersion % test, com.typesafe.akka %% akka-testkit % akkaVersion% test, junit % junit% 4.12 % test ) scalacOptions ++= Seq( -unchecked, -deprecation, -Xlint, -Ywarn-dead-code, -language:_, -target:jvm-1.7, -encoding, UTF-8 ) testOptions += Tests.Argument(TestFrameworks.JUnit, -v) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8492) Support BinaryType in UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8492: --- Assignee: Apache Spark (was: Davies Liu) Support BinaryType in UnsafeRow --- Key: SPARK-8492 URL: https://issues.apache.org/jira/browse/SPARK-8492 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7810) rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used
[ https://issues.apache.org/jira/browse/SPARK-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ai He updated SPARK-7810: - Issue Type: Improvement (was: Bug) rdd.py _load_from_socket cannot load data from jvm socket if ipv6 is used --- Key: SPARK-7810 URL: https://issues.apache.org/jira/browse/SPARK-7810 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.1 Reporter: Ai He Method _load_from_socket in rdd.py cannot load data from jvm socket if ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8486) SIFT/SURF Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8486: - Summary: SIFT/SURF Feature Transformer (was: SIFT/SURF Feature Extractor) SIFT/SURF Feature Transformer - Key: SPARK-8486 URL: https://issues.apache.org/jira/browse/SPARK-8486 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Scale invariant feature transform (SIFT) is a method to transform images into dense vectors describing local features which are invariant to scale and rotation. (Lowe, IJCV 2004, http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf) We can implement SIFT in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an Array[Numeric] of the SIFT features present in the image. Depending on performance, approximating Laplacian of Gaussian by Difference of Gaussian (traditional SIFT) as described by Lowe can be even further improved using box filters (aka SURF, see Bay, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593910#comment-14593910 ] Yu Ishikawa commented on SPARK-8477: [~rxin] Should we support not only `inSet` but also what is so called `in` in Python? I think it is just an alias of {{inSet}}. https://github.com/apache/spark/blob/master/python%2Fpyspark%2Fsql%2Fcolumn.py#L248 Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8484) Add TrainValidationSplit to ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8484: - Assignee: Martin Zapletal Add TrainValidationSplit to ml.tuning - Key: SPARK-8484 URL: https://issues.apache.org/jira/browse/SPARK-8484 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Martin Zapletal Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8488) HOG Feature Transformer
Feynman Liang created SPARK-8488: Summary: HOG Feature Transformer Key: SPARK-8488 URL: https://issues.apache.org/jira/browse/SPARK-8488 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Histogram of oriented gradients (HOG) is method utilizing local orientation (gradients and edges) to transform images into dense image descriptors (Dalal Triggs, CVPR 2005, http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf). HOG in Spark ML pipelines can be implemented as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the HOG features for the provided image. HOG and SIFT are similar in that the both represent images using local orientation histograms. In contrast to SIFT, however, HOG uses overlapping spatial blocks and is computed densely across all pixels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593948#comment-14593948 ] Andrew Or commented on SPARK-8470: -- Update: I verified that this is actually already fixed through https://github.com/apache/spark/pull/6891. It is ultimately caused by the same issue as SPARK-8368! I will add a regression test shortly. MissingRequirementError for ScalaReflection on user classes --- Key: SPARK-8470 URL: https://issues.apache.org/jira/browse/SPARK-8470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Michael Armbrust Assignee: Andrew Or Priority: Blocker From the mailing list: {code} Since upgrading to Spark 1.4, I'm getting a scala.reflect.internal.MissingRequirementError when creating a DataFrame from an RDD. The error references a case class in the application (the RDD's type parameter), which has been verified to be present. Items of note: 1) This is running on AWS EMR (YARN). I do not get this error running locally (standalone). 2) Reverting to Spark 1.3.1 makes the problem go away 3) The jar file containing the referenced class (the app assembly jar) is not listed in the classpath expansion dumped in the error message. I have seen SPARK-5281, and am guessing that this is the root cause, especially since the code added there is involved in the stacktrace. That said, my grasp on scala reflection isn't strong enough to make sense of the change to say for sure. It certainly looks, though, that in this scenario the current thread's context classloader may not be what we think it is (given #3 above). Any ideas? App code: def registerTable[A : Product : TypeTag](name: String, rdd: RDD[A])(implicit hc: HiveContext) = { val df = hc.createDataFrame(rdd) df.registerTempTable(name) } Stack trace: scala.reflect.internal.MissingRequirementError: class comMyClass in JavaMirror with sun.misc.Launcher$AppClassLoader@d16e5d6 of type class sun.misc.Launcher$AppClassLoader with classpath [ lots and lots of paths and jars, but not the app assembly jar] not found at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at com.ipcoop.spark.sql.SqlEnv$$typecreator1$1.apply(SqlEnv.scala:87) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:410) {code} Another report: {code} Hi, I use spark 0.14. I tried to create dataframe from RDD below, but got scala.reflect.internal.MissingRequirementError val partitionedTestDF2 = pairVarRDD.toDF(column1,column2,column3) //pairVarRDD is RDD[Record4Dim_2], and Record4Dim_2 is a Case Class How can I fix this? Exception in thread main scala.reflect.internal.MissingRequirementError: class etl.Record4Dim_2 in JavaMirror with sun.misc.Launcher$AppClassLoader@30177039 of type class sun.misc.Launcher$AppClassLoader with classpath [file:/local/spark140/conf/,file:/local/spark140/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.6.0.jar,file:/local/spark140/lib/datanucleus-core-3.2.10.jar,file:/local/spark140/lib/datanucleus-rdbms-3.2.9.jar,file:/local/spark140/lib/datanucleus-api-jdo-3.2.6.jar,file:/etc/hadoop/conf/] and parent being sun.misc.Launcher$ExtClassLoader@52c8c6d9 of type class sun.misc.Launcher$ExtClassLoader with classpath [file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunec.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunjce_provider.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/sunpkcs11.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/zipfs.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/localedata.jar,file:/usr/jdk64/jdk1.7.0_67/jre/lib/ext/dnsns.jar] and parent being primordial classloader with boot classpath
[jira] [Updated] (SPARK-8420) Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0
[ https://issues.apache.org/jira/browse/SPARK-8420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8420: Shepherd: Yin Huai Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0 -- Key: SPARK-8420 URL: https://issues.apache.org/jira/browse/SPARK-8420 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Justin Yip Assignee: Michael Armbrust Priority: Blocker Labels: releasenotes I am trying out 1.4.0 and notice there are some differences in behavior with Timestamp between 1.3.1 and 1.4.0. In 1.3.1, I can compare a Timestamp with string. {code} scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01 00:00:00 ... scala df.filter($_2 = 2014-06-01).show ... _1 _2 2 2014-01-01 00:00:... {code} However, in 1.4.0, the filter is always false: {code} scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01 00:00:00 df: org.apache.spark.sql.DataFrame = [_1: int, _2: timestamp] scala df.filter($_2 = 2014-06-01).show +--+--+ |_1|_2| +--+--+ +--+--+ {code} Not sure if that is intended, but I cannot find any doc mentioning these inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8477: -- Fix Version/s: 1.3.0 Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8470) MissingRequirementError for ScalaReflection on user classes
[ https://issues.apache.org/jira/browse/SPARK-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593927#comment-14593927 ] Andrew Or edited comment on SPARK-8470 at 6/19/15 8:36 PM: --- FYI, I was able to reproduce this locally. This allowed me to conclude two things: 1. It has nothing to do with YARN specifically. 2. It is caused by some code in the hive module; I could reproduce this only with HiveContext, but not with SQLContext. Small reproduction: {code} bin/spark-submit --master local --class FunTest app.jar {code} Inside app.jar: FunTest.scala {code} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.hive.HiveContext object FunTest { def main(args: Array[String]): Unit = { println(Runnin' my cool class) val conf = new SparkConf().setAppName(testing) val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) val coolClasses = Seq( MyCoolClass(ast, resent, uture), MyCoolClass(mamazing, papazing, fafazing)) val df = sqlContext.createDataFrame(coolClasses) df.collect() } } {code} Inside app.jar: MyCoolClass.scala {code} case class MyCoolClass(past: String, present: String, future: String) {code} Result: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class MyCoolClass not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureClassSymbol(Mirrors.scala:90) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at FunTest$$typecreator1$1.apply(FunTest.scala:13) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:71) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:426) at FunTest$.main(FunTest.scala:13) at FunTest.main(FunTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} was (Author: andrewor14): FYI, I was able to reproduce this locally. This allowed me to conclude two things: 1. It has nothing to do with YARN specifically. 2. It is caused by some code in the hive module; I could reproduce this only with HiveContext, but not with SQLContext. Small reproduction: {code} bin/spark-submit --master local --class FunTest app.jar {code} Inside app.jar: FunTest.scala {code} object FunTest { def main(args: Array[String]): Unit = { println(Runnin' my cool class) val conf = new SparkConf().setAppName(testing) val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) val coolClasses = Seq( MyCoolClass(ast, resent, uture), MyCoolClass(mamazing, papazing, fafazing)) val df = sqlContext.createDataFrame(coolClasses) df.collect() } } {code} Inside app.jar: MyCoolClass.scala {code} case class MyCoolClass(past: String, present: String, future: String) {code} Result: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class MyCoolClass not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureClassSymbol(Mirrors.scala:90) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at FunTest$$typecreator1$1.apply(FunTest.scala:13) at
[jira] [Resolved] (SPARK-8477) Add in operator to DataFrame Column in Python
[ https://issues.apache.org/jira/browse/SPARK-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8477. --- Resolution: Implemented Target Version/s: 1.3.0 (was: 1.5.0) Add in operator to DataFrame Column in Python - Key: SPARK-8477 URL: https://issues.apache.org/jira/browse/SPARK-8477 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8489) Add regression tests for SPARK-8470
Andrew Or created SPARK-8489: Summary: Add regression tests for SPARK-8470 Key: SPARK-8489 URL: https://issues.apache.org/jira/browse/SPARK-8489 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical See SPARK-8470 for more detail. Basically the Spark Hive code silently overwrites the context class loader populated in SparkSubmit, resulting in certain classes missing when we do reflection in `SQLContext#createDataFrame`. That issue is already resolved in https://github.com/apache/spark/pull/6891, but we should add a regression test for the specific manifestation of the bug in SPARK-8470. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8485) Feature transformers for image processing
[ https://issues.apache.org/jira/browse/SPARK-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593990#comment-14593990 ] Sean Owen commented on SPARK-8485: -- I think that before you opened all these JIRAs you should have established whether this is a fit for MLlib. Please read: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Some of these are of enough use, maybe, to be included, but they can start in a separate repo. Some I am not sure about myself. Please let's back up before opening more Feature transformers for image processing - Key: SPARK-8485 URL: https://issues.apache.org/jira/browse/SPARK-8485 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Many transformers exist to convert from image representations into more compact descriptors amenable to standard ML techniques. We should implement these transformers in Spark to support machine learning on richer content types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8485) Feature transformers for image processing
[ https://issues.apache.org/jira/browse/SPARK-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8485: - Target Version/s: (was: 1.5.0) Feature transformers for image processing - Key: SPARK-8485 URL: https://issues.apache.org/jira/browse/SPARK-8485 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Many transformers exist to convert from image representations into more compact descriptors amenable to standard ML techniques. We should implement these transformers in Spark to support machine learning on richer content types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8483) Remove commons-lang3 depedency from flume-sink
[ https://issues.apache.org/jira/browse/SPARK-8483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8483: --- Assignee: (was: Apache Spark) Remove commons-lang3 depedency from flume-sink -- Key: SPARK-8483 URL: https://issues.apache.org/jira/browse/SPARK-8483 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Hari Shreedharan flume-sink module uses only one method from commons-lang3. Since the build would become complex if we create an assembly and would likely make it more difficult for customers, let's just remove the dependency altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8483) Remove commons-lang3 depedency from flume-sink
[ https://issues.apache.org/jira/browse/SPARK-8483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594058#comment-14594058 ] Apache Spark commented on SPARK-8483: - User 'harishreedharan' has created a pull request for this issue: https://github.com/apache/spark/pull/6910 Remove commons-lang3 depedency from flume-sink -- Key: SPARK-8483 URL: https://issues.apache.org/jira/browse/SPARK-8483 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Hari Shreedharan flume-sink module uses only one method from commons-lang3. Since the build would become complex if we create an assembly and would likely make it more difficult for customers, let's just remove the dependency altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8483) Remove commons-lang3 depedency from flume-sink
[ https://issues.apache.org/jira/browse/SPARK-8483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8483: --- Assignee: Apache Spark Remove commons-lang3 depedency from flume-sink -- Key: SPARK-8483 URL: https://issues.apache.org/jira/browse/SPARK-8483 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Hari Shreedharan Assignee: Apache Spark flume-sink module uses only one method from commons-lang3. Since the build would become complex if we create an assembly and would likely make it more difficult for customers, let's just remove the dependency altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8485) Feature transformers for image processing
[ https://issues.apache.org/jira/browse/SPARK-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594122#comment-14594122 ] Joseph K. Bradley commented on SPARK-8485: -- This is something which is going to come up in MLlib now that we have a better interface for feature transformers. I suspect a lot of people will look to Pipelines for existing transformers, including in major applications areas like NLP, vision, and audio. I think some of these are clearly useful (SIFT HOG are the ones I hear most about). For others, it would be good to look to other libraries and see what is most common. My feeling is that it would be nice to have a few such transformers in MLlib itself, but a full-fledged image processing library would belong in an external package for now. My main concerns are: * Interest/need: We should hold off on implementing these to see if the community has sufficient interest. * Data type: If we add image processing, we need to support actual images, including depth (data type) and multiple channels (e.g. RGB). This will be a significant commitment to create a UDT for images, but it would be important to lay the groundwork for further image processing work. Let's leave the JIRAs open for discussion to gather interest, use cases with Spark, and feedback. But people should discuss here before sending PRs. Feature transformers for image processing - Key: SPARK-8485 URL: https://issues.apache.org/jira/browse/SPARK-8485 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Many transformers exist to convert from image representations into more compact descriptors amenable to standard ML techniques. We should implement these transformers in Spark to support machine learning on richer content types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8420) Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0
[ https://issues.apache.org/jira/browse/SPARK-8420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594200#comment-14594200 ] Apache Spark commented on SPARK-8420: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/6914 Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0 -- Key: SPARK-8420 URL: https://issues.apache.org/jira/browse/SPARK-8420 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Justin Yip Assignee: Michael Armbrust Priority: Blocker Labels: releasenotes I am trying out 1.4.0 and notice there are some differences in behavior with Timestamp between 1.3.1 and 1.4.0. In 1.3.1, I can compare a Timestamp with string. {code} scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01 00:00:00 ... scala df.filter($_2 = 2014-06-01).show ... _1 _2 2 2014-01-01 00:00:... {code} However, in 1.4.0, the filter is always false: {code} scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01 00:00:00 df: org.apache.spark.sql.DataFrame = [_1: int, _2: timestamp] scala df.filter($_2 = 2014-06-01).show +--+--+ |_1|_2| +--+--+ +--+--+ {code} Not sure if that is intended, but I cannot find any doc mentioning these inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8494) ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3
[ https://issues.apache.org/jira/browse/SPARK-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8494: --- Assignee: (was: Patrick Wendell) ClassNotFoundException when running with sbt, scala 2.10.4, spray 1.3.3 --- Key: SPARK-8494 URL: https://issues.apache.org/jira/browse/SPARK-8494 Project: Spark Issue Type: Bug Components: Spark Core Reporter: PJ Fanning Attachments: spark-test-case.zip I found a similar issue to SPARK-1923 but with Scala 2.10.4. I used the Test.scala from SPARK-1923 but used the libraryDependencies from a build.sbt that I am working on. If I remove the spray 1.3.3 jars, the test case passes but has a ClassNotFoundException otherwise. I have a spark-assembly jar built using Spark 1.3.2-SNAPSHOT. Application: {code} import org.apache.spark.SparkConf import org.apache.spark.SparkContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster(local[4]).setAppName(Test) val sc = new SparkContext(conf) sc.makeRDD(1 to 1000, 10).map(x = Some(x)).count sc.stop() } {code} Exception: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.ClassNotFoundException: scala.collection.immutable.Range java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:425) java.lang.ClassLoader.loadClass(ClassLoader.java:358) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:270) org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) {code} {code} name := spark-test-case version := 1.0 scalaVersion := 2.10.4 resolvers += spray repo at http://repo.spray.io; resolvers += Scalaz Bintray Repo at https://dl.bintray.com/scalaz/releases; val akkaVersion = 2.3.11 val sprayVersion = 1.3.3 libraryDependencies ++= Seq( com.h2database % h2 % 1.4.187, com.typesafe.akka %% akka-actor % akkaVersion, com.typesafe.akka %% akka-slf4j % akkaVersion, ch.qos.logback % logback-classic % 1.0.13, io.spray %% spray-can% sprayVersion, io.spray %% spray-routing% sprayVersion, io.spray %% spray-json % 1.3.1, com.databricks %% spark-csv% 1.0.3, org.specs2 %% specs2 % 2.4.17 % test, org.specs2 %% specs2-junit % 2.4.17 % test, io.spray %% spray-testkit% sprayVersion % test, com.typesafe.akka %% akka-testkit % akkaVersion% test, junit % junit% 4.12 % test ) scalacOptions ++= Seq( -unchecked, -deprecation, -Xlint, -Ywarn-dead-code, -language:_, -target:jvm-1.7, -encoding, UTF-8 ) testOptions += Tests.Argument(TestFrameworks.JUnit, -v) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8389) Expose KafkaRDDs offsetRange in Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das closed SPARK-8389. Resolution: Duplicate Expose KafkaRDDs offsetRange in Python -- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8389) Expose KafkaRDDs offsetRange in Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-8389: - Assignee: (was: Saisai Shao) Expose KafkaRDDs offsetRange in Python -- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7786) Allow StreamingListener to be specified in SparkConf and loaded when creating StreamingContext
[ https://issues.apache.org/jira/browse/SPARK-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594240#comment-14594240 ] Tathagata Das commented on SPARK-7786: -- [~397090770] This functionality can be easily done by the user code without actually loosing events rather than having the functionality as a SparkConf. The user can very easily pass the name of the class by whatever means (cmdline args, etc.) in the process and the user can use reflection to instantiate the right listener and attach it to the streaming context before starting it. The reason similar functionality was added for SparkListener because attaching any listener after the SparkContext has been initialized will not catch all the initial events. So the system needs to attach any listener before any event has been generated, and that why SparkConf config was necessary. However this is not the case for StreamingListener as there are no events before starting the StreamingContext. So can you elaborate on scenarios where this is absolutely essential? Allow StreamingListener to be specified in SparkConf and loaded when creating StreamingContext -- Key: SPARK-7786 URL: https://issues.apache.org/jira/browse/SPARK-7786 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Reporter: yangping wu Priority: Minor As mentioned in [SPARK-5411|https://issues.apache.org/jira/browse/SPARK-5411], We can also allow user to register StreamingListener through SparkConf settings, and loaded when creating StreamingContext, This would allow monitoring frameworks to be easily injected into Spark programs without having to modify those programs' code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8483) Remove commons-lang3 depedency from flume-sink
[ https://issues.apache.org/jira/browse/SPARK-8483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594252#comment-14594252 ] Hari Shreedharan commented on SPARK-8483: - Well, we aren't adding a dependency - we are only removing one. So I don't see stuff breaking. We can push it out to 1.5 if this is risky. Remove commons-lang3 depedency from flume-sink -- Key: SPARK-8483 URL: https://issues.apache.org/jira/browse/SPARK-8483 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Hari Shreedharan Assignee: Hari Shreedharan flume-sink module uses only one method from commons-lang3. Since the build would become complex if we create an assembly and would likely make it more difficult for customers, let's just remove the dependency altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8495) Add a `.lintr` file to validate the SparkR files
Yu Ishikawa created SPARK-8495: -- Summary: Add a `.lintr` file to validate the SparkR files Key: SPARK-8495 URL: https://issues.apache.org/jira/browse/SPARK-8495 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa https://issues.apache.org/jira/browse/SPARK-6813 As we discussed, we are planning to go with {{lintr}} to validate the SparkR files. So we should add a rules for it as a {{.lintr}} file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8497) Graph Clique(Complete Connected Sub-graph) Discovery Algorithm
Fan Jiang created SPARK-8497: Summary: Graph Clique(Complete Connected Sub-graph) Discovery Algorithm Key: SPARK-8497 URL: https://issues.apache.org/jira/browse/SPARK-8497 Project: Spark Issue Type: New Feature Components: GraphX, ML, MLlib, Spark Core Reporter: Fan Jiang In recent years, social network industry has high demand on Complete Connected Sub-Graph Discoveries, so does Telecom. Similar as the graph connection from Twitter, the calls and other activities from telecoms world form a huge social graph, and due to the nature of communication method, it shows the strongest inter-person relationship, the graph based analysis will reveal tremendous value from telecoms connections. We need an algorithm in Spark to figure out ALL the strongest completely connected sub-graph (so called Clique here) for EVERY person in the network which will be one of the start point for understanding user's social behaviour. In Huawei, we have many real-world use cases that invovle telecom social graph of tens billion edges and hundreds million vertices, and the cliques will be also in tens million level. The graph will be a fast changing one which means we need to analyse the graph pattern very often (one result per day/week for moving time window which spans multiple months). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594199#comment-14594199 ] Tathagata Das commented on SPARK-8389: -- Aah, there is already discussion. This escaped my notice because it did not have the streaming component tag on it. I am going to close this JIRA as duplicate. Expose KafkaRDDs offsetRange in Python -- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Saisai Shao Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8389) Expose KafkaRDDs offsetRange in Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-8389: - Assignee: Saisai Shao (was: Cody Koeninger) Expose KafkaRDDs offsetRange in Python -- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Saisai Shao Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version
[ https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-8337: - Component/s: Streaming KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version -- Key: SPARK-8337 URL: https://issues.apache.org/jira/browse/SPARK-8337 Project: Spark Issue Type: Bug Components: PySpark, Streaming Affects Versions: 1.4.0 Reporter: Amit Ramesh Priority: Critical See the following thread for context. http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6813) SparkR style guide
[ https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594239#comment-14594239 ] Shivaram Venkataraman commented on SPARK-6813: -- Two things 1. For the variable and function names should be lowercase, this should not come up if the camelCase=NULL is being picked up correctly. I think the best way to get lintr to pick up options is to create `.lintr` file in `SPARK_HOME/R/pkg` -- I just tried this and this removed all the variable name errors. 2. The trailing whitespace is a valid problem. We should remove it. In fact [~rxin] has been doing this for all the other parts of the code recently. SparkR style guide -- Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7148) Configure Parquet block size (row group size) for ML model import/export
[ https://issues.apache.org/jira/browse/SPARK-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594261#comment-14594261 ] Joseph K. Bradley commented on SPARK-7148: -- Hm, if it's that simple, then I wonder if we can adjust parquet.block.size before saving/loading the ML models and reset the block size to its original value afterwards. I'll have to try that! Configure Parquet block size (row group size) for ML model import/export Key: SPARK-7148 URL: https://issues.apache.org/jira/browse/SPARK-7148 Project: Spark Issue Type: Improvement Components: MLlib, SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Joseph K. Bradley Priority: Minor It would be nice if we could configure the Parquet buffer size when using Parquet format for ML model import/export. Currently, for some models (trees and ensembles), the schema has 13+ columns. With a default buffer size of 128MB (I think), that puts the allocated buffer way over the default memory made available by run-example. Because of this problem, users have to use spark-submit and explicitly use a larger amount of memory in order to run some ML examples. Is there a simple way to specify {{parquet.block.size}}? I'm not familiar with this part of SparkSQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6813) SparkR style guide
[ https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594281#comment-14594281 ] Yu Ishikawa edited comment on SPARK-6813 at 6/20/15 5:21 AM: - That sounds good! I created an issue to add a {{.lintr}} file as folllows. https://issues.apache.org/jira/browse/SPARK-8495 was (Author: yuu.ishik...@gmail.com): That's sounds good! I created an issue to add a {{.lintr}} file as folllows. https://issues.apache.org/jira/browse/SPARK-8495 SparkR style guide -- Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8489) Add regression tests for SPARK-8470
[ https://issues.apache.org/jira/browse/SPARK-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-8489. - Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Issue resolved by https://github.com/apache/spark/pull/6909 (the pr used a wrong jira number). Add regression tests for SPARK-8470 --- Key: SPARK-8489 URL: https://issues.apache.org/jira/browse/SPARK-8489 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Fix For: 1.4.1, 1.5.0 See SPARK-8470 for more detail. Basically the Spark Hive code silently overwrites the context class loader populated in SparkSubmit, resulting in certain classes missing when we do reflection in `SQLContext#createDataFrame`. That issue is already resolved in https://github.com/apache/spark/pull/6891, but we should add a regression test for the specific manifestation of the bug in SPARK-8470. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8389) Expose KafkaRDDs offsetRange in Python
[ https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-8389: - Summary: Expose KafkaRDDs offsetRange in Python (was: Expose KafkaRDDs offsetRange in Java and Python) Expose KafkaRDDs offsetRange in Python -- Key: SPARK-8389 URL: https://issues.apache.org/jira/browse/SPARK-8389 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.4.0 Reporter: Tathagata Das Assignee: Cody Koeninger Priority: Critical Probably requires creating a JavaKafkaPairRDD and also use that in the python APIs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8420) Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0
[ https://issues.apache.org/jira/browse/SPARK-8420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594215#comment-14594215 ] Yin Huai commented on SPARK-8420: - Will resolve this one after 1.4 backport is merged. Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0 -- Key: SPARK-8420 URL: https://issues.apache.org/jira/browse/SPARK-8420 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Justin Yip Assignee: Michael Armbrust Priority: Blocker Labels: releasenotes Fix For: 1.5.0 I am trying out 1.4.0 and notice there are some differences in behavior with Timestamp between 1.3.1 and 1.4.0. In 1.3.1, I can compare a Timestamp with string. {code} scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01 00:00:00 ... scala df.filter($_2 = 2014-06-01).show ... _1 _2 2 2014-01-01 00:00:... {code} However, in 1.4.0, the filter is always false: {code} scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01 00:00:00 df: org.apache.spark.sql.DataFrame = [_1: int, _2: timestamp] scala df.filter($_2 = 2014-06-01).show +--+--+ |_1|_2| +--+--+ +--+--+ {code} Not sure if that is intended, but I cannot find any doc mentioning these inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8483) Remove commons-lang3 depedency from flume-sink
[ https://issues.apache.org/jira/browse/SPARK-8483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-8483: - Target Version/s: 1.4.1, 1.5.0 (was: 1.5.0) Remove commons-lang3 depedency from flume-sink -- Key: SPARK-8483 URL: https://issues.apache.org/jira/browse/SPARK-8483 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Hari Shreedharan Assignee: Hari Shreedharan flume-sink module uses only one method from commons-lang3. Since the build would become complex if we create an assembly and would likely make it more difficult for customers, let's just remove the dependency altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8483) Remove commons-lang3 depedency from flume-sink
[ https://issues.apache.org/jira/browse/SPARK-8483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594228#comment-14594228 ] Tathagata Das commented on SPARK-8483: -- We generally do not the dependency changes between patch releases. Because of potential dependency issues between multiple version of same libraries at runtime, etc. But this sink thing runs only in Flume. So do you think it is okay in this case? For sure? Remove commons-lang3 depedency from flume-sink -- Key: SPARK-8483 URL: https://issues.apache.org/jira/browse/SPARK-8483 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Hari Shreedharan Assignee: Hari Shreedharan flume-sink module uses only one method from commons-lang3. Since the build would become complex if we create an assembly and would likely make it more difficult for customers, let's just remove the dependency altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6813) SparkR style guide
[ https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594227#comment-14594227 ] Yu Ishikawa commented on SPARK-6813: [~shivaram] Those two rules looks good! I modified my code, using the github version instead of the CRAN version. And I tryied to set the two rules in it. h3. The latest version script https://github.com/apache/spark/compare/master...yu-iskw:SPARK-6813 h3. The result of the script https://gist.github.com/yu-iskw/7a663dbea295ee767849 h3. Rules we should discuss - {{Variable and function names should be all lowercase}} or not - {{Trailing whitespace is superfluous}} SparkR style guide -- Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6813) SparkR style guide
[ https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594250#comment-14594250 ] Yu Ishikawa edited comment on SPARK-6813 at 6/20/15 1:25 AM: - Sounds great! h3. TODO Please let me know if you'd like to add anything. - We will modify valid problems - You will create {{.lintr}} in {{SPARK_HOME/R/pkg}} - I will send a PR about my {{lint-r}} script and merge it - We will modify valid problems with my {{lint-r}} script again - We will add some settings to run the {{lint-r}} script on the official jenkins was (Author: yuu.ishik...@gmail.com): Sounds great! h3. TODO Please let me know if you'd like to add anything. - Modify valid problems - You will create {{.lintr}} in {{SPARK_HOME/R/pkg}} - I will send a PR about my {{lint-r}} script and merge it - Modify valid problems with my {{lint-r}} script again - We will add some settings to run the {{lint-r}} script on the official jenkins SparkR style guide -- Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8490) SURF Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8490: - Priority: Minor (was: Major) SURF Feature Transformer Key: SPARK-8490 URL: https://issues.apache.org/jira/browse/SPARK-8490 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Priority: Minor Speeded up robust features (SURF) (Bay et al, ECCV 2006, http://www.vision.ee.ethz.ch/~surf/eccv06.pdf) is an image descriptor transform very similar to SIFT (SPARK-8486) but can be computed more efficiently. One key difference is using box filters (Difference of Boxes) to approximate the Laplacian of the Gaussian. We can implement SURF in Spark ML pipelines as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the SURF features for the provided image. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8485) Feature transformers for image processing
[ https://issues.apache.org/jira/browse/SPARK-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8485: - Priority: Minor (was: Major) Feature transformers for image processing - Key: SPARK-8485 URL: https://issues.apache.org/jira/browse/SPARK-8485 Project: Spark Issue Type: New Feature Components: ML Reporter: Feynman Liang Priority: Minor Many transformers exist to convert from image representations into more compact descriptors amenable to standard ML techniques. We should implement these transformers in Spark to support machine learning on richer content types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8493) Fisher Vector Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8493: - Priority: Minor (was: Major) Fisher Vector Feature Transformer - Key: SPARK-8493 URL: https://issues.apache.org/jira/browse/SPARK-8493 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Priority: Minor Fisher vectors provide a vocabulary-based encoding for images (see https://hal.inria.fr/hal-00830491/file/journal.pdf). This representation is useful due to reduced dimensionality, providing regularization as well as increased scalability. An implementation of FVs in Spark ML should provide a way to both train a GMM vocabulary as well compute Fisher kernel encodings of provided images. The vocabulary trainer can be implemented as a standalone GMM pipeline. The feature transformer can be implemented as a org.apache.spark.ml.UnaryTransformer. It should accept a vocabulary (Array[Array[Double]]) as well as an image (Array[Double]) and produce the Fisher kernel encoding (Array[Double]). See Enceval (http://www.robots.ox.ac.uk/~vgg/software/enceval_toolkit/) for a reference implementation in MATLAB/C++. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org