Does anyone have spark code style guide xml file ?

2016-03-01 Thread zml
Hello,

Appreciate if you have xml file with the following style code ?
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

thanks.


Is there some open source tools which implements draggable widget and make the app runing in a form of DAG ?

2016-02-01 Thread zml
Hello ,

  I am trying to find some tools but useless. So, as title described, Is 
there some open source tools which implements draggable widget and make the app 
running in a form of DAG like workflow ?

Thanks,
Minglei.


Is there a test like MiniCluster example in Spark just like hadoop ?

2016-01-18 Thread zml
Hello,

   I want to find some test file in spark which support the same function 
just like in Hadoop MiniCluster test environment. But I can not find them. 
Anyone know about that ?


转发: Error:scalac: Error: assertion failed: List(object package$DebugNode, object package$DebugNode)

2015-12-30 Thread zml
I’m sorry. The error is not when I build spark occurs. It’s happen when running 
the example with LogisticRegreesionWithElasticNetExample.scala.

发件人: zml张明磊 [mailto:mingleizh...@ctrip.com]
发送时间: 2015年12月31日 15:01
收件人: user@spark.apache.org
主题: Error:scalac: Error: assertion failed: List(object package$DebugNode, 
object package$DebugNode)

Hello,

Recently, I build spark from apache/master and getting the following error. 
From stackoverflow 
http://stackoverflow.com/questions/24165184/scalac-assertion-failed-while-run-scalatest-in-idea,
 I can not find Preferences > Scala he said in Intellij IDEA. And SBT is not 
worked for me in our company. Use maven instead. How can I fix and work around 
it ? Last : happy new year to everyone.

Error:scalac: Error: assertion failed: List(object package$DebugNode, object 
package$DebugNode)
  java.lang.AssertionError: assertion failed: List(object 
package$DebugNode, object package$DebugNode)
   at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678)
   at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:2988)
   at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:2991)
   at 
scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.genClass(GenASM.scala:1371)
   at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:120)
   at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
   at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
   at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
   at scala.tools.nsc.Global$Run.compile(Global.scala:1662)
   at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)


thanks,
Minglei.


Error:scalac: Error: assertion failed: List(object package$DebugNode, object package$DebugNode)

2015-12-30 Thread zml
Hello,

Recently, I build spark from apache/master and getting the following error. 
From stackoverflow 
http://stackoverflow.com/questions/24165184/scalac-assertion-failed-while-run-scalatest-in-idea,
 I can not find Preferences > Scala he said in Intellij IDEA. And SBT is not 
worked for me in our company. Use maven instead. How can I fix and work around 
it ? Last : happy new year to everyone.

Error:scalac: Error: assertion failed: List(object package$DebugNode, object 
package$DebugNode)
  java.lang.AssertionError: assertion failed: List(object 
package$DebugNode, object package$DebugNode)
   at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678)
   at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:2988)
   at 
scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:2991)
   at 
scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.genClass(GenASM.scala:1371)
   at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:120)
   at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
   at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
   at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
   at scala.tools.nsc.Global$Run.compile(Global.scala:1662)
   at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)


thanks,
Minglei.


How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-24 Thread zml
Hi,

   I am a new to Scala and Spark and trying to find relative API in 
DataFrame to solve my problem as title described. However, I just only find 
this API DataFrame.col(colName : String) : Column which returns an object of 
Column. Not the content. If only DataFrame support such API which like 
Column.toArray : Type is enough for me. But now, it doesn’t. How can I do can 
achieve this function ?

Thanks,
Minglei.


running spark application encouter an error (maven relative)

2015-12-22 Thread zml
Hi,

I am trying to figure out how maven works. When I add a dependency to my 
existing pom.xml and rebuild my spark application project. BUILD SUCCESS I can 
get from the console. However, when I running the spark application, the 
spark-shell was not happy and directly give me a message following :

Exception : Java.lang.NoClassDefFoundError : 
com/github/stuxuhai/jpinyin/PinyinFormat
Caused by : ClassNotFoundException : com.github.stuxuhai.jpinyin.PinyinFormat

I go to the directory .m2/repo/com/github/stuxuhai/jpinyin/1/.1.1 and 
jpinyin-1.1.1.jar was there. What happened ? Can anyone help me ?

Thanks,
Minglei.






UnsupportedOperationException Schema for type String => Int is not supported

2015-12-22 Thread zml
Hi,

Spark-version : 1.4.1
Runing the code getting the following error, how can I fix the code and run 
collectly ? I don’t know why the schema don’t support this type system. If I 
use callUDF instead of udf. Everything is good.

Thanks,
Minglei.

val index:(String => (String => Int)) = (value:String) => { (a:String)  =>  if 
( value.equals(a)) 1 else 0 }
val sqlfunc = udf(index)
var temp = df
val meetsConditionValue = List("fergubo01m" ,"wrighha01m" ,"woodji01m" 
,"mcbridi01m" ,"cravebi01m")

for (i <- 0 until j) {
 temp = temp.withColumn(columnName + "_" + meetsConditionValue(i), 
sqlfunc(col(columnName)))
}


Exception in thread "main" java.lang.UnsupportedOperationException: Schema for 
type String => Int is not supported
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:152)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:63)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:28)
at org.apache.spark.sql.functions$.udf(functions.scala:1363)
at com.asa.ml.toolimpl.DummyImpl.create_dummy(DummyImpl.scala:60)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.asa.ml.client.Client$.main(Client.scala:26)
at com.asa.ml.client.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



Are there some solution to complete the transform category variables into dummy variable in scala or spark ?

2015-12-17 Thread zml
Hi ,

 I am a new to scala and spark. Recently, I need to write a tool that 
transform category variables to dummy/indicator variables. I want to know are 
there some tools in scala and spark which support this transformation which 
like pandas.get_dummies in python ? Any example or study learning materials for 
me ?

Thanks,
Minglei.


YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2015-12-15 Thread zml
Yesterday night, I run the jar on my pseudo-distributed mode without WARN and 
ERROR. However, Today, Getting the WARN and directly leading to the ERROR 
below. My computer memory is 8GB and I think it’s not the issue as the LOG WARN 
describe. What ‘s wrong ? The code haven’t change yet. And the environment 
haven’t change too. So Strange. Can anybody help me ? Why …….

Thanks.
Minglei.

Here is the submit job script

/bin/spark-submit --master local[*] --driver-memory 8g --executor-memory 8g  
--class com.ctrip.ml.client.Client  
/root/di-ml-tool/target/di-ml-tool-1.0-SNAPSHOT.jar

Error below
15/12/16 10:22:01 WARN cluster.YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/12/16 10:22:04 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster has disassociated: 10.32.3.21:48311
15/12/16 10:22:04 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster has disassociated: 10.32.3.21:48311
15/12/16 10:22:04 WARN remote.ReliableDeliverySupervisor: Association with 
remote system [akka.tcp://sparkYarnAM@10.32.3.21:48311] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
15/12/16 10:22:04 ERROR cluster.YarnClientSchedulerBackend: Yarn application 
has already exited with state FINISHED!

Exception in thread "main" 15/12/16 10:22:04 INFO 
cluster.YarnClientSchedulerBackend: Shutting down all executors
Exception in thread "Yarn application state monitor" 
org.apache.spark.SparkException: Error asking standalone scheduler to shut down 
executors
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)



RuntimeException: Failed to check null bit for primitive int type

2015-12-14 Thread zml
Hi,

 My spark version is spark-1.4.1-bin-hadoop2.6. When I submit a spark 
job and read data from hive table. Getting the following error. Although it’s 
just a WARN. But it’s leading to the job failure.
Maybe the following jira has solved. So, I am confusing.  
https://issues.apache.org/jira/browse/SPARK-3004

15/12/14 19:21:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 40.0 
(TID 1255, minglei): java.lang.RuntimeException: Failed to check null bit for 
primitive int value.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(rows.scala:82)
at com.ctrip.ml.toolimpl.MetadataImpl$$anonfun$1.apply(MetadataImpl.scala:22)
at com.ctrip.ml.toolimpl.MetadataImpl$$anonfun$1.apply(MetadataImpl.scala:22)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:30)
at org.spark-project.guava.collect.Ordering.leastOf(Ordering.java:658)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1338)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1335)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745
15/12/14 19:21:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 40.0 
(TID 1255, minglei): java.lang.RuntimeException: Failed to check null bit for 
primitive int value.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(rows.scala:82)
at com.ctrip.ml.toolimpl.MetadataImpl$$anonfun$1.apply(MetadataImpl.scala:22)
at com.ctrip.ml.toolimpl.MetadataImpl$$anonfun$1.apply(MetadataImpl.scala:22)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:30)
at org.spark-project.guava.collect.Ordering.leastOf(Ordering.java:658)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1338)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$29.apply(RDD.scala:1335)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)