date:20151110

Re: Spark Thrift doesn't start

2015-11-10 Thread fightf...@163.com

I think the exception info just says clear that you may miss some tez related 
jar on the 
spark thrift server classpath.



fightf...@163.com
 
From: DaeHyun Ryu
Date: 2015-11-11 14:47
To: user
Subject: Spark Thrift doesn't start
Hi folks,

I configured tez as execution engine of Hive. After done that, whenever I 
started spark thrift server, it just stopped automatically. 
I checked log and saw the following messages. My spark version is 1.4.1 and   
tez version is 0.7.0 (IBM BigInsights 4.1)
Does anyone have any idea on this ? 

java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at 
org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
at 
org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
at 
org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at

Spark Thrift doesn't start

2015-11-10 Thread DaeHyun Ryu

Hi folks,

I configured tez as execution engine of Hive. After done that, whenever I 
started spark thrift server, it just stopped automatically. 
I checked log and saw the following messages. My spark version is 1.4.1 
and   tez version is 0.7.0 (IBM BigInsights 4.1)
Does anyone have any idea on this ? 

java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
at 
org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at 
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
at 
org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at 
org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
at 
org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
at 
org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
at 
org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: 
org.apache.tez.dag.api.SessionNotRunning

RE: Connecting SparkR through Yarn

2015-11-10 Thread Sun, Rui

Amit,
You can simply set “MASTER” as “yarn-client” before calling sparkR.init().
Sys.setenv("MASTER"="yarn-client")

I assume that you have set “YARN_CONF_DIR” env variable required for running 
Spark on YARN.

If you want to set more YARN specific configurations, you can for example
Sys.setenv ("SPARKR_SUBMIT_ARGS", " --master yarn-client --num-executors 4 
sparkr-shell"
Before calling sparkR.init().

From: Amit Behera [mailto:amit.bd...@gmail.com]
Sent: Monday, November 9, 2015 2:36 AM
To: user@spark.apache.org
Subject: Connecting SparkR through Yarn

Hi All,

Spark Version = 1.5.1
Hadoop Version = 2.6.0

I set up the cluster in Amazon EC2 machines (1+5)
I am able create a SparkContext object using init method from RStudio.

But I do not know how can I create a SparkContext object in yarn mode.

I got the below link to run on yarn. but in this document its given for Spark 
version >= 0.9.0 and <= 1.2.

https://github.com/amplab-extras/SparkR-pkg/blob/master/README.md#running-on-yarn


Please help me how can I connect SparkR on Yarn.



Thanks,
Amit.

Terasort on Spark

2015-11-10 Thread Du, Fan


Hi Spark experts

I'm using ehiggs/spark-terasort to exercise my cluster.
I don't understand how to run the terasort in a standard way when using 
cluster.



Currently, all the input data and output data is put into hdfs, and I 
can generate/sort/validate

all the sample data.But I'm not sure it's the right way to do it.

./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen 
spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar  256g 
hdfs:///tmp/data_in
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen 
spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar  256g 
hdfs:///tmp/data_in
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidate 
spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar 
hdfs:///test/data_out file:///tmp/data_validate


That's being said, all the 256G input data is stored in hdfs,and mesos 
slave needs to access the hdfs based input data.

So this leads into another question on how hdfs is setup in a standard way.

Is there any docs to summarize how to setup a standard runtime env for 
terasort on Spark?


thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-10 Thread Bryan

Anyone have thoughts or a similar use-case for SparkSQL / Cassandra?

Regards,

Bryan Jeffrey

-Original Message-
From: "Bryan Jeffrey" 
Sent: ‎11/‎4/‎2015 11:16 AM
To: "user" 
Subject: Cassandra via SparkSQL/Hive JDBC

Hello.

I have been working to add SparkSQL HDFS support to our application.  We're 
able to process streaming data, append to a persistent Hive table, and have 
that table available via JDBC/ODBC.  Now we're looking to access data in 
Cassandra via SparkSQL.  

In reading a number of previous posts, it appears that the way to do this is to 
instantiate a Spark Context, read the data into an RDD using the Cassandra 
Spark Connector, convert the data to a DF and register it as a temporary table. 
 The data will then be accessible via SparkSQL - although I assume that you 
would need to refresh the table on a periodic basis.

Is there a more straightforward way to do this?  Is it possible to register the 
Cassandra table with Hive so that the SparkSQL thrift server instance can just 
read data directly?

Regards,

Bryan Jeffrey

Spark-csv error on read AWS s3a in spark 1.4.1

2015-11-10 Thread Zhang, Jingyu

A small csv file in S3. I use s3a://key:seckey@bucketname/a.csv

 It works for SparkContext

pixelsStr: SparkContext = ctx.textFile(s3pathOrg);

It works for Java Spark-csv as well

Java code : DataFrame careerOneDF = sqlContext.read().format(
"com.databricks.spark.csv")

.option("inferSchema", "true") .option("header", "true").load(s3pathOrg
);

However, it do not work for Scala, error message shown below

val careerOneDF:DataFrame = sqlContext.read

.format("com.databricks.spark.csv")

.option("inferSchema", "true")

.option("header", "true")

.load(s3pathOrg);

com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service:
Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID:
F2E11C10E6D35BF3), S3 Extended Request ID:
0tdESZAHmROgSJem6P3gYnEZs86rrt4PByrTYbxzCw0xyM9KUMCHEAX3x4lcoy5O3A8qccgHraQ=

at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(
AmazonHttpClient.java:1160)

at com.amazonaws.http.AmazonHttpClient.executeOneRequest(
AmazonHttpClient.java:748)

at com.amazonaws.http.AmazonHttpClient.executeHelper(
AmazonHttpClient.java:467)

at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:302)

at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)

at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(
AmazonS3Client.java:1050)

at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(
AmazonS3Client.java:1027)

at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(
S3AFileSystem.java:688)

at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(
S3AFileSystem.java:71)

at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)

at org.apache.hadoop.fs.Globber.glob(Globber.java:252)

at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644)

at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(
FileInputFormat.java:257)

at org.apache.hadoop.mapred.FileInputFormat.listStatus(
FileInputFormat.java:228)

at org.apache.hadoop.mapred.FileInputFormat.getSplits(
FileInputFormat.java:313)

at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
MapPartitionsRDD.scala:32)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1255)

at org.apache.spark.rdd.RDDOperationScope$.withScope(
RDDOperationScope.scala:147)

at org.apache.spark.rdd.RDDOperationScope$.withScope(
RDDOperationScope.scala:108)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)

at org.apache.spark.rdd.RDD.take(RDD.scala:1250)

at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1290)

at org.apache.spark.rdd.RDDOperationScope$.withScope(
RDDOperationScope.scala:147)

at org.apache.spark.rdd.RDDOperationScope$.withScope(
RDDOperationScope.scala:108)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)

at org.apache.spark.rdd.RDD.first(RDD.scala:1289)

at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(
CsvRelation.scala:129)

at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:127)

at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:109)

at com.databricks.spark.csv.CsvRelation.(CsvRelation.scala:62)

at com.databricks.spark.csv.DefaultSource.createRelation(
DefaultSource.scala:115)

at com.databricks.spark.csv.DefaultSource.createRelation(
DefaultSource.scala:40)

at com.databricks.spark.csv.DefaultSource.createRelation(
DefaultSource.scala:28)

at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:269)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)


Thanks

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.

Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-10 Thread Shivaram Venkataraman

I think this is happening in the driver. Could you check the classpath
of the JVM that gets started ? If you use spark-submit on yarn the
classpath is setup before R gets launched, so it should match the
behavior of Scala / Python.

Thanks
Shivaram

On Fri, Nov 6, 2015 at 1:39 PM, Tom Graves  wrote:
> I'm trying to use the netlib-java stuff with mllib and sparkR on yarn. I've
> compiled with -Pnetlib-lgpl, see the necessary things in the spark assembly
> jar.  The nodes have  /usr/lib64/liblapack.so.3, /usr/lib64/libblas.so.3,
> and /usr/lib/libgfortran.so.3.
>
>
> Running:
> data <- read.df(sqlContext, 'data.csv', 'com.databricks.spark.csv')
> mdl = glm(C2~., data, family="gaussian")
>
> But I get the error:
> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemLAPACK
> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefLAPACK
> 15/11/06 21:17:27 ERROR RBackendHandler: fitRModelFormula on
> org.apache.spark.ml.api.r.SparkRWrappers failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.AssertionError: assertion failed: lapack.dpotrs returned 18.
>at scala.Predef$.assert(Predef.scala:179)
> at
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
> at
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:114)
> at
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:166)
> at
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:65)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
> at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:138)
> at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>
> Anyone have this working?
>
> Thanks,
> Tom

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Python Kafka support?

2015-11-10 Thread Saisai Shao

Hi Darren,

Functionality like messageHandler is missing in python API, still not
included in version 1.5.1.

Thanks
Jerry

On Wed, Nov 11, 2015 at 7:37 AM, Darren Govoni  wrote:

> Hi,
>  I read on this page
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
> about python support for "receiverless" kafka integration (Approach 2) but
> it says its incomplete as of version 1.4.
>
> Has this been updated in version 1.5.1?
>
> Darren
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: [ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Fengdong Yu

This is the most simplest announcement I saw.



> On Nov 11, 2015, at 12:49 AM, Reynold Xin  wrote:
> 
> Hi All,
> 
> Spark 1.5.2 is a maintenance release containing stability fixes. This release 
> is based on the branch-1.5 maintenance branch of Spark. We *strongly 
> recommend* all 1.5.x users to upgrade to this release.
> 
> The full list of bug fixes is here: http://s.apache.org/spark-1.5.2 
> 
> 
> http://spark.apache.org/releases/spark-release-1-5-2.html 
> 
> 
>

Spark SQL reading json with pre-defined schema

2015-11-10 Thread ganesh.tiwari

I have very very large json and I want to save by avoiding Spark to make scan
over data to infer the schema. Instead since I already know the data, I
would prefer to provide the schema myself with

sqlContext.read().schema(mySchema).json(jsonFilePath)

however the problem is the json data format is kind of weird

[
{
  "apiTypeName": "someApi",
  "allFieldsAndValues": {
"Field_1": "Value",
"Field_2": "Value",
"Field_3": 779.0,
"Field_4": "Value",
"Field_5": true
  }
},
{
  "apiTypeName": "someApi",
  "allFieldsAndValues": { 
 "Field_1": "Value", 
 "Field_2": "Value", 
 "Field_3": 779.0, 
 "Field_4": "Value", 
 "Field_5": true }
}
]

I can't seem to construct a schema for this kind of data that Spark could
use to avoid inferring schema on its own.  Every which I have tried to
create schema from StructType, StructField or Array combinations to build
the schema, spark wouldn't pick it up as i intend it to


Any help is appreciated



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-reading-json-with-pre-defined-schema-tp25353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Python Kafka support?

2015-11-10 Thread Darren Govoni


Hi,
 I read on this page 
http://spark.apache.org/docs/latest/streaming-kafka-integration.html 
about python support for "receiverless" kafka integration (Approach 2) 
but it says its incomplete as of version 1.4.


Has this been updated in version 1.5.1?

Darren

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is it possible Running SparkR on 2 nodes without HDFS

2015-11-10 Thread Ali Tajeldin EDU

make sure 
"/mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv" is 
accessible on your slave node.
--
Ali

On Nov 9, 2015, at 6:06 PM, Sanjay Subramanian 
 wrote:

> hey guys
> 
> I have a 2 node SparkR (1 master 1 slave)cluster on AWS using 
> spark-1.5.1-bin-without-hadoop.tgz
> 
> Running the SparkR job on the master node 
> 
> /opt/spark-1.5.1-bin-hadoop2.6/bin/sparkR --master  
> spark://ip-xx-ppp-vv-ddd:7077 --packages com.databricks:spark-csv_2.10:1.2.0  
> --executor-cores 16 --num-executors 8 --executor-memory 8G --driver-memory 8g 
>   myRprogram.R
> 
> 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 
> in stage 1.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1.0 
> (TID 103, xx.ff.rr.tt): java.io.FileNotFoundException: File 
> file:/mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
>   at org.apache.hadoop.mapred.LineRecordReader.(LineRecord
> 
> 
> 
> 
> 
> myRprogram.R
> 
> library(SparkR)
> 
> sc <- sparkR.init(appName="SparkR-CancerData-example")
> sqlContext <- sparkRSQL.init(sc)
> 
> lds <- read.df(sqlContext, 
> "file:///mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv",
>  "com.databricks.spark.csv", header="true")
> sink("file:///mnt/local/1024gbxvdf1/leads_new_data_analyis.txt")
> summary(lds)
> 
> 
> This used to run when we had a single node SparkR installation
> 
> regards
> 
> sanjay
> 
>

Re: Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky

(accidental keyboard-shortcut sent the message)
... spark-shell from the spark 1.5.2 binary distribution.
Also, running "spPublishLocal" has the same effect.

thanks,
--Jakob

On 10 November 2015 at 14:55, Jakob Odersky  wrote:

> Hi,
> I ran into in error trying to run spark-shell with an external package
> that I built and published locally
> using the spark-package sbt plugin (
> https://github.com/databricks/sbt-spark-package).
>
> To my understanding, spark packages can be published simply as maven
> artifacts, yet after running "publishLocal" in my package project (
> https://github.com/jodersky/spark-paperui), the following command
>
>park-shell --packages ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT
>
> gives an error:
>
> ::
>
> ::  UNRESOLVED DEPENDENCIES ::
>
> ::
>
> :: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
> required from org.apache.spark#spark-submit-parent;1.0 default
>
> ::
>
>
> :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
> Exception in thread "main" java.lang.RuntimeException: [unresolved
> dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
> required from org.apache.spark#spark-submit-parent;1.0 default]
> at
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
> at
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12
>
> Do I need to include some default configuration? If so where and how
> should I do it? All other packages I looked at had no such thing.
>
> Btw, I am using spark-shell from a
>
>

Spark Packages Configuration Not Found

2015-11-10 Thread Jakob Odersky

Hi,
I ran into in error trying to run spark-shell with an external package that
I built and published locally
using the spark-package sbt plugin (
https://github.com/databricks/sbt-spark-package).

To my understanding, spark packages can be published simply as maven
artifacts, yet after running "publishLocal" in my package project (
https://github.com/jodersky/spark-paperui), the following command

   park-shell --packages ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT

gives an error:

::

::  UNRESOLVED DEPENDENCIES ::

::

:: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
required from org.apache.spark#spark-submit-parent;1.0 default

::


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved
dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
required from org.apache.spark#spark-submit-parent;1.0 default]
at
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
at
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12

Do I need to include some default configuration? If so where and how should
I do it? All other packages I looked at had no such thing.

Btw, I am using spark-shell from a

RE: thought experiment: use spark ML to real time prediction

2015-11-10 Thread Kothuvatiparambil, Viju

I have a similar issue.  I want to load a model saved by a spark machine 
learning job, in a web application.

model.save(jsc.sc(), "myModelPath");

LogisticRegressionModel model = 
LogisticRegressionModel.load(
jsc.sc(), "myModelPath");

When I do that, I need to pass a spark context for loading the model.  The 
model is small and can be saved to local file system, so is there any way to 
use it without the spark context?  Looks like creating spark context is an 
expensive step that internally starts a jetty server.  I do not want to start 
one more web server inside a web application.

A solution that I received (pasted below) was to export the model into a 
generic format such as PMML. I haven't tried it, and I am hoping to find a way 
to use the model without adding a lot more dependencies and code to the project.


On Oct 30, 2015, at 2:11 PM, Stefano Baghino 
mailto:stefano.bagh...@radicalbit.io>> wrote:
One possibility would be to export the model as a PMML (Predictive Model Markup 
Language, an XML-based standard to describe predictive models) and then use it 
in your Web app (using something like JPMML, for 
example). You can directly export (some) models (including LinReg) since Spark 
1.4: https://databricks.com/blog/2015/07/02/pmml-support-in-spark-mllib.html

For more info on PMML support on MLlib (including model support): 
https://spark.apache.org/docs/latest/mllib-pmml-model-export.html
For more info on the PMML standard: 
http://dmg.org/pmml/v4-2-1/GeneralStructure.html


Thanks
Viju





From: Andy Davidson [mailto:a...@santacruzintegration.com]
Sent: Tuesday, November 10, 2015 1:32 PM
To: user @spark
Subject: thought experiment: use spark ML to real time prediction

Lets say I have use spark ML to train a linear model. I know I can save and 
load the model to disk. I am not sure how I can use the model in a real time 
environment. For example I do not think I can return a "prediction" to the 
client using spark streaming easily. Also for some applications the extra 
latency created by the batch process might not be acceptable.

If I was not using spark I would re-implement the model I trained in my batch 
environment in a lang like Java  and implement a rest service that uses the 
model to create a prediction and return the prediction to the client. Many 
models make predictions using linear algebra. Implementing predictions is 
relatively easy if you have a good vectorized LA package. Is there a way to use 
a model I trained using spark ML outside of spark?

As a motivating example, even if its possible to return data to the client 
using spark streaming. I think the mini batch latency would not be acceptable 
for a high frequency stock trading system.

Kind regards

Andy

P.s. The examples I have seen so far use spark streaming to "preprocess" 
predictions. For example a recommender system might use what current users are 
watching to calculate "trending recommendations". These are stored on disk and 
served up to users when the use the "movie guide". If a recommendation was a 
couple of min. old it would not effect the end users experience.

--
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.

Re: Querying nested struct fields

2015-11-10 Thread pratik khadloya

That worked!! Thanks a lot Michael.

~Pratik

On Tue, Nov 10, 2015 at 12:02 PM Michael Armbrust 
wrote:

> Oh sorry _1 is not a valid hive identifier, you need to use backticks to
> escape it:
>
> Seq(((1, 2), 2)).toDF().registerTempTable("test")
> sql("SELECT `_1`.`_1` FROM test")
>
> On Tue, Nov 10, 2015 at 11:31 AM, pratik khadloya 
> wrote:
>
>> I tried the same, didn't work :(
>>
>> scala> hc.sql("select _1.item_id from agg_imps_df limit 10").collect()
>> 15/11/10 14:30:41 INFO parse.ParseDriver: Parsing command: select
>> _1.item_id from agg_imps_df limit 10
>> org.apache.spark.sql.AnalysisException: missing \' at 'from' near
>> ''; line 1 pos 23
>> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
>> at
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>> at
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
>> at
>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>> at
>> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>>
>> On Tue, Nov 10, 2015 at 11:25 AM Michael Armbrust 
>> wrote:
>>
>>> Use a `.`:
>>>
>>> hc.sql("select _1.item_id from agg_imps_df limit 10").collect()
>>>
>>> On Tue, Nov 10, 2015 at 11:24 AM, pratik khadloya 
>>> wrote:
>>>
 Hello,

 I just saved a PairRDD as a table, but i am not able to query it
 correctly. The below and other variations does not seem to work.

 scala> hc.sql("select * from agg_imps_df").printSchema()
  |-- _1: struct (nullable = true)
  ||-- item_id: long (nullable = true)
  ||-- flight_id: long (nullable = true)
  |-- _2: struct (nullable = true)
  ||-- day_hour: string (nullable = true)
  ||-- imps: long (nullable = true)
  ||-- revenue: double (nullable = true)


 scala> hc.sql("select _1:item_id from agg_imps_df limit 10").collect()


 Can anyone please suggest the correct way to get the list of item_ids
 in the query?

 Thanks,
 ~Pratik

>>>
>>>
>

PySpark: breakdown application execution time and fine-tuning the application

2015-11-10 Thread saluc

Hello,

I am using PySpark to develop my big-data application. I have the impression
that most of the execution of my application is spent on the infrastructure
(distributing the code and the data in the cluster, IPC between the Python
processes and the JVM) rather than on the  computation itself. I would be
interested in particular in measuring the time spent in the IPC between the
Python processes and the JVM.

I would like to ask you, is there a way to breakdown the execution time in
order to have more details on how much time is effectively spent on the
different phases of the execution, so to have some kind of detailed
profiling of the execution time, and have more information for fine-tuning
the application?

Thank you very much for your help and support,
Luca



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-breakdown-application-execution-time-and-fine-tuning-the-application-tp25350.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using model saved by MLlib with out creating spark context

2015-11-10 Thread Viju K

Thank you for your suggestion, Stefano. I was hoping for an easier solution :)

Sent from my iPhone

> On Oct 30, 2015, at 2:11 PM, Stefano Baghino  
> wrote:
> 
> One possibility would be to export the model as a PMML (Predictive Model 
> Markup Language, an XML-based standard to describe predictive models) and 
> then use it in your Web app (using something like JPMML, for example). You 
> can directly export (some) models (including LinReg) since Spark 1.4: 
> https://databricks.com/blog/2015/07/02/pmml-support-in-spark-mllib.html
> 
> For more info on PMML support on MLlib (including model support): 
> https://spark.apache.org/docs/latest/mllib-pmml-model-export.html
> For more info on the PMML standard: 
> http://dmg.org/pmml/v4-2-1/GeneralStructure.html
> 
>> On Fri, Oct 30, 2015 at 9:33 PM, vijuks  wrote:
>> I want to load a model saved by a spark machine learning job, in a web
>> application.
>> 
>> model.save(jsc.sc(), "myModelPath");
>> 
>> LogisticRegressionModel model = 
>> LogisticRegressionModel.load(
>> jsc.sc(), "myModelPath");
>> 
>> When I do that, I need to pass a spark context for loading the model.  The
>> model is small and can be saved to local file system, so is there any way to
>> use it with out the spark context?  Looks like creating spark context is an
>> expensive step that starts http server that listens on multiple ports.
>> 
>> 15/10/30 10:53:39 INFO HttpServer: Starting HTTP Server
>> 15/10/30 10:53:39 INFO Utils: Successfully started service 'HTTP file
>> server' on port 63341.
>> 15/10/30 10:53:39 INFO SparkEnv: Registering OutputCommitCoordinator
>> 15/10/30 10:53:40 INFO Utils: Successfully started service 'SparkUI' on port
>> 4040.
>> 15/10/30 10:53:40 INFO SparkUI: Started SparkUI at http://171.142.49.18:4040
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-model-saved-by-MLlib-with-out-creating-spark-context-tp25239.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> 
> -- 
> BR,
> Stefano Baghino
> 
> Software Engineer @ Radicalbit

Re: Anybody hit this issue in spark shell?

2015-11-10 Thread Ted Yu

In the PR, a new scala style rule is added banning use of @VisibleForTesting

Similar rules can be added as seen fit.

Cheers

On Tue, Nov 10, 2015 at 11:22 AM, Shixiong Zhu  wrote:

> Scala compiler stores some metadata in the ScalaSig attribute. See the
> following link as an example:
>
>
> http://stackoverflow.com/questions/10130106/how-does-scala-know-the-difference-between-def-foo-and-def-foo/10130403#10130403
>
> As maven-shade-plugin doesn't recognize ScalaSig, it cannot fix the
> reference in it. Not sure if there is a Scala version of
> `maven-shade-plugin` to deal with it.
>
> Generally, annotations that will be shaded should not be used in the Scala
> codes. I'm wondering if we can expose this issue in the PR build. Because
> SBT build doesn't do the shading, now it's hard for us to find similar
> issues in the PR build.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-11-09 18:47 GMT-08:00 Ted Yu :
>
>> Created https://github.com/apache/spark/pull/9585
>>
>> Cheers
>>
>> On Mon, Nov 9, 2015 at 6:39 PM, Josh Rosen 
>> wrote:
>>
>>> When we remove this, we should add a style-checker rule to ban the
>>> import so that it doesn't get added back by accident.
>>>
>>> On Mon, Nov 9, 2015 at 6:13 PM, Michael Armbrust >> > wrote:
>>>
 Yeah, we should probably remove that.

 On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu  wrote:

> If there is no option to let shell skip processing @VisibleForTesting
> , should the annotation be dropped ?
>
> Cheers
>
> On Mon, Nov 9, 2015 at 5:50 PM, Marcelo Vanzin 
> wrote:
>
>> We've had this in the past when using "@VisibleForTesting" in classes
>> that for some reason the shell tries to process. QueryExecution.scala
>> seems to use that annotation and that was added recently, so that's
>> probably the issue.
>>
>> BTW, if anyone knows how Scala can find a reference to the original
>> Guava class even after shading, I'd really like to know. I've looked
>> several times and never found where the original class name is stored.
>>
>> On Mon, Nov 9, 2015 at 10:37 AM, Zhan Zhang 
>> wrote:
>> > Hi Folks,
>> >
>> > Does anybody meet the following issue? I use "mvn package -Phive
>> > -DskipTests” to build the package.
>> >
>> > Thanks.
>> >
>> > Zhan Zhang
>> >
>> >
>> >
>> > bin/spark-shell
>> > ...
>> > Spark context available as sc.
>> > error: error while loading QueryExecution, Missing dependency 'bad
>> symbolic
>> > reference. A signature in QueryExecution.class refers to term
>> annotations
>> > in package com.google.common which is not available.
>> > It may be completely missing from the current classpath, or the
>> version on
>> > the classpath might be incompatible with the version used when
>> compiling
>> > QueryExecution.class.', required by
>> >
>> /Users/zzhang/repo/spark/assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.2.0.jar(org/apache/spark/sql/execution/QueryExecution.class)
>> > :10: error: not found: value sqlContext
>> >import sqlContext.implicits._
>> >   ^
>> > :10: error: not found: value sqlContext
>> >import sqlContext.sql
>> >   ^
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>

thought experiment: use spark ML to real time prediction

2015-11-10 Thread Andy Davidson

Lets say I have use spark ML to train a linear model. I know I can save and
load the model to disk. I am not sure how I can use the model in a real time
environment. For example I do not think I can return a ³prediction² to the
client using spark streaming easily. Also for some applications the extra
latency created by the batch process might not be acceptable.

If I was not using spark I would re-implement the model I trained in my
batch environment in a lang like Java  and implement a rest service that
uses the model to create a prediction and return the prediction to the
client. Many models make predictions using linear algebra. Implementing
predictions is relatively easy if you have a good vectorized LA package. Is
there a way to use a model I trained using spark ML outside of spark?

As a motivating example, even if its possible to return data to the client
using spark streaming. I think the mini batch latency would not be
acceptable for a high frequency stock trading system.

Kind regards

Andy

P.s. The examples I have seen so far use spark streaming to ³preprocess²
predictions. For example a recommender system might use what current users
are watching to calculate ³trending recommendations². These are stored on
disk and served up to users when the use the ³movie guide². If a
recommendation was a couple of min. old it would not effect the end users
experience.

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-11-10 Thread Holden Karau

So the SF version will be this Friday at
https://foursquare.com/v/coffee-mission/561ab392498e1bc38c3a7e8d (next to
the 24th st bart) from 2pm till 5:30pm , come by with your Spark questions
:D I'll try and schedule the on-line ones shortly but I had some unexpected
travel come up.

On Tue, Oct 27, 2015 at 11:43 AM, Holden Karau  wrote:

> So I'm going to try and do these again, with an on-line (
> http://doodle.com/poll/cr9vekenwims4sna ) and SF version (
> http://doodle.com/poll/ynhputd974d9cv5y ).  You can help me pick a day
> that works for you by filling out the doodle (if none of the days fit let
> me know and I can try and arrange something else) :).
>
> On Wed, Oct 21, 2015 at 11:54 AM, Jacek Laskowski  wrote:
>
>> Hi Holden,
>>
>> What a great idea! I'd love to join, but since I'm in Europe it's not
>> gonna happen by this Fri. Any plans to visit Europe or perhaps Warsaw,
>> Poland and host office hours here? ;-)
>>
>> p.s. What about an virtual event with Google Hangout on Air on?
>>
>> Pozdrawiam,
>> Jacek
>>
>> --
>> Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
>> Follow me at https://twitter.com/jaceklaskowski
>> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>>
>>
>> On Wed, Oct 21, 2015 at 12:55 AM, Holden Karau 
>> wrote:
>> > Hi SF based folks,
>> >
>> > I'm going to try doing some simple office hours this Friday afternoon
>> > outside of Paramo Coffee. If no one comes by I'll just be drinking
>> coffee
>> > hacking on some Spark PRs so if you just want to hangout and hack on
>> Spark
>> > as a group come by too. (See
>> > https://twitter.com/holdenkarau/status/656592409455779840 ). If you
>> have any
>> > questions you'd like to ask in particular shoot me an e-mail in advance
>> I'll
>> > even try and be prepared.
>> >
>> > Cheers,
>> >
>> > Holden :)
>> > --
>> > Cell : 425-233-8271
>> > Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: NullPointerException with joda time

2015-11-10 Thread Ted Yu

I took a look at
https://github.com/JodaOrg/joda-time/blob/master/src/main/java/org/joda/time/DateTime.java
Looks like the NPE came from line below:

long instant = getChronology().days().add(getMillis(), days);
Maybe catch the NPE and print out the value of currentDate to see if there
is more clue ?

Cheers

On Tue, Nov 10, 2015 at 12:55 PM, Romain Sagean 
wrote:

> see below a more complete version of the code.
> the firstDate (previously minDate) should not be null, I even added an extra 
> "filter( _._2 != null)" before the flatMap and the error is still there.
>
> What I don't understand is why I have the error on dateSeq.las.plusDays and 
> not on dateSeq.last.isBefore (in the condition).
>
> I also tried changing the allDates function to use a while loop but i got the 
> same error.
>
>   def allDates(dateStart: DateTime, dateEnd: DateTime): Seq[DateTime] = {
> var dateSeq = Seq(dateStart)
> var currentDate = dateStart
> while (currentDate.isBefore(dateEnd)){
>   dateSeq = dateSeq :+ currentDate
>   currentDate = currentDate.plusDays(1)
> }
> return dateSeq
>   }
>
> val videoAllDates = events.select("player_id", "current_ts")  
> .filter("player_id is not null")  .filter("current_ts is not null")  
> .map( row => (row.getString(0), timestampToDate(row.getString(1  
> .filter(r => r._2.isAfter(minimumDate))  .reduceByKey(minDateTime)  
> .flatMapValues( firstDate => allDates(firstDate, endDate))
>
>
> And the stack trace.
>
> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
> output locations for shuffle 2 to sparkexecu...@r610-2.pro.hupi.loc:50821
> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses for
> shuffle 2 is 695 bytes
> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
> output locations for shuffle 1 to sparkexecu...@r610-2.pro.hupi.loc:50821
> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses for
> shuffle 1 is 680 bytes
> 15/11/10 21:10:36 INFO TaskSetManager: Starting task 206.0 in stage 3.0
> (TID 798, R610-2.pro.hupi.loc, PROCESS_LOCAL, 4416 bytes)
> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 205.0 in stage 3.0 (TID
> 797, R610-2.pro.hupi.loc): java.lang.NullPointerException
> at org.joda.time.DateTime.plusDays(DateTime.java:1070)
> at Heatmap$.allDates(heatmap.scala:34)
> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.R

Re: How to configure logging...

2015-11-10 Thread Hitoshi

I don't have akka but with just Spark, I just edited log4j.properties to
"log4j.rootCategory=ERROR, console" and ran the following command and was
able to get only the Time row as output.

run-example org.apache.spark.examples.streaming.JavaNetworkWordCount
localhost 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-configure-logging-tp25346p25348.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: NullPointerException with joda time

2015-11-10 Thread Romain Sagean

see below a more complete version of the code.
the firstDate (previously minDate) should not be null, I even added an
extra "filter( _._2 != null)" before the flatMap and the error is
still there.

What I don't understand is why I have the error on
dateSeq.las.plusDays and not on dateSeq.last.isBefore (in the
condition).

I also tried changing the allDates function to use a while loop but i
got the same error.

  def allDates(dateStart: DateTime, dateEnd: DateTime): Seq[DateTime] = {
var dateSeq = Seq(dateStart)
var currentDate = dateStart
while (currentDate.isBefore(dateEnd)){
  dateSeq = dateSeq :+ currentDate
  currentDate = currentDate.plusDays(1)
}
return dateSeq
  }

val videoAllDates = events.select("player_id", "current_ts")
.filter("player_id is not null")  .filter("current_ts is not
null")  .map( row => (row.getString(0),
timestampToDate(row.getString(1  .filter(r =>
r._2.isAfter(minimumDate))  .reduceByKey(minDateTime)
.flatMapValues( firstDate => allDates(firstDate, endDate))


And the stack trace.

15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 2 to sparkexecu...@r610-2.pro.hupi.loc:50821
15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 2 is 695 bytes
15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to sparkexecu...@r610-2.pro.hupi.loc:50821
15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 1 is 680 bytes
15/11/10 21:10:36 INFO TaskSetManager: Starting task 206.0 in stage 3.0
(TID 798, R610-2.pro.hupi.loc, PROCESS_LOCAL, 4416 bytes)
15/11/10 21:10:36 WARN TaskSetManager: Lost task 205.0 in stage 3.0 (TID
797, R610-2.pro.hupi.loc): java.lang.NullPointerException
at org.joda.time.DateTime.plusDays(DateTime.java:1070)
at Heatmap$.allDates(heatmap.scala:34)
at Heatmap$$anonfun$12.apply(heatmap.scala:97)
at Heatmap$$anonfun$12.apply(heatmap.scala:97)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/11/10 21:

Re: though experiment: Can I use spark streaming to replace all of my rest services?

2015-11-10 Thread Jörn Franke

Maybe you look for web sockets/stomp to get it to the end user? Or http2/stomp 
in the future 

> On 10 Nov 2015, at 21:28, Andy Davidson  wrote:
> 
> I just finished watching a great presentation from a recent spark summit on 
> real time movie recommendations using spark.
> https://spark-summit.org/east-2015/talk/real-time-recommendations-using-spark 
> . For the purpose of email I am going to really simplify what they did. In 
> general their real time system took in data about what all users are watching 
> and calculates the most popular/trending shows. The results are stored in a 
> data base. When an individual user goes to “movie guide” they read the top 10 
> recommendations from a database. 
> 
> My guess is the part of their system that services up recommendations to 
> users in real time is not implemented using spark. Its probably a bunch of 
> rest servers sitting behind a bunch of proxy servers and load balancers. The 
> rest servers read the recommendations calculated using spark streaming.
> 
> This got me thinking. So in general we have spark handling batch, ingestion 
> of real time data but not the part of the system that delivers the real time 
> user experience. Ideally I would like to have one unified platform.
> 
> Using spark streaming with a small window size of say 100 ms would meet my 
> SLA. Each window is going to contain many unrelated requests. In the 
> recommender system example map() would look up the user specific 
> recommendation for each request. The trick is how to return the response to 
> the correctly “client”. I could publish the response to some other system 
> (kafka? Or custom proxy?) that can truly return the data to the client. Is 
> this a good idea? What do people do in practice?
> 
> Also I assume I would have to use rdd.foreach() to some how mark the cause 
> the response data to be sent to the correct client. 
> 
> Comments and suggestions appreciated.
> 
> Kind regards
> 
> Andy

though experiment: Can I use spark streaming to replace all of my rest services?

2015-11-10 Thread Andy Davidson

I just finished watching a great presentation from a recent spark summit on
real time movie recommendations using spark.
https://spark-summit.org/east-2015/talk/real-time-recommendations-using-spar
k . For the purpose of email I am going to really simplify what they did. In
general their real time system took in data about what all users are
watching and calculates the most popular/trending shows. The results are
stored in a data base. When an individual user goes to ³movie guide² they
read the top 10 recommendations from a database.

My guess is the part of their system that services up recommendations to
users in real time is not implemented using spark. Its probably a bunch of
rest servers sitting behind a bunch of proxy servers and load balancers. The
rest servers read the recommendations calculated using spark streaming.

This got me thinking. So in general we have spark handling batch, ingestion
of real time data but not the part of the system that delivers the real time
user experience. Ideally I would like to have one unified platform.

Using spark streaming with a small window size of say 100 ms would meet my
SLA. Each window is going to contain many unrelated requests. In the
recommender system example map() would look up the user specific
recommendation for each request. The trick is how to return the response to
the correctly ³client². I could publish the response to some other system
(kafka? Or custom proxy?) that can truly return the data to the client. Is
this a good idea? What do people do in practice?

Also I assume I would have to use rdd.foreach() to some how mark the cause
the response data to be sent to the correct client.

Comments and suggestions appreciated.

Kind regards

Andy

Re: Querying nested struct fields

2015-11-10 Thread Michael Armbrust

Oh sorry _1 is not a valid hive identifier, you need to use backticks to
escape it:

Seq(((1, 2), 2)).toDF().registerTempTable("test")
sql("SELECT `_1`.`_1` FROM test")

On Tue, Nov 10, 2015 at 11:31 AM, pratik khadloya 
wrote:

> I tried the same, didn't work :(
>
> scala> hc.sql("select _1.item_id from agg_imps_df limit 10").collect()
> 15/11/10 14:30:41 INFO parse.ParseDriver: Parsing command: select
> _1.item_id from agg_imps_df limit 10
> org.apache.spark.sql.AnalysisException: missing \' at 'from' near '';
> line 1 pos 23
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
> at
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
> at
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>
> On Tue, Nov 10, 2015 at 11:25 AM Michael Armbrust 
> wrote:
>
>> Use a `.`:
>>
>> hc.sql("select _1.item_id from agg_imps_df limit 10").collect()
>>
>> On Tue, Nov 10, 2015 at 11:24 AM, pratik khadloya 
>> wrote:
>>
>>> Hello,
>>>
>>> I just saved a PairRDD as a table, but i am not able to query it
>>> correctly. The below and other variations does not seem to work.
>>>
>>> scala> hc.sql("select * from agg_imps_df").printSchema()
>>>  |-- _1: struct (nullable = true)
>>>  ||-- item_id: long (nullable = true)
>>>  ||-- flight_id: long (nullable = true)
>>>  |-- _2: struct (nullable = true)
>>>  ||-- day_hour: string (nullable = true)
>>>  ||-- imps: long (nullable = true)
>>>  ||-- revenue: double (nullable = true)
>>>
>>>
>>> scala> hc.sql("select _1:item_id from agg_imps_df limit 10").collect()
>>>
>>>
>>> Can anyone please suggest the correct way to get the list of item_ids in
>>> the query?
>>>
>>> Thanks,
>>> ~Pratik
>>>
>>
>>

Re: Querying nested struct fields

2015-11-10 Thread pratik khadloya

I tried the same, didn't work :(

scala> hc.sql("select _1.item_id from agg_imps_df limit 10").collect()
15/11/10 14:30:41 INFO parse.ParseDriver: Parsing command: select
_1.item_id from agg_imps_df limit 10
org.apache.spark.sql.AnalysisException: missing \' at 'from' near '';
line 1 pos 23
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

On Tue, Nov 10, 2015 at 11:25 AM Michael Armbrust 
wrote:

> Use a `.`:
>
> hc.sql("select _1.item_id from agg_imps_df limit 10").collect()
>
> On Tue, Nov 10, 2015 at 11:24 AM, pratik khadloya 
> wrote:
>
>> Hello,
>>
>> I just saved a PairRDD as a table, but i am not able to query it
>> correctly. The below and other variations does not seem to work.
>>
>> scala> hc.sql("select * from agg_imps_df").printSchema()
>>  |-- _1: struct (nullable = true)
>>  ||-- item_id: long (nullable = true)
>>  ||-- flight_id: long (nullable = true)
>>  |-- _2: struct (nullable = true)
>>  ||-- day_hour: string (nullable = true)
>>  ||-- imps: long (nullable = true)
>>  ||-- revenue: double (nullable = true)
>>
>>
>> scala> hc.sql("select _1:item_id from agg_imps_df limit 10").collect()
>>
>>
>> Can anyone please suggest the correct way to get the list of item_ids in
>> the query?
>>
>> Thanks,
>> ~Pratik
>>
>
>

Re: Querying nested struct fields

2015-11-10 Thread Michael Armbrust

Use a `.`:

hc.sql("select _1.item_id from agg_imps_df limit 10").collect()

On Tue, Nov 10, 2015 at 11:24 AM, pratik khadloya 
wrote:

> Hello,
>
> I just saved a PairRDD as a table, but i am not able to query it
> correctly. The below and other variations does not seem to work.
>
> scala> hc.sql("select * from agg_imps_df").printSchema()
>  |-- _1: struct (nullable = true)
>  ||-- item_id: long (nullable = true)
>  ||-- flight_id: long (nullable = true)
>  |-- _2: struct (nullable = true)
>  ||-- day_hour: string (nullable = true)
>  ||-- imps: long (nullable = true)
>  ||-- revenue: double (nullable = true)
>
>
> scala> hc.sql("select _1:item_id from agg_imps_df limit 10").collect()
>
>
> Can anyone please suggest the correct way to get the list of item_ids in
> the query?
>
> Thanks,
> ~Pratik
>

Querying nested struct fields

2015-11-10 Thread pratik khadloya

Hello,

I just saved a PairRDD as a table, but i am not able to query it correctly.
The below and other variations does not seem to work.

scala> hc.sql("select * from agg_imps_df").printSchema()
 |-- _1: struct (nullable = true)
 ||-- item_id: long (nullable = true)
 ||-- flight_id: long (nullable = true)
 |-- _2: struct (nullable = true)
 ||-- day_hour: string (nullable = true)
 ||-- imps: long (nullable = true)
 ||-- revenue: double (nullable = true)


scala> hc.sql("select _1:item_id from agg_imps_df limit 10").collect()


Can anyone please suggest the correct way to get the list of item_ids in
the query?

Thanks,
~Pratik

Re: Anybody hit this issue in spark shell?

2015-11-10 Thread Shixiong Zhu

Scala compiler stores some metadata in the ScalaSig attribute. See the
following link as an example:

http://stackoverflow.com/questions/10130106/how-does-scala-know-the-difference-between-def-foo-and-def-foo/10130403#10130403

As maven-shade-plugin doesn't recognize ScalaSig, it cannot fix the
reference in it. Not sure if there is a Scala version of
`maven-shade-plugin` to deal with it.

Generally, annotations that will be shaded should not be used in the Scala
codes. I'm wondering if we can expose this issue in the PR build. Because
SBT build doesn't do the shading, now it's hard for us to find similar
issues in the PR build.

Best Regards,
Shixiong Zhu

2015-11-09 18:47 GMT-08:00 Ted Yu :

> Created https://github.com/apache/spark/pull/9585
>
> Cheers
>
> On Mon, Nov 9, 2015 at 6:39 PM, Josh Rosen 
> wrote:
>
>> When we remove this, we should add a style-checker rule to ban the import
>> so that it doesn't get added back by accident.
>>
>> On Mon, Nov 9, 2015 at 6:13 PM, Michael Armbrust 
>> wrote:
>>
>>> Yeah, we should probably remove that.
>>>
>>> On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu  wrote:
>>>
 If there is no option to let shell skip processing @VisibleForTesting
 , should the annotation be dropped ?

 Cheers

 On Mon, Nov 9, 2015 at 5:50 PM, Marcelo Vanzin 
 wrote:

> We've had this in the past when using "@VisibleForTesting" in classes
> that for some reason the shell tries to process. QueryExecution.scala
> seems to use that annotation and that was added recently, so that's
> probably the issue.
>
> BTW, if anyone knows how Scala can find a reference to the original
> Guava class even after shading, I'd really like to know. I've looked
> several times and never found where the original class name is stored.
>
> On Mon, Nov 9, 2015 at 10:37 AM, Zhan Zhang 
> wrote:
> > Hi Folks,
> >
> > Does anybody meet the following issue? I use "mvn package -Phive
> > -DskipTests” to build the package.
> >
> > Thanks.
> >
> > Zhan Zhang
> >
> >
> >
> > bin/spark-shell
> > ...
> > Spark context available as sc.
> > error: error while loading QueryExecution, Missing dependency 'bad
> symbolic
> > reference. A signature in QueryExecution.class refers to term
> annotations
> > in package com.google.common which is not available.
> > It may be completely missing from the current classpath, or the
> version on
> > the classpath might be incompatible with the version used when
> compiling
> > QueryExecution.class.', required by
> >
> /Users/zzhang/repo/spark/assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.2.0.jar(org/apache/spark/sql/execution/QueryExecution.class)
> > :10: error: not found: value sqlContext
> >import sqlContext.implicits._
> >   ^
> > :10: error: not found: value sqlContext
> >import sqlContext.sql
> >   ^
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>

save data as unique file on each slave node

2015-11-10 Thread Chuming Chen

Hi List,

I have a paired RDD, I want to save the data of each partition as a file with 
unique file name (path) on each slave node. Then I will invoke an external 
program from Spark to process those files on the slave nodes.

Is it possible to do that? 

Thanks.

Chuming


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is it possible Running SparkR on 2 nodes without HDFS

2015-11-10 Thread Sanjay Subramanian

Cool thanksI have a CDH 5.4.8 (Cloudera Starving Developers Version) with 1 NN 
and 4 DN and SPark is running but its 1.3.xI want to leverage this HDFS hive 
cluster for SparkR because we do all data munging here and produce datasets for 
ML.
I am thinking of the following idea 
1. Add 2 datanodes to the existing HDFS cluster thru Cloudera Manager2. Dont 
add any Spark Service to these two new nodes3. Download and install latest 
1.5.1 Spark on these two datanodes4. Download and Install R on these 2 
datanodes5. Configure spark as 1 master and 1 slave on one node . On second 
node, configure slave
will report back if this works !
thanks
sanjay   From: shenLiu 
 To: Sanjay Subramanian ; User 
 
 Sent: Monday, November 9, 2015 10:23 PM
 Subject: RE: Is it possible Running SparkR on 2 nodes without HDFS
   
#yiv4791623997 #yiv4791623997 --.yiv4791623997hmmessage 
P{margin:0px;padding:0px;}#yiv4791623997 
body.yiv4791623997hmmessage{font-size:12pt;}#yiv4791623997 Hi Sanjay,
It's better to use HDFS. otherwise you should have copies of the csv file on 
all worker node with same path.
regardsShawn



Date: Tue, 10 Nov 2015 02:06:16 +
From: sanjaysubraman...@yahoo.com.INVALID
To: user@spark.apache.org
Subject: Is it possible Running SparkR on 2 nodes without HDFS

hey guys
I have a 2 node SparkR (1 master 1 slave)cluster on AWS using 
spark-1.5.1-bin-without-hadoop.tgz
Running the SparkR job on the master node 
/opt/spark-1.5.1-bin-hadoop2.6/bin/sparkR --master  
spark://ip-xx-ppp-vv-ddd:7077 --packages com.databricks:spark-csv_2.10:1.2.0  
--executor-cores 16 --num-executors 8 --executor-memory 8G --driver-memory 8g   
myRprogram.R

  org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in 
stage 1.0 failed 4 times, most recent failure: Lost task 17.3 in stage 1.0 (TID 
103, xx.ff.rr.tt): java.io.FileNotFoundException: File 
file:/mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv 
does not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:140)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341) 
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766) at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecord




myRprogram.R
library(SparkR)
sc <- sparkR.init(appName="SparkR-CancerData-example")sqlContext <- 
sparkRSQL.init(sc)
lds <- read.df(sqlContext, 
"file:///mnt/local/1024gbxvdf1/all_adleads_cleaned_commas_in_quotes_good_file.csv",
 "com.databricks.spark.csv", 
header="true")sink("file:///mnt/local/1024gbxvdf1/leads_new_data_analyis.txt")summary(lds)

This used to run when we had a single node SparkR installation
regards
sanjay

Re: NullPointerException with joda time

2015-11-10 Thread Ted Yu

Can you show the stack trace for the NPE ?

Which release of Spark are you using ?

Cheers

On Tue, Nov 10, 2015 at 8:20 AM, romain sagean 
wrote:

> Hi community,
> I try to apply the function below during a flatMapValues or a map but I
> get a nullPointerException with the plusDays(1). What did I miss ?
>
> def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime): Seq[DateTime] = {
> if (dateSeq.last.isBefore(dateEnd)){
>   allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
> } else {
>   dateSeq
> }
>   }
>
> val videoAllDates = .select("player_id", "mindate").flatMapValues( minDate
> => allDates(Seq(minDate), endDate))
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

[ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Reynold Xin

Hi All,

Spark 1.5.2 is a maintenance release containing stability fixes. This
release is based on the branch-1.5 maintenance branch of Spark. We
*strongly recommend* all 1.5.x users to upgrade to this release.

The full list of bug fixes is here: http://s.apache.org/spark-1.5.2

http://spark.apache.org/releases/spark-release-1-5-2.html

NullPointerException with joda time

2015-11-10 Thread romain sagean


Hi community,
I try to apply the function below during a flatMapValues or a map but I 
get a nullPointerException with the plusDays(1). What did I miss ?


def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime): Seq[DateTime] = {
if (dateSeq.last.isBefore(dateEnd)){
  allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
} else {
  dateSeq
}
  }

val videoAllDates = .select("player_id", "mindate").flatMapValues( 
minDate => allDates(Seq(minDate), endDate))



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.5 UDAF ArrayType

2015-11-10 Thread Alex Nastetsky

Hi,

I believe I ran into the same bug in 1.5.0, although my error looks like
this:

Caused by: java.lang.ClassCastException:
[Lcom.verve.spark.sql.ElementWithCount; cannot be cast to
org.apache.spark.sql.types.ArrayData
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getArray(rows.scala:47)
at
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getArray(rows.scala:247)
at
org.apache.spark.sql.catalyst.expressions.JoinedRow.getArray(JoinedRow.scala:108)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
Source)
at
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373)

...

I confirmed that it's fixed in 1.5.1, but unfortunately I'm using AWS EMR
4.1.0 (the latest), which has Spark 1.5.0. Are there any workarounds in
1.5.0?

Thanks.


> Michael



Thank you for your prompt answer. I will repost after I try this again on
> 1.5.1 or branch-1.5. In addition a blog post on SparkSQL data types would
> be very helpful. I am familiar with the Hive data types, but there is very
> little documentation on Spark SQL data types. Regards
>


Deenar On 22 September 2015 at 19:28, Michael Armbrust <
> mich...@databricks.com>
> wrote:



> I think that you are hitting a bug (which should be fixed in Spark
> > 1.5.1). I'm hoping we can cut an RC for that this week. Until then you
> > could try building branch-1.5.
> >
> > On Tue, Sep 22, 2015 at 11:13 AM, Deenar Toraskar <
> > deenar.toras...@gmail.com> wrote:
> >
> >> Hi
> >>
> >> I am trying to write an UDAF ArraySum, that does element wise sum of
> >> arrays of Doubles returning an array of Double following the sample in
> >>
> >>
> https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html
> .
> >> I am getting the following error. Any guidance on handle complex type in
> >> Spark SQL would be appreciated.
> >>
> >> Regards
> >> Deenar
> >>
> >> import org.apache.spark.sql.expressions.MutableAggregationBuffer
> >> import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
> >> import org.apache.spark.sql.Row
> >> import org.apache.spark.sql.types._
> >> import org.apache.spark.sql.functions._
> >>
> >> class ArraySum extends UserDefinedAggregateFunction {
> >> def inputSchema: org.apache.spark.sql.types.StructType =
> >> StructType(StructField("value", ArrayType(DoubleType, false)) :: Nil)
> >>
> >> def bufferSchema: StructType =
> >> StructType(StructField("value", ArrayType(DoubleType, false)) :: Nil)
> >>
> >> def dataType: DataType = ArrayType(DoubleType, false)
> >>
> >> def deterministic: Boolean = true
> >>
> >> def initialize(buffer: MutableAggregationBuffer): Unit = {
> >> buffer(0) = Nil
> >> }
> >>
> >> def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
> >> val currentSum : Seq[Double] = buffer.getSeq(0)
> >> val currentRow : Seq[Double] = input.getSeq(0)
> >> buffer(0) = (currentSum, currentRow) match {
> >> case (Nil, Nil) => Nil
> >> case (Nil, row) => row
> >> case (sum, Nil) => sum
> >> case (sum, row) => (seq, anotherSeq).zipped.map{ case (a, b) => a +
> >> b }
> >> // TODO handle different sizes arrays here
> >> }
> >> }
> >>
> >> def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> >> val currentSum : Seq[Double] = buffer1.getSeq(0)
> >> val currentRow : Seq[Double] = buffer2.getSeq(0)
> >> buffer1(0) = (currentSum, currentRow) match {
> >> case (Nil, Nil) => Nil
> >> case (Nil, row) => row
> >> case (sum, Nil) => sum
> >> case (sum, row) => (seq, anotherSeq).zipped.map{ case (a, b) => a +
> >> b }
> >> // TODO handle different sizes arrays here
> >> }
> >> }
> >>
> >> def evaluate(buffer: Row): Any = {
> >> buffer.getSeq(0)
> >> }
> >> }
> >>
> >> val arraySum = new ArraySum
> >> sqlContext.udf.register("ArraySum", arraySum)
> >>
> >> *%sql select ArraySum(Array(1.0,2.0,3.0)) from pnls where date =
> >> '2015-05-22' limit 10*
> >>
> >> gives me the following error
> >>
> >>
> >> Error in SQL statement: SparkException: Job aborted due to stage
> failure:
> >> Task 0 in stage 219.0 failed 4 times, most recent failure: Lost task
> 0.3 in
> >> stage 219.0 (TID 11242, 10.172.255.236): java.lang.ClassCastException:
> >> scala.collection.mutable.WrappedArray$ofRef cannot be cast to
> >> org.apache.spark.sql.types.ArrayData at
> >>
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getArray(rows.scala:47)
> >> at
> >>
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getArray(rows.scala:247)
> >> at
> >>
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getArray(JoinedRow.scala:108)
> >> at
> >>
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
> >> Source) at
> >>
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373)
> >> at
> >>
> org.apache.spark.sql.execution.aggre

A question about accumulator

2015-11-10 Thread Tan Tim

Hi, all

There is a discussion about the accumulator in stack overflow:
http://stackoverflow.com/questions/27357440/spark-accumalator-value-is-different-when-inside-rdd-and-outside-rdd

I comment about this question (from user Tim). As the output I tried, I
hava two questions:
1. Why the addInplace function be called twice?
2. Why the order of two ouput  is difference ?

Any suggestion will be appreciated.

Re: could not understand issue about static spark Function (map / sortBy ...)

2015-11-10 Thread Zhiliang Zhu

I have got the issues all, after quite a lot of test.

Function would only be defined in static normal function body, or defined as 
static member variable.Function would also be defined as inner static class, 
some its own member variable or functions could be defined, the variable can be 
passed while new the Function obj, and in the Function inner class the inner 
normal function can be called. 


 On Tuesday, November 10, 2015 5:12 PM, Zhiliang Zhu  
wrote:
   

 As more test, the Function call by map/sortBy etc must be defined as static, 
or it can be defined as non-static and must be called by other static normal 
function.I am really confused by it. 


 On Tuesday, November 10, 2015 4:12 PM, Zhiliang Zhu 
 wrote:
   

 Hi All,
I have met some bug not understandable as follows:
class A {  private JavaRDD _com_rdd;  ...  ...
  //here it must be static, but not every Function as map etc would be static, 
as the code examples in spark self official doc
  static Function mapParseRow = new Function() 
{ 
    @Override
 public Vector call (Vector v) {    System.out.println("mark. map log 
is here");    Vector rt;
    ...   //if here needs to call some other non-static function, 
how can it be ?
    return rt;
    }  };  public void run() { //it will be called outside some other public 
class by A object
  ...  JavaRDD rdd = (this._com_rdd).map(mapParseRow); //it 
will cause failure while map is not static
  ...  }
}

Would you help comment some for it? What would be done? 

Thank you in advance!Zhiliang 





 On Tuesday, November 10, 2015 11:42 AM, Deng Ching-Mallete 
 wrote:
   

 Hi Zhiliang,
You should be able to see them in the executor logs, which you can view via the 
Spark UI, in the Executors page (stderr log).
HTH,Deng

On Tue, Nov 10, 2015 at 11:33 AM, Zhiliang Zhu  
wrote:

Hi All,
I need debug spark job, my general way is to print out the log, however, some 
bug is in spark functions as mapPartitions etc, and not any log printed from 
those functionscould be found...Would you help point what is way to the log in 
the spark own function as mapPartitions? Or, what is general way to debug spark 
job.
Thank you in advance!
Zhiliang

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Jörn Franke

I would have to check the Spark source code, but theoretically you can limit 
the no of threads on the jvm level. Maybe spark does this.Alternatively, you 
can use cgroups, but this introduces other complexity.

> On 10 Nov 2015, at 14:33, Peter Rudenko  wrote:
> 
> Hi i have a question: how does the cores isolation works on spark on yarn. 
> E.g. i have a machine with 8 cores, but launched a worker with 
> --executor-cores 1, and after doing something like:
> 
> rdd.foreachPartition(=>{for all visible cores: burn core in a new tread})
> 
> Will it see 1 core or all 8 cores?
> 
> Thanks,
> Peter Rudenko
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko

As i've tried cgroups - seems the isolation is done by percantage not by 
cores number. E.g. i've set min share to 256 - i still see all 8 cores, 
but i could only load only 20% of each core.


Thanks,
Peter Rudenko
On 2015-11-10 15:52, Saisai Shao wrote:
From my understanding, it depends on whether you enabled CGroup 
isolation or not in Yarn. By default it is not, which means you could 
allocate one core but bump a lot of thread in your task to occupy the 
CPU resource, this is just a logic limitation. For Yarn CPU isolation 
you may refer to this post 
(http://hortonworks.com/blog/apache-hadoop-yarn-in-hdp-2-2-isolation-of-cpu-resources-in-your-hadoop-yarn-clusters/). 



Thanks
Jerry

On Tue, Nov 10, 2015 at 9:33 PM, Peter Rudenko 
mailto:petro.rude...@gmail.com>> wrote:


Hi i have a question: how does the cores isolation works on spark
on yarn. E.g. i have a machine with 8 cores, but launched a worker
with --executor-cores 1, and after doing something like:

rdd.foreachPartition(=>{for all visible cores: burn core in a new
tread})

Will it see 1 core or all 8 cores?

Thanks,
Peter Rudenko

Re: What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg

N/m, these are just profiling snapshots :) Sorry for the wide distribution.

On Tue, Nov 10, 2015 at 9:46 AM, Dmitry Goldenberg  wrote:

> We're seeing a bunch of .snapshot files being created under
> /home/spark/Snapshots, such as the following for example:
>
> CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot
> CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot
> SparkSubmit-2015-08-31-shutdown-1.snapshot
> Worker-2015-08-27-shutdown.snapshot
>
> These files are large and they blow out our disk space in some
> environments.
>
> What are these, when are they created and for what purpose?  Is there a
> way to control how they're generated and most importantly where they're
> stored?
>
> Thanks.
>

What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg

We're seeing a bunch of .snapshot files being created under
/home/spark/Snapshots, such as the following for example:

CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot
CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot
SparkSubmit-2015-08-31-shutdown-1.snapshot
Worker-2015-08-27-shutdown.snapshot

These files are large and they blow out our disk space in some environments.

What are these, when are they created and for what purpose?  Is there a way
to control how they're generated and most importantly where they're stored?

Thanks.

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Saisai Shao

>From my understanding, it depends on whether you enabled CGroup isolation
or not in Yarn. By default it is not, which means you could allocate one
core but bump a lot of thread in your task to occupy the CPU resource, this
is just a logic limitation. For Yarn CPU isolation you may refer to this
post (
http://hortonworks.com/blog/apache-hadoop-yarn-in-hdp-2-2-isolation-of-cpu-resources-in-your-hadoop-yarn-clusters/
).

Thanks
Jerry

On Tue, Nov 10, 2015 at 9:33 PM, Peter Rudenko 
wrote:

> Hi i have a question: how does the cores isolation works on spark on yarn.
> E.g. i have a machine with 8 cores, but launched a worker with
> --executor-cores 1, and after doing something like:
>
> rdd.foreachPartition(=>{for all visible cores: burn core in a new tread})
>
> Will it see 1 core or all 8 cores?
>
> Thanks,
> Peter Rudenko
>
>

[Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko

Hi i have a question: how does the cores isolation works on spark on 
yarn. E.g. i have a machine with 8 cores, but launched a worker with 
--executor-cores 1, and after doing something like:


rdd.foreachPartition(=>{for all visible cores: burn core in a new tread})

Will it see 1 core or all 8 cores?

Thanks,
Peter Rudenko

Re: AnalysisException Handling for unspecified field in Spark SQL

2015-11-10 Thread Arvin

Hi,you can add a the new column to your DataFrame just before your query. The
added column will be null in your new table, so you can have what you want.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/AnalysisException-Handling-for-unspecified-field-in-Spark-SQL-tp25343p25344.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread Andrés Ivaldi

Hi,
Cassandra looks very interesting and It seems to fit right, but it looks
like It needs too much work to have the proper configuration that depends
of the data. And what We need to do it  a generic structure with less
configuration possible, because the end users dont have the know-how for do
that.

Please let me know if we did a bad interpretation about cassandra, so we
can take a look to it again.

Best Regards!!


On Mon, Nov 9, 2015 at 11:11 PM, fightf...@163.com 
wrote:

> Hi,
>
> Have you ever considered cassandra as a replacement ? We are now almost
> the seem usage as your engine, e.g. using mysql to store
>
> initial aggregated data. Can you share more about your kind of Cube
> queries ? We are very interested in that arch too : )
>
> Best,
> Sun.
> --
> fightf...@163.com
>
>
> *From:* Andrés Ivaldi 
> *Date:* 2015-11-10 07:03
> *To:* tsh 
> *CC:* fightf...@163.com; user ; dev
> 
> *Subject:* Re: OLAP query using spark dataframe with cassandra
> Hi,
> I'm also considering something similar, Spark plain is too slow for my
> case, a possible solution is use Spark as Multiple Source connector and
> basic transformation layer, then persist the information (actually is a
> RDBM), after that with our engine we build a kind of Cube queries, and the
> result is processed again by Spark adding Machine Learning.
> Our Missing part is reemplace the RDBM with something more suitable and
> scalable than RDBM, dont care about pre processing information if after pre
> processing the queries are fast.
>
> Regards
>
> On Mon, Nov 9, 2015 at 3:56 PM, tsh  wrote:
>
>> Hi,
>>
>> I'm in the same position right now: we are going to implement something
>> like OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have terabytes
>> of versatile data and the necessity to make something like cubes (Hive and
>> Hive on HBase are unsatisfactory). From the other, our users get accustomed
>> to Tableau + Vertica.
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
>> storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
>> Flume (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>>
>> For myself, I decided not to dive into Mesos. Cassandra is hardly
>> configurable, you'll have to dedicate special employee to support it.
>>
>> I'll be glad to hear other ideas & propositions as we are at the
>> beginning of the process too.
>>
>> Sincerely yours, Tim Shenkao
>>
>>
>> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>>
>> Hi,
>>
>> Thanks for suggesting. Actually we are now evaluating and stressing the
>> spark sql on cassandra, while
>>
>> trying to define business models. FWIW, the solution mentioned here is
>> different from traditional OLAP
>>
>> cube engine, right ? So we are hesitating on the common sense or
>> direction choice of olap architecture.
>>
>> And we are happy to hear more use case from this community.
>>
>> Best,
>> Sun.
>>
>> --
>> fightf...@163.com
>>
>>
>> *From:* Jörn Franke 
>> *Date:* 2015-11-09 14:40
>> *To:* fightf...@163.com
>> *CC:* user ; dev 
>> *Subject:* Re: OLAP query using spark dataframe with cassandra
>>
>> Is there any distributor supporting these software components in
>> combination? If no and your core business is not software then you may want
>> to look for something else, because it might not make sense to build up
>> internal know-how in all of these areas.
>>
>> In any case - it depends all highly on your data and queries. You will
>> have to do your own experiments.
>>
>> On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:
>>
>> Hi, community
>>
>> We are specially interested about this featural integration according to
>> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>
>> seems good implementation for lambda architecure in the open-source
>> world, especially non-hadoop based cluster environment. As we can see,
>>
>> the advantages obviously consist of :
>>
>> 1 the feasibility and scalability of spark datafram api, which can also
>> make a perfect complement for Apache Cassandra native cql feature.
>>
>> 2 both streaming and batch process availability using the ALL-STACK
>> thing, cool.
>>
>> 3 we can both achieve compacity and usability for spark with cassandra,
>> including seemlessly integrating with job scheduling and resource
>> management.
>>
>> Only one concern goes to the OLAP query performance issue, which mainly
>> caused by frequent aggregation work between daily increased large tables,
>> for
>>
>> both spark sql and cassandra. I can see that the [1] use case facilitates
>> FiloDB to achieve columnar storage and query performance, but we had
>> nothing more
>>
>> knowledg

Re: Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread Andrés Ivaldi

Hi,

We have been evaluating apache Kylin, how flexible is it? I mean, we need
to create the cube Structure Dynamically and populete it from different
sources, the process time is not too important, what is important is the
response time on queries?

Thanks.

On Mon, Nov 9, 2015 at 11:01 PM, fightf...@163.com 
wrote:

> Hi,
>
> According to my experience, I would recommend option 3) using Apache Kylin
> for your requirements.
>
> This is a suggestion based on the open-source world.
>
> For the per cassandra thing, I accept your advice for the special support
> thing. But the community is very
>
> open and convinient for prompt response.
>
> --
> fightf...@163.com
>
>
> *From:* tsh 
> *Date:* 2015-11-10 02:56
> *To:* fightf...@163.com; user ; dev
> 
> *Subject:* Re: OLAP query using spark dataframe with cassandra
> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> --
> fightf...@163.com
>
>
> *From:* Jörn Franke 
> *Date:* 2015-11-09 14:40
> *To:* fightf...@163.com
> *CC:* user ; dev 
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> 
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> --
> fightf...@163.com
>
>
>


-- 
Ing. Ivaldi Andres

Save to distributed file system from worker processes

2015-11-10 Thread bikash.mnr

I am quite new with pyspark. In my application with pyspark, I want to
achieve following things:

-- Create a RDD using python list and partition it into some partitions.
-- Now use rdd.foreachPartition(func)
-- Here, the function "func" performs an iterative operation which,
reads content of saved file into a local variable (for e.g. numpy array),
performs some updates using the rdd partion data and again saves the content
of variable to some common file system.

I am not able to figure out how to read and write a variable inside a worker
process to some common shared system which is accessible to all processes??




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Save-to-distributed-file-system-from-worker-processes-tp25342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: NoSuchElementException: key not found

2015-11-10 Thread Ankush Khanna


Any suggestions any one?
Using version 1.5.1.

Regards
Ankush Khanna

On Nov 10, 2015, at 11:37 AM, Ankush Khanna  wrote:

Hi,

I was working with a simple task (running locally). Just reading a file (35 mb) 
with about 200 features and making a random forest with 5 trees with 5 depth. 
While saving the file with:
predictions.select("VisitNumber", "probability")
   .write.format("json") // tried different formats
   .mode(SaveMode.Overwrite)
   .option("header", "true")
   .save("finalResult2")

I get an error: java.util.NoSuchElementException: key not found: 
-2.379675967804967E-16 (Stack trace below)

Just for you info, I was not getting this error earlier, it started some time 
ago and i am not able to get rid of it. 
My SPARK CONFIG is simple: 
val conf = new SparkConf() 
.setAppName("walmart")
.setMaster("local[2]")

== Physical Plan ==
Project 
[VisitNumber#52,UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)) AS 
features#81,UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))) AS 
indexedFeatures#167,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59
 AS 
rawPrediction#168,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)
 AS 
probability#169,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)
 AS 
prediction#170,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))
 AS predictedLabel#171]
   SortBasedAggregate(key=[VisitNumber#52], functions=   [  
(ConcatenateString(DepartmentDescription#56),mode=Final,isDistinct=false)], 
output=[VisitNumber#52,Agg-DepartmentDescription#59])
ConvertToSafe
TungstenSort [VisitNumber#52 ASC], false, 0
TungstenExchange hashpartitioning(VisitNumber#52)
ConvertToUnsafe
SortBasedAggregate(key=[VisitNumber#52], 
functions=[(ConcatenateString(DepartmentDescription#56),mode=Partial,isDistinct=false)],
 output=[VisitNumber#52,concatenate#65])
ConvertToSafe
TungstenSort [VisitNumber#52 ASC], false, 0
TungstenProject [VisitNumber#52,DepartmentDescription#56]
Scan 
CsvRelation(src/main/resources/test.csv,true,,,",null,#,PERMISSIVE,COMMONS,false,false,null,UTF-8,true)[VisitNumber#52,Weekday#53,Upc#54L,ScanCount#55,DepartmentDescription#56,FinelineNumber#57]


# STACK TRACE #
java.util.NoSuchElementException: key not found: -2.379675967804967E-16
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:308)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:307)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:307)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:301)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:241)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/11/10 11:35:0

Re: Overriding Derby in hive-site.xml giving strange results...

2015-11-10 Thread Gaurav Tiwari

So which one will be used during query execution ? What we observed that no
matter what queries are getting executed on Hive with derby .

Also there is no dialect defined for HSQLDB , do you think this will work ?

On Tue, Nov 10, 2015 at 5:07 AM, Michael Armbrust 
wrote:

> We have two copies of hive running in order to support multiple versions
> of hive with a single version of Spark.  You are see log messages for the
> version that we use for execution (it just creates a temporary derby
> metastore).
>
> On Mon, Nov 9, 2015 at 3:32 PM, mayurladwa 
> wrote:
>
>> Hello, this question is around the hive thrift server that comes packaged
>> with spark 1.5.1, I am trying to change the default metastore from derby.
>>
>> From googling I see the more commonly documented alternatives to derby are
>> MySQL, but as it exposes a JDBC interface I want to try and get this
>> working
>> with HSQL (2.0).
>>
>> I'm overriding the following in the hive-site.xml:
>>
>> /
>>   javax.jdo.option.ConnectionURL
>>   jdbc:hsqldb:hsql://myhost:58090/default
>>   JDBC connect string for a JDBC metastore
>> 
>>
>> 
>>   javax.jdo.option.ConnectionDriverName
>>   org.hsqldb.jdbc.JDBCDriver
>>   Driver class name for a JDBC metastore
>> 
>>
>> 
>>javax.jdo.option.ConnectionUserName
>>user
>>username to use against metastore database
>> 
>>
>> 
>>javax.jdo.option.ConnectionPassword
>>pwd
>>password to use against metastore database
>> 
>> /
>>
>> What's really strange is that I see some hive tables created in my HSQL
>> database when my spark hive thrift server is running, but when I do a
>> query
>> I see it switches back to derby! I get logs like this:
>>
>> /15/11/09 11:59:16 DEBUG ObjectStore: *Overriding
>> javax.jdo.option.ConnectionURL value null from  jpox.properties with
>> jdbc:hsqldb:hsql://myhost:58090/default*/
>>
>> And then later I see this:
>>
>> /15/11/09 11:59:18 DEBUG ObjectStore: *Overriding
>> javax.jdo.option.ConnectionURL value null from  jpox.properties with
>>
>> jdbc:derby:;databaseName=/tmp/spark-acb48194-09a7-4beb-b5fc-*ffc0216449c8/metastore;create=true/
>> ...
>> /15/11/09 11:59:18 DEBUG Transaction: Transaction committed in 1 ms
>> 15/11/09 11:59:18 INFO MetaStoreDirectSql: Using direct SQL, *underlying
>> DB
>> is DERBY*
>> 15/11/09 11:59:18 DEBUG ObjectStore: RawStore:
>> org.apache.hadoop.hive.metastore.ObjectStore@7ecab68e, with
>> PersistenceManager: org.datanucleus.api.jdo.JDOPersistenceManager@61e68dab
>> created in the thread with id: 1
>> 15/11/09 11:59:18 INFO ObjectStore: Initialized ObjectStore/
>>
>> So not totally sure how this is getting switched back to derby, or why it
>> thinks later on that the jpox.properties I am overriding in the
>> hive-site.xml area is suddenly null?
>>
>> Any help would be much appreciated :)
>>
>> Many thanks
>>
>> Mayur
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Overriding-Derby-in-hive-site-xml-giving-strange-results-tp25333.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

NoSuchElementException: key not found

2015-11-10 Thread Ankush Khanna


Hi,

I was working with a simple task (running locally). Just reading a file (35 mb) 
with about 200 features and making a random forest with 5 trees with 5 depth. 
While saving the file with:
predictions.select("VisitNumber", "probability")
   .write.format("json") // tried different formats
   .mode(SaveMode.Overwrite)
   .option("header", "true")
   .save("finalResult2")

I get an error: java.util.NoSuchElementException: key not found: 
-2.379675967804967E-16 (Stack trace below)

Just for you info, I was not getting this error earlier, it started some time 
ago and i am not able to get rid of it. 
My SPARK CONFIG is simple: 
val conf = new SparkConf() 
.setAppName("walmart")
.setMaster("local[2]")

== Physical Plan ==
Project 
[VisitNumber#52,UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)) AS 
features#81,UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))) AS 
indexedFeatures#167,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59
 AS 
rawPrediction#168,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)
 AS 
probability#169,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59)
 AS 
prediction#170,UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Agg-DepartmentDescription#59))
 AS predictedLabel#171]
   SortBasedAggregate(key=[VisitNumber#52], functions=   [  
(ConcatenateString(DepartmentDescription#56),mode=Final,isDistinct=false)], 
output=[VisitNumber#52,Agg-DepartmentDescription#59])
ConvertToSafe
TungstenSort [VisitNumber#52 ASC], false, 0
TungstenExchange hashpartitioning(VisitNumber#52)
ConvertToUnsafe
SortBasedAggregate(key=[VisitNumber#52], 
functions=[(ConcatenateString(DepartmentDescription#56),mode=Partial,isDistinct=false)],
 output=[VisitNumber#52,concatenate#65])
ConvertToSafe
TungstenSort [VisitNumber#52 ASC], false, 0
TungstenProject [VisitNumber#52,DepartmentDescription#56]
Scan 
CsvRelation(src/main/resources/test.csv,true,,,",null,#,PERMISSIVE,COMMONS,false,false,null,UTF-8,true)[VisitNumber#52,Weekday#53,Upc#54L,ScanCount#55,DepartmentDescription#56,FinelineNumber#57]


# STACK TRACE #
java.util.NoSuchElementException: key not found: -2.379675967804967E-16
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:308)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:307)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:307)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:301)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343)
at 
org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:343)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:241)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/11/10 11:35:05 ERROR DefaultWriterContainer: Task attempt 
attempt_201511101135_0068_m_00_0 aborted.
15/11/10 11:35:05 ERROR Executo

spark shared RDD

2015-11-10 Thread Ben

Hi,
After reading some documentations about spark and ignite,
I am wondering if shared RDD from ignite can be used to share data in
memory without any duplication between multiple spark jobs.
Running on mesos I can collocate them, but will this be enough to avoid
memory duplication or not?
I am also confused by Tachyon usage compare to apache ignite
which seems to be overlapping at some points.
Thanks for your help
Regards

spark shared RDD

2015-11-10 Thread Ben

Hi,
After reading some documentations about spark and ignite,
I am wondering if shared RDD from ignite can be used to share data in
memory without any duplication between multiple spark jobs.
Running on mesos I can collocate them, but will this be enough to avoid
memory duplication or not?
I am also confused by Tachyon usage compare to apache ignite
which seems to be overlapping at some points.
Thanks for you help
Regards

Re: Why is Kryo not the default serializer?

2015-11-10 Thread ozawa_h

Array issue was also discussed in Apache Hive forum. This problem seems like it 
can be resolved by using Kryo 3.x. Will upgrading to Kryo 3.x allow Kryo to 
become the default SerDes?
https://issues.apache.org/jira/browse/HIVE-12174

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Unwanted SysOuts in Spark Parquet

2015-11-10 Thread Cheng Lian

This is because of PARQUET-369 
, which prevents 
users or other libraries to override Parquet's JUL logging settings via 
SLF4J. It has been fixed in the most recent parquet-format master (PR 
#32 ), but 
unfortunately there hasn't been a release yet.


Cheng

On 11/9/15 3:40 PM, swetha wrote:

Hi,

I see a lot of unwanted SysOuts when I try to save an RDD as parquet file.
Following is the code and
SysOuts. Any idea as to how to avoid the unwanted SysOuts?


 ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])

 AvroParquetOutputFormat.setSchema(job, ActiveSession.SCHEMA$)
activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void],
classOf[ActiveSession],
   classOf[ParquetOutputFormat[ActiveSession]], job.getConfiguration)

Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.codec.CodecConfig: Compression
set to false
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.codec.CodecConfig: Compression:
UNCOMPRESSED
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary
is on
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Validation
is off
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Writer
version is: PARQUET_1_0
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.InternalParquetRecordWriter:
Flushing mem columnStore to file. allocated memory: 29,159,377
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.codec.CodecConfig: Compression
set to false
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.codec.CodecConfig: Compression:
UNCOMPRESSED
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
block size to 134217728
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
page size to 1048576
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet
dictionary page size to 1048576
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary
is on
Nov 8, 2015 11:35:33 PM INFO: parquet.hadoop.ParquetOutputFormat: Validation
is off



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unwanted-SysOuts-in-Spark-Parquet-tp25325.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase

I’ve seen this before during an extreme outage on the cluster, where the kafka 
offsets checkpointed by the directstreamRdd were bigger than what kafka 
reported. The checkpoint was therefore corrupted.
I don’t know the root cause but since I was stressing the cluster during a 
reliability test I can only assume that one of the Kafka partitions was 
restored from an out-of-sync replica and did not contain all the data. Seems 
extreme but I don’t have another idea.

@Cody – do you know of a way to recover from a situation like this? Can someone 
manually delete folders from the checkpoint folder to help the job recover? 
E.g. Go 2 steps back, hoping that kafka has those offsets.

-adrian

From: swetha kasireddy
Date: Monday, November 9, 2015 at 10:40 PM
To: Cody Koeninger
Cc: "user@spark.apache.org"
Subject: Re: Kafka Direct does not recover automatically when the Kafka Stream 
gets messed up?

OK. But, one thing that I observed is that when there is a problem with Kafka 
Stream, unless I delete the checkpoint directory the Streaming job does not 
restart. I guess it tries to retry the failed tasks and if it's not able to 
recover, it fails again. Sometimes, it fails with StackOverFlow Error.

Why does the Streaming job not restart from checkpoint directory when the job 
failed earlier with Kafka Brokers getting messed up? We have the checkpoint 
directory in our hdfs.

On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger 
mailto:c...@koeninger.org>> wrote:
I don't think deleting the checkpoint directory is a good way to restart the 
streaming job, you should stop the spark context or at the very least kill the 
driver process, then restart.

On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy 
mailto:swethakasire...@gmail.com>> wrote:
Hi Cody,

Our job is our failsafe as we don't have Control over Kafka Stream as of now. 
Can setting rebalance max retries help? We do not have any monitors setup as of 
now. We need to setup the monitors.

My idea is to to have some kind of Cron job that queries the Streaming API for 
monitoring like every 5 minutes and then send an email alert and automatically 
restart the Streaming job by deleting the Checkpoint directory. Would that help?



Thanks!

On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger 
mailto:c...@koeninger.org>> wrote:
The direct stream will fail the task if there is a problem with the kafka 
broker.  Spark will retry failed tasks automatically, which should handle 
broker rebalances that happen in a timely fashion. spark.tax.maxFailures 
controls the maximum number of retries before failing the job.  Direct stream 
isn't any different from any other spark task in that regard.

The question of what kind of monitoring you need is more a question for your 
particular infrastructure and what you're already using for monitoring.  We put 
all metrics (application level or system level) into graphite and alert from 
there.

I will say that if you've regularly got problems with kafka falling over for 
half an hour, I'd look at fixing that before worrying about spark monitoring...


On Mon, Nov 9, 2015 at 12:26 PM, swetha 
mailto:swethakasire...@gmail.com>> wrote:
Hi,

How to recover Kafka Direct automatically when the there is a problem with
Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
Streaming job blows up unlike some other consumers which do recover
automatically. How can I make sure that Kafka Direct recovers automatically
when the broker fails for sometime say 30 minutes? What kind of monitors
should be in place to recover the job?

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-10 Thread Adrian Tanase

Can you be a bit more specific about what “blow up” means? Also what do you 
mean by “messed up” brokers? Inbalance? Broker(s) dead?

We’re also using the direct consumer and so far nothing dramatic happened:
- on READ it automatically reads from backups if leader is dead (machine gone)
- or READ if there is a huge imbalance (partitions/leaders) the job might slow 
down if you don’t have enough cores on the machine with many partitions
- on WRITE - we’ve seen a weird delay of ~7 seconds that I don’t know how to 
re-configure, there’s a timeout that delays the job but it eventually writes 
data to a replica
- it only died when there are no more brokers left and there are partitions 
without a leader. This happened when almost half the cluster was dead during a 
reliability test

Regardless, I would look at the source and try to monitor the kafka cluster for 
things like partitions without leaders or big inbalances.

Hope this helps,
-adrian





On 11/9/15, 8:26 PM, "swetha"  wrote:

>Hi,
>
>How to recover Kafka Direct automatically when the there is a problem with
>Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
>Streaming job blows up unlike some other consumers which do recover
>automatically. How can I make sure that Kafka Direct recovers automatically
>when the broker fails for sometime say 30 minutes? What kind of monitors
>should be in place to recover the job?
>
>Thanks,
>Swetha 
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

could not understand issue about static spark Function (map / sortBy ...)

2015-11-10 Thread Zhiliang Zhu

As more test, the Function call by map/sortBy etc must be defined as static, or 
it can be defined as non-static and must be called by other static normal 
function.I am really confused by it. 


 On Tuesday, November 10, 2015 4:12 PM, Zhiliang Zhu 
 wrote:
   

 Hi All,
I have met some bug not understandable as follows:
class A {  private JavaRDD _com_rdd;  ...  ...
  //here it must be static, but not every Function as map etc would be static, 
as the code examples in spark self official doc
  static Function mapParseRow = new Function() 
{ 
    @Override
 public Vector call (Vector v) {    System.out.println("mark. map log 
is here");    Vector rt;
    ...   //if here needs to call some other non-static function, 
how can it be ?
    return rt;
    }  };  public void run() { //it will be called outside some other public 
class by A object
  ...  JavaRDD rdd = (this._com_rdd).map(mapParseRow); //it 
will cause failure while map is not static
  ...  }
}

Would you help comment some for it? What would be done? 

Thank you in advance!Zhiliang 





 On Tuesday, November 10, 2015 11:42 AM, Deng Ching-Mallete 
 wrote:
   

 Hi Zhiliang,
You should be able to see them in the executor logs, which you can view via the 
Spark UI, in the Executors page (stderr log).
HTH,Deng

On Tue, Nov 10, 2015 at 11:33 AM, Zhiliang Zhu  
wrote:

Hi All,
I need debug spark job, my general way is to print out the log, however, some 
bug is in spark functions as mapPartitions etc, and not any log printed from 
those functionscould be found...Would you help point what is way to the log in 
the spark own function as mapPartitions? Or, what is general way to debug spark 
job.
Thank you in advance!
Zhiliang

Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread David Morales

Hi there,

Please consider our real-time aggregation engine, sparkta, fully open
source (Apache2 License).

Here you have some slides about the project:

   - http://www.slideshare.net/Stratio/strata-sparkta

And the source code:


   - https://github.com/Stratio/sparkta

Sparkta is a real-time aggregation engine based on spark streaming. You can
define your aggregation policy in a declarative way and choose the output
of your rollups, too. In addition, you can store the raw data and transform
data on-the-fly, among other features.

When working with Cassandra, it could be useful to use the lucene
integration that we have also released at Stratio:


   - http://www.slideshare.net/Stratio/cassandra-meetup-20150217
   - https://github.com/Stratio/cassandra-lucene-index


Ready for use with sparkSQL or in your CQL queries.

We are now working in a SQL layer to work with the cubes in a flexible way,
but this is not available at this moment.

Do not hesitate to contact us if you have any doubt.


Regards.

















2015-11-10 8:16 GMT+01:00 Luke Han :

> Some friends refer me this thread about OLAP/Kylin and Spark...
>
> Here's my 2 cents..
>
> If you are trying to setup OLAP, Apache Kylin should be one good idea for
> you to evaluate.
>
> The project has developed more than 2 years and going to graduate to
> Apache Top Level Project [1].
> There are many deployments on production already include
> eBay, Exponential, JD.com, VIP.com and others, refer to powered by page [2].
>
> Apache Kylin's spark engine also on the way, there's discussion about
> turning the performance [3].
>
> There are variety clients are available to interactive with Kylin with
> ANSI SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian,
> and the Excel/PowerBI support will roll out this week.
>
> Apache Kylin is young but mature with huge case validation (one biggest
> cube in eBay contains 85+B rows, 90%ile production platform's query latency
> in few seconds).
>
> StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's
> already one real case on production inside eBay, please refer to our design
> deck [4]
>
> We are really welcome everyone to join and contribute to Kylin as OLAP
> engine for Big Data:-)
>
> Please feel free to contact our community or me for any question.
>
> Thanks.
>
> 1. http://s.apache.org/bah
> 2. http://kylin.incubator.apache.org/community/poweredby.html
> 3. http://s.apache.org/lHA
> 4.
> http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai
> 5. http://kylin.io
>
>
> Best Regards!
> -
>
> Luke Han
>
> On Tue, Nov 10, 2015 at 2:56 AM, tsh  wrote:
>
>> Hi,
>>
>> I'm in the same position right now: we are going to implement something
>> like OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have terabytes
>> of versatile data and the necessity to make something like cubes (Hive and
>> Hive on HBase are unsatisfactory). From the other, our users get accustomed
>> to Tableau + Vertica.
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
>> storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
>> Flume (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>>
>> For myself, I decided not to dive into Mesos. Cassandra is hardly
>> configurable, you'll have to dedicate special employee to support it.
>>
>> I'll be glad to hear other ideas & propositions as we are at the
>> beginning of the process too.
>>
>> Sincerely yours, Tim Shenkao
>>
>>
>> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>>
>> Hi,
>>
>> Thanks for suggesting. Actually we are now evaluating and stressing the
>> spark sql on cassandra, while
>>
>> trying to define business models. FWIW, the solution mentioned here is
>> different from traditional OLAP
>>
>> cube engine, right ? So we are hesitating on the common sense or
>> direction choice of olap architecture.
>>
>> And we are happy to hear more use case from this community.
>>
>> Best,
>> Sun.
>>
>> --
>> fightf...@163.com
>>
>>
>> *From:* Jörn Franke 
>> *Date:* 2015-11-09 14:40
>> *To:* fightf...@163.com
>> *CC:* user ; dev 
>> *Subject:* Re: OLAP query using spark dataframe with cassandra
>>
>> Is there any distributor supporting these software components in
>> combination? If no and your core business is not software then you may want
>> to look for something else, because it might not make sense to build up
>> internal know-how in all of these areas.
>>
>> In any case - it depends all highly on your data and queries. You will
>> have to do your own experiments.
>>
>> On 09 Nov 2015, at 07:02, "fightf...@163.com"  wrote:

static spark Function as map

2015-11-10 Thread Zhiliang Zhu

Hi All,
I have met some bug not understandable as follows:
class A {  private JavaRDD _com_rdd;  ...  ...
  //here it must be static, but not every Function as map etc would be static, 
as the code examples in spark self official doc
  static Function mapParseRow = new Function() 
{ 
    @Override
 public Vector call (Vector v) {    System.out.println("mark. map log 
is here");    Vector rt;
    ...   //if here needs to call some other non-static function, 
how can it be ?
    return rt;
    }  };  public void run() { //it will be called outside some other public 
class by A object
  ...  JavaRDD rdd = (this._com_rdd).map(mapParseRow); //it 
will cause failure while map is not static
  ...  }
}

Would you help comment some for it? What would be done? 

Thank you in advance!Zhiliang 





 On Tuesday, November 10, 2015 11:42 AM, Deng Ching-Mallete 
 wrote:
   

 Hi Zhiliang,
You should be able to see them in the executor logs, which you can view via the 
Spark UI, in the Executors page (stderr log).
HTH,Deng

On Tue, Nov 10, 2015 at 11:33 AM, Zhiliang Zhu  
wrote:

Hi All,
I need debug spark job, my general way is to print out the log, however, some 
bug is in spark functions as mapPartitions etc, and not any log printed from 
those functionscould be found...Would you help point what is way to the log in 
the spark own function as mapPartitions? Or, what is general way to debug spark 
job.
Thank you in advance!
Zhiliang

61 matches

Mail list logo