Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-29 Thread Timur Shenkao
Hi! Do you have real HIVE installation? Have you built Spark 2.1 & Spark 2.0 with HIVE support ( -Phive -Phive-thriftserver ) ? It seems that you use "default" Spark's HIVE 1.2.1. Your metadata is stored in local Derby DB which is visible to concrete Spark installation but not for all. On Wed,

[structured streaming] How to remove outdated data when use Window Operations

2016-11-29 Thread Xinyu Zhang
Hi I want to use window operations. However, if i don't remove any data, the "complete" table will become larger and larger as time goes on. So I want to remove some outdated data in the complete table that I would never use. Is there any method to meet my requirement? Thanks!

Unsubscribe

2016-11-29 Thread Bibudh Lahiri
Unsubscribe -- Bibudh Lahiri Senior Data Scientist, Impetus Technolgoies 720 University Avenue, Suite 130 Los Gatos, CA 95129 http://knowthynumbers.blogspot.com/

Re: Best approach to schedule Spark jobs

2016-11-29 Thread Sandeep Samudrala
Here at Inmobi, we use Apache Falcon (with oozie). The pipelines are fully functional in production. You can look into Apache Falcon site for more details. On Wed, Nov 30, 2016 at 7:36 AM, Tiago Albineli Motta wrote: > Here at Globo.com we use

SVM regression in Spark

2016-11-29 Thread roni
Hi All, I am trying to change my R code to spark. I am using SVM regression in R . It seems like spark is providing SVM classification . How can I get the regression results. In my R code I am using call to SVM () function in library("e1071") (

Re: Best approach to schedule Spark jobs

2016-11-29 Thread Tiago Albineli Motta
Here at Globo.com we use Airflow to schedule and manage our spark pipeline. We use the Yarn API in the Airflow Dags to controls things like garantee that the job is not running before start another batch. Tiago Albineli Motta Desenvolvedor de Software - Globo.com ICQ: 32107100

Controlling data placement / locality

2016-11-29 Thread Michael Johnson
I'm reading in data from a single file. I do some computations on the data to get good groupings of the data. Future computations in my program operate on a single group at once. (E.g., I might do frequent itemset mining of members within each group.) How do I tell Spark that all members of a

Fault-tolerant Accumulators in a DStream-only transformations.

2016-11-29 Thread Amit Sela
Hi all, In order to recover Accumulators (functionally) from a Driver failure, it is recommended to use it within a foreachRDD/transform and use the RDD context with a Singleton wrapping the Accumulator as shown in the examples

Spark Job not exited and shows running

2016-11-29 Thread Selvam Raman
Hi, I have submitted spark job in yarn client mode. The executor and cores were dynamically allocated. In the job i have 20 partitions, so 5 container each with 4 core has been submitted. It almost processed all the records but it never exit the job and in the application master container i am

Best approach to schedule Spark jobs

2016-11-29 Thread Bruno Faria
I have a standalone Spark cluster and have some jobs scheduled using crontab. It works but I don't have all the real time monitoring to get emails or to control a flow for example. Thought about using the Spark "hidden" API to have a better control but seems the API is not officially

Re: Does MapWithState follow with a shuffle ?

2016-11-29 Thread Shixiong(Ryan) Zhu
Right. And you can specify the partitioner via "StateSpec.partitioner(partitioner: Partitioner)". On Tue, Nov 29, 2016 at 1:16 PM, Amit Sela wrote: > Hi all, > > I've been digging into MapWithState code (branch 1.6), and I came across > the compute >

Does MapWithState follow with a shuffle ?

2016-11-29 Thread Amit Sela
Hi all, I've been digging into MapWithState code (branch 1.6), and I came across the compute implementation in *InternalMapWithStateDStream*. Looking at

Re: Multilabel classification with Spark MLlib

2016-11-29 Thread Yuhao Yang
If problem transformation is not an option ( https://en.wikipedia.org/wiki/Multi-label_classification#Problem_transformation_methods), I would try to develop a customized algorithm based on MultilayerPerceptronClassifier, in which you probably need to rewrite LabelConverter. 2016-11-29 9:02

Re: Porting LIBSVM models to Spark

2016-11-29 Thread Maciej Szymkiewicz
Hi, Not directly. You could try some workaround with converting to PMML and importing with JPMML-Spark (but you'd have create your own Python wrapper). On a side note please avoid cross posting between Stack Overflow and user list and be sure to read the guidelines

Spark 2 Alternative to SparkContext clearJars()?

2016-11-29 Thread lukasbradley
The method clearJars() has been removed from the SparkContext in Spark 2. Is there an alternative to this in Spark 2? Out of curiosity, does anyone know why it was removed? -- View this message

Re: Do I have to wrap akka around spark streaming app?

2016-11-29 Thread shyla deshpande
Thanks Vincent for the feedback. I appreciate. On Tue, Nov 29, 2016 at 1:34 AM, vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > You can still achieve it by implementing an actor in each partition but I > am not sure it's a good design regarding scalability because your >

Porting LIBSVM models to Spark

2016-11-29 Thread Pat Blachly
Is it possible to read LIBSVM model files into PySpark? Naively, I'm thinking of something like: scaler_path = "path to LIBSVM model generated with svm-scale" a = MinMaxScaler().load(scaler_path) While this example is shown for a feature transformation model, I would also be interested in

Porting LIBSVM models to Spark

2016-11-29 Thread Pat Blachly
Is it possible to read LIBSVM model files into PySpark? Naively, I'm thinking of something like: scaler_path = "path to LIBSVM model generated with svm-scale" a = MinMaxScaler().load(scaler_path) While this example is shown for a feature transformation model, I would also be interested in

Multilabel classification with Spark MLlib

2016-11-29 Thread Md. Rezaul Karim
Hello All, Is there anyone who has developed multilabel classification applications with Spark? I found an example class in Spark distribution (i.e., *JavaMultiLabelClassificationMetricsExample.java*) which is not a classifier but an evaluator for a multilabel classification. Moreover, the

Re: build models in parallel

2016-11-29 Thread Georg Heiler
They https://www.youtube.com/watch?v=R-6nAwLyWCI use such functionality via pyspark. Xiaomeng Wan schrieb am Di., 29. Nov. 2016 um 17:54 Uhr: > I want to divide big data into groups (eg groupby some id), and build one > model for each group. I am wondering whether I can

build models in parallel

2016-11-29 Thread Xiaomeng Wan
I want to divide big data into groups (eg groupby some id), and build one model for each group. I am wondering whether I can parallelize the model building process by implementing a UDAF (eg running linearregression in its evaluate mothod). is it good practice? anybody has experience? Thanks!

Re: null values returned by max() over a window function

2016-11-29 Thread Yong Zhang
This is not a bug, but a intension of windows function. When you use max + rowsBetween, it is kind of strange requirement. RowsBetween is more like to be used to calculate the moving sun or avg, which will handle null as 0. But in your case, you want your grouping window as 2 rows before +

python environments with "local" and "yarn-client" - Boto failing on HDP2.5

2016-11-29 Thread Andrew Holway
Hey, I am making some calls with Boto3 in my pyspark which is working fine in master=local mode but when I switch to master=yarn I am getting "NoCredentialsError: Unable to locate credentials" which is a bit annoying as I cannot work out why! I have been running this application fine on Mesos

Re: createDataFrame causing a strange error.

2016-11-29 Thread Andrew Holway
Hi Marco, I was not able to find out what was causing the problem but a "git stash" seems to have fixed it :/ Thanks for your help... :) On Mon, Nov 28, 2016 at 10:50 PM, Marco Mistroni wrote: > Hi Andrew, > sorry but to me it seems s3 is the culprit > I have

Java Collections.emptyList inserted as null object in cassandra

2016-11-29 Thread Selvam Raman
Filed Type in cassandra : List I am trying to insert Collections.emptyList() from spark to cassandra list field. In cassandra it stores as null object. How can i avoid null values here. -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Re: Spark ignoring partition names without equals (=) separator

2016-11-29 Thread Steve Loughran
On 29 Nov 2016, at 05:19, Prasanna Santhanam > wrote: On Mon, Nov 28, 2016 at 4:39 PM, Steve Loughran > wrote: irrespective of naming, know that deep directory trees are performance killers when

Re: Do I have to wrap akka around spark streaming app?

2016-11-29 Thread vincent gromakowski
You can still achieve it by implementing an actor in each partition but I am not sure it's a good design regarding scalability because your distributed actors would send a message for each event to your single app actor, it would be a huge load If you want to experiment this and because actor

Re: Spark Streaming + Kinesis : Receiver MaxRate is violated

2016-11-29 Thread dav009
Possibly a bug, please check: https://issues.apache.org/jira/browse/SPARK-18620 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Kinesis-Receiver-MaxRate-is-violated-tp28141p28144.html Sent from the Apache Spark User List mailing list archive