Re: MLlib: Feature Importances API

2015-12-17 Thread Yanbo Liang
Hi Asim, The "featureImportances" is only exposed at ML not MLlib. You need to update your code to use RandomForestClassifier of ML to train and get one RandomForestClassificationModel. Then you can call RandomForestClassificationModel.featureImportances

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-17 Thread Yanbo Liang
Spark 1.5 officially use Parquet 1.7.0, but Spark 1.3 use Parquet 1.6.0. It's better to check which version of Parquet is used in your environment. 2015-12-17 10:26 GMT+08:00 Joseph Bradley : > This method is tested in the Spark 1.5 unit tests, so I'd guess it's a >

Are there some solution to complete the transform category variables into dummy variable in scala or spark ?

2015-12-17 Thread zml张明磊
Hi , I am a new to scala and spark. Recently, I need to write a tool that transform category variables to dummy/indicator variables. I want to know are there some tools in scala and spark which support this transformation which like pandas.get_dummies in python ? Any example or study

Re: Need clarifications in Regression

2015-12-17 Thread Yanbo Liang
Hi Arunkumar, There are two implementation for LinearRegression, one under ml package and another one

Re: Linear Regression with OLS

2015-12-17 Thread Yanbo Liang
Hi Arunkumar, You can refer the officially examples of LinearRegression under ML package( https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala ). If you want to train this LinearRegressionModel with OLS, you

java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-17 Thread Priya Ch
Hi All, When running streaming application, I am seeing the below error: java.io.FileNotFoundException: /data1/yarn/nm/usercache/root/appcache/application_1450172646510_0004/blockmgr-a81f42cd-6b52-4704-83f3-2cfc12a11b86/02/temp_shuffle_589ddccf-d436-4d2c-9935-e5f8c137b54b (Too many open

Re: Content based window operation on Time-series data

2015-12-17 Thread Sandy Ryza
Hi Arun, A Java API was actually recently added to the library. It will be available in the next release. -Sandy On Thu, Dec 10, 2015 at 12:16 AM, Arun Verma wrote: > Thank you for your reply. It is a Scala and Python library. Is similar > library exists for Java? >

Re: Are there some solution to complete the transform category variables into dummy variable in scala or spark ?

2015-12-17 Thread Yanbo Liang
Hi Minglei, Spark ML provide a transformer named "OneHotEncoder" to map a column of category indices to a column of binary vectors. It's similar with pandas.get_dummies and OneHotEncoder of sklearn, but the output will be a column of vector type rather than multiple columns. You can refer the

Re: Cluster mode dependent jars not working

2015-12-17 Thread vimal dinakaran
--driver-classpath needs to be added with jars needed. But this is not being mentioned in the spark documentation. On Tue, Dec 15, 2015 at 9:13 PM, Ted Yu wrote: > Please use --conf spark.executor.extraClassPath=XXX to specify dependent > jars. > > On Tue, Dec 15, 2015 at

Dynamic jar loading

2015-12-17 Thread amarouni
Hello guys, Do you know if the method SparkContext.addJar("file:///...") can be used on a running context (an already started spark-shell) ? And if so, does it add the jar to the class-path of the Spark workers (Yarn containers in case of yarn-client) ? Thanks,

Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-17 Thread Saiph Kappa
I am not sure how the process works and if patches are applied to all upcoming versions of spark. Is it likely that the fix is available in this build (spark 1.6.0 17-Dec-2015 09:02)? http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/ Thanks! On Wed, Dec 16, 2015 at 9:22

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-17 Thread Bartłomiej Alberski
I prepared simple example helping in reproducing problem: https://github.com/alberskib/spark-streaming-broadcast-issue I think that in that way it will be easier for you to understand problem and find solution (if any exists) Thanks Bartek 2015-12-16 23:34 GMT+01:00 Bartłomiej Alberski

Some tasks take a long time to find local block

2015-12-17 Thread patrick256
I'm using Spark 1.5.2 and my RDD has 512 equally sized partitions and is 100% cached in memory across 512 executors. I have a filter-map-collect job with 512 tasks. Sometimes this job completes sub-second. On other occasions when I run it 50% of the tasks complete sub-second, 45% of the tasks

Re: Kafka - streaming from multiple topics

2015-12-17 Thread Cody Koeninger
Using spark.streaming.concurrentJobs for this probably isn't a good idea, as it allows the next batch to start processing before current one is finished, which may have unintended consequences. Why can't you use a single stream with all the topics you care about, or multiple streams if you're

Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-17 Thread Ted Yu
As far as I can tell, it is not in 1.6.0 RC. You can comment on the JIRA, requesting backport to 1.6.1 Cheers On Thu, Dec 17, 2015 at 5:28 AM, Saiph Kappa wrote: > I am not sure how the process works and if patches are applied to all > upcoming versions of spark. Is it

Spark streaming: Consistency of multiple streams in Spark

2015-12-17 Thread Ashwin
Hi, I have been looking into using Spark streaming for the specific use case of joining events of data from multiple time-series streams. The part that I am having a hard time understanding is the consistency semantics of this across multiple streams. As per [1] Section 4.3.4, I understand

Matrix Inverse

2015-12-17 Thread Arunkumar Pillai
Hi I want to find matrix inverse of (XTranspose * X). PFB my code. This code does not work for even slight larger dataset. Please help me if the approach is correct. val sqlQuery = "SELECT column1,column2 ,column3 FROM " + tableName val matrixDF` = sqlContext.sql(sqlQuery) var

One task hangs and never finishes

2015-12-17 Thread Daniel Haviv
Hi, I have an application running a set of transformations and finishes with saveAsTextFile. Out of 80 tasks all finish pretty fast but one that just hangs and outputs these message to STDERR: 5/12/17 17:22:19 INFO collection.ExternalAppendOnlyMap: Thread 82 spilling in-memory map of 4.0 GB to

How to submit spark job to YARN from scala code

2015-12-17 Thread Saiph Kappa
Hi, Since it is not currently possible to submit a spark job to a spark cluster running in standalone mode (cluster mode - it's not currently possible to specify this deploy mode within the code), can I do it with YARN? I tried to do something like this (but in scala): « ... // Client object -

[SparkML] RandomForestModel vs PipelineModel API on a Driver.

2015-12-17 Thread Eugene Morozov
Hi! I'm looking for a way to run prediction for learned model in the most performant way. It might happen that some users might want to predict just couple of samples (literally one or two), but some other would run prediction for tens of thousands. It's not a surprise there is an overhead to

Re: Kafka - streaming from multiple topics

2015-12-17 Thread Jean-Pierre OCALAN
Hi Cody, First of all thanks for the note about spark.streaming.concurrentJobs. I guess this is why it's not mentioned in the actual spark streaming doc. Since those 3 topics contain completely different data on which I need to apply different kind of transformations, I am not sure joining them

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-17 Thread Mark Hamstra
Ah, sorry for leading you astray a bit. I was working from memory instead of looking at the code, and was probably thinking back all the way to Reynold's initial implementation of SparkContext#killJob(), which was public. I'd have to do some digging to determine exactly when and why

How to access resources added with SQL: ADD FILE

2015-12-17 Thread Antonio Piccolboni
Hi, I need to access a file from a UDF. In standalone, if I add the file /tmp/somedata, it ends up in /private/tmp/somedata, as I found out keeping an eye on the logs. That is actually the same file because of a link between the directories, nothing related to spark. My expectation reading some

Re: Kafka - streaming from multiple topics

2015-12-17 Thread Cody Koeninger
You could stick them all in a single stream, and do mapPartitions, then switch on the topic for that partition. It's probably cleaner to do separate jobs, just depends on how you want to organize your code. On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN wrote: > Hi

pyspark + kafka + streaming = NoSuchMethodError

2015-12-17 Thread Christos Mantas
Hello, I am trying to set up a simple example with Spark Streaming (Python) and Kafka on a single machine deployment. My Kafka broker/server is also on the same machine (localhost:1281) and I am using Spark Version: spark-1.5.2-bin-hadoop2.6 Python code ... ssc = StreamingContext(sc,

​Spark 1.6 - YARN Cluster Mode

2015-12-17 Thread syepes
Hello, This week I have been testing 1.6 (#d509194b) in our HDP 2.3 platform and its been working pretty ok, at the exception of the YARN cluster deployment mode. Note that with 1.5 using the same "spark-props.conf" and "spark-env.sh" config files the cluster mode works as expected. Has anyone

Can't run spark on yarn

2015-12-17 Thread Eran Witkon
Hi, I am trying to install spark 1.5.2 on Apache hadoop 2.6 and Hive and yarn spark-env.sh export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop bash_profile #HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export HADOOP_INSTALL=/usr/local/hadoop export

unsubscribe

2015-12-17 Thread Roman Garcia
please

Re: How to submit spark job to YARN from scala code

2015-12-17 Thread Steve Loughran
On 17 Dec 2015, at 16:50, Saiph Kappa > wrote: Hi, Since it is not currently possible to submit a spark job to a spark cluster running in standalone mode (cluster mode - it's not currently possible to specify this deploy mode within the

Re: Large number of conf broadcasts

2015-12-17 Thread Prasad Ravilla
Hi Anders, I am running into the same issue as yours. I am trying to read about 120 thousand avro files into a single data frame. Is your patch part of a pull request from the master branch in github? Thanks, Prasad. From: Anders Arpteg Date: Thursday, October 22, 2015 at 10:37 AM To: Koert

Spark Path Wildcards Question

2015-12-17 Thread Mark Vervuurt
Hi Guys, quick verification question: Spark’s method like textFile(…) and sequenceFile(…) support wildcards. However if I have a directory structure with “hdfs:///data/year/month/day” (ex. "hdfs:///data/2015/12/17”), then its possible to crawl a whole year of data consisting of sequence files

Re: Large number of conf broadcasts

2015-12-17 Thread Koert Kuipers
https://github.com/databricks/spark-avro/pull/95 On Thu, Dec 17, 2015 at 3:35 PM, Prasad Ravilla wrote: > Hi Anders, > > I am running into the same issue as yours. I am trying to read about 120 > thousand avro files into a single data frame. > > Is your patch part of a pull

Re: Large number of conf broadcasts

2015-12-17 Thread Prasad Ravilla
Thanks, Koert. Regards, Prasad. From: Koert Kuipers Date: Thursday, December 17, 2015 at 1:06 PM To: Prasad Ravilla Cc: Anders Arpteg, user Subject: Re: Large number of conf broadcasts

Re: number of blocks in ALS/recommendation API

2015-12-17 Thread Burak Yavuz
Copying the first part from the scaladoc: " This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on

Difference between Local Hive Metastore server and A Hive-based Metastore server

2015-12-17 Thread Divya Gehlot
Hi, I am new bee to spark and using 1.4.1 Got confused between Local Metastore server and a hive based metastore server. Can somebody share the usecases when to use which one and pros and cons ? I am using HDP 2,.3.2 in which hive-site-xml is already in spark configuration directory that means

seriazable error in apache spark job

2015-12-17 Thread Pankaj Narang
I am encountering below error. Can somebody guide ? Something similar is one this link https://github.com/elastic/elasticsearch-hadoop/issues/298 actor.MentionCrawlActor java.io.NotSerializableException: actor.MentionCrawlActor at

Is DataFrame.groupBy supposed to preserve order within groups?

2015-12-17 Thread Timothée Carayol
Hi all, I tried to do something like the following in Spark df.orderBy('col1, 'col2).groupBy('col1).agg(first('col3)) I was hoping to get, within each col1 value, the value for col3 that corresponds to the highest value for col2 within that col1 group. This only works if the order on col2 is

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-17 Thread Shixiong Zhu
Streaming checkpoint doesn't support Accumulator or Broadcast. See https://issues.apache.org/jira/browse/SPARK-5206 Here is a workaround: https://issues.apache.org/jira/browse/SPARK-5206?focusedCommentId=14506806=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14506806

ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow.

2015-12-17 Thread Anderson de Andrade
Hi. The following code is raising the warning in the title: I read a similar thread about this. However, I do not think I'm joining two VertexRDDs. Is this the best way to go about stacking aggregateMessage calls? Thank you. -- View this message in context:

Re: Spark REST API shows Error 503 Service Unavailable

2015-12-17 Thread prateek arora
Hi Vikram , As per Cloudera Person : " There is a minor bug with the way the classpath is setup for the Spark HistoryServer in 5.5.0 which causes the observed error when using the REST API (as a result of bad jersey versions (1.9) being included). This will be fixed in CDH and CM 5.5.2 (yet to

Re: Spark REST API shows Error 503 Service Unavailable

2015-12-17 Thread Vikram Kone
Hi Prateek, Were you able to figure why this is happening? I'm seeing the same error on my spark standalone cluster. Any pointers anyone? On Fri, Dec 11, 2015 at 2:05 PM, prateek arora wrote: > > > Hi > > I am trying to access Spark Using REST API but got below

Re: Spark REST API shows Error 503 Service Unavailable

2015-12-17 Thread Marcelo Vanzin
Hi Prateek, Are you using CDH 5.5 by any chance? We fixed this bug in an upcoming patch. Unfortunately there's no workaround at the moment... it doesn't affect upstream Spark either. On Fri, Dec 11, 2015 at 2:05 PM, prateek arora wrote: > > > Hi > > I am trying to

Re: Spark REST API shows Error 503 Service Unavailable

2015-12-17 Thread Vikram Kone
No we are using standard spark w/ datastax cassandra. I'm able to see some json when I do http://10.1.40.16:7080/json/v1/applications but getting the following errors when I do http://10.1.40.16:7080/api/v1/applications HTTP ERROR 503 Problem accessing /api/v1/applications. Reason: Service

Re: Spark REST API shows Error 503 Service Unavailable

2015-12-17 Thread Marcelo Vanzin
On Thu, Dec 17, 2015 at 3:31 PM, Vikram Kone wrote: > No we are using standard spark w/ datastax cassandra. I'm able to see some > json when I do http://10.1.40.16:7080/json/v1/applications > but getting the following errors when I do >

Re: looking for Spark streaming unit example written in Java

2015-12-17 Thread Andy Davidson
Hi Ted I added the following hack to my gradle project. I am now able to run spark streaming unit tests in my project. Hopefully others will find this helpful andy dependencies { providedgroup: 'commons-cli', name: 'commons-cli', version: '1.3+' providedgroup:

Re: Download Problem with Spark 1.5.2 pre-built for Hadoop 1.X

2015-12-17 Thread Jean-Baptiste Onofré
Hi, we have a Jira about that (let me find it): by default, a suffix is appended causing issue to resolve the artifact. Let me find the Jira and the workaround. Regards JB On 12/17/2015 12:48 PM, abc123 wrote: Get error message when I try to download Spark 1.5.2 pre-built for Hadoop 1.X.

Re: MLlib: Feature Importances API

2015-12-17 Thread Yanbo Liang
Hi Asim, I think it's not necessary to back port featureImportances to mllib.tree.RandomForest. You can use ml.RandomForestClassifier and ml.RandomForestRegressor directly. Yanbo 2015-12-17 19:39 GMT+08:00 Asim Jalis : > Yanbo, > > Thanks for the reply. > > Is there a JIRA

Python 3.x support

2015-12-17 Thread YaoPau
I found the jira for Python 3 support here , but it looks like support for 3.4 was still unresolved. Which Python 3 versions are supported by Spark 1.5? -- View this message in context:

Re: Base ERROR

2015-12-17 Thread Jeff Zhang
I believe this is hbase issue, you'd better to ask on hbase mail list. On Fri, Dec 18, 2015 at 9:57 AM, censj wrote: > hi,all: > I wirte data to hbase,but Hbase arise this ERROR,Could you help me? > > > r.KeeperException$SessionExpiredException: KeeperErrorCode =

Re: HiveContext Self join not reading from cache

2015-12-17 Thread Gourav Sengupta
Hi Ted, The self join works fine on tbales where the hivecontext tables are direct hive tables, therefore table1 = hiveContext.sql("select columnA, columnB from hivetable1") table1.registerTempTable("table1") table1.cache() table1.count() and if I do a self join on table1 things are quite fine

number of blocks in ALS/recommendation API

2015-12-17 Thread Roberto Pagliari
What is the meaning of the 'blocks' input argument in mllib ALS implementation, and how does that relate to the number of executors and/or size of the input data? Thank you,

Re: MLlib: Feature Importances API

2015-12-17 Thread Asim Jalis
Yanbo, Thanks for the reply. Is there a JIRA for exposing featureImportances on org.apache.spark.mllib.tree.RandomForest?, or could you create one? I am unable to create an issue on JIRA against Spark. Thanks. Asim On Thu, Dec 17, 2015 at 12:07 AM, Yanbo Liang wrote: >

Download Problem with Spark 1.5.2 pre-built for Hadoop 1.X

2015-12-17 Thread abc123
Get error message when I try to download Spark 1.5.2 pre-built for Hadoop 1.X. Can someone help me please? Error: http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop1.tgz NoSuchKey The specified key does not exist. spark-1.5.2-bin-hadoop1.tgz

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-17 Thread Jakob Odersky
It might be a good idea to see how many files are open and try increasing the open file limit (this is done on an os level). In some application use-cases it is actually a legitimate need. If that doesn't help, make sure you close any unused files and streams in your code. It will also be easier

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-17 Thread Jacek Laskowski
Thanks Mark! That helped a lot, and my takeaway from it is to...back away now! :) I'm following the advice as there's simply too much at the moment to learn in Spark. Pozdrawiam, Jacek Jacek Laskowski | https://medium.com/@jaceklaskowski/ Mastering Apache Spark ==>

Base ERROR

2015-12-17 Thread censj
hi,all: I wirte data to hbase,but Hbase arise this ERROR,Could you help me? > > r.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired > for /hbase-unsecure/rs/byd0157,16020,1449106975377 > 2015-12-17 21:24:29,854 WARN [regionserver/byd0157/192.168.0.157:16020] >

Re: Can't run spark on yarn

2015-12-17 Thread Saisai Shao
Please check the Yarn AM log to see why AM is failed to start. That's the reason why using `sc` will get such complaint. On Fri, Dec 18, 2015 at 4:25 AM, Eran Witkon wrote: > Hi, > I am trying to install spark 1.5.2 on Apache hadoop 2.6 and Hive and yarn > > spark-env.sh >

Re: pyspark + kafka + streaming = NoSuchMethodError

2015-12-17 Thread Shixiong Zhu
What's the Scala version of your Spark? Is it 2.10? Best Regards, Shixiong Zhu 2015-12-17 10:10 GMT-08:00 Christos Mantas : > Hello, > > I am trying to set up a simple example with Spark Streaming (Python) and > Kafka on a single machine deployment. > My Kafka

Re: pyspark + kafka + streaming = NoSuchMethodError

2015-12-17 Thread Luciano Resende
Unless you built your own Spark distribution with Scala 2_11, you want to use the 2.10 dependency : --packages org.apache.spark:spark-streaming-kafka_2.10:1.5.2 On Thu, Dec 17, 2015 at 10:10 AM, Christos Mantas wrote: > Hello, > > I am trying to set up a simple

Re: Can't run spark on yarn

2015-12-17 Thread Alexander Pivovarov
Try to start aws EMR 4.2.0 with hadoop and spark applications on spot instances. Then look at how hadoop and spark configured. Try to configure your hadoop and spark similar way On Dec 17, 2015 6:09 PM, "Saisai Shao" wrote: > Please check the Yarn AM log to see why AM is

RE: How to submit spark job to YARN from scala code

2015-12-17 Thread Alexander Pivovarov
Spark-submit --master yarn-cluster Look docs for more details On Dec 17, 2015 5:00 PM, "Forest Fang" wrote: > Maybe I'm not understanding your question correctly but would it be > possible for you to piece up your job submission information as if you are > operating

Re: Content based window operation on Time-series data

2015-12-17 Thread Davies Liu
Could you try this? df.groupBy(cast(col("timeStamp") - start) / bucketLengthSec, IntegerType)).agg(max("timestamp"), max("value")).collect() On Wed, Dec 9, 2015 at 8:54 AM, Arun Verma wrote: > Hi all, > > We have RDD(main) of sorted time-series data. We want to split it

RE: How to submit spark job to YARN from scala code

2015-12-17 Thread Forest Fang
Maybe I'm not understanding your question correctly but would it be possible for you to piece up your job submission information as if you are operating spark-submit? If so, you could just call org.apache.spark.deploy.SparkSubmit and pass your regular spark-submit arguments. This is how I do

Writing output fails when spark.unsafe.offHeap is enabled

2015-12-17 Thread Mayuresh Kunjir
I am testing a simple Sort program written using Dataframe APIs. When I enable spark.unsafe.offHeap, the output stage fails with a NPE. The exception when run on spark-1.5.1 is copied below. ​ Job aborted due to stage failure: Task 23 in stage 3.0 failed 4 times, most recent failure: Lost task