Re: Problem with Generalized Regression Model

2017-01-09 Thread sethah
This likely indicates that the IRLS solver for GLR has encountered a singular matrix. Can you check if you have linearly dependent columns in your data? Also, this error message should be fixed in the latest version of Spark, after: https://issues.apache.org/jira/browse/SPARK-11918

Re: spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Palash Gupta
Hello Mr. Burton, Can you share example code how did you implement for other user to see? "So I think what we did is did a repartition too large and now we ran out of memory in spark shell.  " Thanks! P.Gupta Sent from Yahoo Mail on Android On Tue, 10 Jan, 2017 at 8:20 am, Kevin

Re: spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
Ah.. ok. I think I know what's happening now. I think we found this problem when running a job and doing a repartition() Spark is just way way way too sensitive to memory configuration. The 2GB per shuffle limit is also insanely silly in 2017. So I think what we did is did a repartition too

Re: spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Steven Ruppert
The spark-shell process alone shouldn't take up that much memory, at least in my experience. Have you dumped the heap to see what's all in there? What environment are you running spark in? Doing stuff like RDD.collect() or .countByKey will pull potentially a lot of data the spark-shell heap.

Re: How to connect Tableau to databricks spark?

2017-01-09 Thread Raymond Xie
To exclude firewall blocking the port, I added a rule in windows firewall to enable all inbound and outbound port 1. I then tried telnet ec2-35-160-128-113.us-west-2.compute.amazonaws.com 1 in putty and it still doesn't work, ** *Sincerely

spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
We've had various OOM issues with spark and have been trying to track them down one by one. Now we have one in spark-shell which is super surprising. We currently allocate 6GB to spark shell, as confirmed via 'ps' Why the heck would the *shell* need that much memory. I'm going to try to give

Re: unsubscribe

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org HTH! On Mon, Jan 9, 2017 4:40 PM, william tellme williamtellme...@gmail.com wrote:

Re: UNSUBSCRIBE

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org HTH! On Mon, Jan 9, 2017 4:41 PM, Chris Murphy - ChrisSMurphy.com cont...@chrissmurphy.com wrote: PLEASE!!

UNSUBSCRIBE

2017-01-09 Thread Chris Murphy - ChrisSMurphy.com
PLEASE!!

unsubscribe

2017-01-09 Thread william tellme

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Abhishek Bhandari
Glad that you found it. ᐧ On Mon, Jan 9, 2017 at 3:29 PM, Richard Siebeling wrote: > Probably found it, it turns out that Mesos should be explicitly added > while building Spark, I assumed I could use the old build command that I > used for building Spark 2.0.0... Didn't

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Richard Siebeling
Probably found it, it turns out that Mesos should be explicitly added while building Spark, I assumed I could use the old build command that I used for building Spark 2.0.0... Didn't see the two lines added in the documentation... Maybe these kind of changes could be added in the changelog under

Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Richard Siebeling
Hi, I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not parse Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error. Mesos is running fine (both the master as the slave, it's a single machine configuration). I really don't understand why this is happening since the same

Re: How to connect Tableau to databricks spark?

2017-01-09 Thread Silvio Fiorito
Also, meant to add the link to the docs: https://docs.databricks.com/user-guide/faq/tableau.html From: Silvio Fiorito Date: Monday, January 9, 2017 at 2:59 PM To: Raymond Xie , user Subject: Re: How to connect Tableau

Re: Spark 2.x OFF_HEAP persistence

2017-01-09 Thread Gene Pang
Yes, as far as I can tell, your description is accurate. Thanks, Gene On Wed, Jan 4, 2017 at 9:37 PM, Vin J wrote: > Thanks for the reply Gene. Looks like this means, with Spark 2.x, one has > to change from rdd.persist(StorageLevel.OFF_HEAP) to >

Re: Spark Master stops abruptly

2017-01-09 Thread streamly tester
After investigating thoroughly, I find out that it was zk that was causing the issue, Infact one of my zk amongst the three that I configured was done. On Mon, Jan 9, 2017 at 7:16 PM, streamly tester wrote: > Hi, > > I have setup spark in cluster of 4 machines, with 2

Re: How to connect Tableau to databricks spark?

2017-01-09 Thread Silvio Fiorito
Hi Raymond, Are you using a Spark 2.0 or 1.6 cluster? With Spark 2.0 it’s just a matter of entering the hostname of your Databricks environment, the HTTP path from the cluster page, and your Databricks credentials. Thanks, Silvio From: Raymond Xie Date: Sunday, January

Spark UI not coming up in EMR

2017-01-09 Thread Saurabh Malviya (samalviy)
Spark web UI for detailed monitoring for streaming jobs stop rendering after 2 weeks. Its keep looping to fetch the page. Is there any clue I can get that page. Or logs where I can see how many events coming in spark for each internval -Saurabh

Re: Spark SQL 1.6.3 ORDER BY and partitions

2017-01-09 Thread Yong Zhang
I am not sure what do you mean that "table" is comprised of 200/1200 partitions. A partition could mean the dataset(RDD/DataFrame) will be chunked within Spark, then processed; Or it could mean you define the metadata in the Hive of the partitions of the table. If you mean the first one, so

Spark Master stops abruptly

2017-01-09 Thread streamly tester
Hi, I have setup spark in cluster of 4 machines, with 2 masters and 2 workers. I have zookeeper who does the election of the master. Here is my configuration The spark-env.sh contains export SPARK_MASTER_IP=master1,master2 The conf/slaves contains worker1 worker2 conf/spark-defaults.conf

Re: What is the difference between hive on spark and spark on hive?

2017-01-09 Thread Nicholas Hakobian
Hive on Spark is Hive which takes sql statements in and creates Spark jobs for processing instead of Mapreduce or Tez. There is no such thing as "Spark on Hive", but there is SparkSQL. SparkSQL can accept both programmatic statements or it can parse SQL statements to produce a native Spark

Re: Spark GraphFrame ConnectedComponents

2017-01-09 Thread Ankur Srivastava
Hi Steve, I could get the application working by setting "spark.hadoop.fs.default.name". Thank you!! And thank you for your input on using S3 for checkpoint. I am still working on PoC so will consider using HDFS for the final implementation. Thanks Ankur On Fri, Jan 6, 2017 at 9:57 AM, Steve

Re: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Md. Rezaul Karim
Hi, Currently, I have been using Spark 2.1.0 for ML and so far did not experience any critical issue. It's much stable compared to Spark 2.0.1/2.0.2 I would say. Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National

RE: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Ankur Jain
Thanks Rezaul… Is Spark 2.1.0 still have any issues w.r.t. stability? Regards, Ankur From: Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org] Sent: Monday, January 09, 2017 5:02 PM To: Ankur Jain Cc: user@spark.apache.org Subject: Re: Machine Learning in Spark 1.6

[PySpark] py4j.Py4JException: PythonFunction Does Not Exist

2017-01-09 Thread Tegan Snyder
Hello, I recently setup a small 3 cluster setup of Spark on an existing Hadoop installation. I’m running into an error message when attempting to use the pyspark shell. I can reproduce the error in the pyspark shell the with the following example: from operator import add text =

Re: Entering the variables in the Argument part in Submit job section to run a spark code on Google Cloud

2017-01-09 Thread Dinko Srkoč
Not knowing how the code that handles those arguments look like, I would, in the "Arguments" field for submitting a dataproc job, put: --trainFile=gs://Anahita/small_train.dat --testFile=gs://Anahita/small_test.dat --numFeatures=9947 --numRounds=100 ... providing you still keep those files in

Re: Tuning spark.executor.cores

2017-01-09 Thread Aaron Perrin
That setting defines the total number of tasks that an executor can run in parallel. Each node is partitioned into executors, each with identical heap and cores. So, it can be a balancing act to optimally set these values, particularly if the goal is to maximize CPU usage with memory and other

What is the difference between hive on spark and spark on hive?

2017-01-09 Thread 李斌松
What is the difference between hive on spark and spark on hive?

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
Hi Takeshi, Thanks for the answer. My UDAF aggregates data into an array of rows. Apparently this makes it ineligible to using Hash-based aggregate based on the logic at:

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Takeshi Yamamuro
Hi, Spark always uses hash-based aggregates if the types of aggregated data are supported there; otherwise, spark fails to use hash-based ones, then it uses sort-based ones. See:

How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
Hi all, It appears to me that Dataset.groupBy().agg(udaf) requires a full sort, which is very inefficient for certain aggration: The code is very simple: - I have a UDAF - What I want to do is: dataset.groupBy(cols).agg(udaf).count() The physical plan I got was: *HashAggregate(keys=[],

Tuning spark.executor.cores

2017-01-09 Thread Appu K
Are there use-cases for which it is advisable to give a value greater than the actual number of cores to spark.executor.cores ?

Fwd: Entering the variables in the Argument part in Submit job section to run a spark code on Google Cloud

2017-01-09 Thread Anahita Talebi
Dear friends, I am trying to run a run a spark code on Google cloud using submit job. https://cloud.google.com/dataproc/docs/tutorials/spark-scala My question is about the part "argument". In my spark code, they are some variables that their values are defined in a shell file (.sh), as

Re: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Md. Rezaul Karim
Hello Jain, I would recommend using Spark MLlib (and ML) of *Spark 2.1.0* with the following features: - ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering -

Re: AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread Santosh.B
Yes it provides but whatever have seen its line by line update. Please see below link https://gist.github.com/QwertyManiac/4724582 This is very slow because of append Avro , am thinking of something which we normally do for test files where we buffer the data to a size and the flush the buffer.

Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Ankur Jain
Hi Team, I want to start a new project with ML. But wanted to know which version of Spark is much stable and have more features w.r.t ML Please suggest your opinion... Thanks in Advance... [cid:image013.png@01D1AAE2.28F7BBF0] Thanks & Regards Ankur Jain Technical Architect - Big Data | IoT |

Re: AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread Jörn Franke
Avro itself supports it, but I am not sure if this functionality is available through the Spark API. Just out of curiosity, if your use case is only write to HDFS then you might use simply flume. > On 9 Jan 2017, at 09:58, awkysam wrote: > > Currently for our

AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread awkysam
Currently for our project we are collecting data and pushing into Kafka with messages are in Avro format. We need to push this data into HDFS and we are using SparkStreaming and in HDFS also it is stored in Avro format. We are partitioning the data per each day. So when we write data into HDFS we

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-09 Thread Appu K
That explains it. Appreciate the help Hyukjin! Thank you On 9 January 2017 at 1:08:02 PM, Hyukjin Kwon (gurwls...@gmail.com) wrote: Hi Appu, I believe that textFile and filter came from...