spark cluster performance decreases by adding more nodes

2017-05-17 Thread Junaid Nasir
I have a large data set of 1B records and want to run analytics using Apache spark because of the scaling it provides, but I am seeing an anti pattern here. The more nodes I add to spark cluster, completion time increases. Data store is Cassandra, and queries are run by Zeppelin. I have tried many

Re: spark cluster performance decreases by adding more nodes

2017-05-17 Thread darren
Maybe your master or zeppelin server is running out of memory and the more data it receives the more memory swapping it has to dosomething to check. Get Outlook for Android On Wed, May 17, 2017 at 11:14 AM -0400, "Junaid Nasir" wrote: I have a large data

Re: spark cluster performance decreases by adding more nodes

2017-05-17 Thread Jörn Franke
The issue might be group by , which under certain circumstances can cause a lot of traffic to one node. This transfer is of course obsolete the less nodes you have. Have you checked in the UI what it reports? > On 17. May 2017, at 17:13, Junaid Nasir wrote: > > I have a large

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-17 Thread kant kodali
Thanks Mike & Ryan. Now I can finally see my 5KB messages. However I am running into the following error. OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00073470, 530579456, 0) failed; error='Cannot allocate memory' (errno=12) # There is insufficient memory for the Java

Re: Cloudera 5.8.0 and spark 2.1.1

2017-05-17 Thread Arkadiusz Bicz
It is working fine, but it is not supported by Cloudera. On May 17, 2017 1:30 PM, "issues solution" wrote: > Hi , > it s possible to use prebuilt version of spark 2.1 inside cloudera 5.8 > where scala 2.1.0 not scala 2.1.1 and java 1.7 not java 1.8 > > Why ? > >

Spark Launch programatically - Basics!

2017-05-17 Thread Nipun Arora
Hi, I am trying to get a simple spark application to run programatically. I looked at http://spark.apache.org/docs/2.1.0/api/java/index.html?org/apache/spark/launcher/package-summary.html, at the following code. public class MyLauncher { public static void main(String[] args) throws

Re: spark cluster performance decreases by adding more nodes

2017-05-17 Thread ayan guha
How many nodes do you have in casandra cluster? On Thu, 18 May 2017 at 1:33 am, Jörn Franke wrote: > The issue might be group by , which under certain circumstances can cause > a lot of traffic to one node. This transfer is of course obsolete the less > nodes you have. >

How to flatten struct into a dataframe?

2017-05-17 Thread kant kodali
Hi, I have the following schema. And I am trying to put the structure below in a data frame or dataset such that each in field inside a struct is a column in a data frame. I tried to follow this link and

spark ML Recommender program

2017-05-17 Thread Arun
hi I am writing spark ML Movie Recomender program on Intelij on windows10 Dataset is 2MB with 10 datapoints, My Laptop has 8gb Memory When I set number of iteration 10 works fine When I set number of Iteration 20 I get StackOverFlow error.. Whats the solution?.. thanks Sent from

Re: Spark <--> S3 flakiness

2017-05-17 Thread lucas.g...@gmail.com
Steve, just to clarify: "FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the fs.s3a.experimental.fadvise=random option. " Are you talking about the hadoop-aws lib or hadoop

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Kun Liu
What is the problem here? To use Toree there are some step ups. Best, On Wed, May 17, 2017 at 7:22 PM, upendra 1991 wrote: > What's the best way to use jupyter with Scala spark. I tried Apache toree > and created a kernel but did not get it working. I believe

Re: scalastyle violation on mvn install but not on mvn package

2017-05-17 Thread Marcelo Vanzin
scalastyle runs on the "verify" phase, which is after package but before install. On Wed, May 17, 2017 at 5:47 PM, yiskylee wrote: > ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean > package > works, but > ./build/mvn -Pyarn -Phadoop-2.4

How to see the full contents of dataset or dataframe is structured streaming?

2017-05-17 Thread kant kodali
Hi All, How to see the full contents of dataset or dataframe is structured streaming just like we normally with *df.show(false)*? Is there any parameter I can pass in to the code below? val df1 = df.selectExpr("payload.data.*"); df1.writeStream().outputMode("append").format("console").start()

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Mark Vervuurt
Hi Upendra, I got toree to work and I described it in the following JIRA issue. See the last comment of the issue. https://issues.apache.org/jira/browse/TOREE-336 Mark > On 18 May 2017, at 04:22, upendra 1991

Jupyter spark Scala notebooks

2017-05-17 Thread upendra 1991
What's the best way to use jupyter with Scala spark. I tried Apache toree and created a kernel but did not get it working. I believe there is a better way. Please suggest any best practices. Sent from Yahoo Mail on Android

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Richard Moorhead
Take a look at Apache Zeppelin; it has both python and scala interpreters. https://zeppelin.apache.org/ Apache Zeppelin zeppelin.apache.org Apache Zeppelin. A web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive

unsubscribe

2017-05-17 Thread suyash kharade

Re: spark ML Recommender program

2017-05-17 Thread Kun Liu
Hi Arun, Would you like to show us the codes? On Wed, May 17, 2017 at 8:15 PM, Arun wrote: > hi > > I am writing spark ML Movie Recomender program on Intelij on windows10 > Dataset is 2MB with 10 datapoints, My Laptop has 8gb Memory > > When I set number of iteration

Re: spark ML Recommender program

2017-05-17 Thread Mark Vervuurt
If you are running locally try increasing driver memory to for example 4G en executor memory to 3G. Regards, Mark > On 18 May 2017, at 05:15, Arun > wrote: > > hi > > I am writing spark ML Movie Recomender program on Intelij on windows10 >

scalastyle violation on mvn install but not on mvn package

2017-05-17 Thread yiskylee
./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package works, but ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean install triggers scalastyle violation error. Is the scalastyle check not used on package but only on install? To install, should

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Stephen Boesch
Jupyter with toree works well for my team. Jupyter is well more refined vs zeppelin as far as notebook features and usability: shortcuts, editing,etc. The caveat is it is better to run a separate server instanace for python/pyspark vs scala/spark 2017-05-17 19:27 GMT-07:00 Richard Moorhead

Re: Jupyter spark Scala notebooks

2017-05-17 Thread kanth909
Which of these notebooks can help populate real time graphs through web socket or some sort of push mechanism? Sent from my iPhone > On May 17, 2017, at 8:50 PM, Stephen Boesch wrote: > > Jupyter with toree works well for my team. Jupyter is well more refined vs >

Re: checkpointing without streaming?

2017-05-17 Thread Tathagata Das
Why not just save the RDD to a proper file? text file, sequence, file, many options. Then its standard to read it back in different program. On Wed, May 17, 2017 at 12:01 AM, neelesh.sa wrote: > Is it possible to checkpoint a RDD in one run of my application and

RE: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2017-05-17 Thread Mendelson, Assaf
Thanks for the response. I will try with log4j. That said, I am running in windows using winutil.exe and still getting the warning. Thanks, Assaf. From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Tuesday, May 16, 2017 6:55 PM To: Mendelson, Assaf Cc:

Re: checkpointing without streaming?

2017-05-17 Thread neelesh.sa
Is it possible to checkpoint a RDD in one run of my application and use the saved RDD in the next run of my application? For example, with the following code: val x = List(1,2,3,4) val y = sc.parallelize(x ,2).map( c => c*2) y.checkpoint y.count Is it possible to read the checkpointed RDD in

Parquet file amazon s3a timeout

2017-05-17 Thread Karin Valisova
Hello! I'm working with some parquet files saved on amazon service and loading them to dataframe with Dataset df = spark.read() .parquet(parketFileLocation); however, after some time I get the "Timeout waiting for connection from pool" exception. I hope I'm not mistaken, but I think that there's

Re: Spark <--> S3 flakiness

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 06:00, lucas.g...@gmail.com wrote: Steve, thanks for the reply. Digging through all the documentation now. Much appreciated! FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads,

Re: Parquet file amazon s3a timeout

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 11:13, Karin Valisova > wrote: Hello! I'm working with some parquet files saved on amazon service and loading them to dataframe with Dataset df = spark.read() .parquet(parketFileLocation); however, after some time I get the

Cloudera 5.8.0 and spark 2.1.1

2017-05-17 Thread issues solution
Hi , it s possible to use prebuilt version of spark 2.1 inside cloudera 5.8 where scala 2.1.0 not scala 2.1.1 and java 1.7 not java 1.8 Why ? i am in corporate area and i want to test last version of spark. but my probleme i dont Know if the version 2.1.1 of spark can or not work with this

Re: s3 bucket access/read file

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 00:10, jazzed > wrote: How did you solve the problem with V4? which v4 problem? Authentication? you need to declare the explicit s3a endpoint via fs.s3a.endpoint , otherwise you get a generic "bad auth" message which