Re: Connect the two tables in spark sql

2016-03-01 Thread Mich Talebzadeh
If you are using Spark-sql as opposed to spark-shell, then you can just use UNION as in SQL for this. Pretty straight forward. SELECT * from TABLE_A UNION SELECT * from TABLE_B ORDER BY COLUMN_A, COLUMN_B; Example spark-sql> SELECT * FROM dummy where id = 1 > UNION > SELECT *

Re: SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
Hi Mich, thanks a ton for your kind response, but this error was happening because of loading derby classes mroe than once In my second email I mentioned the steps that I took in order to resolve the issue. Thanks and Regards, Gourav On Tue, Mar 1, 2016 at 8:54 PM, Mich Talebzadeh

Re: Spark UI standalone "crashes" after an application finishes

2016-03-01 Thread Gourav Sengupta
Hi Teng, I was not asking the question, I was responding in terms of what to expect from SPARK UI in terms of how you start using SPARK application. Thanks and Regards, Gourav On Tue, Mar 1, 2016 at 8:30 PM, Teng Qiu wrote: > as Gourav said, the application UI on port 4040

RE: Connect the two tables in spark sql

2016-03-01 Thread Mao, Wei
It should be a “union” operation instead of “join”. And besides from Ted’s answer, if you are working with DataSet API: def union(other: Dataset[T]): Dataset[T] = withPlan[T](other){ (left, right) => Thanks, William From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, March 2, 2016 11:41

Re: Spark execuotr Memory profiling

2016-03-01 Thread Nirav Patel
Thanks Nilesh, Thanks for sharing those docs. I have came across most of those tuning in past and believe me I have tune the hack of out of this job. What I can't beleive is spark needs 4x more resource then MapReduce to run the same job (for dataset magnitude of >100GB). I was able to run my job

Re: Connect the two tables in spark sql

2016-03-01 Thread Ted Yu
You only showed one record from each table. Have you looked at the following method in DataFrame ? def unionAll(other: DataFrame): DataFrame = withPlan { On Tue, Mar 1, 2016 at 7:13 PM, Angel Angel wrote: > Hello Sir/Madam, > > I am using the spark sql for the data

Connect the two tables in spark sql

2016-03-01 Thread Angel Angel
Hello Sir/Madam, I am using the spark sql for the data operation. I have two tables with the same fields. Table 1 name address phone Number sagar india Table 2 name address phone Number jaya india 222 I want to join this tables like the following way Result Table

Does anyone have spark code style guide xml file ?

2016-03-01 Thread Minglei Zhang
Hello, Appreciate if you have xml file with the following style code ? https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide thanks.

Re: Does anyone have spark code style guide xml file ?

2016-03-01 Thread Ted Yu
See this in source repo: ./.idea/projectCodeStyle.xml On Tue, Mar 1, 2016 at 6:55 PM, zml张明磊 wrote: > Hello, > > > > Appreciate if you have xml file with the following style code ? > > https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide > > > >

Does anyone have spark code style guide xml file ?

2016-03-01 Thread zml张明磊
Hello, Appreciate if you have xml file with the following style code ? https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide thanks.

Re: Spark executor killed without apparent reason

2016-03-01 Thread Ted Yu
Using pastebin seems to be better. The attachment may not go through. FYI On Tue, Mar 1, 2016 at 6:07 PM, Jeff Zhang wrote: > Do you mind to attach the whole yarn app log ? > > On Wed, Mar 2, 2016 at 10:03 AM, Nirav Patel > wrote: > >> Hi Ryan, >> >>

Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-01 Thread Nirav Patel
Hi, I have a spark jobs that runs on yarn and keeps failing at line where i do : val hConf = HBaseConfiguration.create hConf.setInt("hbase.client.scanner.caching", 1) hConf.setBoolean("hbase.cluster.distributed", true) new PairRDDFunctions(hbaseRdd).saveAsHadoopDataset(jobConfig)

Re: Spark executor killed without apparent reason

2016-03-01 Thread Jeff Zhang
Do you mind to attach the whole yarn app log ? On Wed, Mar 2, 2016 at 10:03 AM, Nirav Patel wrote: > Hi Ryan, > > I did search "OutOfMemoryError" earlier and just now but it doesn't > indicate anything else. > > Another thing is Job fails at "saveAsHadoopDataset" call to

Re: Spark executor killed without apparent reason

2016-03-01 Thread Nirav Patel
Hi Ryan, I did search "OutOfMemoryError" earlier and just now but it doesn't indicate anything else. Another thing is Job fails at "saveAsHadoopDataset" call to huge rdd. Most of the executors fails at this stage. I don't understand that as well. Because that should be a straight write job to

Re: Update edge weight in graphx

2016-03-01 Thread 王肇康
Naveen, You can "modify" an RDD by creating a new RDD based on the existing one. You can add vertices and edges via **union** operation and create a new graph with the new vertex RDD and the edge RDD. By this way, you can "modify" the old graph. Best wishes, Zhaokang > On Mar 2 2016,

Re: Configure Spark Resource on AWS CLI Not Working

2016-03-01 Thread Jonathan Kelly
Weiwei, Please see this documentation for configuring Spark and other apps on EMR 4.x: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html This documentation about what has changed between 3.x and 4.x should also be helpful:

Building a REST Service with Spark back-end

2016-03-01 Thread Don Drake
I'm interested in building a REST service that utilizes a Spark SQL Context to return records from a DataFrame (or IndexedRDD?) and even add/update records. This will be a simple REST API, with only a few end-points. I found this example: https://github.com/alexmasselot/spark-play-activator

Re: scikit learn on EMR PySpark

2016-03-01 Thread Jonathan Kelly
Hi, Myles, We do not install scikit-learn or spark-sklearn on EMR clusters by default, but you may install them yourself by just doing "sudo pip install scikit-learn spark-sklearn" (either by ssh'ing to the master instance and running this manually, or by running it as an EMR Step). ~ Jonathan

Re: DataSet Evidence

2016-03-01 Thread Michael Armbrust
Hey Steve, This isn't possible today, but it would not be hard to allow. You should open a feature request JIRA. Michael On Mon, Feb 29, 2016 at 4:55 PM, Steve Lewis wrote: > I have a relatively complex Java object that I would like to use in a > dataset > > if I say

Re: Mapper side join with DataFrames API

2016-03-01 Thread Michael Armbrust
Its helpful to always include the output of df.explain(true) when you are asking about performance. On Mon, Feb 29, 2016 at 6:14 PM, Deepak Gopalakrishnan wrote: > Hello All, > > I'm trying to join 2 dataframes A and B with a > > sqlContext.sql("SELECT * FROM A INNER JOIN B ON

Re: Checkpoint RDD ReliableCheckpointRDD at foreachRDD has different number of partitions from original RDD MapPartitionsRDD at reduceByKeyAndWindow

2016-03-01 Thread RK
I had an incorrect variable name in line 70 while sanitizing the code for this email. Here is the actual code: 45    val windowedEventCounts = events.reduceByKeyAndWindow(_ + _, _ - _, 30, 5, filterFunc = filterFunction)         val usefulEvents = windowedEventCounts.filter { case (event,

scikit learn on EMR PySpark

2016-03-01 Thread Gartland, Myles
New to Spark and MLlib. Coming from sickit learn. I am launching my Spark 1.6 instance through AWS EMR and pyspark. All the examples using Mllib work fine. But I have seen a couple examples where you can combine scikit learn packages and syntax with mllib. Like in this example-

Checkpoint RDD ReliableCheckpointRDD at foreachRDD has different number of partitions from original RDD MapPartitionsRDD at reduceByKeyAndWindow

2016-03-01 Thread RK
Here is a code snippet in my spark job. I added the numbers at the start of code lines to show the relevant line numbers in exception. 45    val windowedEventCounts = events.reduceByKeyAndWindow(_ + _, _ - _, 30, 5, filterFunc = filterFunction)         val usefulEvents =

RE: Update edge weight in graphx

2016-03-01 Thread Mohammed Guller
Like RDDs, Graphs are also immutable. Mohammed Author: Big Data Analytics with Spark -Original Message- From: naveen.marri [mailto:naveenkumarmarri6...@gmail.com] Sent: Monday, February 29, 2016 9:11 PM To: user@spark.apache.org Subject: Update edge weight in graphx Hi, I'm

Re: [Proposal] Enabling time series analysis on spark metrics

2016-03-01 Thread Reynold Xin
Is the suggestion just to use a different config (and maybe fallback to appid) in order to publish metrics? Seems reasonable. On Tue, Mar 1, 2016 at 8:17 AM, Karan Kumar wrote: > +dev mailing list > > Time series analysis on metrics becomes quite useful when running

Re: Converting array to DF

2016-03-01 Thread Ashok Kumar
Thanks great val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toSeq.toDF("weights","value").orderBy(desc("value")).collect.foreach(println) On Tuesday, 1 March 2016, 20:52, Shixiong(Ryan) Zhu wrote: For Array,

Re: Spark executor killed without apparent reason

2016-03-01 Thread Shixiong(Ryan) Zhu
Could you search "OutOfMemoryError" in the executor logs? It could be "OufOfMemoryError: Direct Buffer Memory" or something else. On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel wrote: > Hi, > > We are using spark 1.5.2 or yarn. We have a spark application utilizing > about

How to control the number of parquet files getting created under a partition ?

2016-03-01 Thread SRK
Hi, How can I control the number of parquet files getting created under a partition? I have my sqlContext queries to create a table and insert the records as follows. It seems to create around 250 parquet files under each partition though I was expecting that to create around 2 or 3 files. Due to

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-03-01 Thread Teng Qiu
and also make sure that hbase-site.xml is set in your classpath on all nodes, both master and workers, and also client. normally i put it into $SPARK_HOME/conf/ then the spark cluster will be started with this conf file. btw. @Ted, did you tried insert into hbase table with spark's HiveContext?

Re: Converting array to DF

2016-03-01 Thread Shixiong(Ryan) Zhu
For Array, you need to all `toSeq` at first. Scala can convert Array to ArrayOps automatically. However, it's not a `Seq` and you need to call `toSeq` explicitly. On Tue, Mar 1, 2016 at 1:02 AM, Ashok Kumar wrote: > Thank you sir > > This works OK > import

Re: Shuffle guarantees

2016-03-01 Thread Corey Nolet
Nevermind, a look @ the ExternalSorter class tells me that the iterator for each key that's only partially ordered ends up being merge sorted by equality after the fact. Wanted to post my finding on here for others who may have the same questions. On Tue, Mar 1, 2016 at 3:05 PM, Corey Nolet

Re: Using a non serializable third party JSON serializable on a spark worker node throws NotSerializableException

2016-03-01 Thread Shixiong(Ryan) Zhu
Could you show the full companion object? It looks weird that having `override` in a companion object of a case class. On Tue, Mar 1, 2016 at 11:16 AM, Yuval Itzchakov wrote: > As I said, it is the method which eventually serializes the object. It is > declared inside a

Re: Spark UI standalone "crashes" after an application finishes

2016-03-01 Thread Teng Qiu
as Gourav said, the application UI on port 4040 will no more available after your spark app finished. you should go to spark master's UI (port 8080), and take a look "completed applications"... refer to doc: http://spark.apache.org/docs/latest/monitoring.html read the first "note that" :)

Re: Spark UI standalone "crashes" after an application finishes

2016-03-01 Thread Gourav Sengupta
Hi, in case you are submitting your SPARK jobs then the UI is only available when the job is running. Else if you are starting a SPARK cluster in standalone mode or HADOOP or etc, then the SPARK UI remains alive. The other way to keep the SPARK UI alive is to use the Jupyter notebook for Python

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Gourav Sengupta
Hi, which region are you using the EMR clusters from? Is there any tweaking of the HADOOP parameters that you are doing before starting the clusters? If you are using AWS CLI to start the cluster just send across the command. I have, never till date, faced any such issues in the Ireland region.

Re: Shuffle guarantees

2016-03-01 Thread Corey Nolet
The reason I'm asking, I see a comment in the ExternalSorter class that says this: "If we need to aggregate by key, we either use a total ordering from the ordering parameter, or read the keys with the same hash code and compare them with each other for equality to merge values". How can this be

Shuffle guarantees

2016-03-01 Thread Corey Nolet
So if I'm using reduceByKey() with a HashPartitioner, I understand that the hashCode() of my key is used to create the underlying shuffle files. Is anything other than hashCode() used in the shuffle files when the data is pulled into the reducers and run through the reduce function? The reason

Re: Spark 1.5 on Mesos

2016-03-01 Thread Timothy Chen
Can you go through the Mesos UI and look at the driver/executor log from steer file and see what the problem is? Tim > On Mar 1, 2016, at 8:05 AM, Ashish Soni wrote: > > Not sure what is the issue but i am getting below error when i try to run > spark PI example > >

Re: SparkR Count vs Take performance

2016-03-01 Thread Sean Owen
Yeah one surprising result is that you can't call isEmpty on an RDD of nonserializable objects. You can't do much with an RDD of nonserializable objects anyway, but they can exist as an intermediate stage. We could fix that pretty easily with a little copy and paste of the take() code; right now

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Alexander Pivovarov
EMR-4.3.0 and Spark-1.6.0 works fine for me I use r3.2xlarge boxes (spot) (even 3 slave boxes works fine) I use the following settings (in json) [ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraJavaOptions": "-Dfile.encoding=UTF-8",

Re: Using a non serializable third party JSON serializable on a spark worker node throws NotSerializableException

2016-03-01 Thread Yuval Itzchakov
As I said, it is the method which eventually serializes the object. It is declared inside a companion object of a case class. The problem is that Spark will still try to serialize the method, as it needs to execute on the worker. How will that change the fact that `EncodeJson[T]` is not

Re: Sample sql query using pyspark

2016-03-01 Thread Maurin Lenglart
Hi, Thanks for the hint. I have tried to remove the limit from the query but the result is still the same. If I understand correctly, the func "sample()” is taking a sample of the result of the query and not sampling the original table that I am querying. I have a business use case to sample a

Re: SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho
Great, I didn't noticed this isEmpty method. Well serialization is been a problem in this project, we have noticed a lot of time been spent in serializing and deserializing things to send and get from the cluster. 2016-03-01 15:47 GMT-03:00 Sean Owen : > There is an "isEmpty"

Re: Using a non serializable third party JSON serializable on a spark worker node throws NotSerializableException

2016-03-01 Thread Shixiong(Ryan) Zhu
Don't know where "argonaut.EncodeJson$$anon$2" comes from. However, you can always put your codes into an method of an "object". Then just call it like a Java static method. On Tue, Mar 1, 2016 at 10:30 AM, Yuval.Itzchakov wrote: > I have a small snippet of code which relays

Re: SparkR Count vs Take performance

2016-03-01 Thread Sean Owen
There is an "isEmpty" method that basically does exactly what your second version does. I have seen it be unusually slow at times because it must copy 1 element to the driver, and it's possible that's slow. It still shouldn't be slow in general, and I'd be surprised if it's slower than a count in

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Daniel Siegmann
How many core nodes does your cluster have? On Tue, Mar 1, 2016 at 4:15 AM, Oleg Ruchovets wrote: > Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but > it looks it does't work and throws exceptions. > Please advice: > > [hadoop@ip-172-31-39-37

Using a non serializable third party JSON serializable on a spark worker node throws NotSerializableException

2016-03-01 Thread Yuval.Itzchakov
I have a small snippet of code which relays on argonaut for JSON serialization which is ran from a `PairRDDFunctions.mapWithState` once a session is completed. This is the code snippet (not that important): override def sendMessage(pageView: PageView): Unit = {

Re: Get rid of FileAlreadyExistsError

2016-03-01 Thread Peter Halliday
I haven’t trie spark.hadoop.validateOutputSpecs. However, it seems that has to do with the existence of the output directory itself and not the files. Maybe I’m wrong? Peter > On Mar 1, 2016, at 11:53 AM, Sabarish Sasidharan > wrote: > > Have you tried

Re: SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
Hi, FIRST ATTEMPT: Use build.sbt in IntelliJ and it was giving me nightmares with several incompatibility and library issues though the sbt version was compliant with the scala version SECOND ATTEMPT: Created a new project with no entries in build.sbt file and imported all the files in

Re: Spark streaming from Kafka best fit

2016-03-01 Thread Cody Koeninger
You don't need an equal number of executor cores to partitions. An executor can and will work on multiple partitions within a batch, one after the other. The real issue is whether you are able to keep your processing time under your batch time, so that delay doesn't increase. On Tue, Mar 1,

SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho
Hello all. I have a script that create a dataframe from this operation: mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable")) rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe) dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT) After filtering

Re: Spark streaming from Kafka best fit

2016-03-01 Thread Jatin Kumar
Thanks Cody! I understand what you said and if I am correct it will be using 224 executor cores just for fetching + stage-1 processing of 224 partitions. I will obviously need more cores for processing further stages and fetching next batch. I will start with higher number of executor cores and

I need some help making datasets with known columns from a JavaBean

2016-03-01 Thread Steve Lewis
I asked a similar question a day or so ago but this is a much more concrete example showing the difficulty I am running into I am trying to use DataSets. I have an object which I want to encode with its fields as columns. The object is a well behaved Java Bean. However one field is an object (or

Re: Save DataFrame to Hive Table

2016-03-01 Thread Mich Talebzadeh
Have these files the same schema. Probably yes Can they be read as an RDD each -> converted to DF and then registered as temporary tables and a UNION ALL on those temporary tables? Alternatively if these files have different names, they can be put on the same HDFS staging sub-directory and read

Re: Save DataFrame to Hive Table

2016-03-01 Thread Silvio Fiorito
Just do: val df = sqlContext.read.load(“/path/to/parquets/*”) If you do df.explain it’ll show the multiple input paths. From: "andres.fernan...@wellsfargo.com" > Date: Tuesday, March

What version of twitter4j should I use with Spark Streaming?UPDATING thread

2016-03-01 Thread Alonso Isidoro Roman
hi, i read this post recommending to use twitter4j-3.0.3* and it is not working for me. I want to load this jars within spark-shell without any lucky. This is the output 1. *MacBook-Pro-Retina-de-Alonso:spark-1.6 aironman$ ls *jar* 2. mysql-connector-java-5.1.30.jar

RE: Union Parquet, DataFrame

2016-03-01 Thread Andres.Fernandez
Worked perfectly. Thanks very much Silvio. From: Silvio Fiorito [mailto:silvio.fior...@granturing.com] Sent: Tuesday, March 01, 2016 2:14 PM To: Fernandez, Andres; user@spark.apache.org Subject: Re: Union Parquet, DataFrame Just replied to your other email, but here’s the same thing: Just do:

Fwd: Starting SPARK application in cluster mode from an IDE

2016-03-01 Thread Gourav Sengupta
Hi, I will be grateful if someone could kindly respond back to this query. Thanks and Regards, Gourav Sengupta -- Forwarded message -- From: Gourav Sengupta Date: Sat, Feb 27, 2016 at 12:39 AM Subject: Starting SPARK application in cluster mode from

Re: Union Parquet, DataFrame

2016-03-01 Thread Silvio Fiorito
Just replied to your other email, but here’s the same thing: Just do: val df = sqlContext.read.load(“/path/to/parquets/*”) If you do df.explain it’ll show the multiple input paths. From: "andres.fernan...@wellsfargo.com"

SPARK SQL HiveContext Error

2016-03-01 Thread Gourav Sengupta
Hi, I am getting the error "*java.lang.SecurityException: sealing violation: can't seal package org.apache.derby.impl.services.locks: already loaded"* after running the following code in SCALA. I do not have any other instances of sparkContext running from my system. I will be grateful for if

Union Parquet, DataFrame

2016-03-01 Thread Andres.Fernandez
Good day colleagues. Quick question on Parquet and Dataframes. Right now I have the 4 parquet files stored in HDFS under the same path: /path/to/parquets/parquet1, /path/to/parquets/parquet2, /path/to/parquets/parquet3, /path/to/parquets/parquet4… I want to perform a union on all this parquet

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-01 Thread Jules Damji
Hello Joshua, comments are inline... > On Mar 1, 2016, at 5:03 AM, Joshua Sorrell wrote: > > I haven't used Spark in the last year and a half. I am about to start a > project with a new team, and we need to decide whether to use pyspark or > Scala. Indeed, good questions,

RE: Save DataFrame to Hive Table

2016-03-01 Thread Andres.Fernandez
Good day colleagues. Quick question on Parquet and Dataframes. Right now I have the 4 parquet files stored in HDFS under the same path: /path/to/parquets/parquet1, /path/to/parquets/parquet2, /path/to/parquets/parquet3, /path/to/parquets/parquet4… I want to perform a union on all this parquet

Real time anomaly system

2016-03-01 Thread Priya Ch
Hi, I am trying to build real time anomaly detection system using Spark, kafka, Cassandra and Akka. I have network intrusion dataset (KDD 1999 cup). how can i build the system using this ? I understood that certain part of the data, I am considering as historical data for my model training and

Re: Save DataFrame to Hive Table

2016-03-01 Thread Mich Talebzadeh
Hi It seems that your code is not specifying which database is your table created Try this scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> // Choose a database scala> HiveContext.sql("show databases").show scala> HiveContext.sql("use test") // I chose test

Spark Submit using Convert to Marthon REST API

2016-03-01 Thread Ashish Soni
Hi All , Can some one please help me how do i translate below spark submit to marathon JSON request docker run -it --rm -e SPARK_MASTER="mesos://10.0.2.15:5050" -e SPARK_IMAGE="spark_driver:latest" spark_driver:latest /opt/spark/bin/spark-submit --name "PI Example" --class

performance of personalized page rank

2016-03-01 Thread Cesar Flores
I would like to know if someone can help me with the next question. I have a network of around ~5 billion edges and ~500 million nodes. I need to generate a personalized page rank for each node of my network. How many executors do you think I may need to execute this task in a reasonable amount

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-03-01 Thread Robin East
Mohammed and I both obviously have a certain bias here but I have to agree with him - the documentation is pretty good but other sources are necessary to supplement. (Good) books are a curated source of information that can short-cut a lot of the learning.

RE: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-03-01 Thread Mohammed Guller
I agree that the Spark official documentation is pretty good. However, a book also serves a useful purpose. It provides a structured roadmap for learning a new technology. Everything is nicely organized for the reader. For somebody who has just started learning Spark, the amount of material on

Re: [Proposal] Enabling time series analysis on spark metrics

2016-03-01 Thread Karan Kumar
+dev mailing list Time series analysis on metrics becomes quite useful when running spark jobs using a workflow manager like oozie. Would love to take this up if the community thinks its worthwhile. On Tue, Feb 23, 2016 at 2:59 PM, Karan Kumar wrote: > HI > > Spark

Re: Get rid of FileAlreadyExistsError

2016-03-01 Thread Peter Halliday
http://pastebin.com/vbbFzyzb The problem seems to be to be two fold. First, the ParquetFileWriter in Hadoop allows for an overwrite flag that Spark doesn’t allow to be set. The second is that the DirectParquetOutputCommitter has an abortTask that’s empty. I see SPARK-8413 open on this too,

Re: Spark 1.5 on Mesos

2016-03-01 Thread Ashish Soni
Not sure what is the issue but i am getting below error when i try to run spark PI example Blacklisting Mesos slave value: "5345asdasdasdkas234234asdasdasdasd" due to too many failures; is Spark installed on it? WARN TaskSchedulerImpl: Initial job has not accepted any resources; check

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-03-01 Thread Sachin Aggarwal
hi, I used this code for graceful shutdown of my streaming app, this may not be the best way. correct me sys.ShutdownHookThread { println("Gracefully stopping Spark Streaming Application") ssc.stop(true, true) println("Application stopped") } class StopContextThread(ssc: StreamingContext)

Re: Get rid of FileAlreadyExistsError

2016-03-01 Thread Ted Yu
Do you mind pastebin'ning the stack trace with the error so that we know which part of the code is under discussion ? Thanks On Tue, Mar 1, 2016 at 7:48 AM, Peter Halliday wrote: > I have a Spark application that has a Task seem to fail, but it actually > did write out some

Get rid of FileAlreadyExistsError

2016-03-01 Thread Peter Halliday
I have a Spark application that has a Task seem to fail, but it actually did write out some of the files that were assigned it. And Spark assigns another executor that task, and it gets a FileAlreadyExistsException. The Hadoop code seems to allow for files to be overwritten, but I see the

Re: local class incompatible: stream classdesc

2016-03-01 Thread Ted Yu
RDD serialized by one release of Spark is not guaranteed to be readable by another release of Spark. Please check whether there are mixed Spark versions. FYI: http://stackoverflow.com/questions/10378855/java-io-invalidclassexception-local-class-incompatible On Tue, Mar 1, 2016 at 7:35 AM,

Re: Spark Streaming: java.lang.StackOverflowError

2016-03-01 Thread Cody Koeninger
What code is triggering the stack overflow? On Mon, Feb 29, 2016 at 11:13 PM, Vinti Maheshwari wrote: > Hi All, > > I am getting below error in spark-streaming application, i am using kafka > for input stream. When i was doing with socket, it was working fine. But > when i

local class incompatible: stream classdesc

2016-03-01 Thread Nesrine BEN MUSTAPHA
Hi, I installed a standalone spark cluster with two workers. I developed a Java Application that use the maven dependency of spark (same version as the spark cluster). In my class Spark jobs I have only two methods considered as two different jobs: the first one is the example of spark word

Re: Spark UI standalone "crashes" after an application finishes

2016-03-01 Thread Sumona Routh
Thanks Shixiong! To clarify for others, yes, I was speaking of the UI at port 4040, and I do have event logging enabled, so I can review jobs after the fact. We hope to upgrade our version of Spark soon, so I'll write back if that resolves it. Sumona On Mon, Feb 29, 2016 at 8:27 PM Sea

Re: Spark streaming from Kafka best fit

2016-03-01 Thread Cody Koeninger
> "How do I keep a balance of executors which receive data from Kafka and which process data" I think you're misunderstanding how the direct stream works. The executor which receives data is also the executor which processes data, there aren't separate receivers. If it's a single stage worth of

Re: Sample sql query using pyspark

2016-03-01 Thread James Barney
Maurin, I don't know the technical reason why but: try removing the 'limit 100' part of your query. I was trying to do something similar the other week and what I found is that each executor doesn't necessarily get the same 100 rows. Joins would fail or result with a bunch of nulls when keys

Spark executor killed without apparent reason

2016-03-01 Thread Nirav Patel
Hi, We are using spark 1.5.2 or yarn. We have a spark application utilizing about 15GB executor memory and 1500 overhead. However, at certain stage we notice higher GC time (almost same as task time) spent. These executors are bound to get killed at some point. However, nodemanager or resource

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-03-01 Thread Ted Yu
16/03/01 01:36:31 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal): java.lang.RuntimeException: hbase-default.xml file seems to be for an older version of HBase (null), this version is 1.1.2.2.3.4.0-3485 The above was likely caused by some

Re: Suggested Method to Write to Kafka

2016-03-01 Thread Sathish Dhinakaran
http://stackoverflow.com/questions/31590592/how-to-write-to-kafka-from-spark-streaming On Tue, Mar 1, 2016 at 8:54 AM, Bryan Jeffrey wrote: > Hello. > > Is there a suggested method and/or some example code to write results from > a Spark streaming job back to Kafka? > >

Suggested Method to Write to Kafka

2016-03-01 Thread Bryan Jeffrey
Hello. Is there a suggested method and/or some example code to write results from a Spark streaming job back to Kafka? I'm using Scala and Spark 1.4.1. Regards, Bryan Jeffrey

Re: Spark for client

2016-03-01 Thread Todd Nist
You could also look at Apache Toree, http://toree.apache.org/ , github : https://github.com/apache/incubator-toree. This use to be the Spark Kernel from IBM but has been contributed to Apache. Good overview here on its features,

Does pyspark still lag far behind the Scala API in terms of features

2016-03-01 Thread Joshua Sorrell
I haven't used Spark in the last year and a half. I am about to start a project with a new team, and we need to decide whether to use pyspark or Scala. We are NOT a java shop. So some of the build tools/procedures will require some learning overhead if we go the Scala route. What I want to know

Dropping parquet file partitions

2016-03-01 Thread sparkuser2345
Is there a way to drop parquet file partitions through Spark? I'm partitioning a parquet file by a date field and I would like to drop old partitions in a file system agnostic manner. I guess I could read the whole parquet file into a DataFrame, filter out the dates to be dropped, and overwrite

Re: Mllib Logistic Regression performance relative to Mahout

2016-03-01 Thread Sonal Goyal
You can also check if you are caching your input so that features are not being read/computed every iteration. Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Jeff Zhang
Check the code again. Looks like currently the task result will be loaded into memory no matter it is DirectTaskResult or InDirectTaskResult. Previous I thought InDirectTaskResult can be loaded into memory later which can save memory, RDD#collectAsIterator is what I thought that may save memory.

EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Oleg Ruchovets
Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but it looks it does't work and throws exceptions. Please advice: [hadoop@ip-172-31-39-37 conf]$ cd /usr/bin/ [hadoop@ip-172-31-39-37 bin]$ ./spark-shell OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M;

Re: Spark for client

2016-03-01 Thread Mich Talebzadeh
Thanks Mohannad. Installed Anaconda 3 that contains Jupyter. Now I want to access Spark on Scala from Jupyter. What is the easiest way of doing it without using Python! Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Reynold Xin
How big of a deal is this though? If I am reading your email correctly, either way this job will fail. You simply want it to fail earlier in the executor side, rather than collecting it and fail on the driver side? On Sunday, February 28, 2016, Jeff Zhang wrote: > data skew

Re: Converting array to DF

2016-03-01 Thread Jeff Zhang
Change Array to Seq and import sqlContext.implicits._ On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar wrote: > Hi, > > I have this > > val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), > ("f", 4), ("g", 6)) > weights.toDF("weights","value") > > I

Converting array to DF

2016-03-01 Thread Ashok Kumar
Hi, I have this val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)) weights.toDF("weights","value") I want to convert the Array to DF but I get thisor weights: Array[(String, Int)] = Array((a,3), (b,2), (c,5), (d,1), (e,9), (f,4), (g,6)) :33: error: value

Re: Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang
I may not express it clearly. This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. And the virtualenvs don't need to be the same path for each node (In my POC, it is

Re: Spark for client

2016-03-01 Thread Mohannad Ali
Jupyter (http://jupyter.org/) also supports Spark and generally it's a beast allows you to do so much more. On Mar 1, 2016 00:25, "Mich Talebzadeh" wrote: > Thank you very much both > > Zeppelin looks promising. Basically as I understand runs an agent on a > given port

Re: Support virtualenv in PySpark

2016-03-01 Thread Mohannad Ali
Hello Jeff, Well this would also mean that you have to manage the same virtualenv (same path) on all nodes and install your packages to it the same way you would if you would install the packages to the default python path. In any case at the moment you can already do what you proposed by

Sample sql query using pyspark

2016-03-01 Thread Maurin Lenglart
Hi, I am trying to get a sample of a sql query in to make the query run faster. My query look like this : SELECT `Category` as `Category`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM groupon_dropbox WHERE `event_date` >= '2015-11-14' AND `event_date` <= '2016-02-19' GROUP

Re: [Help]: DataframeNAfunction fill method throwing exception

2016-03-01 Thread ai he
Hi Divya, I guess the error is thrown from spark-csv. Spark-csv tries to parse string "null" to double. The workaround is to add nullValue option, like .option("nullValue", "null"). But this nullValue feature is not included in current spark-csv 1.3. Just checkout the master of spark-csv and use

  1   2   >