Left Join Yields Results And Not Results

2016-09-24 Thread Aaron Jackson
Hi, I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very odd situation. I have two dataframes, base and updated. The "updated" dataframe contains constrained subset of data from "base" that I wish to excluded. Something like this. updated = base.where(base.X = F.lit(1000

spark-submit failing but job running from scala ide

2016-09-24 Thread vr spark
Hi, I have this simple scala app which works fine when i run it as scala application from the scala IDE for eclipse. But when i export is as jar and run it from spark-submit i am getting below error. Please suggest *bin/spark-submit --class com.x.y.vr.spark.first.SimpleApp test.jar* 16/09/24 23:1

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-24 Thread Timur Shenkao
Hello, everybody! May be it's not a reason of your problem, but I've noticed the line in your commentaries: *java version "1.8.0_51"* It's strongly advised to use Java 1.8.0_66+ I use even Java 1.8.0_101 On Tue, Sep 20, 2016 at 1:09 AM, janardhan shetty wrote: > Yes Sujit I have tried that op

Re: ideas on de duplication for spark streaming?

2016-09-24 Thread Jörn Franke
As Cody said, Spark is not going to help you here. There are two issues you need to look at here: duplicated (or even more) messages processed by two different processes and the case of failure of any component (including the message broker). Keep in mind that duplicated messages can even occur

Spark 1.6.2 Concurrent append to a HDFS folder with different partition key

2016-09-24 Thread Shing Hing Man
I am trying to prototype using a single instance SqlContext and use it toappend Dataframes,partition by a field, to the same HDFS folder from multiple threads. (Each thread will work with a DataFrame having different partition column value.) I get the exception16/09/24 16:45:12 ERROR [ForkJoinP

Re: ideas on de duplication for spark streaming?

2016-09-24 Thread Cody Koeninger
Spark alone isn't going to solve this problem, because you have no reliable way of making sure a given worker has a consistent shard of the messages seen so far, especially if there's an arbitrary amount of delay between duplicate messages. You need a DHT or something equivalent. On Sep 24, 2016

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-24 Thread Yash Sharma
We have too many (large) files. We have about 30k partitions with about 4 years worth data and we need to process entire history in a one time monolithic job. I would like to know how spark decides the number of executors requested. I've seen testcases where the max executors count is Integer's M

Re: databricks spark-csv: linking coordinates are what?

2016-09-24 Thread Anastasios Zouzias
Hi Dan, If you use spark <= 1.6, you can also do $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 to quickly link the spark-csv jars to spark shell. Otherwise as Holden suggested you link it in your maven/sbt dependencies. Spark guys assume that their users have a good

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

2016-09-24 Thread ayan guha
Do you have too many small files you are trying to read? Number of executors are very high On 24 Sep 2016 10:28, "Yash Sharma" wrote: > Have been playing around with configs to crack this. Adding them here > where it would be helpful to others :) > Number of executors and timeout seemed like the