Re: Old version of Spark [v1.2.0]

2017-01-16 Thread Jacek Laskowski
Hi Ayan, Although my first reaction was "Why would anyone ever want to download older versions" after a brief thinking about it I concluded that there might be a merit having it. Could you please file an issue in the issue tracker -> https://issues.apache.org/jira/browse/SPARK? Pozdrawiam,

filter push down on har file

2017-01-16 Thread Yang Cao
Hi, My team just do a archive on last year’s parquet files. I wonder whether the filter push down optimization still work when I read data through “har:///path/to/data/“? THX. Best, - To unsubscribe e-mail:

[Spark DataFrames/Streaming]: Bad performance with window function in streaming job

2017-01-16 Thread Julian Keppel
Hi, I use Spark 2.0.2 and want to do the following: I extract features in a streaming job and than apply the records to a k-means model. Some of the features are simple ones which are calculated directly from the record. But I also have more complex features which depend on records from a

Flume and Spark Streaming

2017-01-16 Thread Guillermo Ortiz
I'm wondering to use Flume (channel file)-Spark Streaming. I have some doubts about it: 1.The RDD size is all data what it comes in a microbatch which you have defined. Risght? 2.If there are 2Gb of data, how many are RDDs generated? just one and I have to make a repartition? 3.When is the ACK

Re: ML PIC

2017-01-16 Thread Nick Pentreath
The JIRA for this is here: https://issues.apache.org/jira/browse/SPARK-15784 There is a PR open already for it, which still needs to be reviewed. On Wed, 21 Dec 2016 at 18:01 Robert Hamilton wrote: > Thank you Nick that is good to know. > > Would this have some

Re: Flume and Spark Streaming

2017-01-16 Thread ayan guha
With Flume, what would be your sink? On Mon, Jan 16, 2017 at 10:44 PM, Guillermo Ortiz wrote: > I'm wondering to use Flume (channel file)-Spark Streaming. > > I have some doubts about it: > > 1.The RDD size is all data what it comes in a microbatch which you have >

Re: Flume and Spark Streaming

2017-01-16 Thread Guillermo Ortiz
Avro sink --> Spark Streaming 2017-01-16 13:55 GMT+01:00 ayan guha : > With Flume, what would be your sink? > > > > On Mon, Jan 16, 2017 at 10:44 PM, Guillermo Ortiz > wrote: > >> I'm wondering to use Flume (channel file)-Spark Streaming. >> >> I have

groupByKey vs mapPartitions for efficient grouping within a Partition

2017-01-16 Thread Patrick
Hi, Does groupByKey has intelligence associated with it, such that if all the keys resides in the same partition, it should not do the shuffle? Or user should write mapPartitions( scala groupBy code). Which would be more efficient and what are the memory considerations? Thanks

Importing a github project on sbt

2017-01-16 Thread marcos rebelo
Hi all, I have this project: https://github.com/oleber/aws-stepfunctions I have a second project that should import the first one. On the second project I did something like: lazy val awsStepFunctions = RootProject(uri("git://

Re: Importing a github project on sbt

2017-01-16 Thread Marco Mistroni
UhmNot a SPK issueAnyway...Had similar issues with sbt The quick sol. To get u going is to place ur dependency in your lib folder The notsoquick is to build the sbt dependency and do a sbt publish-local, or deploy local But I consider both approaches hacks. Hth On 16 Jan 2017 2:00

Re: Old version of Spark [v1.2.0]

2017-01-16 Thread Debasish Das
You may want to pull up release/1.2 branch and 1.2.0 tag to build it yourself incase the packages are not available. On Jan 15, 2017 2:55 PM, "Md. Rezaul Karim" wrote: > Hi Ayan, > > Thanks a million. > > Regards, > _ > *Md. Rezaul

Re: Spark streaming app that processes Kafka DStreams produces no output and no error

2017-01-16 Thread shyla deshpande
Hello, I checked the log file on the worker node and don't see any error there. This is the first time I am asked to run on such a small cluster. I feel its the resources issue, but it will be great help is somebody can confirm this or share your experience. Thanks On Sat, Jan 14, 2017 at 4:01

Re: Spark vs MongoDB: saving DataFrame to db raises missing database name exception

2017-01-16 Thread Palash Gupta
Hi, Example: dframe = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("spark.mongodb.input.uri", " mongodb://user:pass@172.26.7.192:27017/db_name.collection_name").load()dframe.printSchema() One more thing if you create one db in mongo, please create a collection with a

Re: Spark vs MongoDB: saving DataFrame to db raises missing database name exception

2017-01-16 Thread Marco Mistroni
Uh. Many thanksWill try it out On 17 Jan 2017 6:47 am, "Palash Gupta" wrote: > Hi Marco, > > What is the user and password you are using for mongodb connection? Did > you enable authorization? > > Better to include user & pass in mongo url. > > I remember I tested

Re: Apache Spark example split/merge shards

2017-01-16 Thread Takeshi Yamamuro
Hi, It seems you hit this issue: https://issues.apache.org/jira/browse/SPARK-18020 // maropu On Tue, Jan 17, 2017 at 11:51 AM, noppanit wrote: > I'm totally new to Spark and I'm trying to learn from the example. I'm > following this example > >

GroupBy and Spark Performance issue

2017-01-16 Thread KhajaAsmath Mohammed
Hi, I am trying to group by data in spark and find out maximum value for group of data. I have to use group by as I need to transpose based on the values. I tried repartition data by increasing number from 1 to 1.Job gets run till the below stage and it takes long time to move ahead. I was

Apache Spark example split/merge shards

2017-01-16 Thread noppanit
I'm totally new to Spark and I'm trying to learn from the example. I'm following this example https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala. It works well. But I do have one question. Every time I

Re: About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
Hello Spark Folks, Other weird experience i have with Spark with SqlContext is when i created Dataframe sometime this error throws exception and sometime not ! scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> val stdDf =

Re: partition size inherited from parent: auto coalesce

2017-01-16 Thread Takeshi Yamamuro
Hi, The coalesce does not automatically happen now and you need to control the number for yourself. Basically, #partitions respect a `spark.default.parallelism` number, by default, #cores for your computer. http://spark.apache.org/docs/latest/configuration.html#execution-behavior // maropu On

partition size inherited from parent: auto coalesce

2017-01-16 Thread Suzen, Mehmet
Hello List, I was wondering what is the design principle that partition size of an RDD is inherited from the parent. See one simple example below [*]. 'ngauss_rdd2' has significantly less data, intuitively in such cases, shouldn't spark invoke coalesce automatically for performance? What would

Re: Spark vs MongoDB: saving DataFrame to db raises missing database name exception

2017-01-16 Thread Palash Gupta
Hi Marco, What is the user and password you are using for mongodb connection? Did you enable authorization? Better to include user & pass in mongo url. I remember I tested with python successfully. Best Regards,Palash Sent from Yahoo Mail on Android On Tue, 17 Jan, 2017 at 5:37 am, Marco

ScalaReflectionException (class not found) error for user class in spark 2.1.0

2017-01-16 Thread Koert Kuipers
i am experiencing a ScalaReflectionException exception when doing an aggregation on a spark-sql DataFrame. the error looks like this: Exception in thread "main" scala.ScalaReflectionException: class in JavaMirror with sun.misc.Launcher$AppClassLoader@28d93b30 of type class

Weird experience Hive with Spark Transformations

2017-01-16 Thread Chetan Khatri
Hello, I have following services are configured and installed successfully: Hadoop 2.7.x Spark 2.0.x HBase 1.2.4 Hive 1.2.1 *Installation Directories:* /usr/local/hadoop /usr/local/spark /usr/local/hbase *Hive Environment variables:* #HIVE VARIABLES START export HIVE_HOME=/usr/local/hive

Re: Running Spark on EMR

2017-01-16 Thread Everett Anderson
On Sun, Jan 15, 2017 at 11:09 AM, Andrew Holway < andrew.hol...@otternetworks.de> wrote: > use yarn :) > > "spark-submit --master yarn" > Doesn't this require first copying out various Hadoop configuration XML files from the EMR master node to the machine running the spark-submit? Or is there a

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
Hi Pradeep, That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If we have billions of InputSplit, will it be bottlenecked for the performance? That is, will too many data need to be transferred from master node to computing nodes by networking? Thanks, Fei On Mon, Jan 16,

filter rows by all columns

2017-01-16 Thread Shawn Wan
I need to filter out outliers from a dataframe by all columns. I can manually list all columns like: df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0)) .filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(1 )) ... But I want to turn it into a

Metadata is not propagating with Dataset.map()

2017-01-16 Thread tovbinm
Hello, It seems that metadata is not propagating when using Dataset.map(). Is there a workaround? Below are the steps to reproduce: import spark.implicits._ val columnName = "col1" val meta = new MetadataBuilder().putString("foo", "bar").build() val schema =

mapPartition iterator

2017-01-16 Thread AnilKumar B
Hi For my use case, I need to call a third party function(which is in memory based) for each complete partition data. So I am partitioning RDD logically using repartition on index column and applying function f on mapPartitions(f). When, I iterate through mapPartition iterator. Can, I assume

Spark vs MongoDB: saving DataFrame to db raises missing database name exception

2017-01-16 Thread Marco Mistroni
hi all i have the folllowign snippet which loads a dataframe from a csv file and tries to save it to mongodb. For some reason, the MongoSpark.save method raises the following exception Exception in thread "main" java.lang.IllegalArgumentException: Missing database name. Set via the

Metadata is not propagating with Dataset.map()

2017-01-16 Thread Matthew Tovbin
Hello, It seems that metadata is not propagating when using Dataset.map(). Is there a workaround? Below are the steps to reproduce: import spark.implicits._ val columnName = "col1" val meta = new MetadataBuilder().putString("foo", "bar").build() val schema =

Spark 2.0 vs MongoDb /Cannot find dependency using sbt

2017-01-16 Thread Marco Mistroni
HI all in searching on how to use Spark 2.0 with mongo i came across this link https://jira.mongodb.org/browse/SPARK-20 i amended my build.sbt (content below), however the mongodb dependency was not found Could anyone assist? kr marco name := "SparkExamples" version := "1.0" scalaVersion :=

Re: Spark 2.0 vs MongoDb /Cannot find dependency using sbt

2017-01-16 Thread Marco Mistroni
sorry. should have done more research before jumping to the list the version of the connector is 2.0.0, available from maven repors sorry On Mon, Jan 16, 2017 at 9:32 PM, Marco Mistroni wrote: > HI all > in searching on how to use Spark 2.0 with mongo i came across this

partition size inherited from parent: auto coalesce

2017-01-16 Thread Suzen, Mehmet
Hello List, I was wondering what is the design principle that partition size of an RDD is inherited from the parent. See one simple example below [*]. 'ngauss_rdd2' has significantly less data, intuitively in such cases, shouldn't spark invoke coalesce automatically for performance? What would

Re: groupByKey vs mapPartitions for efficient grouping within a Partition

2017-01-16 Thread Andy Dang
groupByKey() is a wide dependency and will cause a full shuffle. It's advised against using this transformation unless you keys are balanced (well-distributed) and you need a full shuffle. Otherwise, what you want is aggregateByKey() or reduceByKey() (depending on the output). These actions are

Re: filter rows by all columns

2017-01-16 Thread Hyukjin Kwon
Hi Shawn, Could we do this as below? for any of true scala> val df = spark.range(10).selectExpr("id as a", "id / 2 as b") df: org.apache.spark.sql.DataFrame = [a: bigint, b: double] scala> df.filter(_.toSeq.exists(v => v == 1)).show() +---+---+ | a| b| +---+---+ | 1|0.5| | 2|1.0|

how to use newAPIHadoopFile

2017-01-16 Thread lk_spark
hi,all I have a test with spark 2.0: I have a test file: field delimiter with \t kevin 30 2016 shen 30 2016 kai 33 2016 wei 30 2016 after useing: var datas: RDD[(LongWritable, String)] = sc.newAPIHadoopFile(inputPath+filename, classOf[TextInputFormat], classOf[LongWritable],