find the matching and get the value

2016-03-22 Thread Divya Gehlot
Hi, I am using Spark1.5.2 My requirement is as below df.withColumn("NoOfDays",lit(datediff(df("Start_date"),df("end_date" Now have to add one more columnn where my datediff(Start_date,end_date)) should match with map keys Map looks like MyMap(1->1D,2->2D,3->3M,4->4W) I want to do

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Thamme Gowda N.
Hi Jeff, Yes, you are absolutely right. It is because of the RecordReader reusing the Writable Instance. I did not anticipate this as it worked for text files. Thank you so much for doing this. Your answer is accepted! Best, Thamme -- *Thamme Gowda N. * Grad Student at usc.edu Twitter:

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
Zhan's reply on stackoverflow is correct. down vote Please refer to the comments in sequenceFile. /** Get an RDD for a Hadoop SequenceFile with given key and value types. * * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * record, directly caching

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
I think I got the root cause, you can use Text.toString() to solve this issue. Because the Text is shared so the last record display multiple times. On Wed, Mar 23, 2016 at 11:37 AM, Jeff Zhang wrote: > Looks like a spark bug. I can reproduce it for sequence file, but it

Re: [Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Jeff Zhang
Looks like a spark bug. I can reproduce it for sequence file, but it works for text file. On Wed, Mar 23, 2016 at 10:56 AM, Thamme Gowda N. wrote: > Hi spark experts, > > I am facing issues with cached RDDs. I noticed that few entries > get duplicated for n times when the RDD

Does anyone install netlib-java on AWS EMR Spark?

2016-03-22 Thread greg huang
Hi All, I want to enable the netlib-java feather for Spark ML module base on AWS EMR. But the Spark cluster has install spark default except I install it myself and configure all the cluster. Does anyone have some idea to just enable the netlib-java base on the standard EMR Spark cluster?

Re: Re: Is there a way to submit spark job to your by YARN REST API?

2016-03-22 Thread Saisai Shao
My question is this ACL control is provided by Yarn or you have an in-house facility to handle this? If you're referring to this #ContainerLaunchContext#setApplicationACLs, I think current Spark on Yarn doesn't support this. From feature level, this is doable in the current yarn/client code, no

Re: DataFrame vs RDD

2016-03-22 Thread Vinay Kashyap
As mentioned earlier, since DataFrame is associated with schema... It makes sense to be created from sqlContext.. So ur statement holds true with that understanding.. On Wed, Mar 23, 2016 at 8:28 AM asethia wrote: > creating RDD is done via spark context where as creating

Re: DataFrame vs RDD

2016-03-22 Thread asethia
creating RDD is done via spark context where as creating Dataframe is from sqlcontext... so Dataframe is part of sparksql where as RDD is spark core -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-vs-RDD-tp26570p26573.html Sent from the Apache

[Critical] Issue with cached RDDs created from hadoop sequence files

2016-03-22 Thread Thamme Gowda N.
Hi spark experts, I am facing issues with cached RDDs. I noticed that few entries get duplicated for n times when the RDD is cached. I asked a question on Stackoverflow with my code snippet to reproduce it. I really appreciate if you can visit http://stackoverflow.com/q/36168827/1506477 and

Re: DataFrame vs RDD

2016-03-22 Thread Arun Sethia
Thanks Vinay. Is it fair to say creating RDD and Creating DataFrame from Cassandra uses SparkSQL, with help of Spark-Cassandra Connector API? On Tue, Mar 22, 2016 at 9:32 PM, Vinay Kashyap wrote: > DataFrame is when there is a schema associated with your RDD.. > For any of

Re: sliding Top N window

2016-03-22 Thread Jatin Kumar
Lar, can you please point to an example? On Mar 23, 2016 2:16 AM, "Lars Albertsson" wrote: > @Jatin, I touched that case briefly in the linked presentation. > > You will have to decide on a time slot size, and then aggregate slots > to form windows. E.g. if you select a time

Re: DataFrame vs RDD

2016-03-22 Thread Vinay Kashyap
DataFrame is when there is a schema associated with your RDD.. For any of your transformation on the data, you have a defined schema then it is always advised to use DataFrame as there are efficient supporting APIs for the same.. It is neatly explained in the official docs.. Thanks and regards

Re: DataFrame vs RDD

2016-03-22 Thread Jeff Zhang
Please check the offical doc http://spark.apache.org/docs/latest/ On Wed, Mar 23, 2016 at 10:08 AM, asethia wrote: > Hi, > > I am new to Spark, would like to know any guidelines when to use Data Frame > vs. RDD. > > Thanks, > As > > > > > > -- > View this message in

DataFrame vs RDD

2016-03-22 Thread asethia
Hi, I am new to Spark, would like to know any guidelines when to use Data Frame vs. RDD. Thanks, As -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-vs-RDD-tp26570.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

MLlib for Spark

2016-03-22 Thread Abid Muslim Malik
Dear all, Is there any library/framework for Spark that uses accelerators for machine learning algorithms? Thanks, -- Abid M. Malik ** "I have learned silence from the talkative, toleration from the intolerant, and kindness from the

Re: pyspark sql convert long to timestamp?

2016-03-22 Thread Andy Davidson
Thanks createdAt is a long from_unixtime(createdAt / 1000, '-MM-dd HH:mm:ss z') as fromUnix, Worked From: Akhil Das Date: Monday, March 21, 2016 at 11:56 PM To: Andrew Davidson Cc: "user @spark"

Cached Parquet file paths problem

2016-03-22 Thread Piotr Smoliński
Hi, After migration from Spark 1.5.2 to 1.6.1 I faced strange issue. I have a Parquet directory with partitions. Each partition (month) is a subject of incremental ETL that takes current Avro files and replaces the corresponding Parquet files. Now there is a problem that appeared in 1.6.x: I

Using Spark to retrieve a HDFS file protected by Kerberos

2016-03-22 Thread Nkechi Achara
I am having issues setting up my spark environment to read from a kerberized HDFS file location. At the moment I have tried to do the following: def ugiDoAs[T](ugi: Option[UserGroupInformation])(code: => T) = ugi match { case None => code case Some(u) => u.doAs(new

Re: Mapping csv columns and reformatting date string

2016-03-22 Thread Mich Talebzadeh
Just being too lazy. should define it as custom UDF def ChangeDate(word : String) : String = { return word.substring(6,10)+"-"+word.substring(3,5)+"-"+word.substring(0,2) } Register it as custom UDF sqlContext.udf.register("ChangeDate", ChangeDate(_:String)) And use it in mapping scala>

Mapping csv columns and reformatting date string

2016-03-22 Thread Mich Talebzadeh
Hi, I have the following CSV load val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") I have defined this UDF def ChangeDate(word : String) : String = { return

Re: sliding Top N window

2016-03-22 Thread Lars Albertsson
@Jatin, I touched that case briefly in the linked presentation. You will have to decide on a time slot size, and then aggregate slots to form windows. E.g. if you select a time slot of an hour, you build a CMS and a heavy hitter list for the current hour slot, and start new ones at 00 minutes. In

Re: Spark Metrics Framework?

2016-03-22 Thread Silvio Fiorito
Hi Mike, It’s been a while since I worked on a custom Source but I think all you need to do is make your Source in the org.apache.spark package. Thanks, Silvio From: Mike Sukmanowsky > Date: Tuesday, March 22, 2016 at 3:13 PM To:

Re: Spark Metrics Framework?

2016-03-22 Thread Ted Yu
See related thread: http://search-hadoop.com/m/q3RTtuwg442GBwKh On Tue, Mar 22, 2016 at 12:13 PM, Mike Sukmanowsky < mike.sukmanow...@gmail.com> wrote: > The Source class is private >

Re: Spark Metrics Framework?

2016-03-22 Thread Mike Sukmanowsky
The Source class is private to the spark package and any new Sources added to the metrics registry must be of type Source

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread Sebastian Piu
As you said, create a folder for each different minute, you can use the rdd.time also as a timestamp. Also you might want to have a look at the window function for the batching On Tue, 22 Mar 2016, 17:43 vetal king, wrote: > Hi Cody, > > Thanks for your reply. > > Five

Expose spark pre-computed data via thrift server

2016-03-22 Thread rjtokenring
Hi, I'd like to submit a possible use case and have some guidance on the overall architecture. I have 2 different datasources (a relational PostgreSQL and a Cassandra cluster) and I'd like to provide to user the ability to query data 'joining' the 2 worlds. So, an idea that comes to my mind is:

Additional classpaths / java options

2016-03-22 Thread Ashic Mahtab
Hello,Is it possible to specify additional class paths / java options "in addition to" those specified in spark-defaults.conf? I see that if I specify spark.executor.extraJavaOptions or spark.executor.extraClassPaths in defaults, and then specify --conf

Expose spark pre-computed data via thrift server

2016-03-22 Thread rjtokenring
Hi, I'd like to submit a possible use case and have some guidance on the overall architecture. I have 2 different datasources (a relational PostgreSQL and a Cassandra cluster) and I'd like to provide to user the ability to query data 'joining' the 2 worlds. So, an idea that comes to my mind is:

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread vetal king
Hi Cody, Thanks for your reply. Five seconds batch and one min publishing interval is just a representative example. What we want is, to group data over a certain frequency. That frequency is configurable. One way we think it can be achieved is "directory" will be created per this frequency,

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread Cody Koeninger
If you want 1 minute granularity, why not use a 1 minute batch time? Also, HDFS is not a great match for this kind of thing, because of the small files issue. On Tue, Mar 22, 2016 at 12:26 PM, vetal king wrote: > We are using Spark 1.4 for Spark Streaming. Kafka is data

Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread vetal king
We are using Spark 1.4 for Spark Streaming. Kafka is data source for the Spark Stream. Records are published on Kafka every second. Our requirement is to store records published on Kafka in a single folder per minute. The stream will read records every five seconds. For instance records published

Re: Direct Kafka input stream and window(…) function

2016-03-22 Thread Cody Koeninger
I definitely have direct stream jobs that use window() without problems... Can you post a minimal code example that reproduces the problem? Using print() will confuse the issue, since print() will try to only use the first partition. Use foreachRDD { rdd => rdd.foreach(println) or something

RE: Rename Several Aggregated Columns

2016-03-22 Thread Andres.Fernandez
Thank you! Yes that's the way to go taking care of selecting them in the proper order first. Added a select before the toDF and it does the trick. From: Sunitha Kambhampati [mailto:skambha...@gmail.com] Sent: Friday, March 18, 2016 5:46 PM To: Fernandez, Andres Cc: user@spark.apache.org Subject:

Re: Handling Missing Values in MLLIB Decision Tree

2016-03-22 Thread Joseph Bradley
It does not currently handle surrogate splits. You will need to preprocess your data to remove or fill in missing values. I'd recommend using the DataFrame API for that since it comes with a number of na methods. Joseph On Thu, Mar 17, 2016 at 9:51 PM, Abir Chakraborty

Direct Kafka input stream and window(…) function

2016-03-22 Thread Martin Soch
Hi all, I am using direct-Kafka-input-stream in my Spark app. When I use window(...) function in the chain it will cause the processing pipeline to stop - when I open the Spark-UI I can see that the streaming batches are being queued and the pipeline reports to process one of the first

Re: new object store driver for Spark

2016-03-22 Thread Benjamin Kim
Hi Gil, Currently, our company uses S3 heavily for data storage. Can you further explain the benefits of this in relation to S3 when the pending patch does come out? Also, I have heard of Swift from others. Can you explain to me the pros and cons of Swift compared to HDFS? It can be just a

Re: Re: Is there a way to submit spark job to your by YARN REST API?

2016-03-22 Thread tony....@tendcloud.com
Hi, Saisai, Thanks a lot for your reply. We want to have a way which we can control the user who submit spark jobs with program so that we can have security control on our data safety. So is there any good way for that? 阎志涛(Tony) 北京腾云天下科技有限公司

Re: Spark schema evolution

2016-03-22 Thread Chris Miller
With Avro you solve this by using a default value for the new field... maybe Parquet is the same? -- Chris Miller On Tue, Mar 22, 2016 at 9:34 PM, gtinside wrote: > Hi , > > I have a table sourced from* 2 parquet files* with few extra columns in one > of the parquet file.

Re: Serialization issue with Spark

2016-03-22 Thread Ted Yu
Can you show code snippet and the exception for 'Task is not serializable' ? Please see related JIRA: SPARK-10251 whose pull request contains code for registering classes with Kryo. Cheers On Tue, Mar 22, 2016 at 7:00 AM, Hafsa Asif wrote: > Hello, > I am facing

Re: ALS setIntermediateRDDStorageLevel

2016-03-22 Thread Sean Owen
https://github.com/apache/spark/blob/branch-1.6/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L185 ? On Tue, Mar 22, 2016 at 2:53 PM, Roberto Pagliari wrote: > I have and it¹s under class ALS private > > On 22/03/2016 10:58, "Sean Owen"

Serialization issue with Spark

2016-03-22 Thread Hafsa Asif
Hello, I am facing Spark serialization issue in Spark (1.4.1 - Java Client) with Spring Framework. It is known that Spark needs serialization and it requires every class need to be implemented with java.io.Serializable. But, in the documentation link:

Re: ALS setIntermediateRDDStorageLevel

2016-03-22 Thread Roberto Pagliari
I have and it¹s under class ALS private On 22/03/2016 10:58, "Sean Owen" wrote: >No, it's been there since 1.1 and still is there: >setIntermediateRDDStorageLevel. Double-check your code. > >On Mon, Mar 21, 2016 at 10:09 PM, Roberto Pagliari >

new object store driver for Spark

2016-03-22 Thread Gil Vernik
We recently released an object store connector for Spark. https://github.com/SparkTC/stocator Currently this connector contains driver for the Swift based object store ( like SoftLayer or any other Swift cluster ), but it can easily support additional object stores. There is a pending patch to

Re: Issue wihle applying filters/conditions in DataFrame in Spark

2016-03-22 Thread Hafsa Asif
springBootVersion = '1.2.8.RELEASE' springDIVersion = '0.5.4.RELEASE' thriftGradleVersion = '0.3.1' Other Gradle configs: compile "org.apache.thrift:libthrift:0.9.3" compile 'org.slf4j:slf4j-api:1.7.14' compile 'org.apache.kafka:kafka_2.11:0.9.0.0' compile

Spark schema evolution

2016-03-22 Thread gtinside
Hi , I have a table sourced from* 2 parquet files* with few extra columns in one of the parquet file. Simple * queries works fine but queries with predicate on extra column doesn't work and I get column not found *Column resp_party_type exist in just one parquet file* a) Query that work :

Re: Issue wihle applying filters/conditions in DataFrame in Spark

2016-03-22 Thread Ted Yu
The NullPointerEx came from Spring. Which version of Spring do you use ? Thanks > On Mar 22, 2016, at 6:08 AM, Hafsa Asif wrote: > > yes I know it is because of NullPointerEception, but could not understand > why? > The complete stack trace is : > [2016-03-22

Re: Issue wihle applying filters/conditions in DataFrame in Spark

2016-03-22 Thread Hafsa Asif
Even If I m using this query then also give NullPointerException: "SELECT clientId FROM activePush" -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-wihle-applying-filters-conditions-in-DataFrame-in-Spark-tp26560p26562.html Sent from the Apache Spark

Re: Issue wihle applying filters/conditions in DataFrame in Spark

2016-03-22 Thread Hafsa Asif
yes I know it is because of NullPointerEception, but could not understand why? The complete stack trace is : [2016-03-22 13:40:14.894] boot - 10493 WARN [main] --- AnnotationConfigApplicationContext: Exception encountered during context initialization - cancelling refresh attempt:

Re: Issue wihle applying filters/conditions in DataFrame in Spark

2016-03-22 Thread Ted Yu
bq. Caused by: java.lang.NullPointerException Can you show the remaining stack trace ? Thanks On Tue, Mar 22, 2016 at 5:43 AM, Hafsa Asif wrote: > Hello everyone, > I am trying to get benefits of DataFrames (to perform all SQL BASED > operations like 'Where Clause',

Re: sliding Top N window

2016-03-22 Thread Jatin Kumar
I am sorry, the signature of compare() is different. It should be: implicit val order = new scala.Ordering[(String, Long)] { override def compare(a1: (String, Long), a2: (String, Long)): Int = { a1._2.compareTo(a2._2) } } -- Thanks Jatin On Tue, Mar 22, 2016 at 5:53 PM,

Issue wihle applying filters/conditions in DataFrame in Spark

2016-03-22 Thread Hafsa Asif
Hello everyone, I am trying to get benefits of DataFrames (to perform all SQL BASED operations like 'Where Clause', Joining etc.) as mentioned in https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrame.html. I am using, Aerospike and Spark (1.4.1) Java Client in Spring

Re: sliding Top N window

2016-03-22 Thread Jatin Kumar
Hello Yakubovich, I have been looking into a similar problem. @Lars please note that he wants to maintain the top N products over a sliding window, whereas the CountMinSketh algorithm is useful if we want to maintain global top N products list. Please correct me if I am wrong here. I tried using

Re: Work out date column in CSV more than 6 months old (datediff or something)

2016-03-22 Thread James Hammerton
On 22 March 2016 at 10:57, Mich Talebzadeh wrote: > Thanks Silvio. > > The problem I have is that somehow string comparison does not work. > > Case in point > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", >

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-22 Thread Manivannan Selvadurai
I should have phrased it differently, Avro schema has additional properties like required etc.. Right now the json data that I have gets stored as optional fields in the parquet file. Is there a way to model the parquet file schema, close to avro schema. I tried using the

Re: ALS setIntermediateRDDStorageLevel

2016-03-22 Thread Sean Owen
No, it's been there since 1.1 and still is there: setIntermediateRDDStorageLevel. Double-check your code. On Mon, Mar 21, 2016 at 10:09 PM, Roberto Pagliari wrote: > According to this thread > >

Re: Work out date column in CSV more than 6 months old (datediff or something)

2016-03-22 Thread Mich Talebzadeh
Thanks Silvio. The problem I have is that somehow string comparison does not work. Case in point val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") val current_date = sqlContext.sql("SELECT

Re: Find all invoices more than 6 months from csv file

2016-03-22 Thread James Hammerton
On 21 March 2016 at 17:57, Mich Talebzadeh wrote: > > Hi, > > For test purposes I am ready a simple csv file as follows: > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load("/data/stg/table2")

trying to implement mini-batch GD in pyspark

2016-03-22 Thread sethirot
Hello, I want to able to update an existing model with new data without the need to do a batch GD again on all data. I would rather use the native mllib functions and without the streaming module. The way I thought about doing this is to use the *initialWeights* input argument to load my previous

Re: Is there a way to submit spark job to your by YARN REST API?

2016-03-22 Thread Saisai Shao
I'm afraid currently it is not supported by Spark to submit application through Yarn REST API. However Yarn AMRMClient is functionally equal to REST API, not sure which specific features are you referring? Thanks Saisai On Tue, Mar 22, 2016 at 5:27 PM, tony@tendcloud.com <

Add org.apache.spark.mllib model .predict() method to models in org.apache.spark.ml?

2016-03-22 Thread James Hammerton
Hi, The machine learning models in org.apache.spark.mllib have a .predict() method that can be applied to a Vector to return a prediction. However this method does not appear on the new models on org.apache.spark.ml and you have to wrap up a Vector in a DataFrame to send a prediction in. This

Re: Issues facing while Running Spark Streaming Job in YARN cluster mode

2016-03-22 Thread Saisai Shao
I guess in local mode you're using local FS instead of HDFS, here the exception mainly threw from HDFS when running on Yarn, I think it would be better to check the status and configurations of HDFS to see if it normal or not. Thanks Saisai On Tue, Mar 22, 2016 at 5:46 PM, Soni spark

Re: [Streaming] Difference between windowed stream and stream with large batch size?

2016-03-22 Thread Jatin Kumar
Hello all, I am also looking for the answer of the same. Can someone please answer on the pros and cons of using a larger batch size or putting a window operation on smaller batch size? -- Thanks Jatin On Wed, Mar 16, 2016 at 2:30 PM, Hao Ren wrote: > Any ideas ? > > Feel

Issues facing while Running Spark Streaming Job in YARN cluster mode

2016-03-22 Thread Soni spark
Hi , I am able to run spark streaming job in local mode, when i try to run the same job in my YARN cluster, its throwing errors. Any help is appreciated in this regard Here are my Exception logs: Exception 1: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to

Re: Spark 1.6.0 op CDH 5.6.0

2016-03-22 Thread Sebastian YEPES FERNANDEZ
Hello Michel, I had a similar issue when running my custom built Spark 1.6.1, at the end I resolved the issue by building Spark and my Jar with the CDH built in jvm export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera/ Regards and hope this helps, Sebastian On Tue, Mar 22, 2016 at 10:13 AM, Michel

Is there a way to submit spark job to your by YARN REST API?

2016-03-22 Thread tony....@tendcloud.com
Hi, All, We are trying to build a data processing workflow which will call different spark jobs and we are using YARN. Because we want to constraint ACL for those spark jobs, so we need to submit spark job to use Yarn REST API( which we can pass application acl as parameters. So is there any

Spark 1.6.0 op CDH 5.6.0

2016-03-22 Thread Michel Hubert
Hi, I'm trying to run a Spark 1.6.0 application on a CDH 5.6.0 cluster. How do I submit the uber-jar so it's totally self-reliant? With kind regards, Mitchel spark-submit --class TEST --master yarn-cluster ./uber-TEST-1.0-SNAPSHOT.jar Spark 1.6.1 Version: Cloudera Express 5.6.0 16/03/22

Re: Issue with wholeTextFiles

2016-03-22 Thread Akhil Das
Can you paste the exception stack here? Thanks Best Regards On Mon, Mar 21, 2016 at 1:42 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > I'm using Hadoop 1.0.4 and Spark 1.2.0. > > I'm facing a strange issue. I have a requirement to read a small file from > HDFS and all

Re: pyspark sql convert long to timestamp?

2016-03-22 Thread Akhil Das
Have a look at the from_unixtime() functions. https://spark.apache.org/docs/1.5.0/api/python/_modules/pyspark/sql/functions.html#from_unixtime Thanks Best Regards On Tue, Mar 22, 2016 at 4:49 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > Any idea how I have a col in a data frame

Re: Spark SQL Optimization

2016-03-22 Thread Rishi Mishra
What we have observed so far is Spark picks join order in the same order as tables in from clause is specified. Sometimes reordering benefits the join query. This can be an inbuilt optimization in Spark. But again its not going to be straight forward, where rather than table size, selectivity of

Re: sliding Top N window

2016-03-22 Thread Rishi Mishra
Hi Alexy, We are also trying to solve similar problems using approximation. Would like to hear more about your usage. We can discuss this offline without boring others. :) Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Tue,