Cache in Spark

2015-10-09 Thread vinod kumar
Hi Guys, May I know whether cache is enabled in spark by default? Thanks, Vinod

Different partition number of GroupByKey leads different result

2015-10-09 Thread Devin Huang
Hi everyone, I got a trouble these days,and I don't know whether it is a bug of spark.When I use GroupByKey for our sequenceFile Data,I find that different partition number lead different result, so as ReduceByKey. I think the problem happens on the shuffle stage.I read the source code,

run “dev/mima” error in spark1.4.1

2015-10-09 Thread wangxiaojing
info] spark-core: found 18 potential binary incompatibilities (filtered 423) [error] * method getServletHandlers()Array[org.spark-project.jetty.servlet.ServletContextHandler] in class org.apache.spark.metrics.MetricsSystem has now a different result type; was:

Re: Different partition number of GroupByKey leads different result

2015-10-09 Thread Devin Huang
Forgive me for not understanding what you mean.The sequence file key is UserWritable,and Value is TagsWritable.Both of them implement WritableComparable and Serializable and rewrite the clone(). The key of string is collected from UserWritable through a map transformation. Have you ever read

Datastore or DB for spark

2015-10-09 Thread Rahul Jeevanandam
Hi Guys, I wanted to know what is the databases that you associate with spark? -- Regards, *Rahul J*

Re: Different partition number of GroupByKey leads different result

2015-10-09 Thread Sean Owen
If you are not copying or cloning the value (TagsWritable) object, then that is likely the problem. The value is not immutable and is changed by the InputFormat code reading the file, because it is reused. On Fri, Oct 9, 2015 at 11:04 AM, Devin Huang wrote: > Forgive me for not

Re: Different partition number of GroupByKey leads different result

2015-10-09 Thread Devin Huang
Let me add. The problem is that GroupByKey cannot divide our sequence data into groups correctly ,and produce wrong key/value .The shuffle stage might not be execute correctly.And I don’t know what leads this. The type of key is String, and the type of value is TagsWritable. I take out one

Re: Cache in Spark

2015-10-09 Thread vinod kumar
Thanks Natu, If so,Can you please share me the Spark SQL query to check whether the given table is cached or not? if you know Thanks, Vinod On Fri, Oct 9, 2015 at 2:26 PM, Natu Lauchande wrote: > > I don't think so. > > Spark is not keeping the results in memory unless

RE: Insert via HiveContext is slow

2015-10-09 Thread Cheng, Hao
I think DF performs the same as the SQL API does in the multi-inserts, if you don’t use the cached table. Hao From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] Sent: Friday, October 9, 2015 3:09 PM To: Cheng, Hao Cc: user Subject: Re: Insert via HiveContext is slow Thanks Hao. It

Re: Cache in Spark

2015-10-09 Thread Natu Lauchande
I don't think so. Spark is not keeping the results in memory unless you tell it too. You have to explicitly call the cache method in your RDD: linesWithSpark.cache() Thanks, Natu On Fri, Oct 9, 2015 at 10:47 AM, vinod kumar wrote: > Hi Guys, > > May I know

Re: Different partition number of GroupByKey leads different result

2015-10-09 Thread Sean Owen
Another guess, since you say the key is String (offline): you are not cloning the value of TagsWritable. Hadoop reuses the object under the hood, and so is changing your object value. You can't save references to the object you get from reading a SequenceFile. On Fri, Oct 9, 2015 at 10:22 AM,

Re: Different partition number of GroupByKey leads different result

2015-10-09 Thread Sean Owen
First guess: your key class does not implement hashCode/equals On Fri, Oct 9, 2015 at 10:05 AM, Devin Huang wrote: > Hi everyone, > > I got a trouble these days,and I don't know whether it is a bug of > spark.When I use GroupByKey for our sequenceFile Data,I find that

Re: OutOfMemoryError

2015-10-09 Thread Ramkumar V
How to increase the Xmx of the workers ? *Thanks*, On Mon, Oct 5, 2015 at 3:48 PM, Ramkumar V wrote: > No. I didn't try to increase xmx. > > *Thanks*, > > > > On Mon, Oct 5, 2015 at

Re: Kafka and Spark combination

2015-10-09 Thread Xiao Li
Please see the following discussion: http://search-hadoop.com/m/YGbbS0SqClMW5T1 Thanks, Xiao Li 2015-10-09 6:17 GMT-07:00 Nikhil Gs : > Has anyone worked with Kafka in a scenario where the Streaming data from > the Kafka consumer is picked by Spark (Java)

SQLcontext changing String field to Long

2015-10-09 Thread Abhisheks
Hi there, I have saved my records in to parquet format and am using Spark1.5. But when I try to fetch the columns it throws exception* java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.unsafe.types.UTF8String*. This filed is saved as String while writing parquet. so

Re: Kafka and Spark combination

2015-10-09 Thread Tathagata Das
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Recently it has been merged into HBase https://issues.apache.org/jira/browse/HBASE-13992 There are other options to use. See spark-packages.org. On Fri, Oct 9, 2015 at 4:33 PM, Xiao Li wrote: >

Re:Re: Re: Re: Error in load hbase on spark

2015-10-09 Thread roywang1024
Finally I fix it . It just cause by "ClassNotFoundException: org.apache.htrace.Trace". I can't see this message in logs on driver node,but can be found on worker node. And I modify "spark.executor.extraClassPath" in spark-default.conf still not work.Also modify classpath.txt on every node. It

Re: Best storage format for intermediate process

2015-10-09 Thread Xiao Li
Hi, Saif, This depends on your use cases. For example, you want to do a table scan every time? or you want to get a specific row? or you want to get a temporal query? Do you have a security concern when you choose your target-side data store? Offloading a huge table is also very expensive. It is

Re: Issue with the class generated from avro schema

2015-10-09 Thread Bartłomiej Alberski
I knew that one possible solution will be to map loaded object into another class just after reading from HDFS. I was looking for solution enabling reuse of avro generated classes. It could be useful in situation when your record have more 22 records, because you do not need to write boilerplate

Re: How to compile Spark with customized Hadoop?

2015-10-09 Thread Matei Zaharia
You can publish your version of Hadoop to your Maven cache with mvn publish (just give it a different version number, e.g. 2.7.0a) and then pass that as the Hadoop version to Spark's build (see http://spark.apache.org/docs/latest/building-spark.html

Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-09 Thread Hyukjin Kwon
Hi all, While wring some parquet files by Spark, I found it actually only writes the parquet files with writer version1. This differs encoding types of the file. Is this intendedly fixed for some reasons? I changed codes and tested to write this as writer version2 and it looks fine. In more

Re: Insert via HiveContext is slow

2015-10-09 Thread Daniel Haviv
Thanks Hao. It seems like one issue. The other issue to me seems the renaming of files at the end of the insert. would DF.save perform the task better? Thanks, Daniel On Fri, Oct 9, 2015 at 3:35 AM, Cheng, Hao wrote: > I think that’s a known performance issue(Compared to

Re: sql query orc slow

2015-10-09 Thread patcharee
Yes, the predicate pushdown is enabled, but still take longer time than the first method BR, Patcharee On 08. okt. 2015 18:43, Zhan Zhang wrote: Hi Patcharee, Did you enable the predicate pushdown in the second method? Thanks. Zhan Zhang On Oct 8, 2015, at 1:43 AM, patcharee

Cannot connect to standalone spark cluster

2015-10-09 Thread ekraffmiller
Hi, I'm trying to run a java application that connects to a local standalone spark cluster. I start the cluster with the default configuration, using start-all.sh. When I go to the web page for the cluster, it is started ok. I can connect to this cluster with SparkR, but when I use the same

Re: Spark checkpoint restore failure due to s3 consistency issue

2015-10-09 Thread Tathagata Das
That wont really. What we need to see is the lifecycle of the file before the failure, so we need to the log4j logs. On Fri, Oct 9, 2015 at 2:34 PM, Spark Newbie wrote: > Unfortunately I don't have the before stop logs anymore since the log was > overwritten in my

Re: Spark checkpoint restore failure due to s3 consistency issue

2015-10-09 Thread Spark Newbie
Unfortunately I don't have the before stop logs anymore since the log was overwritten in my next run. I created a rdd-_$folder$ file in S3 which was missing compared to the other rdd- checkpointed. The app started without the IllegalArgumentException. Do you still need to after restart log4j

Re: Datastore or DB for spark

2015-10-09 Thread Xiao Li
FYI, in my local environment, Spark is connected to DB2 on z/OS but that requires a special JDBC driver. Xiao Li 2015-10-09 8:38 GMT-07:00 Rahul Jeevanandam : > Hi Jörn Franke > > I was sure that relational database wouldn't be a good option for Spark. > But what about

Spark checkpoint restore failure due to s3 consistency issue

2015-10-09 Thread Spark Newbie
Hi Spark Users, I'm seeing checkpoint restore failures causing the application startup to fail with the below exception. When I do "ls" on the s3 path I see the key listed sometimes and not listed sometimes. There are no part files (checkpointed files) in the specified S3 path. This is possible

Question about GraphX connected-components

2015-10-09 Thread John Lilley
Greetings, We are looking into using the GraphX connected-components algorithm on Hadoop for grouping operations. Our typical data is on the order of 50-200M vertices with an edge:vertex ratio between 2 and 30. While there are pathological cases of very large groups, they tend to be small. I

Re: Streaming Application Unable to get Stream from Kafka

2015-10-09 Thread Terry Hoo
Hi Prateek, How many cores (threads) do you assign to spark in local mode? It is very likely the local spark does not have enough resource to proceed. You can check http://yourip:4040 to check the details. Thanks! Terry On Fri, Oct 9, 2015 at 10:34 PM, Prateek . wrote: >

Jar is cached in yarn-cluster mode?

2015-10-09 Thread Rex Xiong
I use "spark-submit -master yarn-cluster hdfs://.../a.jar .." to submit my app to yarn. Then I update this a.jar in HDFS, run the command again, I found a line of log that was been removed still exist in "yarn logs ". Is there a cache mechanism I need to disable? Thanks

akka.event.Logging$LoggerInitializationException

2015-10-09 Thread luohui20001
Hi there: when my colleague runs multiple spark Apps simultaneously,some of them failed with akka.event.Logging$LoggerInitializationException. Caused by: akka.event.Logging$LoggerInitializationException: Logger log1-Slf4jLogger did not respond with LoggerInitialized, sent instead

Re: Cache in Spark

2015-10-09 Thread Ted Yu
For RDD, I found this method: def getStorageLevel: StorageLevel = storageLevel FYI On Fri, Oct 9, 2015 at 2:46 AM, vinod kumar wrote: > Thanks Natu, > > If so,Can you please share me the Spark SQL query to check whether the > given table is cached or not? if you

Re: Issue with the class generated from avro schema

2015-10-09 Thread Igor Berman
I think there is deepCopy method of generated avro classes. On 9 October 2015 at 23:32, Bartłomiej Alberski wrote: > I knew that one possible solution will be to map loaded object into > another class just after reading from HDFS. > I was looking for solution enabling reuse

Re: Spark checkpoint restore failure due to s3 consistency issue

2015-10-09 Thread Tathagata Das
Can you provide the before stop and after restart log4j logs for this? On Fri, Oct 9, 2015 at 2:13 PM, Spark Newbie wrote: > Hi Spark Users, > > I'm seeing checkpoint restore failures causing the application startup to > fail with the below exception. When I do "ls"

Re: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread Ted Yu
This is related: SPARK-10501 On Fri, Oct 9, 2015 at 7:28 AM, java8964 wrote: > Hi, Sparkers: > > In this case, I want to use Spark as an ETL engine to load the data from > Cassandra, and save it into HDFS. > > Here is the environment specified information: > > Spark 1.3.1

Re: sql query orc slow

2015-10-09 Thread Zhan Zhang
Hi Patcharee, >From the query, it looks like only the column pruning will be applied. >Partition pruning and predicate pushdown does not have effect. Do you see big >IO difference between two methods? The potential reason of the speed difference I can think of may be the different versions of

Issue with the class generated from avro schema

2015-10-09 Thread alberskib
Hi all, I have piece of code written in spark that loads data from HDFS into java classes generated from avro idl. On RDD created in that way I am executing simple operation which results depends on fact whether I cache RDD before it or not i.e if I run code below val loadedData =

Re: Kafka streaming "at least once" semantics

2015-10-09 Thread pushkar priyadarshi
Spark 1.5 kafka direct i think does not store messages rather than it fetches messages as in when consumed in the pipeline.That would prevent you from having data loss. On Fri, Oct 9, 2015 at 7:34 AM, bitborn wrote: > Hi all, > > My company is using Spark streaming

Re: Datastore or DB for spark

2015-10-09 Thread Ted Yu
There are connectors for hbase, Cassandra, etc. Which data store do you use now ? Cheers > On Oct 9, 2015, at 3:10 AM, Rahul Jeevanandam wrote: > > Hi Guys, > > I wanted to know what is the databases that you associate with spark? > > -- > Regards, > Rahul J

Re: spark-submit hive connection through spark Initial job has not accepted any resources

2015-10-09 Thread vinayak
Java code which I am trying to invoke. import org.apache.spark.SparkContext; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.hive.HiveContext; public class SparkHiveInsertor { public static void main(String[]

Kafka streaming "at least once" semantics

2015-10-09 Thread bitborn
Hi all, My company is using Spark streaming and the Kafka API's to process an event stream. We've got most of our application written, but are stuck on "at least once" processing. I created a demo to show roughly what we're doing here:

spark-submit hive connection through spark Initial job has not accepted any resources

2015-10-09 Thread vinayak
Hi, I am able to fetch data, create table, put data from spark shell (scala command line) from spark to hive but when I create java code to do same and submitting it through spark-submit i am getting *"Initial job has not accepted any resources; check your cluster UI to ensure that workers are

Re: Kafka streaming "at least once" semantics

2015-10-09 Thread pushkar priyadarshi
i am refering to back pressure implementation here. On Fri, Oct 9, 2015 at 8:30 AM, pushkar priyadarshi < priyadarshi.push...@gmail.com> wrote: > Spark 1.5 kafka direct i think does not store messages rather than it > fetches messages as in when consumed in the pipeline.That would prevent you >

RE: Streaming Application Unable to get Stream from Kafka

2015-10-09 Thread Prateek .
Hi All, In my application I have a serializable class which is taking InputDStream from Kafka. The inputDStream contains JSON which is stored in serializable case class. Transformations are applied and saveToCassandra() is executed. I was getting task not serializable exception , so I made the

How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964
Hi, Sparkers: In this case, I want to use Spark as an ETL engine to load the data from Cassandra, and save it into HDFS. Here is the environment specified information: Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2 I am using the Cassandra Spark Connector 1.3.x, which I have no problem to query the C*

Re: sql query orc slow

2015-10-09 Thread Zhan Zhang
In your case, you manually set an AND pushdown, and the predicate is right based on your setting, : leaf-0 = (EQUALS x 320) The right way is to enable the predicate pushdown as follows. sqlContext.setConf("spark.sql.orc.filterPushdown", "true”) Thanks. Zhan Zhang On Oct 9, 2015, at 9:58

Re: "Too many open files" exception on reduceByKey

2015-10-09 Thread tian zhang
You are right, I did find that mesos overwrite this to a smaller number.So we will modify that and try to run again. Thanks! Tian On Thursday, October 8, 2015 4:18 PM, DB Tsai wrote: Try to run to see actual ulimit. We found that mesos overrides the ulimit which

RE: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964
Thanks, Ted. Does this mean I am out of luck for now? If I use HiveContext, and cast the UUID as string, will it work? Yong Date: Fri, 9 Oct 2015 09:09:38 -0700 Subject: Re: How to handle the UUID in Spark 1.3.1 From: yuzhih...@gmail.com To: java8...@hotmail.com CC: user@spark.apache.org This

Re: sql query orc slow

2015-10-09 Thread patcharee
I set hiveContext.setConf("spark.sql.orc.filterPushdown", "true"). But from the log No ORC pushdown predicate for my query with WHERE clause. 15/10/09 19:16:01 DEBUG OrcInputFormat: No ORC pushdown predicate I did not understand what wrong with this. BR, Patcharee On 09. okt. 2015 19:10,

Re: sql query orc slow

2015-10-09 Thread patcharee
Hi Zhan Zhang Actually my query has WHERE clause "select date, month, year, hh, (u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 and z <= 8", column "x", "y" is not partition column, the others are

Re: OutOfMemoryError

2015-10-09 Thread Ted Yu
You can add it in in conf/spark-defaults.conf # spark.executor.extraJavaOptions -XX:+PrintGCDetails FYI On Fri, Oct 9, 2015 at 3:07 AM, Ramkumar V wrote: > How to increase the Xmx of the workers ? > > *Thanks*, > > > > On

Re: Re: Re: Error in load hbase on spark

2015-10-09 Thread Ted Yu
Can you pastebin log snippet showing hbase related errors ? Please also consider posting the question on vendor's forum. On Thu, Oct 8, 2015 at 10:17 PM, roywang1024 wrote: > > I add hbase-conf-dir to spark/conf/classpath.txt,but still error. > > > > > > At 2015-10-09

Kafka and Spark combination

2015-10-09 Thread Nikhil Gs
Has anyone worked with Kafka in a scenario where the Streaming data from the Kafka consumer is picked by Spark (Java) functionality and directly placed in Hbase. Regards, Gs.

Re: ExecutorLostFailure when working with RDDs

2015-10-09 Thread Ivan Héda
The solution is to set 'spark.shuffle.io.preferDirectBufs' to 'false'. Then it is working. Cheers! On Fri, Oct 9, 2015 at 3:13 PM, Ivan Héda wrote: > Hi, > > I'm facing an issue with PySpark (1.5.1, 1.6.0-SNAPSHOT) running over Yarn > (2.6.0-cdh5.4.4). Everything seems

Re: Datastore or DB for spark

2015-10-09 Thread Jörn Franke
I am not aware of any empirical evidence, but I think hadoop (HDFS) as a datastore for Spark is quiet common. With relational databases you usually do not have so much data and you do not benefit from data locality. Le ven. 9 oct. 2015 à 15:16, Rahul Jeevanandam a écrit : >

ExecutorLostFailure when working with RDDs

2015-10-09 Thread Ivan Héda
Hi, I'm facing an issue with PySpark (1.5.1, 1.6.0-SNAPSHOT) running over Yarn (2.6.0-cdh5.4.4). Everything seems fine when working with dataframes, but when i need RDD the workers start to fail. Like in the next code table1 = sqlContext.table('someTable') table1.count() ## OK ## cca 500

Re: Error in load hbase on spark

2015-10-09 Thread Guru Medasani
Hi Roy, Here is a cloudera-labs project SparkOnHBase that makes it really simple to read HBase data into Spark. https://github.com/cloudera-labs/SparkOnHBase Link to blog that explains how to use the package.

Re: Error in load hbase on spark

2015-10-09 Thread Ted Yu
Work for hbase-spark module is still ongoing https://issues.apache.org/jira/browse/HBASE-14406 > On Oct 9, 2015, at 6:18 AM, Guru Medasani wrote: > > Hi Roy, > > Here is a cloudera-labs project SparkOnHBase that makes it really simple to > read HBase data into Spark. > >

Re: Datastore or DB for spark

2015-10-09 Thread Rahul Jeevanandam
I wanna know what everyone are using. Which datastore is popular among Spark community. On Fri, Oct 9, 2015 at 6:16 PM, Ted Yu wrote: > There are connectors for hbase, Cassandra, etc. > > Which data store do you use now ? > > Cheers > > On Oct 9, 2015, at 3:10 AM, Rahul

Re: Kafka streaming "at least once" semantics

2015-10-09 Thread Nikhil Gs
Hello Everyone, Has anyone worked with Kafka in a scenario where the Streaming data from the Kafka consumer is picked by Spark (Java) functionality and directly placed in Hbase. Please let me know, we are completely new to this scenario. That will be very helpful. Regards, GS. Regards, Nik.

Streaming Application Unable to get Stream from Kafka

2015-10-09 Thread Prateek .
Hi, I have Spark Streaming application running with the following log on console, I don’t get any exception but I am not able to receive the data from Kafka Stream. Can anyone please provide any insight what is happening with Spark Streaming. Is Receiver is not able to read the stream? How

Re: sql query orc slow

2015-10-09 Thread Zhan Zhang
That is weird. Unfortunately, there is no debug info available on this part. Can you please open a JIRA to add some debug information on the driver side? Thanks. Zhan Zhang On Oct 9, 2015, at 10:22 AM, patcharee > wrote: I set

Best storage format for intermediate process

2015-10-09 Thread Saif.A.Ellafi
Hi all, I am in the procss of learning big data. Right now, I am bringing huge databases through JDBC to Spark (a 250 million rows table can take around 3 hours), and then re-saving it into JSON, which is fast, simple, distributed, fail-safe and stores data types, although without any

Re: weird issue with sqlContext.createDataFrame - pyspark 1.3.1

2015-10-09 Thread ping yan
Thanks. It does seem like that my pandas installation is corrupted. Thanks! On Fri, Oct 9, 2015 at 11:04 AM, Davies Liu wrote: > Is it possible that you have an very old version of pandas, that does > not have DataFrame (or in different submodule). > > Could you try

Re: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread Ted Yu
I guess that should work :-) On Fri, Oct 9, 2015 at 10:46 AM, java8964 wrote: > Thanks, Ted. > > Does this mean I am out of luck for now? If I use HiveContext, and cast > the UUID as string, will it work? > > Yong > > -- > Date: Fri, 9 Oct 2015

Re: Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-09 Thread Cheng Lian
Hi Hyukjin, Thanks for bringing this up. Could you please make a PR for this one? We didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0, but we should let users choose the writer version, as long as PARQUET_1_0 remains the default option. Cheng On 10/8/15 11:04 PM,

Re: weird issue with sqlContext.createDataFrame - pyspark 1.3.1

2015-10-09 Thread Davies Liu
Is it possible that you have an very old version of pandas, that does not have DataFrame (or in different submodule). Could you try this: ``` >>> import pandas >>> pandas.__version__ '0.14.0' ``` On Thu, Oct 8, 2015 at 10:28 PM, ping yan wrote: > I really cannot figure out

Re: Issue with the class generated from avro schema

2015-10-09 Thread Igor Berman
u should create copy of your avro data before working with it, i.e. just after loadFromHDFS map it into new instance that is deap copy of the object it's connected to the way spark/avro reader reads avro files(it reuses some buffer or something) On 9 October 2015 at 19:05, alberskib

Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-09 Thread Michael Armbrust
I'm thinking there must be a typo somewhere else as this works for me on Spark 1.4: Seq(("1231234", 1)).toDF("barcode", "items").registerTempTable("goods") sql("SELECT barcode, IF(items IS NULL, 0, items) FROM goods").collect() res1: Array[org.apache.spark.sql.Row] = Array([1231234,1]) I'll

Re: error in sparkSQL 1.5 using count(1) in nested queries

2015-10-09 Thread Michael Armbrust
Thanks for reporting: https://issues.apache.org/jira/browse/SPARK-11032 You can probably workaround this by aliasing the count and just doing a filter on that value afterwards. On Thu, Oct 8, 2015 at 8:47 PM, Jeff Thompson < jeffreykeatingthomp...@gmail.com> wrote: > After upgrading from 1.4.1

How to calculate percentile of a column of DataFrame?

2015-10-09 Thread unk1102
Hi how to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select percentile_approx("mycol",0.25) from myTable); I can see

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes. On Fri, Oct 9, 2015 at 12:01 PM, unk1102 wrote: > Hi how to calculate percentile of a column in a DataFrame? I cant find any > percentile_approx function in Spark aggregation functions. For

How to tune unavoidable group by query?

2015-10-09 Thread unk1102
Hi I have the following group by query which I tried to use it both using DataFrame and hiveContext.sql() but both shuffles huge data and is slow. I have around 8 fields passed in as group by fields sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla bla"); OR

RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Where can we find other available functions such as lit() ? I can’t find lit in the api. Thanks From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, October 09, 2015 4:04 PM To: unk1102 Cc: user Subject: Re: How to calculate percentile of a column of DataFrame? You can use

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
I found it in 1.3 documentation lit says something else not percent public static Column lit(Object literal) Creates a Column of literal

How to compile Spark with customized Hadoop?

2015-10-09 Thread Dogtail L
Hi all, I have modified Hadoop source code, and I want to compile Spark with my modified Hadoop. Do you know how to do that? Great thanks!

RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes a percentile function lit(25) Thanks for clarification Saif From: Umesh Kacha [mailto:umesh.ka...@gmail.com] Sent: Friday, October 09, 2015 4:10 PM To: Ellafi, Saif A. Cc: Michael Armbrust; user Subject: Re: How to

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
This is confusing because I made a typo... callUDF("percentile_approx", col("mycol"), lit(0.25)) The first argument is the name of the UDF, all other arguments need to be columns that are passed in as arguments. lit is just saying to make a literal column that always has the value 0.25. On

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
thanks much Michael let me try. On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust wrote: > This is confusing because I made a typo... > > callUDF("percentile_approx", col("mycol"), lit(0.25)) > > The first argument is the name of the UDF, all other arguments need to be >

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
I have a doubt Michael I tried to use callUDF in the following code it does not work. sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25))) Above code does not compile because callUdf() takes only two arguments function name in String and Column class type. Please guide. On Sat,

Re: spark.mesos.coarse impacts memory performance on mesos

2015-10-09 Thread Utkarsh Sengar
Hi Tim, Any way I can provide more info on this? On Thu, Oct 1, 2015 at 4:21 PM, Utkarsh Sengar wrote: > Not sure what you mean by that, I shared the data which I see in spark UI. > Can you point me to a location where I can precisely get the data you need? > > When I

Re: Kafka streaming "at least once" semantics

2015-10-09 Thread Cody Koeninger
To be clear, have you tried compiling and running the idempotent example from my repo? Is that behaving as you'd expect? On Fri, Oct 9, 2015 at 6:34 AM, bitborn wrote: > Hi all, > > My company is using Spark streaming and the Kafka API's to process an event > stream.

Create hashmap using two RDD's

2015-10-09 Thread kali.tumm...@gmail.com
Hi all, I am trying to create a hashmap using two rdd, but having issues key not found do I need to convert RDD to list first ? 1) rdd has key data 2) rdd has value data Key Rdd:- val quotekey=file.map(x => x.split("\\|")).filter(line => line(0).contains("1017")).map(x => x(5)+x(4))

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-09 Thread Khandeshi, Ami
It seems the problem is with creating Usage: RBackend From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, October 07, 2015 10:23 PM To: Khandeshi, Ami; Hossein Cc: akhandeshi; user@spark.apache.org Subject: RE: SparkR Error in sparkR.init(master=“local”) in RStudio Can you extract the

RE: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-09 Thread Khandeshi, Ami
Thank you for your help! I was able to resolve it by changing my working directory to local. The default was a map drive. From: Khandeshi, Ami Sent: Friday, October 09, 2015 11:23 AM To: 'Sun, Rui'; Hossein Cc: akhandeshi; user@spark.apache.org Subject: RE: SparkR Error in

Re: Datastore or DB for spark

2015-10-09 Thread Rahul Jeevanandam
Hi Jörn Franke I was sure that relational database wouldn't be a good option for Spark. But what about distributed databases like Hbase, Cassandra, etc? On Fri, Oct 9, 2015 at 7:21 PM, Jörn Franke wrote: > I am not aware of any empirical evidence, but I think hadoop