Re: How Broadcast variable works

2015-05-30 Thread ayan guha
the updated variable? Thanks. -- bit1...@163.com -- Best Regards, Ayan Guha

Re: Batch aggregation by sliding window + join

2015-05-29 Thread ayan guha
of daily block. On 29 May 2015 at 01:51, ayan guha guha.a...@gmail.com wrote: Which version of spark? In 1.4 window queries will show up for these kind of scenarios. 1 thing I can suggest is keep daily aggregates materialised and partioned by key and sorted by key-day combination using

Re: Format RDD/SchemaRDD contents to screen?

2015-05-29 Thread ayan guha
Depending on your spark version, you can convert schemaRDD to a dataframe and then use .show() On 30 May 2015 10:33, Minnow Noir minnown...@gmail.com wrote: Im trying to debug query results inside spark-shell, but finding it cumbersome to save to file and then use file system utils to explore

Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-28 Thread ayan guha
Probably a naive question: can you try the same in hive CLI and see if your SQL is working? Looks like hive thing to me as spark is faithfully delegating the query to hive. On 29 May 2015 03:22, Abhishek Tripathi trackissue...@gmail.com wrote: Hi , I'm using CDH5.4.0 quick start VM and tried

Re: Batch aggregation by sliding window + join

2015-05-28 Thread ayan guha
Which version of spark? In 1.4 window queries will show up for these kind of scenarios. 1 thing I can suggest is keep daily aggregates materialised and partioned by key and sorted by key-day combination using repartitionandsort method. It allows you to use custom partitioner and custom sorter.

Re: How to give multiple directories as input ?

2015-05-27 Thread ayan guha
What about /blah/*/blah/out*.avro? On 27 May 2015 18:08, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I am doing that now. Is there no other way ? On Wed, May 27, 2015 at 12:40 PM, Akhil Das ak...@sigmoidanalytics.com wrote: How about creating two and union [ sc.union(first, second) ] them?

Re: How many executors can I acquire in standalone mode ?

2015-05-27 Thread ayan guha
You can request number of cores and amount of memory for each executor. On 27 May 2015 18:25, canan chen ccn...@gmail.com wrote: Thanks Arush. My scenario is that In standalone mode, if I have one worker, when I start spark-shell, there will be one executor launched. But if I have 2 workers,

Re: DataFrame. Conditional aggregation

2015-05-27 Thread ayan guha
in the aggregation inserting a lambda function or something else. Thanks Regards. Miguel. On Wed, May 27, 2015 at 1:06 AM, ayan guha guha.a...@gmail.com wrote: For this, I can give you a SQL solution: joined data.registerTempTable('j') Res=ssc.sql('select col1,col2, count(1) counter, min(col3

Re: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread ayan guha
Yes, you are at right path. Only thing to remember is placing hive site XML to correct path so spark can talk to hive metastore. Best Ayan On 28 May 2015 10:53, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: hey guys On the Hive/Hadoop ecosystem we have using Cloudera

Re: DataFrame. Conditional aggregation

2015-05-26 Thread ayan guha
know how it works. For example: val result = joinedData.groupBy(col1,col2).agg( count(lit(1)).as(counter), min(col3).as(minimum), sum(case when endrscp 100 then 1 else 0 end).as(test) ) How can I do it? Thanks Regards. Miguel. On Tue, May 26, 2015 at 12:35 AM, ayan guha

Re: Running Javascript from scala spark

2015-05-26 Thread ayan guha
Yes you are in right mailing list, for sure :) Regarding your question, I am sure you are well versed with how spark works. Essentially you can run any arbitrary function with map call and it will run in remote nodes. Hence you need to install any needed dependency in all nodes. You can also pass

Re: Using Spark like a search engine

2015-05-25 Thread ayan guha
Yes, spark will be useful for following areas of your application: 1. Running same function on every CV in parallel and score 2. Improve scoring function by better access to classification and clustering algorithms, within and beyond mllib. These are first benefits you can start with and then

Re: DataFrame. Conditional aggregation

2015-05-25 Thread ayan guha
Case when col2100 then 1 else col2 end On 26 May 2015 00:25, Masf masfwo...@gmail.com wrote: Hi. In a dataframe, How can I execution a conditional sentence in a aggregation. For example, Can I translate this SQL statement to DataFrame?: SELECT name, SUM(IF table.col2 100 THEN 1 ELSE

Re: DataFrame groupBy vs RDD groupBy

2015-05-23 Thread ayan guha
Hi Michael This is great info. I am currently using repartitionandsort function to achieve the same. Is this the recommended way till 1.3 or is there any better way? On 23 May 2015 07:38, Michael Armbrust mich...@databricks.com wrote: DataFrames have a lot more information about the data, so

Re: partitioning after extracting from a hive table?

2015-05-22 Thread ayan guha
I guess not. Spark partitions correspond to number of splits. On 23 May 2015 00:02, Cesar Flores ces...@gmail.com wrote: I have a table in a Hive database partitioning by date. I notice that when I query this table using HiveContext the created data frame has an specific number of partitions.

Re: Hive on Spark VS Spark SQL

2015-05-20 Thread ayan guha
And if I am not wrong, spark SQL api is intended to move closer to SQL standards. I feel its a clever decision on spark's part to keep both APIs operational. These short term confusions worth the long term benefits. On 20 May 2015 17:19, Sean Owen so...@cloudera.com wrote: I don't think that's

Re: Spark 1.3.1 - SQL Issues

2015-05-20 Thread ayan guha
Thanks a bunch On 21 May 2015 07:11, Davies Liu dav...@databricks.com wrote: The docs had been updated. You should convert the DataFrame to RDD by `df.rdd` On Mon, Apr 20, 2015 at 5:23 AM, ayan guha guha.a...@gmail.com wrote: Hi Just upgraded to Spark 1.3.1. I am getting an warning

Re: Spark SQL on large number of columns

2015-05-19 Thread ayan guha
and create a logical plan. Even if i have just one row, it's taking more than 1 hour just to get pass the parsing. Any idea how to optimize in these kind of scenarios? Regards, Madhukara Phatak http://datamantra.io/ -- Best Regards, Ayan Guha

Re: Spark Job not using all nodes in cluster

2015-05-19 Thread ayan guha
What is your spark env file says? Are you setting number of executors in spark context? On 20 May 2015 13:16, Shailesh Birari sbirar...@gmail.com wrote: Hi, I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB of RAM. I have around 600,000+ Json files on HDFS. Each file

Re: Spark SQL on large number of columns

2015-05-19 Thread ayan guha
are you using 发自我的 iPhone 在 2015年5月19日,18:29,ayan guha guha.a...@gmail.com 写道: can you kindly share your code? On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am

Re: Processing multiple columns in parallel

2015-05-18 Thread ayan guha
My first thought would be creating 10 rdds and run your word count on each of them..I think spark scheduler is going to resolve dependency in parallel and launch 10 jobs. Best Ayan On 18 May 2015 23:41, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, Consider I have a tab delimited text

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread ayan guha
Hi So to be clear, do you want to run one operation in multiple threads within a function or you want run multiple jobs using multiple threads? I am wondering why python thread module can't be used? Or you have already gave it a try? On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in

Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread ayan guha
the schema, I am specifying every field as nullable. So I believe, it should not throw this error. Can anyone help me fix this error. Thank you. Regards, Anand.C -- Best Regards, Ayan Guha

Re: IF in SQL statement

2015-05-16 Thread ayan guha
() thx, Antony. -- Best Regards, Ayan Guha

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread ayan guha
-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread ayan guha
Here is from documentation: Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently Spark SQL is based on Hive 0.12.0 and 0.13.1. On Sun, May 17, 2015 at 1:48 AM, ayan guha guha.a...@gmail.com wrote: Hi Try with Hive 0.13. If I am not wrong, Hive 0.14

Re: Custom Aggregate Function for DataFrame

2015-05-16 Thread ayan guha
the performance. Thanks. Justin On Fri, May 15, 2015 at 6:32 AM, ayan guha guha.a...@gmail.com wrote: can you kindly elaborate on this? it should be possible to write udafs in similar lines of sum/min etc. On Fri, May 15, 2015 at 5:49 AM, Justin Yip yipjus...@prediction.io wrote: Hello, May I

Re: Custom Aggregate Function for DataFrame

2015-05-15 Thread ayan guha
-Function-for-DataFrame-tp22893.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Best Regards, Ayan Guha

Re: Worker Spark Port

2015-05-15 Thread ayan guha
...@gmail.com wrote: I understated that this port value is randomly selected. Is there a way to enforce which spark port a Worker should use? -- Best Regards, Ayan Guha

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread ayan guha
batches, I would need to handle update in case the hdfs directory already exists. Is this a common approach? Are there any other approaches that I can try? Thank you! Nisrina. -- Best Regards, Ayan Guha

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread ayan guha
at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: reduceByKey

2015-05-14 Thread ayan guha
: *2553: 0,0,0,1,0,1,0,0* 46551: 0,1,0,0,0,0,0,0 266: 0,1,0,0,0,0,0,0 *225546: 0,0,0,0,0,2,0,0* Anyone can help me getting that? Thank you. Have a nice day. yasemin -- hiç ender hiç -- Best Regards, Ayan Guha

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread ayan guha
) lines.count() On Thu, May 14, 2015 at 4:17 AM, ayan guha guha.a...@gmail.com wrote: Jo Thanks for the reply, but _jsc does not have anything to pass hadoop configs. can you illustrate your answer a bit more? TIA... On Wed, May 13, 2015 at 12:08 AM, Ram Sriharsha sriharsha@gmail.com wrote

Re: Spark performance in cluster mode using yarn

2015-05-14 Thread ayan guha
With this information it is hard to predict. What's the performance you are getting? What's your desired performance? Maybe you can post your code and experts can suggests improvement? On 14 May 2015 15:02, sachin Singh sachin.sha...@gmail.com wrote: Hi Friends, please someone can give the

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread ayan guha
(jsc) https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext through which you can access the hadoop configuration On Tue, May 12, 2015 at 6:39 AM, ayan guha guha.a...@gmail.com wrote: Hi I found this method in scala API but not in python API (1.3.1). Basically, I

Re: how to set random seed

2015-05-14 Thread ayan guha
the seed (call random.seed()) once on each worker? -- *From:* ayan guha guha.a...@gmail.com *Sent:* Tuesday, May 12, 2015 11:17 PM *To:* Charles Hayden *Cc:* user *Subject:* Re: how to set random seed Easiest way is to broadcast it. On 13 May 2015 10:40, Charles

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-13 Thread ayan guha
Your stack trace says it can't convert date to integer. You sure about column positions? On 13 May 2015 21:32, Ishwardeep Singh ishwardeep.si...@impetus.co.in wrote: Hi , I am using Spark SQL 1.3.1. I have created a dataFrame using jdbc data source and am using saveAsTable() method but got

Re: how to set random seed

2015-05-13 Thread ayan guha
Easiest way is to broadcast it. On 13 May 2015 10:40, Charles Hayden charles.hay...@atigeo.com wrote: In pySpark, I am writing a map with a lambda that calls random.shuffle. For testing, I want to be able to give it a seed, so that successive runs will produce the same shuffle. I am looking

Using sc.HadoopConfiguration in Python

2015-05-12 Thread ayan guha
, how? -- Best Regards, Ayan Guha

Re: Python - SQL (geonames dataset)

2015-05-11 Thread ayan guha
Try this Res = ssc.sql(your SQL without limit) Print red.first() Note: your SQL looks wrong as count will need a group by clause. Best Ayan On 11 May 2015 16:22, Tyler Mitchell tyler.mitch...@actian.com wrote: I'm using Python to setup a dataframe, but for some reason it is not being made

Re: Reading Nested Fields in DataFrames

2015-05-11 Thread ayan guha
Typically you would use . notation to access, same way you would access a map. On 12 May 2015 00:06, Ashish Kumar Singh ashish23...@gmail.com wrote: Hi , I am trying to read Nested Avro data in Spark 1.3 using DataFrames. I need help to retrieve the Inner element data in the Structure below.

Re: can we start a new thread in foreachRDD in spark streaming?

2015-05-11 Thread ayan guha
It depends on how you want to run your application. You can always save 100 batch as a data file and run another app to read those files. In that case you have separated contexts and you will find both application running simultaneously in the cluster but on different JVMs. But if you do not want

Re: custom join using complex keys

2015-05-10 Thread ayan guha
with a given predicate to implement this ? (I would probably also need to provide a partitioner, and some sorting predicate). Left and right RDD are 1-10 millions lines long. Any idea ? Thanks Mathieu -- Best Regards, Ayan Guha

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread ayan guha
How did you end up with thousands of df? Are you using streaming? In that case you can do foreachRDD and keep merging incoming rdds to single rdd and then save it through your own checkpoint mechanism. If not, please share your use case. On 11 May 2015 00:38, Peter Aberline

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread ayan guha
file. They have the same schema. There is also the option of appending each DF to the parquet file, but then I can't maintain them as separate DF when reading back in without filtering. I'll rethink maintaining each CSV file as a single DF. Thanks, Peter On 10 May 2015 at 15:51, ayan guha

Re: spark and binary files

2015-05-09 Thread ayan guha
-- Best Regards, Ayan Guha

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread ayan guha
From S3. As the dependency of df will be on s3. And because rdds are not replicated. On 8 May 2015 23:02, Peter Rudenko petro.rude...@gmail.com wrote: Hi, i have a next question: val data = sc.textFile(s3:///)val df = data.toDF df.saveAsParquetFile(hdfs://) df.someAction(...) if during

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread ayan guha
I am just wondering if create table supports the syntax of Create table dB.tablename Instead of two step process of use dB and then create table tablename? On 9 May 2015 08:17, Michael Armbrust mich...@databricks.com wrote: Actually, I was talking about the support for inferring different but

Re: Map one RDD into two RDD

2015-05-08 Thread ayan guha
Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter On 9 May 2015 05:19, anshu shukla anshushuk...@gmail.com wrote: Any update to above mail and Can anyone tell me logic - I have to filter tweets and submit tweets with particular #hashtag1 to SparkSQL databases and tweets with

Re: How can I force operations to complete and spool to disk

2015-05-07 Thread ayan guha
be forced. Any ideas? -- Best Regards, Ayan Guha

Re: Partition Case Class RDD without ParRDDFunctions

2015-05-06 Thread ayan guha
it to a tuple2 seems like a waste of space/computation. It looks like the PairRDDFunctions..partitionBy() uses a ShuffleRDD[K,V,C] requires K,V,C? Could I create a new ShuffleRDD[MyClass,MyClass,MyClass](caseClassRdd, new HashParitioner)? Cheers, N -- Best Regards, Ayan Guha

Re: Creating topology in spark streaming

2015-05-06 Thread ayan guha
Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i

Re: Receiver Fault Tolerance

2015-05-06 Thread ayan guha
. Is the above understanding correct? or is there more to it? -- Best Regards, Ayan Guha

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha
And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark

Re: Unable to join table across data sources using sparkSQL

2015-05-05 Thread ayan guha
.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22768.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Best Regards, Ayan Guha

Re: Maximum Core Utilization

2015-05-05 Thread ayan guha
Also, if not already done, you may want to try repartition your data to 50 partition s On 6 May 2015 05:56, Manu Kaul manohar.k...@gmail.com wrote: Hi All, For a job I am running on Spark with a dataset of say 350,000 lines (not big), I am finding that even though my cluster has a large

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha
for Spark certification, learning in group makes learning easy and fun. Kartik On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote: And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
What happens when you try to put files to your hdfs from local filesystem? Looks like its a hdfs issue rather than spark thing. On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote: I have searched all replies to this question not found an answer. I am running standalone Spark 1.3.1 and

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread ayan guha
You can use custom partitioner to redistribution using partitionby On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote: I'm currently trying to join two large tables (order 1B rows each) using Spark SQL (1.3.0) and am running into long GC pauses which bring the job to a halt. I'm

Re: Hardware requirements

2015-05-04 Thread ayan guha
Hi How do you figure out 500gig~3900 partitions? I am trying to do the math. If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we assume split and block sizes are same, shouldn't we end up with 8k partitions? On 4 May 2015 17:49, Akhil Das ak...@sigmoidanalytics.com wrote:

Re: mapping JavaRDD to jdbc DataFrame

2015-05-04 Thread ayan guha
? Thanks, Lior -- Best Regards, Ayan Guha

Re: Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread ayan guha
...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Python Custom Partitioner

2015-05-04 Thread ayan guha
path? b) How can I do partitionby? Specifically, when I call DF.rdd.partitionBy, what gets passed to the custom function? tuple? row? how to access (say 3rd column of a tuple inside partitioner function)? -- Best Regards, Ayan Guha

Re: Python Custom Partitioner

2015-05-04 Thread ayan guha
Thanks, but is there non broadcast solution? On 5 May 2015 01:34, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have implemented map-side join with broadcast variables and the code is on mailing list (scala). On Mon, May 4, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote: Hi Can

Re: Spark distributed SQL: JSON Data set on all worker node

2015-05-03 Thread ayan guha
Yes it is possible. You need to use jsonfile method on SQL context and then create a dataframe from the rdd. Then register it as a table. Should be 3 lines of code, thanks to spark. You may see few YouTube video esp for unifying pipelines. On 3 May 2015 19:02, Jai jai4l...@gmail.com wrote: Hi,

Re: directory loader in windows

2015-05-02 Thread ayan guha
(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) -- Best Regards, Ayan Guha

Re: How to add a column to a spark RDD with many columns?

2015-05-01 Thread ayan guha
You have rdd or dataframe? Rdds are kind of tuples. You can add a new column to it by a map. rdd s are immutable, so you will get another rdd. On 1 May 2015 14:59, Carter gyz...@hotmail.com wrote: Hi all, I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more column at the

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-30 Thread ayan guha
it using DataFrame? Can you give an example code snipet? Thanks Ningjun *From:* ayan guha [mailto:guha.a...@gmail.com] *Sent:* Wednesday, April 29, 2015 5:54 PM *To:* Wang, Ningjun (LNG-NPV) *Cc:* user@spark.apache.org *Subject:* Re: HOw can I merge multiple DataFrame and remove duplicated key

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread ayan guha
And if I may ask, how long it takes in hbase CLI? I would not expect spark to improve performance of hbase. At best spark will push down the filter to hbase. So I would try to optimise any additional overhead like bringing data into spark. On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.com wrote:

Re: DataFrame filter referencing error

2015-04-30 Thread ayan guha
PM ayan guha guha.a...@gmail.com wrote: Looks like you DF is based on a MySQL DB using jdbc, and error is thrown from mySQL. Can you see what SQL is finally getting fired in MySQL? Spark is pushing down the predicate to mysql so its not a spark problem perse On Wed, Apr 29, 2015 at 9:56 PM

Re: Dataframe filter based on another Dataframe

2015-04-29 Thread ayan guha
Regards, Ayan Guha

Re: DataFrame filter referencing error

2015-04-29 Thread ayan guha
) at java.lang.Thread.run(Thread.java:745) Does filter work only on columns of the integer type? What is the exact behaviour of the filter function and what is the best way to handle the query I am trying to execute? Thank you, Francesco -- Best Regards, Ayan Guha

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread ayan guha
I guess what you mean is not streaming. If you create a stream context at time t, you will receive data coming through starting time t++, not before time t. Looks like you want a queue. Let Kafka write to a queue, consume msgs from the queue and stop when queue is empty. On 29 Apr 2015 14:35,

Re: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread ayan guha
Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to

Re: Compute pairwise distance

2015-04-29 Thread ayan guha
This is my first thought, please suggest any further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance Best Ayan On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl

Re: How to group multiple row data ?

2015-04-29 Thread ayan guha
commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: Understanding Spark's caching

2015-04-28 Thread ayan guha
Hi I replied you in SO. If option A had a action call then it should suffice too. On 28 Apr 2015 05:30, Eran Medan eran.me...@gmail.com wrote: Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val

Re: 1.3.1: Persisting RDD in parquet - Conflicting partition column names

2015-04-28 Thread ayan guha
Can you show your code please? On 28 Apr 2015 13:20, sranga sra...@gmail.com wrote: Hi I am getting the following error when persisting an RDD in parquet format to an S3 location. This is code that was working in the 1.2 version. The version that it is failing to work is 1.3.1. Any help is

Re: How to add jars to standalone pyspark program

2015-04-28 Thread ayan guha
Its a windows thing. Please escape front slash in string. Basically it is not able to find the file On 28 Apr 2015 22:09, Fabian Böhnlein fabian.boehnl...@gmail.com wrote: Can you specifiy 'running via PyCharm'. how are you executing the script, with spark-submit? In PySpark I guess you used

Re: Initial tasks in job take time

2015-04-28 Thread ayan guha
Are your driver running on the same m/c as master? On 29 Apr 2015 03:59, Anshul Singhle ans...@betaglide.com wrote: Hi, I'm running short spark jobs on rdds cached in memory. I'm also using a long running job context. I want to be able to complete my jobs (on the cached rdd) in under 1 sec.

Re: New JIRA - [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-28 Thread ayan guha
Alias function not in python yet. I suggest to write SQL if your data suits it On 28 Apr 2015 14:42, Don Drake dondr...@gmail.com wrote: https://issues.apache.org/jira/browse/SPARK-7182 Can anyone suggest a workaround for the above issue? Thanks. -Don -- Donald Drake Drake Consulting

Re: Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread ayan guha
The answer is it depends :) The fact that query runtime increases indicates more shuffle. You may want to construct rdds based on keys you use. You may want to specify what kind of node you are using and how many executors you are using. You may also want to play around with executor memory

Re: Automatic Cache in SparkSQL

2015-04-27 Thread ayan guha
Spark keeps job in memory by default for kind of performance gains you are seeing. Additionally depending on your query spark runs stages and any point of time spark's code behind the scene may issue explicit cache. If you hit any such scenario you will find those cached objects in UI under

Re: Scalability of group by

2015-04-27 Thread ayan guha
Hi Can you test on a smaller dataset to identify if it is cluster issue or scaling issue in spark On 28 Apr 2015 11:30, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in Spark 1.3 as follows: “select id,

Re: Querying Cluster State

2015-04-26 Thread ayan guha
that are currently available using API calls and then take some appropriate action based on the information I get back, like restart a dead Master or Worker. Is this possible? does Spark provide such API? -- Best Regards, Ayan Guha

Re: Querying Cluster State

2015-04-26 Thread ayan guha
On Sun, Apr 26, 2015 at 10:12 AM, ayan guha guha.a...@gmail.com wrote: In my limited understanding, there must be single leader master in the cluster. If there are multiple leaders, it will lead to unstable cluster as each masters will keep scheduling independently. You should use zookeeper

Re: directory loader in windows

2015-04-25 Thread ayan guha
) print newsY.count() On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote: Hi I am facing this weird issue. I am on Windows, and I am trying to load all files within a folder. Here is my code - loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-25 Thread ayan guha
that this is different than the Spark SQL JDBC server, which allows other applications to run queries using Spark SQL). On Fri, Apr 24, 2015 at 6:27 PM, ayan guha guha.a...@gmail.com wrote: What is the specific usecase? I can think of couple of ways (write to hdfs and then read from spark or stream

Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread ayan guha
, Ayan Guha

directory loader in windows

2015-04-25 Thread ayan guha
:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) -- Best Regards, Ayan

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-24 Thread ayan guha
What is the specific usecase? I can think of couple of ways (write to hdfs and then read from spark or stream data to spark). Also I have seen people using mysql jars to bring data in. Essentially you want to simulate creation of rdd. On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
you! Best, Wenlei -- Best Regards, Ayan Guha

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread ayan guha
I just tested your pr On 25 Apr 2015 10:18, Ali Bajwa ali.ba...@gmail.com wrote: Any ideas on this? Any sample code to join 2 data frames on two columns? Thanks Ali On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote: Hi experts, Sorry if this is a n00b question or has

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
you so much for the help! On Sat, Apr 25, 2015 at 12:41 AM, ayan guha guha.a...@gmail.com wrote: can you give an example set of data and desired output On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie wenlei@gmail.com wrote: Hi, I would like to answer the following customized aggregation

Re: Pipeline in pyspark

2015-04-23 Thread ayan guha
I do not think you can share data across spark contexts. So as long as you can pass it around you should be good. On 23 Apr 2015 17:12, Suraj Shetiya surajshet...@gmail.com wrote: Hi, I have come across ways of building pipeline of input/transform and output pipelines with Java (Google

Re: Spark SQL performance issue.

2015-04-23 Thread ayan guha
Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable {

Re: Map-Side Join in Spark

2015-04-21 Thread ayan guha
If you are using a pairrdd, then you can use partition by method to provide your partitioner On 21 Apr 2015 15:04, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: What is re-partition ? On Tue, Apr 21, 2015 at 10:23 AM, ayan guha guha.a...@gmail.com wrote: In my understanding you need to create

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: Custom Partitioning Spark

2015-04-21 Thread ayan guha
solely for the person(s) named and may be confidential and/or privileged.If you are not the intended recipient,please delete it,notify me and do not copy,use,or disclose its content.* -- Best Regards, Ayan Guha

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread ayan guha
/Column-renaming-after-DataFrame-groupBy-tp22586.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Best Regards, Ayan Guha

<    2   3   4   5   6   7   8   >