the updated variable?
Thanks.
--
bit1...@163.com
--
Best Regards,
Ayan Guha
of daily block.
On 29 May 2015 at 01:51, ayan guha guha.a...@gmail.com wrote:
Which version of spark? In 1.4 window queries will show up for these kind
of scenarios.
1 thing I can suggest is keep daily aggregates materialised and partioned
by key and sorted by key-day combination using
Depending on your spark version, you can convert schemaRDD to a dataframe
and then use .show()
On 30 May 2015 10:33, Minnow Noir minnown...@gmail.com wrote:
Im trying to debug query results inside spark-shell, but finding it
cumbersome to save to file and then use file system utils to explore
Probably a naive question: can you try the same in hive CLI and see if your
SQL is working? Looks like hive thing to me as spark is faithfully
delegating the query to hive.
On 29 May 2015 03:22, Abhishek Tripathi trackissue...@gmail.com wrote:
Hi ,
I'm using CDH5.4.0 quick start VM and tried
Which version of spark? In 1.4 window queries will show up for these kind
of scenarios.
1 thing I can suggest is keep daily aggregates materialised and partioned
by key and sorted by key-day combination using repartitionandsort method.
It allows you to use custom partitioner and custom sorter.
What about /blah/*/blah/out*.avro?
On 27 May 2015 18:08, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I am doing that now.
Is there no other way ?
On Wed, May 27, 2015 at 12:40 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
How about creating two and union [ sc.union(first, second) ] them?
You can request number of cores and amount of memory for each executor.
On 27 May 2015 18:25, canan chen ccn...@gmail.com wrote:
Thanks Arush.
My scenario is that In standalone mode, if I have one worker, when I start
spark-shell, there will be one executor launched. But if I have 2 workers,
in the
aggregation inserting a lambda function or something else.
Thanks
Regards.
Miguel.
On Wed, May 27, 2015 at 1:06 AM, ayan guha guha.a...@gmail.com wrote:
For this, I can give you a SQL solution:
joined data.registerTempTable('j')
Res=ssc.sql('select col1,col2, count(1) counter, min(col3
Yes, you are at right path. Only thing to remember is placing hive site XML
to correct path so spark can talk to hive metastore.
Best
Ayan
On 28 May 2015 10:53, Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
hey guys
On the Hive/Hadoop ecosystem we have using Cloudera
know how it works. For example:
val result = joinedData.groupBy(col1,col2).agg(
count(lit(1)).as(counter),
min(col3).as(minimum),
sum(case when endrscp 100 then 1 else 0 end).as(test)
)
How can I do it?
Thanks
Regards.
Miguel.
On Tue, May 26, 2015 at 12:35 AM, ayan guha
Yes you are in right mailing list, for sure :)
Regarding your question, I am sure you are well versed with how spark
works. Essentially you can run any arbitrary function with map call and it
will run in remote nodes. Hence you need to install any needed dependency
in all nodes. You can also pass
Yes, spark will be useful for following areas of your application:
1. Running same function on every CV in parallel and score
2. Improve scoring function by better access to classification and
clustering algorithms, within and beyond mllib.
These are first benefits you can start with and then
Case when col2100 then 1 else col2 end
On 26 May 2015 00:25, Masf masfwo...@gmail.com wrote:
Hi.
In a dataframe, How can I execution a conditional sentence in a
aggregation. For example, Can I translate this SQL statement to DataFrame?:
SELECT name, SUM(IF table.col2 100 THEN 1 ELSE
Hi Michael
This is great info. I am currently using repartitionandsort function to
achieve the same. Is this the recommended way till 1.3 or is there any
better way?
On 23 May 2015 07:38, Michael Armbrust mich...@databricks.com wrote:
DataFrames have a lot more information about the data, so
I guess not. Spark partitions correspond to number of splits.
On 23 May 2015 00:02, Cesar Flores ces...@gmail.com wrote:
I have a table in a Hive database partitioning by date. I notice that when
I query this table using HiveContext the created data frame has an specific
number of partitions.
And if I am not wrong, spark SQL api is intended to move closer to SQL
standards. I feel its a clever decision on spark's part to keep both APIs
operational. These short term confusions worth the long term benefits.
On 20 May 2015 17:19, Sean Owen so...@cloudera.com wrote:
I don't think that's
Thanks a bunch
On 21 May 2015 07:11, Davies Liu dav...@databricks.com wrote:
The docs had been updated.
You should convert the DataFrame to RDD by `df.rdd`
On Mon, Apr 20, 2015 at 5:23 AM, ayan guha guha.a...@gmail.com wrote:
Hi
Just upgraded to Spark 1.3.1.
I am getting an warning
and create a logical plan. Even if i have
just one row, it's taking more than 1 hour just to get pass the parsing.
Any idea how to optimize in these kind of scenarios?
Regards,
Madhukara Phatak
http://datamantra.io/
--
Best Regards,
Ayan Guha
What is your spark env file says? Are you setting number of executors in
spark context?
On 20 May 2015 13:16, Shailesh Birari sbirar...@gmail.com wrote:
Hi,
I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB
of RAM.
I have around 600,000+ Json files on HDFS. Each file
are you using
发自我的 iPhone
在 2015年5月19日,18:29,ayan guha guha.a...@gmail.com 写道:
can you kindly share your code?
On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com
wrote:
Hi,
I am trying run spark sql aggregation on a file with 26k columns. No
of rows is very small. I am
My first thought would be creating 10 rdds and run your word count on each
of them..I think spark scheduler is going to resolve dependency in parallel
and launch 10 jobs.
Best
Ayan
On 18 May 2015 23:41, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote:
Hi,
Consider I have a tab delimited text
Hi
So to be clear, do you want to run one operation in multiple threads within
a function or you want run multiple jobs using multiple threads? I am
wondering why python thread module can't be used? Or you have already gave
it a try?
On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in
the schema, I am specifying every field
as nullable. So I believe, it should not throw this error. Can anyone help
me fix this error. Thank you.
Regards,
Anand.C
--
Best Regards,
Ayan Guha
()
thx,
Antony.
--
Best Regards,
Ayan Guha
-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
Here is from documentation:
Spark SQL is designed to be compatible with the Hive Metastore, SerDes and
UDFs. Currently Spark SQL is based on Hive 0.12.0 and 0.13.1.
On Sun, May 17, 2015 at 1:48 AM, ayan guha guha.a...@gmail.com wrote:
Hi
Try with Hive 0.13. If I am not wrong, Hive 0.14
the performance.
Thanks.
Justin
On Fri, May 15, 2015 at 6:32 AM, ayan guha guha.a...@gmail.com wrote:
can you kindly elaborate on this? it should be possible to write udafs in
similar lines of sum/min etc.
On Fri, May 15, 2015 at 5:49 AM, Justin Yip yipjus...@prediction.io
wrote:
Hello,
May I
-Function-for-DataFrame-tp22893.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
--
Best Regards,
Ayan Guha
...@gmail.com
wrote:
I understated that this port value is randomly selected.
Is there a way to enforce which spark port a Worker should use?
--
Best Regards,
Ayan Guha
batches, I would need to handle update in case the hdfs
directory already exists.
Is this a common approach? Are there any other approaches that I can try?
Thank you!
Nisrina.
--
Best Regards,
Ayan Guha
at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
:
*2553: 0,0,0,1,0,1,0,0*
46551: 0,1,0,0,0,0,0,0
266: 0,1,0,0,0,0,0,0
*225546: 0,0,0,0,0,2,0,0*
Anyone can help me getting that?
Thank you.
Have a nice day.
yasemin
--
hiç ender hiç
--
Best Regards,
Ayan Guha
)
lines.count()
On Thu, May 14, 2015 at 4:17 AM, ayan guha guha.a...@gmail.com wrote:
Jo
Thanks for the reply, but _jsc does not have anything to pass hadoop
configs. can you illustrate your answer a bit more? TIA...
On Wed, May 13, 2015 at 12:08 AM, Ram Sriharsha sriharsha@gmail.com
wrote
With this information it is hard to predict. What's the performance you are
getting? What's your desired performance? Maybe you can post your code and
experts can suggests improvement?
On 14 May 2015 15:02, sachin Singh sachin.sha...@gmail.com wrote:
Hi Friends,
please someone can give the
(jsc)
https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
through which you can access the hadoop configuration
On Tue, May 12, 2015 at 6:39 AM, ayan guha guha.a...@gmail.com wrote:
Hi
I found this method in scala API but not in python API (1.3.1).
Basically, I
the seed (call
random.seed()) once on each worker?
--
*From:* ayan guha guha.a...@gmail.com
*Sent:* Tuesday, May 12, 2015 11:17 PM
*To:* Charles Hayden
*Cc:* user
*Subject:* Re: how to set random seed
Easiest way is to broadcast it.
On 13 May 2015 10:40, Charles
Your stack trace says it can't convert date to integer. You sure about
column positions?
On 13 May 2015 21:32, Ishwardeep Singh ishwardeep.si...@impetus.co.in
wrote:
Hi ,
I am using Spark SQL 1.3.1.
I have created a dataFrame using jdbc data source and am using
saveAsTable()
method but got
Easiest way is to broadcast it.
On 13 May 2015 10:40, Charles Hayden charles.hay...@atigeo.com wrote:
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs
will produce the same shuffle.
I am looking
, how?
--
Best Regards,
Ayan Guha
Try this
Res = ssc.sql(your SQL without limit)
Print red.first()
Note: your SQL looks wrong as count will need a group by clause.
Best
Ayan
On 11 May 2015 16:22, Tyler Mitchell tyler.mitch...@actian.com wrote:
I'm using Python to setup a dataframe, but for some reason it is not
being made
Typically you would use . notation to access, same way you would access a
map.
On 12 May 2015 00:06, Ashish Kumar Singh ashish23...@gmail.com wrote:
Hi ,
I am trying to read Nested Avro data in Spark 1.3 using DataFrames.
I need help to retrieve the Inner element data in the Structure below.
It depends on how you want to run your application. You can always save 100
batch as a data file and run another app to read those files. In that case
you have separated contexts and you will find both application running
simultaneously in the cluster but on different JVMs. But if you do not want
with a given predicate to
implement this ? (I would probably also need to provide a partitioner, and
some sorting predicate).
Left and right RDD are 1-10 millions lines long.
Any idea ?
Thanks
Mathieu
--
Best Regards,
Ayan Guha
How did you end up with thousands of df? Are you using streaming? In that
case you can do foreachRDD and keep merging incoming rdds to single rdd and
then save it through your own checkpoint mechanism.
If not, please share your use case.
On 11 May 2015 00:38, Peter Aberline
file. They have the same schema.
There is also the option of appending each DF to the parquet file, but
then I can't maintain them as separate DF when reading back in without
filtering.
I'll rethink maintaining each CSV file as a single DF.
Thanks,
Peter
On 10 May 2015 at 15:51, ayan guha
--
Best Regards,
Ayan Guha
From S3. As the dependency of df will be on s3. And because rdds are not
replicated.
On 8 May 2015 23:02, Peter Rudenko petro.rude...@gmail.com wrote:
Hi, i have a next question:
val data = sc.textFile(s3:///)val df = data.toDF
df.saveAsParquetFile(hdfs://)
df.someAction(...)
if during
I am just wondering if create table supports the syntax of
Create table dB.tablename
Instead of two step process of use dB and then create table tablename?
On 9 May 2015 08:17, Michael Armbrust mich...@databricks.com wrote:
Actually, I was talking about the support for inferring different but
Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter
On 9 May 2015 05:19, anshu shukla anshushuk...@gmail.com wrote:
Any update to above mail
and Can anyone tell me logic - I have to filter tweets and submit tweets
with particular #hashtag1 to SparkSQL databases and tweets with
be forced.
Any ideas?
--
Best Regards,
Ayan Guha
it to a tuple2 seems like a waste of space/computation.
It looks like the PairRDDFunctions..partitionBy() uses a ShuffleRDD[K,V,C]
requires K,V,C? Could I create a new
ShuffleRDD[MyClass,MyClass,MyClass](caseClassRdd, new HashParitioner)?
Cheers,
N
--
Best Regards,
Ayan Guha
Every transformation on a dstream will create another dstream. You may want
to take a look at foreachrdd? Also, kindly share your code so people can
help better
On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:
Please help guys, Even After going through all the examples given i
.
Is the above understanding correct? or is there more to it?
--
Best Regards,
Ayan Guha
And how important is to have production environment?
On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:
There are questions in all three languages.
2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:
I too have similar question.
My understanding is since Spark
.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22768.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
--
Best Regards,
Ayan Guha
Also, if not already done, you may want to try repartition your data to 50
partition s
On 6 May 2015 05:56, Manu Kaul manohar.k...@gmail.com wrote:
Hi All,
For a job I am running on Spark with a dataset of say 350,000 lines (not
big), I am finding that even though my cluster has a large
for Spark certification, learning in group makes learning easy and fun.
Kartik
On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote:
And how important is to have production environment?
On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:
There are questions in all three languages
What happens when you try to put files to your hdfs from local filesystem?
Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:
I have searched all replies to this question not found an answer.
I am running standalone Spark 1.3.1 and
You can use custom partitioner to redistribution using partitionby
On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:
I'm currently trying to join two large tables (order 1B rows each) using
Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
a halt.
I'm
Hi
How do you figure out 500gig~3900 partitions? I am trying to do the math.
If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we
assume split and block sizes are same, shouldn't we end up with 8k
partitions?
On 4 May 2015 17:49, Akhil Das ak...@sigmoidanalytics.com wrote:
?
Thanks,
Lior
--
Best Regards,
Ayan Guha
...@spark.apache.org
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
path?
b) How can I do partitionby? Specifically, when I call DF.rdd.partitionBy,
what gets passed to the custom function? tuple? row? how to access (say 3rd
column of a tuple inside partitioner function)?
--
Best Regards,
Ayan Guha
Thanks, but is there non broadcast solution?
On 5 May 2015 01:34, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have implemented map-side join with broadcast variables and the code is
on mailing list (scala).
On Mon, May 4, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote:
Hi
Can
Yes it is possible. You need to use jsonfile method on SQL context and then
create a dataframe from the rdd. Then register it as a table. Should be 3
lines of code, thanks to spark.
You may see few YouTube video esp for unifying pipelines.
On 3 May 2015 19:02, Jai jai4l...@gmail.com wrote:
Hi,
(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
--
Best Regards,
Ayan Guha
You have rdd or dataframe? Rdds are kind of tuples. You can add a new
column to it by a map.
rdd s are immutable, so you will get another rdd.
On 1 May 2015 14:59, Carter gyz...@hotmail.com wrote:
Hi all,
I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more
column at the
it using DataFrame? Can you
give an example code snipet?
Thanks
Ningjun
*From:* ayan guha [mailto:guha.a...@gmail.com]
*Sent:* Wednesday, April 29, 2015 5:54 PM
*To:* Wang, Ningjun (LNG-NPV)
*Cc:* user@spark.apache.org
*Subject:* Re: HOw can I merge multiple DataFrame and remove duplicated key
And if I may ask, how long it takes in hbase CLI? I would not expect spark
to improve performance of hbase. At best spark will push down the filter
to hbase. So I would try to optimise any additional overhead like bringing
data into spark.
On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.com wrote:
PM ayan guha guha.a...@gmail.com wrote:
Looks like you DF is based on a MySQL DB using jdbc, and error is thrown
from mySQL. Can you see what SQL is finally getting fired in MySQL? Spark
is pushing down the predicate to mysql so its not a spark problem perse
On Wed, Apr 29, 2015 at 9:56 PM
Regards,
Ayan Guha
)
at java.lang.Thread.run(Thread.java:745)
Does filter work only on columns of the integer type? What is the exact
behaviour of the filter function and what is the best way to handle the
query I am trying to execute?
Thank you,
Francesco
--
Best Regards,
Ayan Guha
I guess what you mean is not streaming. If you create a stream context at
time t, you will receive data coming through starting time t++, not before
time t.
Looks like you want a queue. Let Kafka write to a queue, consume msgs from
the queue and stop when queue is empty.
On 29 Apr 2015 14:35,
Its no different, you would use group by and aggregate function to do so.
On 30 Apr 2015 02:15, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com
wrote:
I have multiple DataFrame objects each stored in a parquet file. The
DataFrame just contains 3 columns (id, value, timeStamp). I need to
This is my first thought, please suggest any further improvement:
1. Create a rdd of your dataset
2. Do an cross join to generate pairs
3. Apply reducebykey and compute distance. You will get a rdd with keypairs
and distance
Best
Ayan
On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl
commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
Hi I replied you in SO. If option A had a action call then it should
suffice too.
On 28 Apr 2015 05:30, Eran Medan eran.me...@gmail.com wrote:
Hi Everyone!
I'm trying to understand how Spark's cache work.
Here is my naive understanding, please let me know if I'm missing
something:
val
Can you show your code please?
On 28 Apr 2015 13:20, sranga sra...@gmail.com wrote:
Hi
I am getting the following error when persisting an RDD in parquet format
to
an S3 location. This is code that was working in the 1.2 version. The
version that it is failing to work is 1.3.1.
Any help is
Its a windows thing. Please escape front slash in string. Basically it is
not able to find the file
On 28 Apr 2015 22:09, Fabian Böhnlein fabian.boehnl...@gmail.com wrote:
Can you specifiy 'running via PyCharm'. how are you executing the script,
with spark-submit?
In PySpark I guess you used
Are your driver running on the same m/c as master?
On 29 Apr 2015 03:59, Anshul Singhle ans...@betaglide.com wrote:
Hi,
I'm running short spark jobs on rdds cached in memory. I'm also using a
long running job context. I want to be able to complete my jobs (on the
cached rdd) in under 1 sec.
Alias function not in python yet. I suggest to write SQL if your data suits
it
On 28 Apr 2015 14:42, Don Drake dondr...@gmail.com wrote:
https://issues.apache.org/jira/browse/SPARK-7182
Can anyone suggest a workaround for the above issue?
Thanks.
-Don
--
Donald Drake
Drake Consulting
The answer is it depends :)
The fact that query runtime increases indicates more shuffle. You may want
to construct rdds based on keys you use.
You may want to specify what kind of node you are using and how many
executors you are using. You may also want to play around with executor
memory
Spark keeps job in memory by default for kind of performance gains you are
seeing. Additionally depending on your query spark runs stages and any
point of time spark's code behind the scene may issue explicit cache. If
you hit any such scenario you will find those cached objects in UI under
Hi
Can you test on a smaller dataset to identify if it is cluster issue or
scaling issue in spark
On 28 Apr 2015 11:30, Ulanov, Alexander alexander.ula...@hp.com wrote:
Hi,
I am running a group by on a dataset of 2B of RDD[Row [id, time, value]]
in Spark 1.3 as follows:
“select id,
that are currently available using API calls and
then take some appropriate action based on the information I get back, like
restart a dead Master or Worker.
Is this possible? does Spark provide such API?
--
Best Regards,
Ayan Guha
On Sun, Apr 26, 2015 at 10:12 AM, ayan guha guha.a...@gmail.com wrote:
In my limited understanding, there must be single leader master in
the cluster. If there are multiple leaders, it will lead to unstable
cluster as each masters will keep scheduling independently. You should use
zookeeper
)
print newsY.count()
On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote:
Hi
I am facing this weird issue.
I am on Windows, and I am trying to load all files within a folder. Here
is my code -
loc = D:\\Project\\Spark\\code\\news\\jsonfeeds
newsY = sc.textFile(loc
that this is different than the Spark SQL
JDBC server, which allows other applications to run queries using Spark
SQL).
On Fri, Apr 24, 2015 at 6:27 PM, ayan guha guha.a...@gmail.com wrote:
What is the specific usecase? I can think of couple of ways (write to hdfs
and then read from spark or stream
,
Ayan Guha
:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
--
Best Regards,
Ayan
What is the specific usecase? I can think of couple of ways (write to hdfs
and then read from spark or stream data to spark). Also I have seen people
using mysql jars to bring data in. Essentially you want to simulate
creation of rdd.
On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com
you!
Best,
Wenlei
--
Best Regards,
Ayan Guha
I just tested your pr
On 25 Apr 2015 10:18, Ali Bajwa ali.ba...@gmail.com wrote:
Any ideas on this? Any sample code to join 2 data frames on two columns?
Thanks
Ali
On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote:
Hi experts,
Sorry if this is a n00b question or has
you so much for the help!
On Sat, Apr 25, 2015 at 12:41 AM, ayan guha guha.a...@gmail.com wrote:
can you give an example set of data and desired output
On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie wenlei@gmail.com wrote:
Hi,
I would like to answer the following customized aggregation
I do not think you can share data across spark contexts. So as long as you
can pass it around you should be good.
On 23 Apr 2015 17:12, Suraj Shetiya surajshet...@gmail.com wrote:
Hi,
I have come across ways of building pipeline of input/transform and output
pipelines with Java (Google
Quick questions: why are you cache both rdd and table?
Which stage of job is slow?
On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote:
Hi,
I have Spark SQL performance issue. My code contains a simple JavaBean:
public class Person implements Externalizable {
If you are using a pairrdd, then you can use partition by method to provide
your partitioner
On 21 Apr 2015 15:04, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
What is re-partition ?
On Tue, Apr 21, 2015 at 10:23 AM, ayan guha guha.a...@gmail.com wrote:
In my understanding you need to create
.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Best Regards,
Ayan Guha
solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*
--
Best Regards,
Ayan Guha
/Column-renaming-after-DataFrame-groupBy-tp22586.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
--
Best Regards,
Ayan Guha
601 - 700 of 709 matches
Mail list logo